Nvidia NVDEC - copy decoded frame to D3D11 NV12 texture - cuda

I'm trying to copy the NV12 NVDEC decoded buffer directly into an NV12 d3d11 texture. No luck so far. What I've managed to do is a double shot copy using 2 d3d11 textures (luma + chroma), 2 cuGraphicsMapResources, 2 cuGraphicsSubResourceGetMappedArray, 2 CUDA_MEMCPY2D and a pixel shader to merge all....no way to perform a single shot copy, and no response from NVidia forum so far.
I've found this old question facing a very similar problem, no solution there either.

Perhaps you need something like this. This code snipped taken from FFmpeg Project (opensource), libavutil/hwcontext_cude.c file:
for (i = 0; i < FF_ARRAY_ELEMS(src->data) && src->data[i]; i++) {
CUDA_MEMCPY2D cpy = {
.srcMemoryType = CU_MEMORYTYPE_HOST,
.dstMemoryType = CU_MEMORYTYPE_DEVICE,
.srcHost = src->data[i],
.dstDevice = (CUdeviceptr)dst->data[i],
.srcPitch = src->linesize[i],
.dstPitch = dst->linesize[i],
.WidthInBytes = FFMIN(src->linesize[i], dst->linesize[i]),
.Height = src->height >> (i ? priv->shift_height : 0),
};
ret = CHECK_CU(cu->cuMemcpy2DAsync(&cpy, hwctx->stream));
if (ret < 0)
goto exit;
}

Not sure how this can be done with NVidia/Cuda as I'm not familiar with. But this is how I managed to do it with Direct3D (D3D11va) that might help you to translate it to your situation:-
(NV12 NDEC Device).CopySubresourceRegion(src NV12 NVDEC texture, srcSubresourceArrayIndex, dst NV12 shared texture)
(Get Shared Handle for the newly created NV12 shared texture)
(Your Device).OpenSharedResource(NV12 shared handle)
(Prepare VideoProcessorInputView, VideoProcessorOutputView and Streams)
(Your Device).VideoProcessorBlt(src NV12 shared handle, dst Your RGBA/BGRA Render Texture)
This process is Video Acceleration and it happens only in your GPU (no CPU/RAM involved). You should also ensure that the GPU adapter supports that.

Related

How to transfer a float array (without serializing/deserializing) from Scala (JeroMQ) to C (ZMQ)?

Currently, I am using a JSON library to serialize the data at the sender (JeroMQ), and deserialize at the receiver (C, ZMQ). But, while parsing, the JSON library starts to consume a lot of memory and the OS kills the process. So, I want to send the float array as it is, i.e. without using JSON.
The existing sender code is below (syn0 and syn1 are Double arrays). If syn0 and syn1 are around 100 MB each, the process is killed while parsing the received arrays, i.e. the last line of the snippet below:
import org.zeromq.ZMQ
import com.codahale.jerkson
socket.connect("tcp://localhost:5556")
socket.send(json.JSONObject(Map("syn0"->json.JSONArray(List.fromArray(syn0Global)))).toString())
println("SYN0 Request sent”)
val reply_syn0 = socket.recv(0)
println("Response received after syn0: " + new String(reply_syn0))
logInfo("Sending Syn1 request … , size : " + syn1Global.length )
socket.send(json.JSONObject(Map("syn1"->json.JSONArray(List.fromArray(syn1Global)))).toString())
println("SYN1 Request sent")
val reply_syn1 = socket.recv(0)
socket.send(json.JSONObject(Map("foldComplete"->"Done")).toString())
println("foldComplete sent")
// Get the reply.
val reply_foldComplete = socket.recv(0)
val processedSynValuesJson = new String(reply_foldComplete)
val processedSynValues_jerkson = jerkson.Json.parse[Map[String,List[Double]]](processedSynValuesJson)
Can these arrays be transferred without using JSON?
Here I am transferring a float array between two C programs:
//client.c
int main (void)
{
printf ("Connecting to hello world server…\n");
void *context = zmq_ctx_new ();
void *requester = zmq_socket (context, ZMQ_REQ);
zmq_connect (requester, "tcp://localhost:5555");
int request_nbr;
float send_buffer[10];
float recv_buffer[10];
for(int i = 0; i < 10; i++)
send_buffer[i] = i;
for (request_nbr = 0; request_nbr != 10; request_nbr++) {
//char buffer [10];
printf ("Sending Hello %d…\n", request_nbr);
zmq_send (requester, send_buffer, 10*sizeof(float), 0);
zmq_recv (requester, recv_buffer, 10*sizeof(float), 0);
printf ("Received World %.3f\n", recv_buffer[5]);
}
zmq_close (requester);
zmq_ctx_destroy (context);
return 0;
}
//server.c
int main (void)
{
// Socket to talk to clients
void *context = zmq_ctx_new ();
void *responder = zmq_socket (context, ZMQ_REP);
int rc = zmq_bind (responder, "tcp://*:5555");
assert (rc == 0);
float recv_buffer[10];
float send_buffer[10];
while (1) {
//char buffer [10];
zmq_recv (responder, recv_buffer, 10*sizeof(float), 0);
printf ("Received Hello\n");
for(int i = 0; i < 10; i++)
send_buffer[i] = recv_buffer[i]+5;
zmq_send (responder, send_buffer, 10*sizeof(float), 0);
}
return 0;
}
Finally, my unsuccessful attempt at doing something similar using Scala (below is the client code):
def main(args: Array[String]) {
val context = ZMQ.context(1)
val socket = context.socket(ZMQ.REQ)
println("Connecting to hello world server…")
socket.connect ("tcp://localhost:5555")
val msg : Array[Float] = Array(1,2,3,4,5,6,7,8,9,10)
val bbuf = java.nio.ByteBuffer.allocate(4*msg.length)
bbuf.asFloatBuffer.put(java.nio.FloatBuffer.wrap(msg))
for (request_nbr <- 1 to 10) {
socket.sendByteBuffer(bbuf,0)
}
}
SER/DES ? Size ?No, an underlying transport-philosophy related constraint matters.
You have started with an 0.1 GB sizing for transport-payload and reported a JSON-library allocations to cause your O/S to kill the process.
Next, in other post, you have requested an 0.762 GB sizing for transport-payload.
But there is a bit more important issue in ZeroMQ transport orchestration than a choice of an external data-serialiser SER/DES policy.
No one may forbid you to try to send as big BLOB as possible, whereas a JSON-decorated string has already shown you the dark-side of such approaches, there are other reasons not to proceed this way ahead.
ZeroMQ is out of question a great and powerful toolbox. Still it takes some time for one to gain an insight necessary for indeed a smart and highly performant code-deployment, that makes maximum out of this powerful work-horse.
One of side-effects of the feature-rich internal ecosystem "under-the-hood" is a not very much known policy, hidden in a message delivery concept.
One may send any reasonable-sized message, while a delivery is not guaranteed. It is either completely delivered, or nothing gets out at all, as said above, nothing is guaranteed.
Ouch?!
Yes, not guaranteed.
Based on this core Zero-Guarrantee philosophy, one shall take due care to decide on steps and measures, the more if you plan to try to move Gigabyte BEASTs there and back.
In this very sense, it might become quantitatively supported by real SUT testing, that small-sized messages may transport ( if you indeed still need to move GBs ( refer to comment above, under the OP ) and have no other choice ) the whole volume of data segmented into smaller pieces, with error-prone re-assembly measures, which results in much faster and much safer end-to-end solution than trying to use dumb-force and instruct the code to dump about a GB of data onto whatever resources there actually are available ( Zero-Copy principle of ZeroMQ cannot and will not per-se save you in these efforts ).
For details on another hidden trap, related to not fully Zero-Copy implementation, read Martin SUSTRIK's, co-father of ZeroMQ, remarks on Zero-Copy "till-kernel-boundary-only" ( so, at least double the memory-space allocations to be expected... ).
Solution:
Redesign the architecture so as to propagate small-sized messages, if not keeping an original datastructure "mirrored" in remote process(es) instead of attempting to keep one-shot giga-transfers survivable.
The best next step?
While it does not solve your trouble with a few SLOC-s, the best thing, if you are serious about to invest your intellectual powers into distributed processing, is to read Pieter HINTJEN's lovely book "Code Connected, Vol.1"
Yes, it takes some time to generate one's own insight, but this will raise you in many aspects onto another level of professional code design. Worth time. Worth efforts.
You'll need to serialize the data in some form or fashion - ultimately you're taking a structure in memory on one side and instructing the other side on how to rebuild that structure (bonus points for using two separate languages where the structure in memory is likely different anyway). I'd suggest you use a new JSON library as that appears to be where the problem lies, but there are more efficient protocols you could be using. Protocol Buffers enjoy good support across many languages, that might be the place I'd start.

How can I record sound with 16 bits per sample (16 bit depth)?

I try to record PCM sound from flash (using Microphone class). I use org.bytearray.micrecorder.MicRecorder helper class.
In Microphone class I cannot find property like bitDepth or bitsPerSample.
I always get 32 bits.
Is it possible to do?
UPDATE: The asker John812 was able to solve this by using..
bit16_bytes.writeShort( data.readFloat() * 32767 ); see comments below for context
METHOD #2: Based on my experience with using the LoadPCMfromByteArray method
I have something you could try but I've only used it with an actual 32bit WAVE file and played via the LoadPCMFromByteArray command.
The AS3 Microphone Class records 32 bits. You have to write the conversion of samples to a different bit-depth by yourself. I have no idea how many samples you are processing but the general code below shows you how to convert. Note: * 512 means use your actual samples amount (example: * 4096? or * 8192?) If you get the numbers wrong there'll be hiss/distortion so either experiment from small or provide the full details in your question for a more helpful edit/answer.
CONVERT: Assuming your recorded byteArray is called data
public var bit16_bytes : ByteArray; //will hold the 16bit version
public function convert_to16Bit () : void
{
bit16_bytes = new ByteArray(); data.position = 0;
while (bit16_bytes.position < data.length - 4)
//if you get noise/distortion try either: 256, 512, 1024, 2048, 4096 or 8192
{ bit16_bytes.writeShort( data.readInt() * 512 ); } //multiply by samples amount
data = new ByteArray(); //recycle for re-use
bit16_bytes.position = 0; //reset or else E-O-File error
bit16_bytes.readBytes( data ); //copy 16bit back into Data byte-array
}
To run the above function whenever you're ready just add the line convert_to16Bit(); inside whatever function deals with your "recording complete" situation.

Actionscript 3: Endianness of bytearray via websocket

I store audio recorded from the microphone in a bytearray and send it via the AS3Websocket library (https://github.com/Worlize/AS3WebSocket) to a server:
private function processMicInput(event:SampleDataEvent):void {
if (isRecording) {
while (event.data.bytesAvailable) {
recordingBuffer.writeShort(event.data.readFloat()*0x7fff);
}
websocket.sendBytes(recordingBuffer);
recordingBuffer.clear();
}
}
However, I want the data to be little-endian. It doesn't seem to matter whether I set the recordingBuffer bytearray to little-endian or big-endian, it always gets sent as big-endian.
Internally, it seems that the AS3Websocket library uses a socket that is set to big-endian. Is this the problem?
If so, how can I work around this?
Interesting. In the library under sendBytes you can see that the data gets copied to two buffers on its way out. I'm not sure it's safe to modify what's happening in the library. You could change the order of the bytes as you write them into recordingBuffer:
while (event.data.bytesAvailable) {
var val:int = event.data.readFloat()*0x7fff;
recordingBuffer.writeShort( ((val >> 8) & 0xff) | ((val & 0xff) << 8));
}
If you need to flip whole 32-bit words, you'll have to get fancier. LMK if this is the case.
Good luck!

How can I optimise this method?

I have been working on creating an assets class that can generate dynamic TextureAtlas objects whenever I need them. The specific method is Assets.generateTextureAtlas() and I am trying to optimise it as much as possible as I quite frequently need to regenerate texture atlas's and was hoping to get a better time than my 53ms average.
53ms is currently costing me about 3 frames which can add up quickly the more items I need to pack inside my texture atlas and the frequency I need to generate them. So an answer to all the pitfalls within my code would be great.
The entire class code is available here in a github gist.
The RectanglePacker class is simply used to pack rectangles as close together as possible (similar to Texture Packer) and can be found here.
For reference, here is the method:
public static function generateTextureAtlas(folder:String):void
{
if (!_initialised) throw new Error("Assets class not initialised.");
if (_renderTextureAtlases[folder] != null)
{
(_renderTextureAtlases[folder] as TextureAtlas).dispose();
}
var i:int;
var image:Image = new Image(_blankTexture);
var itemName:String;
var itemNames:Vector.<String> = Assets.getNames(folder + "/");
var itemsTexture:RenderTexture;
var itemTexture:Texture;
var itemTextures:Vector.<Texture> = Assets.getTextures(folder + "/");
var noOfRectangles:int;
var rect:Rectangle;
var rectanglePacker:RectanglePacker = new RectanglePacker();
var texture:Texture;
noOfRectangles = itemTextures.length;
if (noOfRectangles == 0)
{
return;
}
for (i = 0; i < noOfRectangles; i++)
{
rectanglePacker.insertRectangle(Math.round(itemTextures[i].width), Math.round(itemTextures[i].height), i);
}
rectanglePacker.packRectangles();
if (rectanglePacker.rectangleCount != noOfRectangles)
{
throw new Error("Only " + rectanglePacker.rectangleCount + " out of " + noOfRectangles + " rectangles packed for folder: " + folder);
}
itemsTexture = new RenderTexture(rectanglePacker.width, rectanglePacker.height);
itemsTexture.drawBundled(function():void
{
for (i = 0; i < noOfRectangles; i++)
{
itemTexture = itemTextures[rectanglePacker.getRectangleId(i)];
rect = rectanglePacker.getRectangle(i, rect);
image.texture = itemTexture;
image.readjustSize();
image.x = rect.x + itemTexture.frame.x;
image.y = rect.y + itemTexture.frame.y;
itemsTexture.draw(image);
}
});
_renderTextureAtlases[folder] = new TextureAtlas(itemsTexture);
for (i = 0; i < noOfRectangles; i++)
{
itemName = itemNames[rectanglePacker.getRectangleId(i)];
itemTexture = itemTextures[rectanglePacker.getRectangleId(i)];
rect = rectanglePacker.getRectangle(i);
(_renderTextureAtlases[folder] as TextureAtlas).addRegion(itemName, rect, itemTexture.frame);
}
}
Well reading the project & finding what all can be optimized would sure take time.
Start by removing multiple calls to rectanglePacker.getRectangle(i) inside loops.
For example :
itemName = itemNames[rectanglePacker.getRectangleId(i)];
itemTexture = itemTextures[rectanglePacker.getRectangleId(i)];
rect = rectanglePacker.getRectangle(i);
perhaps, could have been:
rect = rectanglePacker.getRectangle(i);
itemName = itemNames[rect];
itemTexture = itemTextures[rect];
If getRectangle does indeed just 'get a rectangle' & not set anything.
I think the bigger issue at hand is this, why oh why do you HAVE to do this during run-time, in a situation when this can't take more time? This IS an expansive operation, no matter how much you optimize this you will probably end up with it taking about 40ms or similar when done in AS3.
This is why these kind of operations should be done during compile time or during "loading screens" or other "transitions" when frame-rate is not critical and when you can afford it.
Alternatively create another system in c++ or some other language which can actually handle the number-crunching that gives you the finished result.
Also, when it comes to checking performance, yes the entire function takes 53ms, BUT, where are those milliseconds used? 53ms says nothing and is only the "overhead profiling thing" where you found the culprit, you need to break it down into smaller chunks to gather some reliable information about what it is that ACTUALLY takes time, inside that function.
I mean, inside that function, you have 3 for loops, several calls to other classes, casts, deletes, creations. It's not like you are doing one thing, that function probably results in ~500 lines of code and a bazillion cpu operations. And, you have no idea where it is used. I would guess that it is the rectanglePacker.packRectangles(); that takes 60% of that time, but without profiling, you and we don't know on what to optimize, we simply don't have sufficient data.
If you HAVE to do this during run-time in AS3, I would recommend doing this spread out during several frames and distributing workload evenly during 10 frames or so. You could also doing it with help of another thread and workers. But most of all, this seems like a design error since this could probably be done at another time. And if not, then in another language which is better at these kind of operations.
The easiest way to profile this is to add a couple of timestamps similar to:
var timestamps:Array = [];
And then push getTimer() at different places in code, and then print them out when function is done
As others said, it's unlikely that the reason of bad performance is non-optimized AS code. Output from the profiler (Scout, for example) wold be very helpful. However, if your purpose is just adding new textures, I can suggest several optimizations:
Why would you need to re-generate the whole atlas every time (calling Assets.getTextures() and creating new render texture)? Why don't you just add new items to the existing atlas? Creation of a new RenderTexture (and, thus, a new texture in GPU memory) is very costly operation, because it requires sync between CPU and GPU. On the other hand, drawing into RenderTexture is carried out entirely inside GPU, so it takes much less time.
If you place every item on a grid, then you can avoid using RectanglePacker as all of your rectangles can have the same dimensions matching the dimensions of a grid.
Edit:
To clarify, some time ago I had a similar problem: I had to add new items to the existing atlas on a regular basis. And the performance of this operation was quite acceptable (about 8ms on iPad3 using 1024x1024 dynamic texture). But I used the same RenderTexture and the same Sprite object that contained my dynamic atlas items. When I need to add a new item, I just create new Image with desired texture (stand-alone or from another static atlas), then place it inside the Sprite container, and then redraw this container to the RenderTexture. Similarly with deletion/modification of an item.

RTMP_Write function use

I'm trying to use the librtmp library and it worked pretty well to pull a stream. But now I am trying to publish a stream and for that I believe I have to use the RTMP_Write function.
What I am trying to accomplish here is a simple c++ program that will read from a file and try to push the stream to a crtmp server. The connection and stream creation is ok, but I'm quite puzzled by the use of RTMP_Write.
Here is what I did:
int Upload(RTMP * rtmp, FILE * file){
int nRead = 0;
unsigned int nWrite = 0;
int diff = 0;
int bufferSize = 64 * 1024;
int byteSum = 0;
int count = 0;
char * buffer;
buffer = (char *) malloc(bufferSize);
do{
nRead = fread(buffer+diff,1,bufferSize-diff,file);
if(nRead != bufferSize){
if(feof(file)){
RTMP_LogPrintf("End of file reached!\n");
break;
}else if(ferror(file)){
RTMP_LogPrintf("Error reading from file stream detected\n");
break;
}
}
count += 1;
byteSum += nRead;
RTMP_LogPrintf("Read %d from file, Sum: %d, Count: %d\n",nRead,byteSum,count);
nWrite = RTMP_Write(rtmp,buffer,nRead);
if(nWrite != nRead){
diff = nRead - nWrite;
memcpy(buffer,(const void*)(buffer+bufferSize-diff),diff);
}
}while(!RTMP_ctrlC && RTMP_IsConnected(rtmp) && !RTMP_IsTimedout(rtmp));
free(buffer);
return RD_SUCCESS;
}
In this Upload function I am receiving the already initiallized RTMP structure and a pointer to an open file.
This actually works and I can see some video being displayed, but it soon gets lost and stops sending packages. I managed to understand that it happens whenever the buffer that I setup (and which I randomly required to be 64k, no special reason for that) happens to split the flv tag (http://osflash.org/flv#flv_format) of a new package.
For that I modified the RTMP_Write function and told it to verify if it will be able to decode the whole flv tag (packet type, body size, timestamp, etc..) and if it will not, then it should just return the amount of useful bytes left in the buffer.
if(s2 - 11 <= 0){
rest = size - s2;
return rest;
}
The code above takes notice of this, and if the value returned by RTMP_Write is not the amount of bytes it was supposed to send, then it knows that value is the amount of useful bytes left in the buffer. I then copy these bytes to the beginning of the buffer and read more from the file.
But I keep getting problems with it, so I was wondering: what is the correct use of this function anyway? is there a specific buffer value that I should be using? (don't think so) or is it buggy by itself?