cudaGraphicsMapResources slow speed when mapping DirectX texture - cuda

I'm writing to a texture in DirectX then reading from it in CUDA kernel. I'm using cudaGraphicsMapResources before launching the kernel. Sometimes it takes 10-30 ms. Of course that causes a framedrop in the application. The texture is only written in DirectX and only read in CUDA, not used anywhere else.
I tried different things, like waiting few frames, but it doesn't always help. I also tried to call cudaGraphicsMapResources only in the beginning (instead of calling it each time), but then I have no guarantee that the DirectX has already finished to write the texture (sometimes it hasn't). I tried to use threads, but it crashes when I call cudaGraphicsMapResources from different thread.
I also have the impression that it's mostly occurs when vsync is enabled.
Is this a known problem? What causes this? Is there a way to test if the resource is ready in a non blocking way? Or in general is there some workaround?
I have GeForce GTX 670, Windows 7 64 bit, driver ver. 331.82.

From the CUDA documentation on cudaGraphicsMapResources():
This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.
It could be that the delays you are seeing are caused by waiting for the drawing to complete. In particular since you indicate that, when not mapping for each frame, the drawing has sometimes not completed.
Combining this with vsync could make the problem worse since graphics calls may have to wait for the next vsync before they start drawing.
A partial workaround for the issue when vsync is in use may be to use more back buffers.
If you haven't already, you could also try to call cudaGraphicsResourceSetMapFlags() with cudaGraphicsMapFlagsReadOnly.
Edit:
I think it waits only for drawing calls made by your own app to complete. The docs say:
The graphics API from which resources were registered should not access any resources while they are mapped by CUDA. If an application does so, the results are undefined.
And, of course, you have no control over drawing performed by other apps.
You may be able to check drawing status without blocking by calling the Present() Direct3D method with the D3DPRESENT_DONOTWAIT flag.

Related

Determine if cuda device in use?

Is there a way to directly test whether a cuda device is currently in use by any kernels?
I have a background thread that launches "raw" cuda kernels at full occupancy for a fractal program. The thread builds up large image arrays that I then want to let the user smoothly pan, rotate and zoom.
My GUI thread would like to use the GPU if it is not currently in use for the large image transformations since this runs at 100 fps. If the GPU is in use I can fall back to using CPU code instead at 10-20 fps.
If the GUI-thread GPU code is used when a background thread kernel is already running then the GUI-thread will freeze noticeably until the background kernel finishes. This freezing is what I'm seeking to eliminate by switching instead to CPU code for those frames. I've looked into interrupting the background kernel but solutions I've seen that do this add computational cost to the kernel and/or reset the context, both of which seem like overkill.
Is there a way to directly (asynchronously) detect whether the GPU is in use (by any kernel)? I suppose the GPU is always technically in use as a 2-D display driver, so excluding that activity of course.
My workaround would be to have a flag in my program which keeps track of whether all the kernels have completed. I would need to pass that flag between the two host threads and between the most nested objects within Model and View in my program. I started writing this and thought it was a bit of a messy solution and even then not always 100% accurate. So I wondered if there was a better way and in particular if the GPU could be tested directly at the point in the GUI thread that the decision is needed on whether to use GPU or CPU code for the next frame.
I'm using python 3.7, with cupy to access the GPU, but I would be willing to try to adapt a C++ solution.
I've looked in the docs, but with only basic knowledge of cuda it feels like looking for a needle in a haystack:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE
This is the solution I used following help from #RobertCrovella.
import cupy as cp
stream_done: bool = cp.cuda.get_current_stream().done
if stream_done or worker_ready:
# use cupy to draw next frame
else:
# use numpy to draw next frame
Where worker_ready is a bool passed from the background worker GPU thread indicating it's activity.
For stream_done, see the docs. In my program I'm only using 1 cuda stream, the (unspecified) default stream. Otherwise I imagine you would need to test each stream depending on the problem.
After a lot of testing I found that:
cp.cuda.get_current_stream().done is True in the background thread immediately after the kernel has run but then can become False where I need to do the test despite my code not calling the GPU between the True and the False states. I haven't been able to explain this behaviour but I found I could not rely solely on stream_done. My testing suggests that: if stream_done is True at the point required then it is always safe to use the GPU; if stream_done is False it may or may not be safe to use the GPU.
I also have the background thread fire an event when it starts and stops, this event changes the worker_ready bool for the GUI thread. My testing showed worker_ready was more accurate for determining if the GPU could be used than stream_done. In cases where stream_done was True and worker_ready was False my testing showed the GPU code would also run quickly, presumably because the background thread was performing CPU code at that point in time.
So the best solution to the problem as I asked it was to use the GPU code if either condition is met. However even this didn't remove the visual lag I was seeking to eliminate.
The problem I was trying to solve was that when a background process is running on the GPU and the user tries to pan then occasionally there is a noticable lag of at least 0.5s. I attempted to quantify this lag by measuring the time from mouse press to the panned image being displayed. The time delay measured was 0.1s or less. Therefore no matter how fast the code is after the mouse click it cannot remove the lag whether using the GPU or the CPU.
To me this implies that the starting mouse press event itself has a delay in firing when the GPU is occupied. Presumably this is because the GPU is also running the display driver. I don't have any solid evidence of this beyond:
If the background thread does not run then the lag is removed.
Making the kernels orders of magnitude shorter did not reduce the lag at all.
Increasing the block_size to move away from full occupancy seemed to remove the lag most of the time, although it did not eliminate it completely.

Direct3D texture resource life cycle

I have been working on a project with Direct3D on Windows Phone. It is just a simple game with 2d graphics, and I make use of DirectXTK for helping me out with sprites.
Recently , I have come across to an out of memory error while I was debugging on 512mb emulator. This error was not common and was the result of a sequence of open, suspend , open , suspend ...
Tracking it down, I found out that the textures are loaded on every activation of the app, and finally filling up the allowed memory. To solve it , I will probably go and edit it so as to load textures only on opening but activation from suspends; but after this problem I am curious about the correct life cycle management of texture resources. While searching I have came across to Automatic (or "managed" by microsoft) Texture Management http://msdn.microsoft.com/en-us/library/windows/desktop/bb172341(v=vs.85).aspx . which can probably help out with some management of the textures in video memory.
However, I also would like to know other methods since I couldnt figure out a good way to incorporate managed textures into my code.
My best is to call the Release method of ID3D11ShaderResourceView pointers I store in destructors to prevent filling up the memory , but how do I ensure textures are resting in memory while other apps would want to use it(the memory)?
Windows phone uses Direct3D 11 which 'virtualizes' the GPU memory. Essentially every texture is 'managed'. If you want a detailed description of this, see "Why Your Windows Game Won't Run In 2,147,352,576 Bytes?". The link you provided is Direct3D 9 era for Windows XP XPDM, not any Direct3D 11 capable platform.
It sounds like the key problem is that your application is leaking resources or has too large a working set. You should enable the debug device and first make sure you have cleaned up everything as you expected. You may also want to check that you are following the recommendations for launching/resuming on MSDN. Also keep in mind that Direct3D 11 uses 'deferred destruction' of resources so just because you've called Release everywhere doesn't mean that all the resources are actually gone... To force a full destruction, you need to use Flush.
With Windows phone 8.1 and Direct3D 11.2, there is a specific Trim functionality you can use to reduce an app's memory footprint, but I don't think that's actually your issue.

How to uninitialise CUDA?

CUDA implicitly initialises when the first CUDA runtime function is called.
I'm timing the runtime of my code and repeating 100 times via a loop (for([100 times]) {[Time CUDA code and log]}), which also needs to take into account the initialisation time for CUDA at each iteration. Thus I need to uninitialise CUDA after every iteration - how to do this?
I've tried using cudaDeviceReset(), but seems not to have uninitialised CUDA.
Many thanks.
cudaDeviceReset is the canonical way to destroy a context in the runtime API (and calling cudaFree(0) is the canonical way to create a context). Those are the only levels of "re-initialization" available to a running process. There are other per-process events which happen when a process loads the CUDA driver and runtime libraries and connects to the kernel driver, but there is no way I am aware of to make those happen programatically short of forking a new process.
But I really doubt you want or should be needing to account for this sort of setup time when calculating performance metrics anyway.

Limitations of work-item load in GPU? CUDA/OpenCL

I have a compute-intensive image algorithm that, for each pixel, needs to read many distant pixels. The distance is dependent on a constant defined at compile-time. My OpenCL algorithm performs well, but at a certain maximum distance - resulting in more heavy for loops - the driver seems to bail out. The screen goes black for a couple of seconds and then the command queue never finishes. A balloon message reveals that the driver is unhappy:
"Display driver AMD driver stopped responding and has successfully recovered."
(Running this on OpenCL 1.1 with an AMD FirePro V4900 (FireGL V) Graphics Adapter.)
Why does this occur?
Is it possible to, beforehand, tell the driver that everything is ok?
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming (best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.

How to determine why a task destroys , VxWorks?

I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards
It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help
If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.
Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.