Determine if cuda device in use? - cuda

Is there a way to directly test whether a cuda device is currently in use by any kernels?
I have a background thread that launches "raw" cuda kernels at full occupancy for a fractal program. The thread builds up large image arrays that I then want to let the user smoothly pan, rotate and zoom.
My GUI thread would like to use the GPU if it is not currently in use for the large image transformations since this runs at 100 fps. If the GPU is in use I can fall back to using CPU code instead at 10-20 fps.
If the GUI-thread GPU code is used when a background thread kernel is already running then the GUI-thread will freeze noticeably until the background kernel finishes. This freezing is what I'm seeking to eliminate by switching instead to CPU code for those frames. I've looked into interrupting the background kernel but solutions I've seen that do this add computational cost to the kernel and/or reset the context, both of which seem like overkill.
Is there a way to directly (asynchronously) detect whether the GPU is in use (by any kernel)? I suppose the GPU is always technically in use as a 2-D display driver, so excluding that activity of course.
My workaround would be to have a flag in my program which keeps track of whether all the kernels have completed. I would need to pass that flag between the two host threads and between the most nested objects within Model and View in my program. I started writing this and thought it was a bit of a messy solution and even then not always 100% accurate. So I wondered if there was a better way and in particular if the GPU could be tested directly at the point in the GUI thread that the decision is needed on whether to use GPU or CPU code for the next frame.
I'm using python 3.7, with cupy to access the GPU, but I would be willing to try to adapt a C++ solution.
I've looked in the docs, but with only basic knowledge of cuda it feels like looking for a needle in a haystack:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE

This is the solution I used following help from #RobertCrovella.
import cupy as cp
stream_done: bool = cp.cuda.get_current_stream().done
if stream_done or worker_ready:
# use cupy to draw next frame
else:
# use numpy to draw next frame
Where worker_ready is a bool passed from the background worker GPU thread indicating it's activity.
For stream_done, see the docs. In my program I'm only using 1 cuda stream, the (unspecified) default stream. Otherwise I imagine you would need to test each stream depending on the problem.
After a lot of testing I found that:
cp.cuda.get_current_stream().done is True in the background thread immediately after the kernel has run but then can become False where I need to do the test despite my code not calling the GPU between the True and the False states. I haven't been able to explain this behaviour but I found I could not rely solely on stream_done. My testing suggests that: if stream_done is True at the point required then it is always safe to use the GPU; if stream_done is False it may or may not be safe to use the GPU.
I also have the background thread fire an event when it starts and stops, this event changes the worker_ready bool for the GUI thread. My testing showed worker_ready was more accurate for determining if the GPU could be used than stream_done. In cases where stream_done was True and worker_ready was False my testing showed the GPU code would also run quickly, presumably because the background thread was performing CPU code at that point in time.
So the best solution to the problem as I asked it was to use the GPU code if either condition is met. However even this didn't remove the visual lag I was seeking to eliminate.
The problem I was trying to solve was that when a background process is running on the GPU and the user tries to pan then occasionally there is a noticable lag of at least 0.5s. I attempted to quantify this lag by measuring the time from mouse press to the panned image being displayed. The time delay measured was 0.1s or less. Therefore no matter how fast the code is after the mouse click it cannot remove the lag whether using the GPU or the CPU.
To me this implies that the starting mouse press event itself has a delay in firing when the GPU is occupied. Presumably this is because the GPU is also running the display driver. I don't have any solid evidence of this beyond:
If the background thread does not run then the lag is removed.
Making the kernels orders of magnitude shorter did not reduce the lag at all.
Increasing the block_size to move away from full occupancy seemed to remove the lag most of the time, although it did not eliminate it completely.

Related

Transferring data to GPU while kernel is running to save time

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however,
GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use
it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!
is it possible to accomplish something like that described above
Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.
The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.
I would also suggest this presentation which covers most of what is needed.
Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.

cudaGraphicsMapResources slow speed when mapping DirectX texture

I'm writing to a texture in DirectX then reading from it in CUDA kernel. I'm using cudaGraphicsMapResources before launching the kernel. Sometimes it takes 10-30 ms. Of course that causes a framedrop in the application. The texture is only written in DirectX and only read in CUDA, not used anywhere else.
I tried different things, like waiting few frames, but it doesn't always help. I also tried to call cudaGraphicsMapResources only in the beginning (instead of calling it each time), but then I have no guarantee that the DirectX has already finished to write the texture (sometimes it hasn't). I tried to use threads, but it crashes when I call cudaGraphicsMapResources from different thread.
I also have the impression that it's mostly occurs when vsync is enabled.
Is this a known problem? What causes this? Is there a way to test if the resource is ready in a non blocking way? Or in general is there some workaround?
I have GeForce GTX 670, Windows 7 64 bit, driver ver. 331.82.
From the CUDA documentation on cudaGraphicsMapResources():
This function provides the synchronization guarantee that any graphics calls issued before cudaGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.
It could be that the delays you are seeing are caused by waiting for the drawing to complete. In particular since you indicate that, when not mapping for each frame, the drawing has sometimes not completed.
Combining this with vsync could make the problem worse since graphics calls may have to wait for the next vsync before they start drawing.
A partial workaround for the issue when vsync is in use may be to use more back buffers.
If you haven't already, you could also try to call cudaGraphicsResourceSetMapFlags() with cudaGraphicsMapFlagsReadOnly.
Edit:
I think it waits only for drawing calls made by your own app to complete. The docs say:
The graphics API from which resources were registered should not access any resources while they are mapped by CUDA. If an application does so, the results are undefined.
And, of course, you have no control over drawing performed by other apps.
You may be able to check drawing status without blocking by calling the Present() Direct3D method with the D3DPRESENT_DONOTWAIT flag.

Can CUDA handle its own work queues?

Sorry if this is obvious, but I'm studying c++ and Cuda right now and wanted to know if this was possible so I could focus more on the relevant sections.
Basically my problem is highly parallelizable, in fact I'm running it on multiple servers currently. My program gets a work item(very small list) and runs a loop on it and makes one of 3 decisions:
keep the data(saves it),
Discard the data(doesn't do anything with it),
Process data further(its unsure of what to do so it modifies the data and resends it to the queue to process.
This used to be a recursion but I made each part independent and although I'm longer bound by one cpu but the negative effect of it is there's alot of messages that pass back/forth. I understand at a high level how CUDA works and how to submit work to it but is it possible for CUDA to manage the queue on the device itself?
My current thought process was manage the queue on the c++ host and then send the processing to the device, after which the results are returned back to the host and sent back to the device(and so on). I think that could work but I wanted to see if it was possible to have the queue on the CUDA memory itself and kernels take work and send work directly to it.
Is something like this possible with CUDA or is there a better way to do this?
I think what you're asking is if you can keep intermediate results on the device. The answer to that is yes. In other words, you should only need to copy new work items to the device and only copy finished items from the device. The work items that are still undetermined can stay on the device between kernel calls.
You may want to look into CUDA Thrust for this. Thrust has efficient algorithms for transformations, which can be combined with custom logic (search for "kernel fusion" in the Thrust manual.) It sounds like maybe your processing can be considered to be transformations, where you take a vector of work items and create two new vectors, one of items to keep and one of items that are still undetermined.
Is the host aware(or can it monitor) memory on device? My concern is how to be aware and deal with data that starts to exceed GPU onboard memory.
It is possible to allocate and free memory from within a kernel but it's probably not going to be very efficient. Instead, manage memory by running CUDA calls such as cudaMalloc() and cudaFree() or, if you're using Thrust, creating or resizing vectors between kernel calls.
With this "manual" memory management you can keep track of how much memory you have used with cudaMemGetInfo().
Since you will be copying completed work items back to the host, you will know how many work items are left on the device and thus, what the maximum amount of memory that might be required in a kernel call is.
Maybe a good strategy will be to swap source and destination vectors for each transform. To take a simple example, say you have a set of work items that you want to filter in multiple steps. You create vector A and fill it with work items. Then you create vector B of the same size and leave it empty. After the filtering, some portion of the work items in A have been moved to B, and you have the count. Now you run the filter again, this time with B as the source and A as the destination.

Any tips to avoid laggy display during long kernels?

Dear CUDA users I am reposting a question from nvidia boards:
I am currently doing image processing on GPU and I have one kernel that takes something like 500 to 700 milliseconds when running on big images. It used to work perfectly on smaller images but now the problem is that the whole display and even the mouse cursor are getting laggy (OS=win7)
My idea was to split my kernel in 4 or 8 kernel launches, hoping that the driver could refresh more often (between each kernel launch).
Unfortunately it does not help at all, so what else could I try to avoid this freezing display effect? I was suggested to add a cudaStreamQuery(0) call between each kernel to avoid packing by the driver.
Note: I am prepared to trade performances for smoothness!
The GPU is not (yet) designed to context switch between kernel launches, which is why your long-running kernel is causing a laggy display. Breaking the kernel into multiple launches probably would help on platforms other than Windows Vista/Windows 7. On those platforms, the Windows Display Driver Model requires an expensive user->kernel transition ("kernel thunk") every time the CUDA driver wants to submit work to the GPU.
To amortize the cost of the kernel thunk, the CUDA driver queues up GPU commands and submits them in batches. The driver uses a heuristic to trade off the performance hit from the kernel thunk against the increased latency of not immediately submitting work. What's happening with your multiple-kernels solution is that the driver's submitting your kernel or series of kernels to the GPU all at once.
Have you tried the cudaStreamQuery(0) suggestion? The reason that might help is because it forces the CUDA driver to submit work to the GPU, even if very little work is pending.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance