I'm running CUDA 5.5 on an SUSE Linux machine with 2 M2050 cards installed, neither of which are used for running X11. I am trying to step through a kernel that is specifically only using device 0 using the Nsight Eclipse debugger. If I set an (unconditional) breakpoint inside a kernel, the debugger breaks on block 0/thread 0 first, and then if I continue execution it will break again at the same point 5 or 6 times on seemingly random threads in different blocks, before exiting the kernel and continuing to the next kernel. The program execution in the kernel is happening correctly and displayed correctly. The host code debugs without problems.
If I make the same breakpoint conditional, as outlined in this post:
using nsight to debug
I am seeing no difference in the behavior of the debugger. The condition on the breakpoint seems to be ignored and the debugger breaks on 5 or 6 random threads before exiting the kernel. Neither of these behaviors seem to make much sense to me. I would think the unconditional breakpoint should break on thread 0 or all threads. And I would think that the conditional breakpoint should break on only the thread it's conditioned on. I've looked all over the NVIDIA documentation, stackoverflow etc and seem to have exhausted my options at this point. I was wondering if anyone else has seen similar behavior or might have some pointers.
Unconditional breakpoint breaks for every new "batch" of threads arriving to the device. This is needed so you can explore all your threads.
Because of some technical issues, conditional breakpoints should be set after you break in kernel at least once. This will be fixed in CUDA Toolkit 6.0.
Related
I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).
I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.
I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.
I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:
Why it is not possible to overlap memHtoD with GPU kernel with GTX 590,
CUDA kernels not launching before CudaDeviceSynchronize
Time between Kernel Launch and Kernel Execution
and suggest to:
flush the WDDM queue with cudaEventQuery() or cudaEventQuery(). (Does not work).
submit streams in breadth first manner. (Does not work).
This thread argues it is a bug in fermi:
How can I overlap memory transfers and kernel execution in a CUDA application?
This thread:
http://blog.icare3d.org/2010/04/tesla-compute-drivers.html
proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.
However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this problem.
TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the
TDR delay value to a very high number, and the "enabled" option to
true, the issue goes away.
Read below for other (older) steps followed until i came to the solution above, and some other possible causes.
I just recently were able to mostly solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.
Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!
Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.
This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!
Alternatively, you can:
Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.
This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.
Please, when you try these solutions, come back here and post your findings!
For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...
UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!
I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.
Dear CUDA users I am reposting a question from nvidia boards:
I am currently doing image processing on GPU and I have one kernel that takes something like 500 to 700 milliseconds when running on big images. It used to work perfectly on smaller images but now the problem is that the whole display and even the mouse cursor are getting laggy (OS=win7)
My idea was to split my kernel in 4 or 8 kernel launches, hoping that the driver could refresh more often (between each kernel launch).
Unfortunately it does not help at all, so what else could I try to avoid this freezing display effect? I was suggested to add a cudaStreamQuery(0) call between each kernel to avoid packing by the driver.
Note: I am prepared to trade performances for smoothness!
The GPU is not (yet) designed to context switch between kernel launches, which is why your long-running kernel is causing a laggy display. Breaking the kernel into multiple launches probably would help on platforms other than Windows Vista/Windows 7. On those platforms, the Windows Display Driver Model requires an expensive user->kernel transition ("kernel thunk") every time the CUDA driver wants to submit work to the GPU.
To amortize the cost of the kernel thunk, the CUDA driver queues up GPU commands and submits them in batches. The driver uses a heuristic to trade off the performance hit from the kernel thunk against the increased latency of not immediately submitting work. What's happening with your multiple-kernels solution is that the driver's submitting your kernel or series of kernels to the GPU all at once.
Have you tried the cudaStreamQuery(0) suggestion? The reason that might help is because it forces the CUDA driver to submit work to the GPU, even if very little work is pending.