CUDA kernel launched from Nsight Compute gives inconsistent results - cuda

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.

I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.

I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Related

Cuda Compute Mode and 'CUBLAS_STATUS_ALLOC_FAILED'

I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.

Can execution of CUDA kernels from two contexts overlap?

From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.

Can GPU counters be read transparently to the application code

I am trying to profile the CUDA rodinia benchmarks executing on a GTX 650.
I am using the code /usr/local/cuda-5.0/extras/CUPTI/samples/event_sampling to read
the instructions executed counter. It seems strange that I do not see any change in the
values reported by the event_sampling whether I am executing the CUDA benchmarks or not.
The event_sampling code also has some calculations of its own for which it measures the instructions executed. Unlike CPU, do I need to make changes to the source code of the application to be able to read the GPU counters such as instruction_executed?
CUPTI will only give you counter updates for kernels in the same process. You can get some of these values, though not to the same level of precision, with the NVIDIA visual profiler or related environment variables without modifying the code however.

cuda code produces incorrect result in release mode

my CUDA code produces correct result in Debug mode. However, in the release mode, the same code produces garbage results. Could the synchronization between threads behave differently between debug and release mode?
Code generated with -O0 results in less optimal code and significantly more global and local memory accesses which may be hide a race condition. If you think you may have a race condition in shared memory you can try to the new CUDA 5.0 preview memory checker which supports some forms of race condition detection. Your best bet is to look for any location where you shared memory between two threads and determine if you are missing a thread fence of sync threads.
I think, you got the race condition problem. You can reorganize you code and add synchronization where it's needed. In debug mode your threads are usually executed in order and you can't get this problem.