cuda code produces incorrect result in release mode - cuda

my CUDA code produces correct result in Debug mode. However, in the release mode, the same code produces garbage results. Could the synchronization between threads behave differently between debug and release mode?

Code generated with -O0 results in less optimal code and significantly more global and local memory accesses which may be hide a race condition. If you think you may have a race condition in shared memory you can try to the new CUDA 5.0 preview memory checker which supports some forms of race condition detection. Your best bet is to look for any location where you shared memory between two threads and determine if you are missing a thread fence of sync threads.

I think, you got the race condition problem. You can reorganize you code and add synchronization where it's needed. In debug mode your threads are usually executed in order and you can't get this problem.

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Jetson TK1 Multiple Streams parallel execution

Considering that Tk1 has single SM, is it really possible to run streams concurrently ? I have been unable to do so, even with latest vesions of cuda libraries.
So is it really possible ? any sample code would be great. The sample code under cuda Blas also runs sequential as show on visual profiler.
Also a better insight into what "Streams" are good for in a Single SM ?
[Already asked on nvidia dev forum, the forum isnt very active i think]
With a single Kepler SM, it is not possible to run several streams concurrently. Kepler does not support preemption. This is not related to a CUDA version, rather related to the capability of the SM. Something related to preemption has be discussed for Pascal at GTC 2016, but nothing before.
Regarding the actual use of streams with single SM, some async functions may behave slightly differently between stream 0 and other streams. Hence, I assume some corner case of async memcopy and execution might benefit from streams with single SM - as TK1 device query reads that it has concurrent copy and exec with 1 copy engine. (even though it might be that ZeroCopy is a better approach on TK1).

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Can GPU counters be read transparently to the application code

I am trying to profile the CUDA rodinia benchmarks executing on a GTX 650.
I am using the code /usr/local/cuda-5.0/extras/CUPTI/samples/event_sampling to read
the instructions executed counter. It seems strange that I do not see any change in the
values reported by the event_sampling whether I am executing the CUDA benchmarks or not.
The event_sampling code also has some calculations of its own for which it measures the instructions executed. Unlike CPU, do I need to make changes to the source code of the application to be able to read the GPU counters such as instruction_executed?
CUPTI will only give you counter updates for kernels in the same process. You can get some of these values, though not to the same level of precision, with the NVIDIA visual profiler or related environment variables without modifying the code however.

Is cudaMallocHost() , cudaCreateEvent() asynchronous with executing kernels?

I am running on a very strange issue with the Cuda Runtime API. Calls to functions like cudaMallocHost(), cudaEventCreate(), cudaFree() etc.. seem to be executed only when kernels finish execution on GPU. This kernels are all launched on a stream created with the cudaStreamNonBlocking flag. What is the problem? Do I have to put up some other flags somewhere?
They could be made asynchronous, but it wouldn't be surprising if they are not.
With respect to cudaMallocHost(), which requires that the host memory be mapped for the GPU: if the allocation can't be satisfied from a preallocated pool, the GPU's page tables must be edited. It would not surprise me in the least if the driver had a restriction where it could not edit the page tables of an executing kernel. (Esp. since the page table editing must be done by kernel mode driver code.)
With respect to cudaEventCreate(), that really should be asynchronous since those allocations generally can be satisfied from a preallocated pool. The main impediment there is that changing the behavior would break existing applications that rely on its current, synchronous behavior.
Freeing objects asynchronously requires the driver to track which objects are referenced in the command buffers submitted to the GPU, and defer the actual free operation until after the GPU has finished processing them. It is doable but I am not sure NVIDIA has done the work.
For cudaFree(), it is not possible to track references as you could for CUDA events (because pointers can be stored for running kernels to read and chase). So for large vitrual address ranges that should be deallocated and unmapped, the free must be deferred until after all pending GPU operations have executed. Again, doable but I am not sure NVIDIA has done the work.
I think NVIDIA generally expects developers to work around the lack of asynchrony in these entry points.