Understanding CUDA dependency check - cuda

CUDA C Programming Guide provides the following statements:
For devices that support concurrent kernel execution and are of compute capability 3.0
or lower, any operation that requires a dependency check to see if a streamed kernel
launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any
stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel
launch being checked is complete.
I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.
Any explanation or elaboration will be appreciated!

First of all I suggest you visit the webinar-site of nvidia and watch the webinar on Concurrency & Streams.
Furthermore consider the following points:
commands issued to the same stream are treated as dependent
e.g. you would insert a kernel into a stream after a memcopy of some data this
kernel will acess. The kernel "depends" on the data being available.
commands in the same stream are therefore guaranteed to be executed sequentially (or synchronously, which is often used as synonym)
commands in different streams however are independent and can be run concurrently
so dependencies are only known to the programmer and are expressed using streams (to avoid errors)!
The following corresponds only to devices with compute capability 3.0 or lower (as stated in the quide). If you want to know more about the changes to stream scheduling behaviour with compute capability 3.5 have a look at HyperQ and the corresponding example. At this point I also want to reference this thread where I found the HyperQ examples :)
About your second question: I do not quite understand what you mean by a "kernel execution on some device memory" or a "kernel execution involving device memory" so i reduced your statement to:
A kernel execution requires a dependency check on all the previous kernels and memory transfers.
Better would be:
A CUDA operation requires a dependency check to see whether preceding CUDA operations in the same stream have completed.
I think your problem here is with the expression "started executing". That means there can still be independent (that is on different streams) kernel launches, which will be concurrent with the previous kernels, provided they have all started executing and enough device resources are available.

Related

Can execution of CUDA kernels from two contexts overlap?

From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

Difference in time reported by NVVP and counters

I've been running kernel of CUDA programs. I observe that there is considerable difference between time reported by GPU counters and NVVP for kernel execution. Why such difference is usually observed?
Nsight Visual Studio Edition and the Visual Profiler support two mechanism for capturing the duration of the kernel. Both of these methods will result in a value smaller and more accurate than what is reported by CUevent/cudaEvent. The methods are as follows:
Concurrent Kernel Timing
This is the default mode used by Nsight 2.x and Visual Profiler 5.0 to generate a timeline. The duration of a kernel is defined as the time from when the kernel code starts executing on the device to the time that it completes. This cannot be measured using CUDA events.
Serialized Kernel Timing
This is the default mode used by tools when collecting PM counters for each kernel. The duration of a kernel is defined as the time the GPU processes the launch request until the GPU idles after completion of the kernel. This mode specifically disables concurrent kernel execution. In almost all cases the reported duration will be slightly larger than the concurrent kernel trace duration as it includes time for the GPU to launch the first block and time for the GPU to complete all memory stores.
CUDA Event Range Timing
CUDA event timing is done by calling cu/cudaEventRecord before and after the kernel launch on the same stream. Each event record inserts a command into the GPU push buffer. When the command reaches the GPU it writes a timestamp to memory. It is possible to push two event records without a launch. This allows a developer to measure the GPU time between the two timestamp commands. This method has the following disadvantages and it is why I encourage developers to use the tools (Nsight, Visual Profiler, and CUPTI):
a. The elapsed time between submitting the start event record and the launch can be affected by CPU overhead. Launch overhead is 5-8µs on Linux/TCC and potentially much higher on WDDM.
b. The GPU can context switch between the start event record and the kernel execution.
c. The start event record will include launch overhead including time to update driver buffers that need to be resized, copy parameters, copy texture bindings, ...
d. The elapsed time between submitting the kernel and the end event record can impact the timing.
e. The GPU can context switch between the end of the kernel execution and the end event record.
f. Incorrect use of events will break concurrent kernel execution.
The duration provide in each of these modes will provide different values. Furthermore the definition of duration provided by tools and those available through use of events is different.
The NVIDIA tools define duration as best as possible as the time from when the GPU starts working on the kernel to when the GPU completes work on the kernel. If a developer is interested in collecting this information they should look at the CUPTI SDK included with the toolkit.

Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.