CUDA: what does a stream abstract? - cuda

In the cuda C programming guide, stream is defined very abstractly: a sequence of cuda operations that are executed in order they are issued by the code.
My understanding of how instructions are executed in Nvidia GPU is: when a kernel is launched, the blocks are distributed to SMs in the device. Then the warps ( groups of 32 threads ) are schedueled by a warp schedueler in the SM for instructions to be processed warp-wise.
So, if two kernels are launched in the same stream, then the first is processed before the second ( since the instructions are processed in the order they are put in the stream ). Does that mean two kernels end up only using hardware resource of one kernel? Or does each kernel have their own resources, but the second one is pending until the first is complete?
And in general, how are streams implemented in hardware? I assume it provides ordering to the warp scheduler ( but then a warp scheduler is per-SM based, so how would this allow multi-SM kernels to use stream?).

A CUDA stream is merely a queue of actions to be performed by the GPU.
Every function through API can be issued in an asynchronous way - the CPU code continues while the instruction waits to be executed independently from the host code. Still, it is executed sychronously with respect to other instructions in the queue/stream.
If you want multiple operations on the GPU to be executed asynchronously, you need two or more queues/streams. For example, there is a chapter in the CUDA manual on how to mix kernel execution (first stream) with memory transfers (second stream).

Related

How to understand "All threads in a warp execute the same instruction at the same time." in GPU?

I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:
CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.
The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.
The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?
There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model
Predicated execution (See here)
Instruction replay/serialisation (See here)
Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.
This is how lock-step execution can support branching.

Behaviour of concurrent kernel and CUDA streams

I was wondering, if I run a kernel with 10 blocks of 1000 threads in one stream to analyse an array of data, and then launch a kernel that requires 10 blocks of 1000 threads to analyse another array in a second stream, what is going to happen?
Are the un-active threads on my card going to begin the process of analysing my second array ?
or is the second stream going to be paused until the first stream will have to finish ?
Thank you.
Generally speaking, if the kernels are issued from different (non-default) streams of the same application, and all requirements for execution of concurrent kernels are met, and there are enough resources available (SMs, especially -- I guess this is what you mean by "un-active threads") to schedule both kernels, then some of the blocks of the second kernel will begin executing along side of the blocks of the first kernel that are already executing. This may occur on the same SMs that the blocks of the first kernel are already scheduled on, or it may occur on other, unoccupied SMs, or both (for example if your GPU has 14 SMs, the work distributor would distribute the 10 blocks of the first kernel on 10 of the SMs, leaving 4 that are unused at that point.)
If on the other hand, your kernels had threadblocks requiring 32KB of shared memory usage, and your GPU had 8 SMs, then the threadblocks of the first kernel would effectively "use up" the 8 SMs, and the threadblocks of the second kernel would not begin executing until some of the threadblocks of the first kernel had "drained" i.e. completed and been retired. That's just one example of resource utilization that could inhibit concurrent execution. And of course, if you were launching kernels with many threadblocks each (say 100 or more) then the first kernel would mostly occupy the machine, and the second kernel would not begin executing until the first kernel had largely finished.
If you search in the upper right hand corner on "cuda concurrent kernels" you'll find a number of questions that highlight some of the challenges associated with observing concurrent kernel execution.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

What happen if a CUDA kernel is called from multiple pthreads simultaneously?

I have a CUDA kernel that do my hard work, but I also have some hard work that need to be done in the CPU (calculations with two positions of the same array) that I could not write in CUDA (because CUDA threads are not synchronous, I need to perform a hard work on a position X of an array and after do z[x] = y[x] - y[x - 1], where y is the array resultant of a CUDA kernel where each thread works on one position of this array and z is another array storing the result). So I'm doing this in the CPU.
I have several CPU threads to do the CPU side work, but each one is calling a CUDA kernel passing some data. My question is: what happens on the GPU side when multiple CPU threads are making GPU calls? Would be better if I do the CUDA kernel call once and then create multiple CPU threads to do the CPU side work?
Kernel calls are queued and executed one by one in single stream.
However you can specify stream during kernel execution - then CUDA operations in different streams may run concurrently and operations from different streams may be interleaved.
Default stream is 0.
See: CUDA Streams and Concurrency
Things are similar when different processes use the same card.
Also remember that kernels are executed asynchronously from CPU stuff.
On CUDA 4.0 and later, multiple threads can share the same CUDA context, hence no need for cuPush/PopContext anymore. You just need to call cudaSetDevice for each thread. Then, mentioned #dzonder, you can run multple kernels simulatenously from different threads with streams.

Cuda: Kernel launch queue

I'm not finding much info on the mechanics of a kernel launch operation. The API say to see the CudaProgGuide. And I'm not finding much there either.
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels.
Host code:
1. malloc(hostArry, ......);
2. cudaMalloc(deviceArry, .....);
3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
4. kernelA<<<1,300>>>(int, int);
5. kernelB<<<10,2>>>(float, int));
6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);
7. cudaFree(deviceArry);
Line 3 is synchronous. Line 4 & 5 are asynchronous, and the machine supports concurrent execution. So at some point, both of these kernels are running on the GPU. (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished.
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Yes, there are a variety of queues on the GPU, and the driver manages those.
Asynchronous calls return more or less immediately. Synchronous calls do not return until the operation is complete. Kernel calls are asynchronous. Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. So to answer your question:
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
There are various queues. The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. When launching operations into a single stream, all CUDA operations in that stream are serialized. Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question.
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync.
You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed.
If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. If not, you will need to block the device before calling kernelB, resulting in two blocking operations.