Behaviour of concurrent kernel and CUDA streams - cuda

I was wondering, if I run a kernel with 10 blocks of 1000 threads in one stream to analyse an array of data, and then launch a kernel that requires 10 blocks of 1000 threads to analyse another array in a second stream, what is going to happen?
Are the un-active threads on my card going to begin the process of analysing my second array ?
or is the second stream going to be paused until the first stream will have to finish ?
Thank you.

Generally speaking, if the kernels are issued from different (non-default) streams of the same application, and all requirements for execution of concurrent kernels are met, and there are enough resources available (SMs, especially -- I guess this is what you mean by "un-active threads") to schedule both kernels, then some of the blocks of the second kernel will begin executing along side of the blocks of the first kernel that are already executing. This may occur on the same SMs that the blocks of the first kernel are already scheduled on, or it may occur on other, unoccupied SMs, or both (for example if your GPU has 14 SMs, the work distributor would distribute the 10 blocks of the first kernel on 10 of the SMs, leaving 4 that are unused at that point.)
If on the other hand, your kernels had threadblocks requiring 32KB of shared memory usage, and your GPU had 8 SMs, then the threadblocks of the first kernel would effectively "use up" the 8 SMs, and the threadblocks of the second kernel would not begin executing until some of the threadblocks of the first kernel had "drained" i.e. completed and been retired. That's just one example of resource utilization that could inhibit concurrent execution. And of course, if you were launching kernels with many threadblocks each (say 100 or more) then the first kernel would mostly occupy the machine, and the second kernel would not begin executing until the first kernel had largely finished.
If you search in the upper right hand corner on "cuda concurrent kernels" you'll find a number of questions that highlight some of the challenges associated with observing concurrent kernel execution.

Related

How CUDA kernel work on multiple blocks each of which have different time consumption?

Assume that we run a kernel function with 4 blocks {b1, b2, b3, b3}. Each the blocks requires {10, 2, 3, 4} amount of time to complete job. And our GPU could process only 2 blocks in parallel.
If then, which one is correct way how our GPU work?
To quote this document from Nvidia:
Threadblocks are assigned to SMs
Assignment happens only if an SM has sufficient resources for the entire threadblock
Resources: registers, SMEM, warp slots
Threadblocks that haven’t been assigned wait for resources to free up
The order in which threadblocks are assigned is not defined
Can and does vary between architectures
Thus, without more information, the two scheduling are theoretically possible. In practice, this is even more complex since there are many SMs on a GPU and AFAIK each SM can now execute multiple blocks concurrently.

CUDA: what does a stream abstract?

In the cuda C programming guide, stream is defined very abstractly: a sequence of cuda operations that are executed in order they are issued by the code.
My understanding of how instructions are executed in Nvidia GPU is: when a kernel is launched, the blocks are distributed to SMs in the device. Then the warps ( groups of 32 threads ) are schedueled by a warp schedueler in the SM for instructions to be processed warp-wise.
So, if two kernels are launched in the same stream, then the first is processed before the second ( since the instructions are processed in the order they are put in the stream ). Does that mean two kernels end up only using hardware resource of one kernel? Or does each kernel have their own resources, but the second one is pending until the first is complete?
And in general, how are streams implemented in hardware? I assume it provides ordering to the warp scheduler ( but then a warp scheduler is per-SM based, so how would this allow multi-SM kernels to use stream?).
A CUDA stream is merely a queue of actions to be performed by the GPU.
Every function through API can be issued in an asynchronous way - the CPU code continues while the instruction waits to be executed independently from the host code. Still, it is executed sychronously with respect to other instructions in the queue/stream.
If you want multiple operations on the GPU to be executed asynchronously, you need two or more queues/streams. For example, there is a chapter in the CUDA manual on how to mix kernel execution (first stream) with memory transfers (second stream).

Concurrent Kernels with CUSPARSE Library

I would like to ask you a question about the concurrent kernel execution in Nvidia GPUs. I explain us my situation. I have an code which launchs 1 sparse matrix multiplication for 2 different matrix (one for each one). These matrix multiplications are performed with the cuSPARSE Library. I want both operations can be concurrently performed, so I use 2 streams to launch them. With Nvidia Visual profiler, I´ve observed that both operations (cuSPARSE kernels) are completely overlaped. The time stamps for both kernels are:
Kernel 1) Start Time: 206,205 ms - End Time: 284,177 ms.
Kernel 2) Start Time: 263,519 ms - End Time: 278,916 ms.
I´m using a Tesla K20c with 13 SMs which can execute up 16 blocks per SM. Both kernels have 100% occupancy and launch an enough amount of blocks:
Kernel 1) 2277 blocks, 32 Register/Thread, 1,156 KB shared memory.
Kernel 2) 46555 blocks, 32 Register/Thread, 1,266 KB shared memory.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU. However, Nvidia Visual Profiler shows that these kernels are being overlaped. Why?. Anyone could explain me why this behaviour can occur?
Many thanks in advance.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU.
I think this is an incorrect statement. As far as I know, the low-level scheduling behavior of blocks is not specified in CUDA.
Newer devices (cc3.5+ with hyper-Q) can more easily schedule blocks from concurrent kernels at the same time.
So, if you launch 2 kernels (A and B), each with large numbers of blocks, concurrently, then you may observe either
blocks from kernel A execute concurrently with kernel B
all (or nearly all) of the blocks of kernel A execute before kernel B
all (or nearly all) of the blocks of kernel B execute before kernel A
Since there is no specification at this level, there is no direct answer. Any of the above are possible. The low level block scheduler is free to choose blocks in any order, and the order is not specified.
If a given kernel launch "completely saturates" the machine (i.e. uses enough resources to fully occupy the machine while it is executing) then there is no reason to think that the machine has extra capacity for a second concurrent kernel. Therefore there would be no reason to expect much, if any, speed up from running the two kernels concurrently as opposed to sequentially. In such a scenario, whether they execute concurrently or not, we would expect the total execution time for the 2 kernels running concurrently to be approximately the same as the total execution time if the two kernels are launched or scheduled sequentially (ignoring tail effects and launch overheads, and the like).

synchronization and cuda scalability

I have read that CUDA prevents synchronization between threads from different blocks to allow scalability. How is that even if increasing the number of MP in a device will increase execution speed either threads were synchronized in the same block or between different blocks?
There are at least two related things to consider.
Scalability goes in both directions, up and down. The desire is that your CUDA program be able to run on a GPU other than the one(s) it was "designed" for. This implies it might be running on a larger GPU or on a smaller GPU. Suppose I had a synchronization requirement among 128 threadblocks (out of perhaps a larger program), which all must execute concurrently in order to exchange data and satisfy the synchronization requirement, before any of them can complete. If I run this program on a GPU with 16 SMs, each of which can schedule up to 16 threadblocks, its reasonable to conclude the program might work, since the momentary capacity of the GPU is 256 threadblocks. But if I run the program on a GPU with 4 SMs, each of which can schedule 16 threadblocks, there is no circumstance under which the program can sensibly complete.
CUDA Scheduling will interfere (to some degree) with desired program execution. There is no guaranteed or specified order in which the CUDA scheduler may schedule threadblocks. Therefore, extending the above example, let's say for an 8 SM GPU (128 threadblocks momentary capacity), it might be that 127 of my 128 critical threadblocks get scheduled, while at the same time other, non-synchronization-critical threadblocks are scheduled. The 127 critical threadblocks will occupy 127 of the 128 available "slots", leaving only 1 slot remaining to process other threadblocks. The 127 critical threadblocks may well be idle, waiting for threadblock 128 to appear, consuming slots but not necessarily doing useful work. This will bring the performance of the GPU to a very low level, until the 128th threadblock eventually gets scheduled.
These are some examples of reasons why it is desirable not to have inter-threadblock synchronization requirements in the design of a CUDA program.

How much is run concurrently on a GPU given its numbers of SM's and SP's?

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.