CUDA blocks & warps - which can run in parallel on a single SM? - cuda

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!

The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.

I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.

One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

Related

threadblocks and stream multiprocessors in C

I was told that in order to have a fully efficient CUDA C program, the number of threadblocks should be at least 3 or 4 times the number of stream multiprocessors.
My questions is: is the statement true? If yes/no why? what should the ratio ideally be?
It's generally a good idea to have multiple threadblocks that can be launched on each SM, when a kernel launches.
The GPU is a latency-hiding machine. In order to hide latency, it needs as much potential work as possible. Potential work can be translated into "warps that are ready to execute". This scenario can be maximized by having more than one threadblock per SM.
At some point, the GPU (SMs) run out of resources to host additional threadblocks. This running-out point might occur at around 3-4 threadblocks per SM, depending on the specifics of resource usage (registers, threads, shared memory, etc.) of the threadblocks, and the GPU type. Therefore, launching more than the amount that can be actually scheduled on the SMs won't help with concurrency, latency-hiding, occupancy, or other figures of merit for a parallel program. Those threadblocks will just wait until scheduling slots open up on the SMs.
There is no fixed ratio, but an analysis of typical threadblocks with 256 or 512 threads per block suggest that you will want at least 3-8 threadblocks per SM, to maximize occupancy (this varies based on GPU architecture as well). With 1024 threads per block, it might only require 2 threadblocks per SM.
GPU programs typically don't dramatically slow down if work is partitioned into more threadblocks, so the numbers are not a hard-and-fast rule, and the actual behavior will depend on other factors like shared memory usage (if any). It's just a general guideline.

blocks, threads, warpSize

There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)
kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);
1) To keep the SM as busy as possible, I should set nThreads to a multiple of warpSize. True?
2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I'm "wasting" the other HWcores. True?
3)If the warp-scheduler issues work in units of warpSize (32 threads), and each SM has 32 HWcores, then the SM would be full utilized. What happens when the SM has 48 HWcores? How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32? (If the previous paragraph is true, wouldn't it be better if the scheduler issued work in units of HWcore size?)
4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I'll guess the queue here is more than 2 kernels deep.) Is this correct?
5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn't mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I'll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM's. True?
5a) I'm thinking the smaller the better, but does it matter how small I make nBlocks? The question is, how to choose the value of nBlocks that is decent? (Not necessarily optimal.) Is there a mathematical approach to choosing nBlocks, or is it simply trial-n-err.
1) Yes.
2) CC 2.0 - 3.0 devices can execute up to 16 grids concurrently. Each SM is limited to 8 blocks so in order to reach full concurrency the device has to have at least 2 SMs.
3) Yes the warp schedulers select and issue warps at at time. Forget the concept of CUDA cores they are irrelevant. In order to hide latency you need to have high instruction level parallelism or a high occupancy. It is recommended to have >25% for CC 1.x and >50% for CC >= 2.0. In general CC 3.0 requires higher occupancy than 2.0 devices due to the doubling of schedulers but only a 33% increase in warps per SM. The Nsight VSE Issue Efficiency experiment is the best way to determine if you had sufficient warps to hide instruction and memory latency. Unfortunately, the Visual Profiler does not have this metric.
4) The warp scheduler algorithm is not documented; however, it does not consider which grid the thread block originated. For CC 2.x and 3.0 devices the CUDA work distributor will distribute all blocks from a grid before distributing blocks from the next grid; however, this is not guaranteed by the programming model.
5) In order to keep the SM busy you have to have sufficient blocks to fill the device. After that you want to make sure you have sufficient warps to reach a reasonable occupancy. There are both pros and cons to using large thread blocks. Large thread blocks in general use less instruction cache and have smaller footprints on cache; however, large thread blocks stall at syncthreads (SM can become less efficient as there are less warps to choose from) and tend to keep instructions executing on similar execution units. I recommend trying 128 or 256 threads per thread block to start. There are good reasons for both larger and smaller thread blocks.
5a) Use the occupancy calculator. Picking too large of a thread block size will often cause you to be limited by registers. Picking too small of a thread block size can find you limited by shared memory or the 8 blocks per SM limit.
Let me try to answer your questions one by one.
That is correct.
What do you mean exactly by "HWcores"? The first part of your statement is correct.
According to the NVIDIA Fermi Compute Architecture Whitepaper: "The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream".
Furthermore, the NVIDIA Keppler Architecture Whitepaper states: "Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle."
The "excess" cores are therefore used by scheduling more than one warp at a time.
The warp scheduler schedules warps of the same kernel, not different kernels.
Not quite true: Each block is locked in to a single SM, since that's where its shared memory resides.
That's a tricky issue and depends on how your kernel is implemented. You may want to have a look at the nVidia Webinar Better Performance at Lower Occupancy by Vasily Volkov which explains some of the more important issues. Primarily, though, I would suggest you choose your thread count to improve occupancy, using the CUDA Occupancy Calculator.

Relation between number of blocks of threads and cuda cores on machine (in CUDA C)

I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores.
I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)
Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks
CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.
You might know that threads are handled per warps (groups of 32 threads) inside the blocks you've defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads' input data.
The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won't be able to schedule through warps and you won't take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.
Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.
The warp size in all current CUDA hardware is 32. Using less than 32 threads per block (or not using a round multiple of 32 threads per block) just wastes cycles. As it stands, using 1 thread per block is leaving something like 95% of the ALU cycles of your GPU idle. That is the underlying reason for the poor performance.

CUDA determining threads per block, blocks per grid

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.
I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?
In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.
A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.
First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.
Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).
On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.
Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.
https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...
With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.
You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.
For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.
There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.
For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.

How much is run concurrently on a GPU given its numbers of SM's and SP's?

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.