NVIDIA Architecture: CUDA threads and thread blocks - cuda

This is mostly from the book "Computer Architecture: A Quantitative Approach."
The book states that groups of 32 threads are grouped and executed together in what's called the thread block, but shows an example with a function call that has 256 threads per thread block, and CUDA's documentation states that you can have a maximum of 512 threads per thread block.
The function call looks like this:
int nblocks = (n+255)/256
daxpy<<<nblocks,256>>>(n,2.0,x,y)
Could somebody please explain how thread blocks are structured?

The question is a little unclear in my opinion. I will highlight a difference between thread warps and thread blocks that I find important in hopes that it helps answer whatever the true question is.
The number of threads per warp is defined by the hardware. Often, a thread warp is 32 threads wide (NVIDIA) because the SIMD unit on the GPU has exactly 32 lanes of execution, each with its own ALU (this is not always the case as far as I know; some architectures have only 16 lanes even though thread warps are 32 wide).
The size of a thread block is user defined (although, constrained by the hardware). The hardware will still execute thread code in 32-wide thread warps. Some GPU resources, such as shared memory and synchronization, cannot be shared arbitrarily between any two threads on the GPU. However, the GPU will allow threads to share a larger subset of resources if they belong to the same thread block. That's the main idea behind why thread blocks are used.

Related

Why does each thread have its own instruction address counter inside a warp?

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.
Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?
Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data
I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.
However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:
Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.
The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:
Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to
make better use of execution resources or to allow one thread to wait for data to be produced by
another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines
how to group active threads from the same warp together into SIMT units. This retains the high
throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads
can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in
Volta will still group together threads which are executing the same code and run them in parallel
for maximum efficiency
Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

What will happen if the number of threads in a warp are less than 32?

I read that the number of threads in a warp can be 32 or more. why is that? if the number is less than 32 threads, does that mean the resources goes underutilized or we will not be able to tolerate memory latency?
Your question needs clarification - perhaps you are confusing the CUDA "warp" and "block" concepts?
Regarding warps, it's important to remember that warp and their size is a property of the hardware. Warps are a grouping of hardware threads that execute the same instruction (these days) every cycle. In other words, the size width indicates the SIMD-style execution width, something that the programmer can not change. In CUDA you launch blocks of threads which, when mapped to the hardware, get executed in warp-sized bunches. If you start blocks with thread count that is not divisible by the warp size, the hardware will simply execute the last warp with some of the threads "masked out" (i.e. they do have to execute, but without any effect on the state of the GPU/memory).
For more details I recommend reading carefully the hardware and execution-related sections of the CUDA programming guide.

CUDA blocks & warps - which can run in parallel on a single SM?

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!
The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.
I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.
One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.

How much is run concurrently on a GPU given its numbers of SM's and SP's?

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.