Determining number of warps allowed in CUDA SM - cuda

So if a streaming multiprocessor can allow maximum X threads, while each block in the SM allows Y threads, how many warps can we have in a block and how many warps can we have in a SM?
Here is my take on this question:
(1) A warp consists of 32 threads. In a block we can have Y/32, right?
(2) As far as # of warps per SM, we cannot exceed X the maximum number of threads in SM, so we can have X/32, right? I hope somebody can confirm these calculations.

(1) Yes, rounding up if needed (i.e. if number of threads Y per block is not evenly divisible by 32)
(2) Yes, that is one limit on the number of warps that may be active. Remember that the SM scheduler works by scheduling blocks first. The number of blocks that will be scheduled is a function of available resources (registers, shared memory, threads, etc.) A block will only be scheduled when there are enough resources available to support it's needs. So for example, if I have 1024 threads per block, I can schedule at most 1 block on an SM, because the limit of 1536 threads per SM (using CC 2.0 as an example here) prevents 2 blocks from being scheduled. So in that case, even though your X/32 number predicts a max of 48 warps, only 1024/32 = 32 warps will be scheduled. (using CC 2.0 as an example, with a block structure of 1024 threads per block).

Related

Is there a correlation between the exact meaning of gpu wave and thread block?

computation performed by a GPU kernel is partitioned into
groups of threads called thread blocks, which typically
execute in concurrent groups, resulting in waves of execution
What exactly does wave mean here? Isn't that the same meaning as warp ?
A GPU can execute a maximum number of threads, grouped in a maximum number of thread blocks. When the whole grid for a kernel is larger than the maximum of either of those limits, or if there are concurrent kernels occupying the GPU, it will launch as many thread blocks as possible. When the last thread of a block has terminated, a new block will start.
Since blocks typically have equal run times and scheduling has a certain latency, this often results in bursts of activity on the GPU that you can see in the occupancy. I believe this is what is meant by that sentence.
Do not confuse this with the term "wavefront" which is what AMD calls a warp.
Wave: a group of thread blocks running concurrently on GPU.
Full Wave: (number of SMs on the device) x (max active blocks per SM)
Launching the grid with thread-blocks less than a full wave results in low achieved occupancy. Mostly launching is composed of some number of full wave and possibly 1 incomplete wave. It should be to mention that maximum size of the wave is based on how many blocks can fit on one SM regarding registers per thread, shared memory per block etc.
If we look at the blog of the Julien Demoth and use that values to understand the issue:
max # of threads per SM: 2048 (NVIDIA Tesla K20)
kernel has 4 blocks of 256 threads per SM
Theoretical Occupancy: %50 (4*256/2048)
Full Wave: (# of SMs) x (max active blocks per SM) = 13x4 = 52 blocks
The kernel is launching with 128 blocks so there are 2 full wave and 1 incomplete wave with 24 blocks. The full wave value may be increased using the attribute (launch_bounds) or configuring the amount of shared memory per SM (for some device, see also related report) etc.
Also, the incomplete wave is named as partial last wave and it has negative effect on performance due to having low occupancy. This underutilization of GPU is named as tail effect and it’s dominant especially when launching few thread blocks in a grid.

Why use thread blocks larger than the number of cores per multiprocessor

I have a Nvidia GeForce GTX 960M graphics card, which has the following specs:
Multiprocessors: 5
Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total)
Max threads per multiprocessor: 2048
Max block size (x, y, z): (1024, 1024, 64)
Warpsize: 32
If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time. However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently. So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).
My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?
If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time.
That isn't correct. All 640 threads run concurrently. The SM has instruction latency and is pipelined, so that all threads are active and have state simultaneously. Threads are not tied to a specific core and the execution model is very different from a conventional multi-threaded CPU execution model.
However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently.
That may happen, but it is not guaranteed. All blocks will run. What SM they run on is determined by the block scheduling mechanism, and those heuristics are not documented.
So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).
From the answers above, that does not follow either.
My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?
Because threads are not tied to cores, the architecture has a lot of latency and requires a significant number of threads in flight to hide all that latency and reach peak performance. Unfortunately basically none of the theses you suppose in your question are correct or relevant to determining the optimal number of blocks or their size for a given device.

CUDA: Can a SM concurrently alternate between warps from different blocks?

Let's say a SM has been populated with 8 blocks of 64 threads each.
That gives us 2 warps/block, and 16 warps in total. SMs can alternate between warps in order to hide latencies. Must these warps belong to the same block, or can a warp from block 5 be replaced by a warp from block 8, for example?
Yes, the SM scheduler can "alternate" or choose warps for scheduling from any that are resident on that SM.
The fact that SMs have a max possible warp load (64, currently, for some GPUs) or thread load (2048, currently, for some GPUs) that exceeds the possible limit of a single block (1024, currently, for all GPUs supported by recent CUDA toolkits) is so that the SM can choose warps from different blocks for scheduling, to improve the possibilities for latency hiding.

resident warps per SM in (GK20a GPU) tegra k1

How many resident warps are present per SM in (GK20a GPU) tegra k1?
As per documents I got following information
In tegra k1 there is 1 SMX and 192 cores/multiprocessor
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Can any one specify value of maximun blocks per SMX?
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing
four warps to be issued and executed concurrently) threads running concurrently ?
if NO, How many number of threads run concurrently?
Kindly help me to solve and understand it.
Can any one specify value of maximun blocks per SMX?
The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?
There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".
Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.
So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.
However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.
Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
each thread from certain kernel is using 64 32b registers (256B register memory),
kernel is launched with blocks of size 1024 threads,
obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.

How do CUDA blocks/warps/threads map onto CUDA cores?

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
The programmer writes a kernel, and organize its execution in a grid of thread blocks.
Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.
Two of the best references are
NVIDIA Fermi Compute Architecture Whitepaper
GF104 Reviews
I'll try to answer each of your questions.
The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.
4'. There is a mapping between laneid (threads index in a warp) and a core.
5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.
6'. A thread block will be divided into
WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize
There is no requirement for the warp schedulers to select two warps from the same thread block.
7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.
See reference 2 for differences between a GTX480 and GTX560.
If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.
1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.
2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.
8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions
In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.
3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.
4'. Parallel Nsight and the Visual Profiler show
a. executed IPC
b. issued IPC
c. active warps per active cycle
d. eligible warps per active cycle (Nsight only)
e. warp stall reasons (Nsight only)
f. active threads per instruction executed
The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC.
For MaxIPC assume
GF100 (GTX480) is 2
GF10x (GTX560) is 4 but target is 3 is a better target.
"E. If a warp contains 20 threads, but currently there are only 16 cores available, the warp will not run."
is incorrect. You are confusing cores in their usual sense (also used in CPUs) - the number of "multiprocessors" in a GPU, with cores in nVIDIA marketing speak ("our card has thousands of CUDA cores").
Cuda core (so answer) is a hardware concept and thread is a software concept. Even with only 16 cores available, you can still run 32 threads. However, you may need 2 clock cycles to run them with only 16 hardware cores.
The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle
warp scheduler is responsible to find cores to run instructions (so answer)
A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
A warp itself can only be scheduled on a SM (multiprocessor, or streaming multiprocessor), and can run up to 32 threads at the same time (depending on cores in SM); it cannot use more than a SM.
The number "48 warps" is the maximum number of active warps (warps which may be chosen to be scheduled for work in the next cycle, at any given cycle) per multiprocessor, on NVIDIA GPUs with Compute Capability 2.x; and this number corresponds to 1536 = 48 x 32 threads.
Answer based on this webinar