CUDA: Can a SM concurrently alternate between warps from different blocks? - cuda

Let's say a SM has been populated with 8 blocks of 64 threads each.
That gives us 2 warps/block, and 16 warps in total. SMs can alternate between warps in order to hide latencies. Must these warps belong to the same block, or can a warp from block 5 be replaced by a warp from block 8, for example?

Yes, the SM scheduler can "alternate" or choose warps for scheduling from any that are resident on that SM.
The fact that SMs have a max possible warp load (64, currently, for some GPUs) or thread load (2048, currently, for some GPUs) that exceeds the possible limit of a single block (1024, currently, for all GPUs supported by recent CUDA toolkits) is so that the SM can choose warps from different blocks for scheduling, to improve the possibilities for latency hiding.

Related

resident warps per SM in (GK20a GPU) tegra k1

How many resident warps are present per SM in (GK20a GPU) tegra k1?
As per documents I got following information
In tegra k1 there is 1 SMX and 192 cores/multiprocessor
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Can any one specify value of maximun blocks per SMX?
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing
four warps to be issued and executed concurrently) threads running concurrently ?
if NO, How many number of threads run concurrently?
Kindly help me to solve and understand it.
Can any one specify value of maximun blocks per SMX?
The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?
There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".
Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.
So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.
However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.
Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
each thread from certain kernel is using 64 32b registers (256B register memory),
kernel is launched with blocks of size 1024 threads,
obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.

Determining number of warps allowed in CUDA SM

So if a streaming multiprocessor can allow maximum X threads, while each block in the SM allows Y threads, how many warps can we have in a block and how many warps can we have in a SM?
Here is my take on this question:
(1) A warp consists of 32 threads. In a block we can have Y/32, right?
(2) As far as # of warps per SM, we cannot exceed X the maximum number of threads in SM, so we can have X/32, right? I hope somebody can confirm these calculations.
(1) Yes, rounding up if needed (i.e. if number of threads Y per block is not evenly divisible by 32)
(2) Yes, that is one limit on the number of warps that may be active. Remember that the SM scheduler works by scheduling blocks first. The number of blocks that will be scheduled is a function of available resources (registers, shared memory, threads, etc.) A block will only be scheduled when there are enough resources available to support it's needs. So for example, if I have 1024 threads per block, I can schedule at most 1 block on an SM, because the limit of 1536 threads per SM (using CC 2.0 as an example here) prevents 2 blocks from being scheduled. So in that case, even though your X/32 number predicts a max of 48 warps, only 1024/32 = 32 warps will be scheduled. (using CC 2.0 as an example, with a block structure of 1024 threads per block).

Actual warps execution sequential or in parallel?

If we have configured 256 threads/block for a SM, then total would be 3 blocks/SM (considering maximum 768 threads/SM). Now total warps/block would be 256/32 = 8 and thus 8*3 = 24 warps/SM. So will these 8 warps in blocks will be executed sequentially or in parallel and 24 warps in SM will execute sequentially or in parallel?
As it is already cleared that at any time 3 blocks can be executed by SM(in parallel).
#robot, I moved the discussion to the answer for better readability. Also you can accept it if you like.
Different blocks can be mapped to different SM's and hence executed in parallel. But, internally, blocks consist of warps which are scheduled for execution on an SM one at a time (on 1.x devices). However, the graphics hardware can switch between different warps with 0 overhead (owing to static register allocation). Therefore usually instructions from different warps (and possibly from different blocks) exist in the SM's pipeline at different stages.
Active warps are those that are ready to execute, i.e. not waiting on a barrier, memory access and do not have register dependencies (like read-after-write). I am not sure how the hardware chooses the next warp to execute. Propabably warps are prioritized by "age" (waiting time) and other factors to prevent starvation.
Concerning your questions:
on 1.x devices there could be at most 768 threads per SM , i.e. 24 warps/SM. On 2.x and higher, we have up to 1536 threads/48 warps per SM (depending on the register usage)
if there are 10 SMs per GPU, and you have enough registers/shared memory to run 24 warps per SM, then there could be at most 24*10 active warps per GPU. Though, it is rarely the case that all warps are active at the same time since most of them will be waiting for memory access/register dependencies or barriers, depending on your program logic. Remark that, the actual execution of an instruction (not scheduling!) can take up to 22 cycles on 1.x devices hence a warp will be inactive until the instruction completes.

Why only one of the warps is executed by a SM in cuda?

I frequently found the following words in some CUDA materials:
"At any time, only one of the warps is executed by a SM".
Here I don't quite understand since each SM can run hundreds to thousands of threads simultaneously, why only a single warp, which is 32 threads, can be executed at a time point?
Thanks!
Details vary for different generations of CUDA hardware, but for example in earlier generations each SM has 8 execution units, each of which executes 4 threads (one instruction from each thread every 4 cycles). Hence you get 4 way SMT which gives 32 concurrent threads per SM.
Of course there are multiple SMs per GPU, e.g. 30, which would mean 30 x 32 thread warps = 960 threads executing at any given instant. On top of this warps can be switched in and out so you can have much more than, e.g. 960 "live" threads, even though only 960 of them are actually executing at any given time.
The statement is true of the Tesla architecture but it is incorrect for Fermi and Kepler. It is easier to look at the SM in terms of warp schedulers. On each cycle the warp scheduler selects an eligible warp (a warp that is not stalled) and dispatches one or two instructions from the warp to execution units. The number of execution units per SM is documented in the Fermi and Kepler whitepapers. CUDA cores roughly equate to the number of execution units that can perform integer and single precision floating point operations. There are additional execution units for load/store operations, branching, etc.
Compute Capability 1.x (Tesla)
1 warp scheduler per SM
Dispatch 1 instruction per warp scheduler
Compute Capability 2.0 (Fermi 1st Generation)
2 warp schedulers per SM
Dispatch 1 instruction per warp scheduler
Compute Capability 2.1 (Fermi 2nd Generation)
2 warp schedulers per SM
Dispatch 1 or 2 instructions per warp scheduler
Compute Capability 3.x (Kepler)
4 warp schedulers per SM
Dispatch 1 or 2 instructions per warp scheduler

How do CUDA blocks/warps/threads map onto CUDA cores?

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
The programmer writes a kernel, and organize its execution in a grid of thread blocks.
Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.
Two of the best references are
NVIDIA Fermi Compute Architecture Whitepaper
GF104 Reviews
I'll try to answer each of your questions.
The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.
4'. There is a mapping between laneid (threads index in a warp) and a core.
5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.
6'. A thread block will be divided into
WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize
There is no requirement for the warp schedulers to select two warps from the same thread block.
7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.
See reference 2 for differences between a GTX480 and GTX560.
If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.
1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.
2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.
8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions
In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.
3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.
4'. Parallel Nsight and the Visual Profiler show
a. executed IPC
b. issued IPC
c. active warps per active cycle
d. eligible warps per active cycle (Nsight only)
e. warp stall reasons (Nsight only)
f. active threads per instruction executed
The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC.
For MaxIPC assume
GF100 (GTX480) is 2
GF10x (GTX560) is 4 but target is 3 is a better target.
"E. If a warp contains 20 threads, but currently there are only 16 cores available, the warp will not run."
is incorrect. You are confusing cores in their usual sense (also used in CPUs) - the number of "multiprocessors" in a GPU, with cores in nVIDIA marketing speak ("our card has thousands of CUDA cores").
Cuda core (so answer) is a hardware concept and thread is a software concept. Even with only 16 cores available, you can still run 32 threads. However, you may need 2 clock cycles to run them with only 16 hardware cores.
The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle
warp scheduler is responsible to find cores to run instructions (so answer)
A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
A warp itself can only be scheduled on a SM (multiprocessor, or streaming multiprocessor), and can run up to 32 threads at the same time (depending on cores in SM); it cannot use more than a SM.
The number "48 warps" is the maximum number of active warps (warps which may be chosen to be scheduled for work in the next cycle, at any given cycle) per multiprocessor, on NVIDIA GPUs with Compute Capability 2.x; and this number corresponds to 1536 = 48 x 32 threads.
Answer based on this webinar