Streaming multiprocessors, Blocks and Threads (CUDA) - cuda

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?

The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().

For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.

There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.

Related

task scheduling of NVIDIA GPU

I have some doubt about the task scheduling of nvidia GPU.
(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU? In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?
I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!
The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
I recommend this article. It's somewhat outdated, but I think it is a good starting point. The article targets Kepler architecture, so the most recent one, Pascal, may have some discrepancies in their behavior.
Answers for your specific questions (based on the article):
Q1. Do threads in a block release their resource only after all threads in the block finish running?
Yes. A warp that finished running while other warps in the same block didn't still acquires its resources such as registers and shared memory.
Q2. When every threads in a block all hang up, will it still occupy the resources? Or, does a new block of threads take over the resources?
You are asking whether a block can be preempted. I've searched through web and got the answer from here.
On compute capabilities < 3.2 blocks are never preempted.
On compute capabilities 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging.
So blocks don't give up their resources when stalled by some global memory access. Rather than expecting the stalled warps to be preempted, you should design your CUDA code so that there are plenty of warps resident in an SM, waiting to be dispatched. In this case even when some warps are waiting global memory access to finish, schedulers can launch other threads, effectively hiding latencies.

In GPU architecture, where is the data for all the non-active warps stored?

From my understanding of NVIDIA's CUDA architecture, the execution of threads happens in groups of ~32 called 'warps'. Multiple warps are scheduled at a time, and instructions are issued from any of the warps (depending on some internal algorithm).
Now, if I have say 16KB of shared memory on the device, and each thread uses 400 bytes of shared memory, then one warp will need 400*32 = 12.8 KB. Does this mean that the GPU cannot actually schedule more than 1 warp at a time, irrespective of how many threads I launch within a given block?
From a resource standpoint (registers, shared memory, etc.) the important unit is the threadblock, not the warp.
In order to schedule a threadblock for execution, there must be enough free resources on the SM to cover the needs of the entire threadblock. All threadblocks in a grid will have exactly the same resource requirements.
If the SM has no currently executing threadblocks, (such as at the point of kernel launch) then the SM should have at least enough resources to cover the needs of a single threadblock. If that is not the case, the kernel launch will fail. This could happen, for example, if the number of registers per thread, times the number of threads per block, exceeded the number of registers in the SM.
After the SM has a single threadblock scheduled, additional threadblocks can be scheduled depending on the available resources. So to extend the register analogy, if each threadblock required 30K registers (regs/thread * threads/block), and the SM had a max of 64K register, then at most two threadblocks could be scheduled (i.e. their warps could possibly be brought into execution by the SM).
In this way, any warp that could possibly be brought into execution already has enough resources allocated for it. This is a principal part of the scheduling mechanism that allows the SM to switch execution from one warp to another with zero delay (fast context switching).

clarification about CUDA number of threads executed per SM

I am new to cuda programming and am reading about a G80 chip which has 128 SPs(16 SMs, each with 8 SPs) from the book "Programming Massively Parallel Processors - A hands on approach".
There is a comparison between Intel CPUs and G80 chip.
Intel CPUs support 2 to 4 threads, depending on the machine model, per core.
where as the G80 chip supports 768 threads per SM, which sums up to 12000 threads for this chip.
My question here is it that the G80 chip can execute 768 threads simultaneously ?
If not simultaneously then what is meant by Intel CPUs support 2 to 4 threads per core ? We can always have many threads/processes running on the Intel CPU scheduled by the OS.
G80 keep the context for 768 threads per SM concurrently and interleaves their execution. This is the key difference between CPU and GPU. GPUs are deep-multithreaded processor hiding memory accesses of some threads by the computation from other threads. The latency of executing a thread is much higher that the CPU and GPU is optimized for thread throughput instead of thread latency. In comparison, CPUs use out-of-order speculative execution to reduce the execution delay of one thread. There are several technique used by GPUs to reduce thread scheduling overhead. For example, GPUs group threads in coarser schedulable element called warps of wavefront and execute threads of the warp over an SIMD. GPU threads are identical making them suitable choice for SIMD model. In the eye of the programmer, threads are executed in MIMD fashion and they are grouped in thread blocks to reduce communication overhead.
Threads employed in a CPU core are used to fill different execution units by dynamic scheduling. CPU threads are not necessarily at the same type. It means once a thread is busy with the floating point other threads may find ALU idle. Therefore, execution of these thread can be done concurrently. Multiple threads per core are maintained to fill different execution units effectively preventing idle units. However, dynamic scheduling is costly in term of power and energy consumption. Therefore, manufacturer use a few threads per CPU core.
In answer to second part of your question: Threads in GPUs are scheduled by hardware (per SM warp scheduler) and the OS and even driver do not affect the scheduling.
As far as I know, 768 is the max number of resident threads in an SM. And the threads are executed in warps which consists of 32 threads. So in an SM, all 768 threads will not be executed at the same time, but they will be scheduled in chunks of 32 threads at a time, i.e. one warp at a time.
The analogous technology on CPUs is called "simultanous multithreading" (SMT), or hyperthreading in Intel's marketing speech. It allows usually two, on some CPUs four threads to be scheduled by the CPU itself in hardware.
This is different from the fact that the operating system may on top of that schedule a larger number of threads in software.

How Concurrent blocks can run a single GPU streaming multiprocessor?

I was studying about the CUDA programming structure and what I felt after studying is that; after creating the blocks and threads, each of this blocks is assigned to each of the streaming multiprocessor (e.g. I am using GForce 560Ti which has14 streaming multiprocessors and so at one time 14 blocks can be assigned to all the streaming multiprocessors). But as I am going through several online materials such as this one :
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf
where it has been mentioned that several blocks can be run concurrently on one multiprocessor. I am basically very much confused with the execution of the threads and the blocks on the streaming multiprocessors. I know that the assignment of blocks and the execution of the threads are absolutely arbitrary but I would like how the mapping of the blocks and the threads actually happens so that the concurrent execution could occur.
The Streaming Multiprocessors (SM) can execute more than one block at a time using Hardware Multithreading, a process akin to Hypter-Threading.
The CUDA C Programming Guide describes this as follows in Section 4.2:
4.2 Hardware Multithreading
The execution context (program counters, registers, etc) for each warp
processed by a multiprocessor is maintained on-chip during the entire
lifetime of the warp. Therefore, switching from one execution context
to another has no cost, and at every instruction issue time, a warp
scheduler selects a warp that has threads ready to execute its next
instruction (the active threads of the warp) and issues the
instruction to those threads.
In particular, each multiprocessor has a set of 32-bit registers that
are partitioned among the warps, and a parallel data cache or shared
memory that is partitioned among the thread blocks.
The number of blocks and warps that can reside and be processed
together on the multiprocessor for a given kernel depends on the
amount of registers and shared memory used by the kernel and the
amount of registers and shared memory available on the multiprocessor.
There are also a maximum number of resident blocks and a maximum
number of resident warps per multiprocessor. These limits as well the
amount of registers and shared memory available on the multiprocessor
are a function of the compute capability of the device and are given
in Appendix F. If there are not enough registers or shared memory
available per multiprocessor to process at least one block, the kernel
will fail to launch.

CUDA warps and occupancy

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?
Thank you.
"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.
The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.