task scheduling of NVIDIA GPU - cuda

I have some doubt about the task scheduling of nvidia GPU.
(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU? In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?
I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.

I recommend this article. It's somewhat outdated, but I think it is a good starting point. The article targets Kepler architecture, so the most recent one, Pascal, may have some discrepancies in their behavior.
Answers for your specific questions (based on the article):
Q1. Do threads in a block release their resource only after all threads in the block finish running?
Yes. A warp that finished running while other warps in the same block didn't still acquires its resources such as registers and shared memory.
Q2. When every threads in a block all hang up, will it still occupy the resources? Or, does a new block of threads take over the resources?
You are asking whether a block can be preempted. I've searched through web and got the answer from here.
On compute capabilities < 3.2 blocks are never preempted.
On compute capabilities 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging.
So blocks don't give up their resources when stalled by some global memory access. Rather than expecting the stalled warps to be preempted, you should design your CUDA code so that there are plenty of warps resident in an SM, waiting to be dispatched. In this case even when some warps are waiting global memory access to finish, schedulers can launch other threads, effectively hiding latencies.

Related

How to a warp cause another warp be in the Idle state?

As you can see in the title of the question, I want to know how a warp causes another warp go to the Idle state. I read a lot of the Q/A in the SO but I can not find the answer. At any time, just one warp in a block can be run? If so, the idle state of warp has no meaning, but if we can run multiple warps at the same time each warp can do their work separately to other warps.
The paper said: Irregular work-items lead to whole warps to be in idle state (e.g., warp0 w.r.t. warp1 in the following fig).
The terms used by the Nsight VSE profiler for a warp's state are defined at http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. These terms are also used in numerous GTC presentation on performance analysis.
The compute work distributor (CWD) will launch a thread block on a SM when all resources for the thread block are available. Resources include:
thread block slot
warp slots (sufficient for the block)
registers for each warp
shared memory for the block
barriers for the block
When a SM has sufficient resources the thread block is launched on the SM. The thread block is rasterized into warps. Warps are assigned to warp schedulers. Resources are allocated to each warp. At this point a warp is in an active state meaning that warp can executed instructions.
On each cycle each warp scheduler selects from a list of eligible warps (active, not stalled) and issues 1-2 instructions for the warp. A warp can become stalled for numerous reasons. See the documentation above.
Kepler - Volta GPUs (except GP100) have 4 warps schedulers (subpartitions) per streaming multiprocessor (SM). All warps of a thread blocks must be on the same SM. Therefore, on each given cycle a thread block may issue instructions for up to 4 (subpartition) warps in the thread block.
Each warp scheduler can pick any of the eligible warps each cycle. The SM is pipelined so all warps of a maximum sized thread blocks (1024 threads == 32 warps) can have instructions in flight every cycle.
The only definition of idle that I can determine without additional context are:
- If a warp scheduler has 2 eligible warps and 1 is selected then the other is stalled in a state called not selected.
- If warps in a thread block execute a barrier (__syncthreads) then the warps will stall on the barrier (not eligible) until the requirements of the barrier are met. The warps are stalled on the barrier.

In GPU architecture, where is the data for all the non-active warps stored?

From my understanding of NVIDIA's CUDA architecture, the execution of threads happens in groups of ~32 called 'warps'. Multiple warps are scheduled at a time, and instructions are issued from any of the warps (depending on some internal algorithm).
Now, if I have say 16KB of shared memory on the device, and each thread uses 400 bytes of shared memory, then one warp will need 400*32 = 12.8 KB. Does this mean that the GPU cannot actually schedule more than 1 warp at a time, irrespective of how many threads I launch within a given block?
From a resource standpoint (registers, shared memory, etc.) the important unit is the threadblock, not the warp.
In order to schedule a threadblock for execution, there must be enough free resources on the SM to cover the needs of the entire threadblock. All threadblocks in a grid will have exactly the same resource requirements.
If the SM has no currently executing threadblocks, (such as at the point of kernel launch) then the SM should have at least enough resources to cover the needs of a single threadblock. If that is not the case, the kernel launch will fail. This could happen, for example, if the number of registers per thread, times the number of threads per block, exceeded the number of registers in the SM.
After the SM has a single threadblock scheduled, additional threadblocks can be scheduled depending on the available resources. So to extend the register analogy, if each threadblock required 30K registers (regs/thread * threads/block), and the SM had a max of 64K register, then at most two threadblocks could be scheduled (i.e. their warps could possibly be brought into execution by the SM).
In this way, any warp that could possibly be brought into execution already has enough resources allocated for it. This is a principal part of the scheduling mechanism that allows the SM to switch execution from one warp to another with zero delay (fast context switching).

How to measure Streaming Multiprocessor use/idle times in CUDA?

A simple question, really: I have a kernel which runs with the maximum number of blocks per Streaming Multiprocessor (SM) possible and would like to know how much more performance I could theoretically extract from it. Ideally, I'd like to know the percentage of SM cycles that are idle, i.e. all warps are blocked on memory access.
I'm really just interested in finding that number. What I'm not looking for is
General tips on increasing occupancy. I'm using all the occupancy I can get, and even if I manage to get more performance, it won't tell me how much more would theoretically be possible.
How to compute the theoretical peak GFlops. My computations are not FP-centric, there's a lot of integer arithmetic and logic going on too.
Nsight Visual Studio Edition 2.1 and 2.2 Issue Efficiency experiments provides the information you are requesting. These counters/metrics should be added to the Visual Profiler in a release after CUDA 5.0.
Nsight Visual Studio Edition
From Nsight Visual Studio Edition 2.2 User Guide | Analysis Tools | Other Analysis Reports | Profiler CUDA Settings | Issue Efficiency Section
Issue Efficiency provides information about the device's ability to
issue the instructions of the kernel. The data reported includes
execution dependencies, eligible warps, and SM stall reasons.
For devices of compute capability 2.x, a multiprocessor has two warp
schedulers. Each warp scheduler manages at most 24 warps, for a total
of 48 warps per multiprocessor. The kernel execution configuration may
reduce the runtime limit. For information on occupancy, see the
Achieved Occupancy experiment. The first scheduler is in charge of the
warps with an odd ID, and the second scheduler is in charge of warps
with an even ID.
KEPLER UPDATE: For compute capability 3.x, a multiprocessor has four warp schedulers. Each warp scheduler manages at most 16 warps, for a total of 64 warps per multiprocessor.
At every instruction issue time, each scheduler will pick an eligible
warp from its list of active warps and issue an instruction. A warp is
eligible if the instruction has been fetched, the execution unit
required by the instruction is available, and the instruction has no
dependencies that have not been met.
The schedulers report the following statistics on the warps in the
multiprocessor:
Active Warps – A warp is active from the time it is scheduled on a
multiprocessor until it completes the last instruction. The active
warps counter increments by 0-48 per cycle. The maximum increment per
cycle is defined by the theoretical occupancy.
KEPLER UPDATE Range is 0-64 per cycle.
Eligible Warps – An active warp is eligible if it is able to issue the next instruction.
Warps that are not eligible will report an Issue Stall Reason. This
counter will increment by 0-ActiveWarps per cycle.
UPDATE On Fermi the Issue Stall Reason counters are updated only on cycles that the warp scheduler had not eligible warps. On Kepler the Issue Stall Reason counters are updated every cycle even if the warp scheduler issues an instruction.
Zero Eligible Warps – This counter increments each cycle by 1 if
neither scheduler has a
warp that can be issued.
One Eligible Warp – This counter increments
each cycle by 1 if only one of the two schedulers has a warp that can
be issued.
KEPLER UPDATE: On Kepler the counters are per scheduler so the One Eligible Warp means that the warp scheduler could issue an instruction. On Fermi there is a single counter for both schedulers so on Fermi you want One Eligible Warp counter to be as small as possible.
Warp Issue Holes – This counter increments each cycle by
the number of active warps that are not eligible. This is the same as
Active Warps minus Eligible Warps.
Long Warp Issue Holes – This
counter increment each cycle by the number of active warps that have
not been eligible to issue an instruction for more than 32 clock
cycles. Long holes indicate that warps are stalled on long latency
reasons such as barriers and memory operations.
Issue Stall Reasons –
Each cycle each ineligible warp will increment one of the issue stall
reason counters. The sum of all issue stall reason counters is equal
to warp issue holes. A ineligible warp will increment the Instruction Fetch
stall reason if the next assembly instruction has not yet been
fetched. Execution Dependency stall reason if an input dependency is
not yet available. This can be reduced by increasing the number of
independent instructions. Data Requests stall reasons if the request
cannot currently be made as the required resources are not available,
or are fully utilized, or too many operations of that type are already
outstanding. In case data requests make up a large portion of the
stall reasons, you should also run the memory experiments to determine
if you can optimize existing transactions per request or if you need
to revisit your algorithm. Texture stall reason if the texture
sub-system is already fully utilized and currently not able to accept
further operations. Synchronization stall reason if the warp is
blocked at a __syncthreads(). If this reason is large and the kernel
execution configuration is limited to a small number of blocks then
consider dividing the kernel grid into more thread blocks.
Visual Profiler 5.0
The Visual Profiler does not have counters that address your question. Until the counters are added you can use the following counters:
sm_efficiency[_instance]
ipc[_instance]
achieved_occupancy.
The target and max IPCs for compute capabilities are:
Compute Target Max
Capability IPC IPC
2.0 1.7 2.0
2.x 2.3 4.0
3.x 4.4 7.0
The target IPC is for ALU limited computation. The target IPC for memory bound kernels will be less. For compute capability 2.1 devices and higher it is harder to use IPC as each warp scheduler can dual-issue.

What is the context switching mechanism in GPU?

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds?
Thanks
First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.
So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.
The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.
In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?
The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().
For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).
The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.
There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.