How to a warp cause another warp be in the Idle state? - cuda

As you can see in the title of the question, I want to know how a warp causes another warp go to the Idle state. I read a lot of the Q/A in the SO but I can not find the answer. At any time, just one warp in a block can be run? If so, the idle state of warp has no meaning, but if we can run multiple warps at the same time each warp can do their work separately to other warps.
The paper said: Irregular work-items lead to whole warps to be in idle state (e.g., warp0 w.r.t. warp1 in the following fig).

The terms used by the Nsight VSE profiler for a warp's state are defined at http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. These terms are also used in numerous GTC presentation on performance analysis.
The compute work distributor (CWD) will launch a thread block on a SM when all resources for the thread block are available. Resources include:
thread block slot
warp slots (sufficient for the block)
registers for each warp
shared memory for the block
barriers for the block
When a SM has sufficient resources the thread block is launched on the SM. The thread block is rasterized into warps. Warps are assigned to warp schedulers. Resources are allocated to each warp. At this point a warp is in an active state meaning that warp can executed instructions.
On each cycle each warp scheduler selects from a list of eligible warps (active, not stalled) and issues 1-2 instructions for the warp. A warp can become stalled for numerous reasons. See the documentation above.
Kepler - Volta GPUs (except GP100) have 4 warps schedulers (subpartitions) per streaming multiprocessor (SM). All warps of a thread blocks must be on the same SM. Therefore, on each given cycle a thread block may issue instructions for up to 4 (subpartition) warps in the thread block.
Each warp scheduler can pick any of the eligible warps each cycle. The SM is pipelined so all warps of a maximum sized thread blocks (1024 threads == 32 warps) can have instructions in flight every cycle.
The only definition of idle that I can determine without additional context are:
- If a warp scheduler has 2 eligible warps and 1 is selected then the other is stalled in a state called not selected.
- If warps in a thread block execute a barrier (__syncthreads) then the warps will stall on the barrier (not eligible) until the requirements of the barrier are met. The warps are stalled on the barrier.

Related

task scheduling of NVIDIA GPU

I have some doubt about the task scheduling of nvidia GPU.
(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU? In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?
I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!
The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
I recommend this article. It's somewhat outdated, but I think it is a good starting point. The article targets Kepler architecture, so the most recent one, Pascal, may have some discrepancies in their behavior.
Answers for your specific questions (based on the article):
Q1. Do threads in a block release their resource only after all threads in the block finish running?
Yes. A warp that finished running while other warps in the same block didn't still acquires its resources such as registers and shared memory.
Q2. When every threads in a block all hang up, will it still occupy the resources? Or, does a new block of threads take over the resources?
You are asking whether a block can be preempted. I've searched through web and got the answer from here.
On compute capabilities < 3.2 blocks are never preempted.
On compute capabilities 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging.
So blocks don't give up their resources when stalled by some global memory access. Rather than expecting the stalled warps to be preempted, you should design your CUDA code so that there are plenty of warps resident in an SM, waiting to be dispatched. In this case even when some warps are waiting global memory access to finish, schedulers can launch other threads, effectively hiding latencies.

CUDA warp / block finalization

When a warp finishes a kernel, but another warp of the same block is still running, will the finished warp be blocked until the other warps of the same block finish, or will the finished warp be available for immediate reuse by another block while the other warps of the current block are still running?
A finished warp is retired, freeing up the warp slot in the scheduler queue for another warp, whether from the same block or another one. The number of warps that can be open at any time and ready for execution by the warp scheduler is limited based on the specific hardware type (compute capability). The number of threadblocks that can be open (scheduled) at any given time on a SM is also limited by compute capability. Therefore, if all the warps but one of a particular block are finished and retired, but one warp is still active, then that warp uses up a warp slot, and the block that it belongs to also uses up a block slot. Only when all the warps of a block are finished and retired does the block get retired, freeing it's block slot for use by another block.

How to measure Streaming Multiprocessor use/idle times in CUDA?

A simple question, really: I have a kernel which runs with the maximum number of blocks per Streaming Multiprocessor (SM) possible and would like to know how much more performance I could theoretically extract from it. Ideally, I'd like to know the percentage of SM cycles that are idle, i.e. all warps are blocked on memory access.
I'm really just interested in finding that number. What I'm not looking for is
General tips on increasing occupancy. I'm using all the occupancy I can get, and even if I manage to get more performance, it won't tell me how much more would theoretically be possible.
How to compute the theoretical peak GFlops. My computations are not FP-centric, there's a lot of integer arithmetic and logic going on too.
Nsight Visual Studio Edition 2.1 and 2.2 Issue Efficiency experiments provides the information you are requesting. These counters/metrics should be added to the Visual Profiler in a release after CUDA 5.0.
Nsight Visual Studio Edition
From Nsight Visual Studio Edition 2.2 User Guide | Analysis Tools | Other Analysis Reports | Profiler CUDA Settings | Issue Efficiency Section
Issue Efficiency provides information about the device's ability to
issue the instructions of the kernel. The data reported includes
execution dependencies, eligible warps, and SM stall reasons.
For devices of compute capability 2.x, a multiprocessor has two warp
schedulers. Each warp scheduler manages at most 24 warps, for a total
of 48 warps per multiprocessor. The kernel execution configuration may
reduce the runtime limit. For information on occupancy, see the
Achieved Occupancy experiment. The first scheduler is in charge of the
warps with an odd ID, and the second scheduler is in charge of warps
with an even ID.
KEPLER UPDATE: For compute capability 3.x, a multiprocessor has four warp schedulers. Each warp scheduler manages at most 16 warps, for a total of 64 warps per multiprocessor.
At every instruction issue time, each scheduler will pick an eligible
warp from its list of active warps and issue an instruction. A warp is
eligible if the instruction has been fetched, the execution unit
required by the instruction is available, and the instruction has no
dependencies that have not been met.
The schedulers report the following statistics on the warps in the
multiprocessor:
Active Warps – A warp is active from the time it is scheduled on a
multiprocessor until it completes the last instruction. The active
warps counter increments by 0-48 per cycle. The maximum increment per
cycle is defined by the theoretical occupancy.
KEPLER UPDATE Range is 0-64 per cycle.
Eligible Warps – An active warp is eligible if it is able to issue the next instruction.
Warps that are not eligible will report an Issue Stall Reason. This
counter will increment by 0-ActiveWarps per cycle.
UPDATE On Fermi the Issue Stall Reason counters are updated only on cycles that the warp scheduler had not eligible warps. On Kepler the Issue Stall Reason counters are updated every cycle even if the warp scheduler issues an instruction.
Zero Eligible Warps – This counter increments each cycle by 1 if
neither scheduler has a
warp that can be issued.
One Eligible Warp – This counter increments
each cycle by 1 if only one of the two schedulers has a warp that can
be issued.
KEPLER UPDATE: On Kepler the counters are per scheduler so the One Eligible Warp means that the warp scheduler could issue an instruction. On Fermi there is a single counter for both schedulers so on Fermi you want One Eligible Warp counter to be as small as possible.
Warp Issue Holes – This counter increments each cycle by
the number of active warps that are not eligible. This is the same as
Active Warps minus Eligible Warps.
Long Warp Issue Holes – This
counter increment each cycle by the number of active warps that have
not been eligible to issue an instruction for more than 32 clock
cycles. Long holes indicate that warps are stalled on long latency
reasons such as barriers and memory operations.
Issue Stall Reasons –
Each cycle each ineligible warp will increment one of the issue stall
reason counters. The sum of all issue stall reason counters is equal
to warp issue holes. A ineligible warp will increment the Instruction Fetch
stall reason if the next assembly instruction has not yet been
fetched. Execution Dependency stall reason if an input dependency is
not yet available. This can be reduced by increasing the number of
independent instructions. Data Requests stall reasons if the request
cannot currently be made as the required resources are not available,
or are fully utilized, or too many operations of that type are already
outstanding. In case data requests make up a large portion of the
stall reasons, you should also run the memory experiments to determine
if you can optimize existing transactions per request or if you need
to revisit your algorithm. Texture stall reason if the texture
sub-system is already fully utilized and currently not able to accept
further operations. Synchronization stall reason if the warp is
blocked at a __syncthreads(). If this reason is large and the kernel
execution configuration is limited to a small number of blocks then
consider dividing the kernel grid into more thread blocks.
Visual Profiler 5.0
The Visual Profiler does not have counters that address your question. Until the counters are added you can use the following counters:
sm_efficiency[_instance]
ipc[_instance]
achieved_occupancy.
The target and max IPCs for compute capabilities are:
Compute Target Max
Capability IPC IPC
2.0 1.7 2.0
2.x 2.3 4.0
3.x 4.4 7.0
The target IPC is for ALU limited computation. The target IPC for memory bound kernels will be less. For compute capability 2.1 devices and higher it is harder to use IPC as each warp scheduler can dual-issue.

CUDA warps and occupancy

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?
Thank you.
"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.
The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.

Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?
The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().
For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).
The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.
There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.