CUDA warp / block finalization

CUDA warp / block finalization - cuda

When a warp finishes a kernel, but another warp of the same block is still running, will the finished warp be blocked until the other warps of the same block finish, or will the finished warp be available for immediate reuse by another block while the other warps of the current block are still running?

A finished warp is retired, freeing up the warp slot in the scheduler queue for another warp, whether from the same block or another one. The number of warps that can be open at any time and ready for execution by the warp scheduler is limited based on the specific hardware type (compute capability). The number of threadblocks that can be open (scheduled) at any given time on a SM is also limited by compute capability. Therefore, if all the warps but one of a particular block are finished and retired, but one warp is still active, then that warp uses up a warp slot, and the block that it belongs to also uses up a block slot. Only when all the warps of a block are finished and retired does the block get retired, freeing it's block slot for use by another block.

Related

Warp and block scheduling in CUDA - what exactly happens, and questions about eligible warps

I understand how warps and blocks are scheduled in CUDA - but not how these two scheduling arrangements come together. I know that once there is enough execution resources in an SM to support a new block, a new block is executed and I know that eligible warps are selected to be executed every clock cycle (if the spare execution resources allow). However, what exactly makes a warp "eligible"? And what if there are enough execution resources to support a new warp - but not a new block? Does the block scheduling include warp scheduling? Help will be highly appreciated, thanks!

Does the block scheduling include warp scheduling?
The block scheduler and the warp scheduler should be thought of as 2 separate entities. In fact I would view the block-scheduler as a device-wide entity whereas the warp scheduler is a per-SM entity.
You can imagine that there may be a "queue" of blocks associated with each kernel launch. As resources on a SM become available, the block scheduler will deposit a block from the "queue" onto that SM.
With that description, block scheduling does not include warp scheduling.
However, what exactly makes a warp "eligible"?
We're now considering a block that is already deposited on a SM. A warp is "eligible" when it has one or more instructions that are ready to be executed. The opposite of "eligible" is "stalled". A warp is "stalled" when it has no instructions that are ready to be executed. The GPU profiler documentation describes a variety of possible "stall reasons"(*), but a typical one would be a dependency: An instruction that depends on the results of a previous instruction (or operation, such as a memory read) is not eligible to be issued until the results from the previous instruction/operation are ready. Also note that the GPU currently is not an out-of-order machine. If the next instructions to be executed are currently stalled, the GPU does not search (very far) into the subsequent instruction stream for possible independently executable instructions.
And what if there are enough execution resources to support a new warp - but not a new block?
That doesn't provide anything useful. In order to schedule a new block (i.e. for the block scheduler to deposit a new block on a SM) there must be enough resources available for the entire block. (The block scheduler does not deposit blocks warp-by-warp. It is an all-or-nothing proposition, on a block-by-block basis.)
(*) There is one "stall reason" called "not selected", which does not actually indicate the warp is stalled. It means that the warp is in fact eligible, but it was not selected for instruction dispatch on that cycle, usually because the warp scheduler(s) chose instruction(s) from other warp(s), to issue in that cycle.

How to a warp cause another warp be in the Idle state?

As you can see in the title of the question, I want to know how a warp causes another warp go to the Idle state. I read a lot of the Q/A in the SO but I can not find the answer. At any time, just one warp in a block can be run? If so, the idle state of warp has no meaning, but if we can run multiple warps at the same time each warp can do their work separately to other warps.
The paper said: Irregular work-items lead to whole warps to be in idle state (e.g., warp0 w.r.t. warp1 in the following fig).

The terms used by the Nsight VSE profiler for a warp's state are defined at http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. These terms are also used in numerous GTC presentation on performance analysis.
The compute work distributor (CWD) will launch a thread block on a SM when all resources for the thread block are available. Resources include:
thread block slot
warp slots (sufficient for the block)
registers for each warp
shared memory for the block
barriers for the block
When a SM has sufficient resources the thread block is launched on the SM. The thread block is rasterized into warps. Warps are assigned to warp schedulers. Resources are allocated to each warp. At this point a warp is in an active state meaning that warp can executed instructions.
On each cycle each warp scheduler selects from a list of eligible warps (active, not stalled) and issues 1-2 instructions for the warp. A warp can become stalled for numerous reasons. See the documentation above.
Kepler - Volta GPUs (except GP100) have 4 warps schedulers (subpartitions) per streaming multiprocessor (SM). All warps of a thread blocks must be on the same SM. Therefore, on each given cycle a thread block may issue instructions for up to 4 (subpartition) warps in the thread block.
Each warp scheduler can pick any of the eligible warps each cycle. The SM is pipelined so all warps of a maximum sized thread blocks (1024 threads == 32 warps) can have instructions in flight every cycle.
The only definition of idle that I can determine without additional context are:
- If a warp scheduler has 2 eligible warps and 1 is selected then the other is stalled in a state called not selected.
- If warps in a thread block execute a barrier (__syncthreads) then the warps will stall on the barrier (not eligible) until the requirements of the barrier are met. The warps are stalled on the barrier.

task scheduling of NVIDIA GPU

I have some doubt about the task scheduling of nvidia GPU.
(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU？ In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?
I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!

I recommend this article. It's somewhat outdated, but I think it is a good starting point. The article targets Kepler architecture, so the most recent one, Pascal, may have some discrepancies in their behavior.
Answers for your specific questions (based on the article):
Q1. Do threads in a block release their resource only after all threads in the block finish running?
Yes. A warp that finished running while other warps in the same block didn't still acquires its resources such as registers and shared memory.
Q2. When every threads in a block all hang up, will it still occupy the resources? Or, does a new block of threads take over the resources?
You are asking whether a block can be preempted. I've searched through web and got the answer from here.
On compute capabilities < 3.2 blocks are never preempted.
On compute capabilities 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging.
So blocks don't give up their resources when stalled by some global memory access. Rather than expecting the stalled warps to be preempted, you should design your CUDA code so that there are plenty of warps resident in an SM, waiting to be dispatched. In this case even when some warps are waiting global memory access to finish, schedulers can launch other threads, effectively hiding latencies.

Why smaller block size (same overall thread count) exposes more parallelism?

I'm reading "Professional CUDA C Programming" by Cheng et al. and there are examples of how a (very simple, single-line) kernel is being run for example with <<<1024, 512>>> performs worse than one with <<<2048, 256>>>. And then they state (several times) that you might have expected this result because the second run has more blocks and therefore exposes more parallelism. I can't figure out why though. Isn't the amount of parallelism governed by the number of concurrent warps in the SM? What does block size have to do with that - it doesn't matter to which block these warps belong to - the same block or different blocks, so why would using smaller blocks expose more parallelism (on the contrary, if the block size is too small I'd hit the max blocks per SM limit, resulting in fewer concurrent warps)? The only scenario I can envision is blocks of 1024 threads = 32 warps on Fermi, which has a max of 48 concurrent warps per SM limit. This means that only 1 concurrent block, and only 32 concurrent warps are possible, reducing the amount of parallelism, but that's a very specific use case.
UPDATE:
Another thing I thought of after posting: a block can not be evicted from the SM until all of the warps in it have finished. Thus, at the end of the execution of that block there could be a situation where a few last "slowest" warps are holding the entire block in the SM with most of the warps in that block finished and stalled, but a new block cannot be loaded until those few executing warps are finished. So in this case the efficiency becomes low. Now if the blocks are smaller then this will still happen, but the number of stalled relative to executing warps is smaller hence the efficiency is higher. Is this it?

Yes, this is it. The second paragraph in your question is a good answer.
In more detail, the number of warp schedulers inside every SM is limited (usually 2). Each warp scheduler keeps track of a number of active warps, and schedules a warp for execution only if the warp is allowed to move further in the program. The number of active warps being tracked by a warp scheduler has a maximum (usually 32). Because the resources owned by the thread block (such as shared memory) cannot be released for a new thread block until all the warps finish, a large block size can cause reduced number of candidate active warps to be available to the scheduler if a few warps take a long time to finish. This can result in reduced performance either due to the resource idleness or the SM inability to cover the latency of memory accesses. Bigger block size also increases the probability of warp blockage when synchronizing across the thread block using __syncthreads() or one of its variations, therefore, may lead to a similar phenomenon.

Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?

The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().

For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.

There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008