What is warp-level-programming (racecheck) - cuda

In the online racecheck documentation, the severity level has this description of hazard level WARNING:
An example of this are hazards due to warp level programming that make the assumption that threads are proceeding in groups.
The statement is confusing because threads are processed in groups. (The SM executes code across a warp.) If they are not processed in groups, then how are they processed?
What does "warp level programming" mean? (What would non warp level programming be?)

Its true that all processing is handled in warps. Warp level programming also called warp synchronous programming depends on this to ensure correctness of code/behavior. Many or perhaps most codes do not depend on the concept of a warp or that there are 32 threads per warp to deliver correct behavior.
There are at least two concerns. First in the presence of control structures such as if/then/else its possible that threads in a warp are not all executing in lockstep. Second there is no guarantee that future architectures will preserve the concept of a warp or 32 threads per warp

Related

Why is there a warp-level synchronization primitive in CUDA?

I have two questions regarding __syncwarp() in CUDA:
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary?
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?
No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.
If so, what exactly does __syncwarp() do, and why is it necessary?
Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?
By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).

Why does each thread have its own instruction address counter inside a warp?

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.
Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?
Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data
I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.
However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:
Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.
The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:
Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to
make better use of execution resources or to allow one thread to wait for data to be produced by
another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines
how to group active threads from the same warp together into SIMT units. This retains the high
throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads
can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in
Volta will still group together threads which are executing the same code and run them in parallel
for maximum efficiency
Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

How to understand "All threads in a warp execute the same instruction at the same time." in GPU?

I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:
CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.
The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.
The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?
There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model
Predicated execution (See here)
Instruction replay/serialisation (See here)
Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.
This is how lock-step execution can support branching.

Can kernel change its block size?

The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.

What is the context switching mechanism in GPU?

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds?
Thanks
First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.
So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.
The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.
In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.