The CUDA programming guide has the following to say:
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.
I'm thinking lockstep because of one common instruction at a time.
So what happens in the case where there is no branching and each thread needs to compute an O(n) operation?
Won't some threads in the warp complete before others if the value of the data they operate on is smaller?
If some threads do complete before others do they remain idle until the others complete?
Each single instruction in a warp is performed in a lockstep. The next instruction can be fetched only when the previous one has completed.
If an instruction needs a different amount of time for different threads (e.g. one thread loaded data from cache, while the other waits for global memory reads), then all threads have to wait.
That being said, I am not aware of any single instruction having a complexity O(n). What you are probably referring to is a loop of size n being executed by each of the threads in a warp. Loop, same as any other control flow construct, has a conditional jump. Threads that exit the loop early become masked and wait for the threads still in the loop. When all threads signal that they want to exit, they converge, and the following operations are once again performed in a perfect sync.
Update: As #knedlsepp points out (thank you!) since Volta this is not true. The GPU may split a warp into smaller pieces and run those independently, thus breaking the lockstep. You shouldn't assume too much, but warp synchronisation primities may help.
In practice, GPU will still try to run whole warp in lock step when possible, as this is most efficient. To my knowledge (although I cannot firmly confirm anymore, someone may prove me wrong), there is still a single instruction being executed at once, but different branches with different masks can now be interleaved in time. For a complex control flow, it may even happen that the same branch is executed multiple times, with different masks!
I remember speeding up my CUDA-based ray-tracer 2-3 times when I eliminated all break and mid-function return statements that were problematic for a compiler to figure out the optimal control flow and masking.
Related
I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.
For example I have a code like this:
if (C)
A
else
B
so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on?
Thanks
All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).
So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.
While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…
I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:
CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.
The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.
The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?
There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model
Predicated execution (See here)
Instruction replay/serialisation (See here)
Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.
This is how lock-step execution can support branching.
The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
This is a problem about computer architecture and hope somebody has a clue. More specifically, it is about MIPS instruction pipelined flow. But I feel obscured about some aspects of it. Because I currently do not have enough reputation so I cannot post a image.
Does an S (stall) mean no following instructions can utilize the time slot taken by the stall?
Can two consecutive instructions both have X (execute) in the same time slot?
Is it possible that the M (Memory Access) and W (Write Back) of an instruction come before that of its predecessor in a pipelined structure????
In the situation of a loop and the last instruction is a repetition of the first instruction, why there are 2 F's (fetch) in the last instruction?
For issue 1, in a simple, scalar pipeline, a stall introduces a pipeline bubble which cannot be "popped". To allow an instruction later in program order to fill a pipeline bubble, that instruction would have to go past the stalled instruction. Supporting such reordering of instructions increases the complexity of the pipeline, which tends to increase design and production costs and to increase either pipeline depth or cycle time (as well as use more energy per active cycle [out-of-order execution can be more energy efficient in total even when more energy is used when active]). The mechanisms needed to support such reordering also increases the complexity of explaining pipelines.
For issue 2, with a more complex pipeline it is possible to begin execution of more than one instruction at the same time. Such processors are called superscalar. With in-order execution, only instructions in a consecutive sequence (in program order) can begin execution at the same time, and this requires that the instructions do not have data dependencies and that sufficient hardware resources are available to execute the instructions and handle their results. For an in-order microarchitecture, the width of the earlier pipeline stages is typically the same as the width of later pipeline stages, though buffering would allow multiple instructions to accumulate behind a stall.
(Even at only two-wide execution, there are usually additional restrictions on what kinds of instructions can be executed in parallel. E.g., one execution port might not handle memory accesses or branches while the other execution port might handle those instructions but not shifts or multiplies. Having two copies of hardware for relatively expensive operations [like shifts and multiplies] increases size and limiting the data paths for memory accesses and branches can simplify design and potentially reduce delay.)
For issue 3, out-of-order execution allows the reordering of instructions, so an instruction later in program order could execute and writeback results to the register file before an earlier instruction. With some additional complexity in handling exceptions/interrupts and arbitrating register write port use (or increasing the number of write ports), it is also possible for an in-order processor to writeback results out of program order. The Motorola 88110 (from the early 1990s) is an example of a processor which did such. In order to handle exceptions, the 88110 had a history buffer to hold data that is overwritten by instructions that are later in program order than where the exception is. The 88110 had two additional read ports to each of the register files to read the data in the destination registers and write such to the history buffer.
For issue 4, I am guessing that you mean the case where the body of the loop is composed on only one instruction. For a typical RISC instruction set the branch instruction controlling the loop is a separate instruction from the instruction performing a computation or memory access, so the loop would actually contain two instructions. (Power, formerly PowerPC, could have a one instruction delay loop using branch on counter which decrements the special counter register, but optimizing instruction fetch for a simple implementation for such peculiar code would be foolish.)
For the simple classic 5-stage pipeline with delayed branches, it does not make sense from a performance perspective to avoid an instruction cache access since the loop branch does not introduce a pipeline bubble even when taken. This means that there is no opportunity to execute more instructions. However, in some microarchitectures where redirecting instruction fetch to a non-sequential address introduces a pipeline bubble (particularly if from instruction fetch taking more than one cycle), providing a small fast-access buffer can improve performance. (Instruction fetch bandwidth limitations could also justify a buffer for performance; a small buffer could provide higher bandwidth than a large cache or an off-chip memory.) In addition, to reduce energy use, the use of a loop buffer makes considerable sense, but one would almost certainly not want to limit the size of the buffer to only two instructions (the branch plus one "body" instruction) because such tiny loops are rare and even increasing the buffer size to eight instructions would only add a modest amount of hardware.
In order to specially handle the case of small bodied loops, such loops must be detected. While the buffer could always be filled with the last N instructions (to avoid the first encounter of the short backward branch not "hitting" in the loop buffer--and such a buffer could also even out variations in instruction fetch which might be caused by crossing cache line boundaries, cache misses, fetch redirection delays, etc.), it would be necessary to check each branch instruction to see if it targeted an instruction within the buffer. (It would even be possible to provide a special storage for the loop branch instruction since storage is only needed for the condition checked, a small index into the loop buffer and an indication of where the branch is, but short loops are probably not sufficiently common for such specialized hardware.) In effect, a loop buffer can be a very small Level 0 instruction cache
(A branch target instruction cache [BTIC] is a mechanism similar to a loop buffer, but instead of caching instructions only from the target of the most recent loop branch a BTIC caches instructions from the targets of a number of recent branches. BTICs are primarily used to hide instruction fetch latency.)
When teaching pipelines, such complicating factors are usually avoided initially.