While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…
Related
Are these statements true regarding the behaviour of the active set of blocks assigned to a streaming multiprocessor (SMP) for execution, and what the Cuda programming model guarantees:
When a block is assigned to a SMP, it will never be moved to another SMP once it begins executing
If the kernel configuration limits it so there can only be N blocks in the active set per MP, then when all N blocks are in the active set of the SMP, it cannot remove or add any new blocks from/to that set until one of the existing blocks in the set has finished
If a block within the active set continually spins, all other blocks in the set can still progress on the associated SMP
Copying shared memory and other execution state across from one SMP to another, or backing-up/restoring from global memory doesn't seem like it would ever be a good idea so I suspect the first two of these behaviours could be guaranteed by the Cuda programming model?
Blocks can move to different SMPs during Compute Preemption, which may be triggered e.g. by single-GPU debugging or context switching if more than one process uses the GPU.
Again, Compute Preemption can invalidate this assumption. Theoretically blocks from a different kernel with lower resource requirements can also start execution on SMPs that are not able to launch any more blocks of the original kernel, although I am not sure whether that is relevant to you.
While there are no guarantees about fairness, the throughput architecture of GPUs generally means that a spinning block will not completely prevent other blocks from making progress. The nanosleep() PTX instruction may help reducing the impact of a spinning thread on other blocks further.
In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed?
... similar question for other builtins which now have __sync() added to their names.
No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.
First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.
Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:
Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.
On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync() intrinsic via conditional statements.
This related question discusses the mask parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.
The CUDA programming guide has the following to say:
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.
I'm thinking lockstep because of one common instruction at a time.
So what happens in the case where there is no branching and each thread needs to compute an O(n) operation?
Won't some threads in the warp complete before others if the value of the data they operate on is smaller?
If some threads do complete before others do they remain idle until the others complete?
Each single instruction in a warp is performed in a lockstep. The next instruction can be fetched only when the previous one has completed.
If an instruction needs a different amount of time for different threads (e.g. one thread loaded data from cache, while the other waits for global memory reads), then all threads have to wait.
That being said, I am not aware of any single instruction having a complexity O(n). What you are probably referring to is a loop of size n being executed by each of the threads in a warp. Loop, same as any other control flow construct, has a conditional jump. Threads that exit the loop early become masked and wait for the threads still in the loop. When all threads signal that they want to exit, they converge, and the following operations are once again performed in a perfect sync.
Update: As #knedlsepp points out (thank you!) since Volta this is not true. The GPU may split a warp into smaller pieces and run those independently, thus breaking the lockstep. You shouldn't assume too much, but warp synchronisation primities may help.
In practice, GPU will still try to run whole warp in lock step when possible, as this is most efficient. To my knowledge (although I cannot firmly confirm anymore, someone may prove me wrong), there is still a single instruction being executed at once, but different branches with different masks can now be interleaved in time. For a complex control flow, it may even happen that the same branch is executed multiple times, with different masks!
I remember speeding up my CUDA-based ray-tracer 2-3 times when I eliminated all break and mid-function return statements that were problematic for a compiler to figure out the optimal control flow and masking.
This is a problem about computer architecture and hope somebody has a clue. More specifically, it is about MIPS instruction pipelined flow. But I feel obscured about some aspects of it. Because I currently do not have enough reputation so I cannot post a image.
Does an S (stall) mean no following instructions can utilize the time slot taken by the stall?
Can two consecutive instructions both have X (execute) in the same time slot?
Is it possible that the M (Memory Access) and W (Write Back) of an instruction come before that of its predecessor in a pipelined structure????
In the situation of a loop and the last instruction is a repetition of the first instruction, why there are 2 F's (fetch) in the last instruction?
For issue 1, in a simple, scalar pipeline, a stall introduces a pipeline bubble which cannot be "popped". To allow an instruction later in program order to fill a pipeline bubble, that instruction would have to go past the stalled instruction. Supporting such reordering of instructions increases the complexity of the pipeline, which tends to increase design and production costs and to increase either pipeline depth or cycle time (as well as use more energy per active cycle [out-of-order execution can be more energy efficient in total even when more energy is used when active]). The mechanisms needed to support such reordering also increases the complexity of explaining pipelines.
For issue 2, with a more complex pipeline it is possible to begin execution of more than one instruction at the same time. Such processors are called superscalar. With in-order execution, only instructions in a consecutive sequence (in program order) can begin execution at the same time, and this requires that the instructions do not have data dependencies and that sufficient hardware resources are available to execute the instructions and handle their results. For an in-order microarchitecture, the width of the earlier pipeline stages is typically the same as the width of later pipeline stages, though buffering would allow multiple instructions to accumulate behind a stall.
(Even at only two-wide execution, there are usually additional restrictions on what kinds of instructions can be executed in parallel. E.g., one execution port might not handle memory accesses or branches while the other execution port might handle those instructions but not shifts or multiplies. Having two copies of hardware for relatively expensive operations [like shifts and multiplies] increases size and limiting the data paths for memory accesses and branches can simplify design and potentially reduce delay.)
For issue 3, out-of-order execution allows the reordering of instructions, so an instruction later in program order could execute and writeback results to the register file before an earlier instruction. With some additional complexity in handling exceptions/interrupts and arbitrating register write port use (or increasing the number of write ports), it is also possible for an in-order processor to writeback results out of program order. The Motorola 88110 (from the early 1990s) is an example of a processor which did such. In order to handle exceptions, the 88110 had a history buffer to hold data that is overwritten by instructions that are later in program order than where the exception is. The 88110 had two additional read ports to each of the register files to read the data in the destination registers and write such to the history buffer.
For issue 4, I am guessing that you mean the case where the body of the loop is composed on only one instruction. For a typical RISC instruction set the branch instruction controlling the loop is a separate instruction from the instruction performing a computation or memory access, so the loop would actually contain two instructions. (Power, formerly PowerPC, could have a one instruction delay loop using branch on counter which decrements the special counter register, but optimizing instruction fetch for a simple implementation for such peculiar code would be foolish.)
For the simple classic 5-stage pipeline with delayed branches, it does not make sense from a performance perspective to avoid an instruction cache access since the loop branch does not introduce a pipeline bubble even when taken. This means that there is no opportunity to execute more instructions. However, in some microarchitectures where redirecting instruction fetch to a non-sequential address introduces a pipeline bubble (particularly if from instruction fetch taking more than one cycle), providing a small fast-access buffer can improve performance. (Instruction fetch bandwidth limitations could also justify a buffer for performance; a small buffer could provide higher bandwidth than a large cache or an off-chip memory.) In addition, to reduce energy use, the use of a loop buffer makes considerable sense, but one would almost certainly not want to limit the size of the buffer to only two instructions (the branch plus one "body" instruction) because such tiny loops are rare and even increasing the buffer size to eight instructions would only add a modest amount of hardware.
In order to specially handle the case of small bodied loops, such loops must be detected. While the buffer could always be filled with the last N instructions (to avoid the first encounter of the short backward branch not "hitting" in the loop buffer--and such a buffer could also even out variations in instruction fetch which might be caused by crossing cache line boundaries, cache misses, fetch redirection delays, etc.), it would be necessary to check each branch instruction to see if it targeted an instruction within the buffer. (It would even be possible to provide a special storage for the loop branch instruction since storage is only needed for the condition checked, a small index into the loop buffer and an indication of where the branch is, but short loops are probably not sufficiently common for such specialized hardware.) In effect, a loop buffer can be a very small Level 0 instruction cache
(A branch target instruction cache [BTIC] is a mechanism similar to a loop buffer, but instead of caching instructions only from the target of the most recent loop branch a BTIC caches instructions from the targets of a number of recent branches. BTICs are primarily used to hide instruction fetch latency.)
When teaching pipelines, such complicating factors are usually avoided initially.
I always thought that branch divergence is only caused by the branching code, like "if", "else", "for", "switch", etc. However I have read a paper recently in which it says:
"
One can clearly observe that the number of divergent branches taken by threads in each first exploration-based algorithm is at least twice more important than the full exploration strategy. This is typically the results from additional non-coalesced accesses to the global memory. Hence, such a threads divergence leads to many memory accesses that have to be serialized, increasing the total number of instructions executed.
One can observe that the number of warp serializations for the version using non-coalesced accesses is between seven and sixteen times more important than for its counterpart. Indeed, a threads divergence caused by non-coalesced accesses leads to many memory accesses that have to be serialized, increasing the instructions to be executed.
"
It seems like, according to the author, non-coalesced accesses can cause divergent branches. Is that true?
My question is, how many reasons exactly are there for the branch divergence?
Thanks in advance.
I think the author is unclear on the concepts and/or terminology.
The two concepts of divergence and serialization are closely related. Divergence causes serialization, as the divergent groups of threads in a warp must be executed serially. But serialization does not cause divergence, as divergence refers specifically to threads within a warp running different code paths.
Other things that cause serialization (but not divergence) are bank conflicts and atomic operations.