I always thought that branch divergence is only caused by the branching code, like "if", "else", "for", "switch", etc. However I have read a paper recently in which it says:
"
One can clearly observe that the number of divergent branches taken by threads in each first exploration-based algorithm is at least twice more important than the full exploration strategy. This is typically the results from additional non-coalesced accesses to the global memory. Hence, such a threads divergence leads to many memory accesses that have to be serialized, increasing the total number of instructions executed.
One can observe that the number of warp serializations for the version using non-coalesced accesses is between seven and sixteen times more important than for its counterpart. Indeed, a threads divergence caused by non-coalesced accesses leads to many memory accesses that have to be serialized, increasing the instructions to be executed.
"
It seems like, according to the author, non-coalesced accesses can cause divergent branches. Is that true?
My question is, how many reasons exactly are there for the branch divergence?
Thanks in advance.
I think the author is unclear on the concepts and/or terminology.
The two concepts of divergence and serialization are closely related. Divergence causes serialization, as the divergent groups of threads in a warp must be executed serially. But serialization does not cause divergence, as divergence refers specifically to threads within a warp running different code paths.
Other things that cause serialization (but not divergence) are bank conflicts and atomic operations.
Related
While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…
Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly.
In which of the cases will such work be completed faster? :
The case of intra-warp access locality, e.g. the total number of memory position accessed by each warp is small and most of them are indeed accessed by multiple lanes
The case of access anti-locality, where all lanes typically access distinct positions (and perhaps with an effort to avoid bank conflicts)?
and no less importantly - is this microarchitecture-dependent, or is it essentially the same on all recent NVIDIA microarchitectures?
Anti-localized access will be faster.
On SM5.0 (Maxwell) and above GPUs the shared memory atomics (assume add) the shared memory unit will replay the instruction due to address conflicts (two lanes with the same address). Normal bank conflict replays also apply. On Maxwell/Pascal the shared memory unit has fixed round robin access between the two SM partitions (2 scheduler in each partition). For each partition the shared memory unit will complete all replays of the instruction prior to moving to the next instruction. The Volta SM will complete the instruction prior to any other shared memory instruction.
Avoid bank conflicts
Avoid address conflicts
On Fermi and Kepler architecture a shared memory lock operation had to be performed prior to the read modify write operation. This blocked all other warp instructions.
Maxwell and newer GPUs have significantly faster shared memory atomic performance thank Fermi/Kepler.
A very simple kernel could be written to micro-benchmark your two different cases. The CUDA profilers provide instruction executed counts and replay counts for shared memory accesses but do not differentiate between replays due to atomics and replays due to load/store conflicts or vector accesses.
There's a quite simple argument to be made even without needing to know anything about how shared memory atomics are implemented in CUDA hardware: At the end of the day, atomic operations must be serialized somehow at some point. This is true in general, it doesn't matter which platform or hardware you're running on. Atomicity kinda requires that simply by nature. If you have multiple atomic operations issued in parallel, you have to somehow execute them in a way that ensures atomicity. That means that atomic operations will always become slower as contention increases, no matter if we're talking GPU or CPU. The only question is: by how much. That depends on the concrete implementation.
So generally, you want to keep the level of contention, i.e., the number of threads what will be trying to perform atomic operations on the same memory location in parallel, as low as possible…
This is a speculative partial answer.
Consider the related question: Performance of atomic operations on shared memory and its accepted answer.
If the accepted answer there is correct (and continues to be correct even today), then warp threads in a more localized access would get in each other's way, making it slower for many lanes to operate atomically, i.e. making anti-locality of warp atomics better.
But to be honest - I'm not sure I completely buy into this line of argumentation, nor do I know if things have changed since that answer was written.
The CUDA programming guide has the following to say:
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.
I'm thinking lockstep because of one common instruction at a time.
So what happens in the case where there is no branching and each thread needs to compute an O(n) operation?
Won't some threads in the warp complete before others if the value of the data they operate on is smaller?
If some threads do complete before others do they remain idle until the others complete?
Each single instruction in a warp is performed in a lockstep. The next instruction can be fetched only when the previous one has completed.
If an instruction needs a different amount of time for different threads (e.g. one thread loaded data from cache, while the other waits for global memory reads), then all threads have to wait.
That being said, I am not aware of any single instruction having a complexity O(n). What you are probably referring to is a loop of size n being executed by each of the threads in a warp. Loop, same as any other control flow construct, has a conditional jump. Threads that exit the loop early become masked and wait for the threads still in the loop. When all threads signal that they want to exit, they converge, and the following operations are once again performed in a perfect sync.
Update: As #knedlsepp points out (thank you!) since Volta this is not true. The GPU may split a warp into smaller pieces and run those independently, thus breaking the lockstep. You shouldn't assume too much, but warp synchronisation primities may help.
In practice, GPU will still try to run whole warp in lock step when possible, as this is most efficient. To my knowledge (although I cannot firmly confirm anymore, someone may prove me wrong), there is still a single instruction being executed at once, but different branches with different masks can now be interleaved in time. For a complex control flow, it may even happen that the same branch is executed multiple times, with different masks!
I remember speeding up my CUDA-based ray-tracer 2-3 times when I eliminated all break and mid-function return statements that were problematic for a compiler to figure out the optimal control flow and masking.
The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
According to this, http://www.nvidia.co.uk/content/PDF/isc-2011/Ziegler.pdf, I understand that replays in the GPU literature mean serializations. But what are the factors that contribute to the number of serializations?
To do this, I did some experiments. Profiled a some kernels and find the number of replays (= issued instructions - executed instructions). Sometimes, the number of bank conflicts to be equal to the number of replays. Some other times, the number of bank conflicts is smaller. This implies the number of bank conflicts is always a factor. What about the other?
According to the slides above (from slides 35), there are some others:
. The instruction cache misses
. Constant memory bank conflicts
To my understanding, there can be two others:
. The number of branches divergences. Since both paths are executed, there are replays. But i'm not sure if the number of issued instructions are affected by divergences or not?
. The number of cache misses. I have heard that long latency memory requests will be replayed sometimes. But in my experiments, L1 cache misses are often higher than replays.
Can anyone confirm these factors are those contribute to serializations? What is incorrect and do I miss something else?
Thanks
As far as I know branch divergence contributes to instruction replay.
I am not sure about the number of cache misses. That should be handled transparently by the memory controller not affecting the instruction. Worse thing I can think is that the pipeline gets stopped until the memory has been properly fetched.