How to avoid a branch here for GPU optimization [duplicate] - cuda

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.
For example I have a code like this:
if (C)
A
else
B
so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on?
Thanks

All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).
So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.

Related

Why does each thread have its own instruction address counter inside a warp?

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.
Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?
Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data
I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.
However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:
Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.
The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:
Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to
make better use of execution resources or to allow one thread to wait for data to be produced by
another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines
how to group active threads from the same warp together into SIMT units. This retains the high
throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads
can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in
Volta will still group together threads which are executing the same code and run them in parallel
for maximum efficiency
Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

Issued load/store instructions for replay

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.
I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?
Any explanation for that?
The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.
inst_executed is the count of instructions retired.
inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.
The distinction is made for two reasons:
1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays.
2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.
In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.
In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.
EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).
On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.
inst_executed = 1
inst_issued = 32
On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then
inst_executed = 1
inst_issued >= 64 = 32 request + 32 replays for misses
On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.
inst_executed = 1
inst_issued = 1
On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.
GPU architecture is based on maximizing throughput rather than minimizing latency. Thus, GPUs (currently) don't really do out-of-order execution or branch prediction. Instead of building a few cores full of complex control logic to make one thread run really fast (like you'd have on a CPU), GPUs rather use those transistors to build more cores to run as many threads as possible in parallel.
As explained on slide 9 of the presentation you linked, executed instructions are the instructions that control flow passes over in your program (basically, the number of lines of assembly code that were run). When you, e.g., execute a global load instruction and the memory request cannot be served immediately (misses the cache), the GPU will switch to another thread. Once the value is ready in the cache and GPU switches back to your thread, the load instruction will have to be issued again to complete fetching the value (see also this answer and this thread). When you, e.g., access shared memory and there are a bank conflicts, the shared memory access will have to be replayed multiple times for different threads in the warp…
The main reason to differentiate between executed and issued instructions would seem to be that the ratio of the two can serve as a measurement for the amount of overhead your code produces due to instructions that cannot be completed immediately at the time they are executed…

Do CUDA threads execute in lockstep for O(n) operations?

The CUDA programming guide has the following to say:
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path.
I'm thinking lockstep because of one common instruction at a time.
So what happens in the case where there is no branching and each thread needs to compute an O(n) operation?
Won't some threads in the warp complete before others if the value of the data they operate on is smaller?
If some threads do complete before others do they remain idle until the others complete?
Each single instruction in a warp is performed in a lockstep. The next instruction can be fetched only when the previous one has completed.
If an instruction needs a different amount of time for different threads (e.g. one thread loaded data from cache, while the other waits for global memory reads), then all threads have to wait.
That being said, I am not aware of any single instruction having a complexity O(n). What you are probably referring to is a loop of size n being executed by each of the threads in a warp. Loop, same as any other control flow construct, has a conditional jump. Threads that exit the loop early become masked and wait for the threads still in the loop. When all threads signal that they want to exit, they converge, and the following operations are once again performed in a perfect sync.
Update: As #knedlsepp points out (thank you!) since Volta this is not true. The GPU may split a warp into smaller pieces and run those independently, thus breaking the lockstep. You shouldn't assume too much, but warp synchronisation primities may help.
In practice, GPU will still try to run whole warp in lock step when possible, as this is most efficient. To my knowledge (although I cannot firmly confirm anymore, someone may prove me wrong), there is still a single instruction being executed at once, but different branches with different masks can now be interleaved in time. For a complex control flow, it may even happen that the same branch is executed multiple times, with different masks!
I remember speeding up my CUDA-based ray-tracer 2-3 times when I eliminated all break and mid-function return statements that were problematic for a compiler to figure out the optimal control flow and masking.

How to understand "All threads in a warp execute the same instruction at the same time." in GPU?

I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:
CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it then schedules for execution on available hardware resources.
The SIMT architecture is similar to the SIMD (Single Instruction, Multiple Data) architecture. Both SIMD and SIMT implement parallelism by broadcasting the same instruction to multiple execution units. A key difference is that SIMD requires that all vector elements in a vector execute together in a unifed synchronous group, whereas SIMT allows multiple threads in the same warp to execute independently. Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior. SIMT enables you to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. The SIMT model includes three key features that SIMD does not:
➤ Each thread has its own instruction address counter.
➤ Each thread has its own register state.
➤ Each thread can have an independent execution path.
The first paragraph mentions "All threads in a warp execute the same instruction at the same time.", while in the second paragraph, it says "Even though all threads in a warp start together at the same program address, it is possible for individual threads to have different behavior.". It makes me confused, and the above statements seems contradictory. Could anyone can explain it?
There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model
Predicated execution (See here)
Instruction replay/serialisation (See here)
Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.
This is how lock-step execution can support branching.

Role of Warps in NVIDIA GPU architecture with OpenCL

I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.