Branch divergence in GPGPU - cuda

__golbal__ void function(){
... some codes...
if(0<A<512){
... a few codes...
}
else if (512<=A){
... a lot of codes...
}
... some codes...
}
int main(){
some code for GPU...
...
for(int i=0; i<1024 ; i++){
A[i] = i;
}
...
some code for GPU...
}
Divergence in GPGPU
I'm studying GPGPU, and I became curious about the branch divergence.
I know that when treating branch divergence in GPU, it uses SIMT stack, and selected threads in the same warp will be executed and the others will not. And then, if all threads in the same warp are not selected, the instructions will be executed or not?
for example,
In the above code, half of threads will take if and the other half will take else if. If all A are less than 512 in threads that are in the same warp, dose the instructions about else if will be executed? or just skip them.

for example, In the above code, half of threads will take if and the other half will take else if. If all A are less than 512 in threads that are in the same warp, dose the instructions about else if will be executed? or just skip them.
Just skip.
If all threads in the same warp have an instruction predicated off, the instruction will not be issued, and will not consume execution resources for that warp.
That is the reason for the statement/suggestion on slide 53 here:
Avoid diverging within a warp
– Example with divergence:
• if (threadIdx.x > 2) {...} else {...}
• Branch granularity < warp size
– Example without divergence:
• if (threadIdx.x / WARP_SIZE > 2) {...} else {...}
• Branch granularity is a whole multiple of warp size
Effectively there is "no divergence" if the conditional boundaries align with the warp boundaries. This is considered an "optimization".
To wit, from the same slide as above:
• Different warps can execute different code with no impact on
performance

Related

When can threads of a warp get scheduled independently on Volta+?

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

Where do shared memory of non-resident threadblocks go?

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

Cuda atomics change flag

I have a piece of serial code which does something like this
if( ! variable )
{
do some initialization here
variable = true;
}
I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?
It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.
A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.
Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.
You should use the following code in your kernel to control thread access to a critical section:
__syncthreads();
if (threadIdx.x == 0)
acquire_semaphore(&sem);
__syncthreads();
//begin critical section
// ... your critical section code goes here
//end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
release_semaphore(&sem);
__syncthreads();
Prior to the kernel define these helper functions and device variable:
__device__ volatile int sem = 0;
__device__ void acquire_semaphore(volatile int *lock){
while (atomicCAS((int *)lock, 0, 1) != 0);
}
__device__ void release_semaphore(volatile int *lock){
*lock = 0;
__threadfence();
}
I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.
Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)
Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.
Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.
Here is an example of usage of binary_semaphore to implement a single device global "lock" that could be used for access control to a critical section.

conditional syncthreads & deadlock (or not)

A follow up Q to: EarlyExit and DroppedThreads
According to the above links, the code below should dead-lock.
Please explain why this does NOT dead-lock. (Cuda 5 on a Fermi)
__device__ int add[144];
__device__ int result;
add<<<1,96>>>(); // the calling
__global__ void add() {
for(idx=72>>1; idx>0; idx>>=1) {
if(thrdIdx < idx)
add[thrdIdx]+= add[thrdIdx+idx];
else
return;
__syncthreads();
}
if(thrdIdx == 0)
result= add[0];
}
This is technically an ill-defined program.
Most, but not all (for example G80 does not), NVIDIA GPUs support early exit in this way because the hardware maintains an active thread count for each block, and this count is used for barrier synchronization rather than the initial thread count for the block.
Therefore, when the __syncthreads() in your code is reached, the hardware will not wait on any threads that have already returned, and the program runs without deadlock.
A more common use of this style is:
__global__ void foo(int n, ...) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
... // do some computation with remaining threads
}
Important note: barrier counts are updated per-warp (see here), not per-thread. So you may have the case where, say, only a few (or zero) threads return early. This means that the barrier count is not decremented. However, as long as at least one thread from each warp reaches the barrier, it will not deadlock.
So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work.
Edit: for your specific case.
Iteration Idx==36: 2 active warps so barrier exit count is 64. All threads from warp 0 reach barrier, incrementing count from 0 to 32. 4 threads from warp 1 reach barrier, incrementing count from 32 to 64, and warps 0 and 1 are released from barrier. Read the link above to understand why this happens.
Iteration Idx==18: 1 active warp so barrier exit count is 32. 18 threads from warp 0 reach barrier, incrementing count from 0 to 32. Barrier is satisfied and warp 0 is released.
Etc...