About warp voting function - cuda

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".
My question is: what applications will use these 3 functions?

The prototype of __ballot is the following
unsigned int __ballot(int predicate);
If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.
Combined with atomicOr and __popc, it can be used to accumulate the number of threads in each warp having a true predicate.
Indeed, the prototype of atomicOr is
int atomicOr(int* address, int val);
and atomicOr reads the value pointed to by address, performs a bitwise OR operation with val, and writes the value back to address and returns its old value as a return parameter.
On the other side, __popc returns the number of bits set withing a 32-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];
const u32 warp_sum = threadIdx.x >> 5;
atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));
atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann

__ballot is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc intrinsic to make a very efficient implementation of boolean reduction.
__all and __any was used in reduction before introduction of __ballot, though I can not think of any other use of them.

As an example of algorithm that uses __ballot API i would mention the In-Kernel Stream Compaction by D.M Hughes et Al. It is used in prefix sum part of the stream compaction to count (per warp) the number of elements that passed the predicate.
Here the paper. In-k Stream Compaction

CUDA provides several warp-wide broadcast and reduction operations that NVIDIA’s architectures efficiently support. For example, __ballot(predicate) instruction evaluates predicate for all active threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active [Reference: Flexible Software Profiling of GPU Architectures].

Related

Cuda _sync functions, how to handle unknown thread mask?

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

__activemask() vs __ballot_sync()

After read this post on CUDA Developer Blog I am struggling to understand when is safe\correct use __activemask() in place of __ballot_sync().
In section Active Mask Query, the authors wrote:
This is incorrect, as it would result in partial sums instead of a
total sum.
and after, in section Opportunistic Warp-level Programming they are using the function __activemask() because:
This may be difficult if you want to use warp-level programming inside
a library function but you cannot change the function interface.
There is no __active_mask() in CUDA. That is a typo (in the blog article). It should be __activemask().
__activemask() is only a query. It asks the question "which threads in the warp are currently executing this instruction, in this cycle?" which is equivalent to asking "which threads in the warp are currently converged at this point?"
It has no effect on convergence. It will not cause threads to converge. It has no warp synchronizing behavior.
__ballot_sync() on the other hand has converging behavior (according to the supplied mask).
The primary differentiation here should be considered in light of the Volta warp execution model. Volta and beyond, because of hardware changes in the warp execution engine, can support threads in a warp being diverged in a larger number of scenarios, and for a longer time, than can previous architectures.
The divergence we are referring to here is incidental divergence due to previous conditional execution. Enforced divergence due to explicit coding is identical before or after Volta.
Let's consider an example:
if (threadIdx.x < 1){
statement_A();}
statement_B();
Assuming the threadblock X dimension is greater than 1, statement_A() is in an area of enforced divergence. The warp will be in a diverged state when statement_A() is executed.
What about statement_B() ? The CUDA execution model makes no particular statements about whether the warp will be in a diverged state or not when statement_B() is executed (see note 1 below). In a pre-Volta execution environment, programmers would typically expect that there is some sort of warp reconvergence at the closing curly-brace of the previous if statement (although CUDA makes no guarantees of that). Therefore the general expectation is that statement_B() would be executed in a non-diverged state.
However in the Volta execution model, not only are there no guarantees provided by CUDA, but in practice we may observe the warp to be in a diverged state at statement_B(). Divergence at statement_B() isn't required for code correctness (whereas it is required at statement_A()), nor is convergence at statement_B() required by the CUDA execution model. If there is divergence at statement_B() as may occur in the Volta execution model, I'm referring to this as incidental divergence. It is divergence that arises not out of some requirement of the code, but typically as a result of some kind of previous conditional execution behavior.
If we have no divergence at statement_B(), then these two expressions (if they were at statement_B()) should return the same result:
int mask = __activemask();
and
int mask = __ballot_sync(0xFFFFFFFF, 1);
So in the pre-volta case, when we would typically not expect divergence at statement_B() in practice these two expressions return the same value.
In the Volta execution model, we can have incidental divergence at statement_B(). Therefore these two expressions might not return the same result. Why?
The __ballot_sync() instruction, like all other CUDA 9+ warp level intrinsics which have a mask parameter, have a synchronizing effect. If we have code-enforced divergence, if the synchronizing "request" indicated by the mask argument cannot be fulfilled (as would be the case above where we are requesting full convergence), that would represent illegal code.
However if we have incidental divergence (only, for this example), the __ballot_sync() semantics are to first reconverge the warp at least to the extent that the mask argument is requesting, then perform the requested ballot operation.
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
If you then created code that performed some warp level operation (such as a warp-level sum-reduction as suggested in the blog article) and selected the threads to participate based on __activemask() vs. __ballot_sync(0xFFFFFFFF, 1), you could conceivably get a different result, in the presence of incidental divergence. The __activemask() realization, in the presence of incidental divergence, would compute a result that did not include all threads (i.e. it would compute a "partial" sum). On the other hand, the __ballot_sync(0xFFFFFFFF, 1) realization, because it would first eliminate the incidental divergence, would force all threads to participate (computing a "total" sum).
A similar example and description as what I have given here is given around listing 10 in the blog article.
An example of where it could be correct to use __activemask is given in the blog article on "opportunistic warp-level programming", here:
int mask = __match_all_sync(__activemask(), ptr, &pred);
this statement is saying "tell me which threads are converged" (i.e. the __activemask() request), and then "use (at least) those threads to perform the __match_all operation. This is perfectly legal and will use whatever threads happen to be converged at that point. As that listing 9 example continues, the mask computed in the above step is used in the only other warp-cooperative primitive:
res = __shfl_sync(mask, res, leader);
(which happens to be right after a piece of conditional code). This determines which threads are available, and then forces the use of those threads, regardless of what incidental divergence may have existed, to produce a predictable result.
As an additional clarification on the usage of the mask parameter, note the usage statements in the PTX guide. In particular, the mask parameter is not intended to be an exclusion method. If you wish threads to be excluded from the shuffle operation, you must use conditional code to do that. This is important in light of the following statement from the PTX guide:
The behavior of shfl.sync is undefined if the executing thread is not in the membermask.
Furthermore, although not directly related to the above discussion, there is an "exception" to the forced divergence idea for __shfl_sync(). The programming guide states that this is acceptable on volta and beyond:
if (tid % warpSize < 16) {
...
float swapped = __shfl_xor_sync(0xffffffff, val, 16);
...
} else {
...
float swapped = __shfl_xor_sync(0xffffffff, val, 16);
...
}
The reason for this is hinted at there, and we can get a further explanation of behavior in this case from the PTX guide:
shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.
This means that the __shfl_sync() in the if-path and the __shfl_sync() in the else-path are effectively working together in this case to produce a defined result for all threads in the warp. A few caveats:
This statement applies to cc7.0 and higer
Other constructs are not necessarily going to work. For example this:
if (tid % warpSize < 16) {
...
float swapped = __shfl_xor_sync(0xffffffff, val, 16);
...
} else {
}
will not provide interesting results for any thread in the warp.
This question/answer may also be of interest.
(note 1) This is perhaps a non-intuitive concept. The CUDA programming model allows for incidental divergence to occur or appear at any point in device code, for no reason at all. The blog article makes reference to this concept in several places, I will excerpt one here:
It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

Why are there no simple atomic increment, decrement operations in CUDA?

From the CUDA Programming guide:
unsigned int atomicInc(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or
shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the
result back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
This is nice and dandy. But where's
unsigned int atomicInc(unsigned int* address);
which simply increments the value at address and returns the old value? And
void atomicInc(unsigned int* address);
which simply increments the value at address and returns nothing?
Note: Of course I can 'roll my own' by wrapping the actual API call, but I would think the hardware has such a simpler operation, which could be cheaper.
They haven't implemented the simple increase and decrease operations because the operations wouldn't have better performance. Each machine code instruction in current architectures takes the same amount of space, which is 64 bits. In other words, there's room in the instruction for a full 32-bit immediate value and since they have atomic instructions that support adding a full 32-bit value, they have already spent the transistors.
I think that dedicated inc and dec instructions on old processors are now just artifacts from a time when transistors were much more expensive and the instruction cache was small, thereby making it worth the effort to code instructions into as few bits as possible. My guess is that, on new CPUs, the inc and dec instructions are implemented in terms of a more general addition function internally and are mostly there for backwards compatibility.
You should be able to accomplish "simple" atomic increment using:
__global__ void mykernel(int *value){
int my_old_val = atomicAdd(value, 1);
}
Likewise if you don't care about the return value, this is perfectly acceptable:
__global__ void mykernel(int *value){
atomicAdd(value, 1);
}
You can do atomic decrement similarly with:
atomicSub(value, 1);
or even
atomicAdd(value, -1);
Here is the section on atomics in the programming guide, which covers support by compute capability for the various functions and data types.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.