CUDA Reduction - atomic vs single thread summation - cuda

I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).
if(i==0){
*sum = 0.; // Initialize to 0
}
__syncthreads();
if (tdx == 0){
x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();
Then I want to sum the first x elements (as many as I have blocks).
When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":
if(i == 0){
for(int k = 0; k < gridDim.x; k++){
atomicAdd(sum, x[k]); //Works good
//sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
}
}
Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!

__syncthreads() only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.
The incorrect result is due to a synchronization problem. The operands x[k] are the outcomes of the computations from different blocks: x[0] is the result from block 0, x[1] is the result from block 1, etc. Thread 0 could start adding them up before some blocks have really finished their computations.
You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k]; can now work.

As has been pointed out, your problem is due to missing synchronisation after the first pass since you cannot synchronise between blocks. There is a good walkthrough on reduction in the sample codes provided with the toolkit.
Having said that, I would strongly recommend that people don't write reduction kernels (or other primitives such as scan) where such primitives exist in library code. Much better to invest your effort elsewhere and reuse existing optimised code where it is available. This doesn't apply if you're doing this to learn of course!
I recommend you take a look at Thrust and CUB.

Related

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or <cg>.shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.

how to sync threads in this cuda example

I have the following rough code outline:
run a loop, millions of times
in that loop, compute values 'I's - see example of such functions below
After all 'I's have been computed, compute other values 'V's
repeat the loop
Each computation of an I or V could involve up to 20ish mathematical operations, (e.g. I1 = A + B/C * D + 1/exp(V1) - E + F + V2 etc).
There are roughly:
50 'I's
10 'V's
10 values in each I and V, i.e. they are vectors of length 10
At first I tried running a simple loop in C, with kernel calls for each time step but this was really slow. It seems like I can get the code to run faster if the main loop is in a kernel that calls other kernels. However, I'm worried about kernel call overhead (maybe I shouldn't be) so I came up with something like the following, where each I and V loop independently, with syncing between the kernels as necessary.
For reference, the variables below are hardcoded as __device__ values, but eventually I will pass some values into specific kernels to make the system interesting.
__global__ void compute_IL1()
{
int id = threadIdx.x;
//n_t = 1e6;
for (int i = 0; i < n_t; i++){
IL1[id] = gl_1*(V1[id] - El_1);
//atomic, sync, event????,
}
}
__global__ void compute_IK1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
Ik1[id] = gk_1*powf(0.75*(1-H1[id]),4)*(V1[id]-Ek_1);
//atomic, sync, event?
}
}
__global__ void compute_V1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
//wait for IL1 and Ik1 and others, but how????
V1[id] = Ik1[id]+IL1[id] + ....
//trigger the I's again
}
}
//main function
compute_IL1<<<1,10,0,s0>>>();
compute_IK1<<<1,10,0,s1>>>();
//repeat this for many 50 - 70 more kernels (Is and Vs)
So the question is, how would I sync these kernels? Is an event approach best? Is there a better paradigm to use here?
There is no sane mechanism I can think of to have multiple resident kernels synchronize without resorting to hacky atomic tricks which may well not work reliably.
If you are running blocks with 10 threads and these kernels cannot execute concurrently for correctness reasons, you are (in the best possible case) using 1/64 of the computational capacity of your device. This problem as you have described it sounds completely Ill suited to a GPU.
So, I tried a couple of approaches.
A loop with a few kernel calls, where the last kernel call is dependent on the previous ones. This can be done with cudaStreamWaitEvent which can wait for multiple events. I found this on: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/ . Unfortunately, the kernel calls were too expensive.
Global variables between concurrent streams. The logic was pretty simple, having one thread pause until a global variable equaled the loop variable, indicating that all threads could proceed. This was then followed by a sync-threads call. Unfortunately, this did not work well.
Ultimately, I think I've settled on a nested loop, where the outer loop represents time, and the inner loop indicates which of a set instructions to run, based on dependencies. I also launched the maximum number of threads per block (1024) and broke up the vectors that needed to be processed into warps. The rough psuedocode is:
run_main<<<1,1024>>>();
__global__ void run_main(){
int warp = threadIdx.x/32;
int id = threadIdx.x - warp*32;
if (id < 10){
for (int i = 0; i < n_t; i++){
for(int j = 0; j < n_j; j++){
switch (j){
case 0:
switch(warp){
case 0:
I1[id] = a + b + c*d ...
break;
case 1:
I2[id] = f*g/h
break;
}
break;
//These things depend on case 0 OR
//we've run out of space in the first pass
//32 cases max [0 ... 31]
case 1:
switch(warp){
case 0:
V1[ID] = I1*I2+ ...
break;
case 1:
V2[ID] = ...
//syncs across the block
__syncthreads();
This design is based on my impression that each set of 32 threads runs independently but should run the same code, otherwise things can slow done significantly.
So at the end, I'm running roughly 32*10 instructions simultaneously.
Where 32 is the number of warps, and it depends on how many different values I can compute at the same time (due to dependencies) and 10 is the # of elements in each vector. This is slowed down by any imbalances in the # of computations in each warp case, since all warps need to merge before moving onto the next step (due to the syncthreads call). I'm running different parameters (parameter sweep) on top of this, so I could potentially run 3 at a time in the block, multiplied by the # of streaming processors (or whatever the official name is) on the card.
One thing I need to change is that I'm currently testing on a video card that is attached to a monitor as well. Apparently Windows will kill the kernel if it lasts for more than 5 seconds, so I need to call the kernel in chunked time steps, like once every 1e5 time steps (in my case).

Benefit of splitting a big CUDA kernel and using dynamic parallelism

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.
I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

Coding a CUDA Kernel that has many threads writing to the same index?

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).
As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.
The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).
You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?
EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions
+= is not atomical => not thread safe. Use atomicAdd.
Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.
You need to do a reduction.
Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads
Now reduce the resulting sub totals by adding successive neighboring subtotals:
int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex] + 1;
__syncthreads;
i /= 2;
}
}
The following deck explains this in some detail:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
Sample code for this is here:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).
There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.