Is the nvidia kepler shuffle "destructive"? - cuda

I'm using the implementation of the parallel reduction on CUDA using new kepler's shuffle instructions, similar to this:
http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
I was searching for the minima of rows in a given matrix, and in the end of the kernel I had the following code:
my_register = min(my_register, __shfl_down(my_register,8,16));
my_register = min(my_register, __shfl_down(my_register,4,16));
my_register = min(my_register, __shfl_down(my_register,2,16));
my_register = min(my_register, __shfl_down(my_register,1,16));
My blocks are 16*16, so everything worked fine, with that code I was getting minima in two sub-rows in the very same kernel.
Now I also need to return the indices of the smallest elements in every row of my matrix, so I was going to replace "min" with the "if" statement and handle these indices in a similar fashion, I got stuck at this code:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);};
if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);};
if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};
No cudaErrors whatsoever, but kernel returns trash now. Nevertheless I have fix for that:
myreg_tmp = __shfl_down(myreg,8,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,4,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,2,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,1,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
So, allocating new tmp variable to sneak into neighboring registers saves everything for me.
Now the question: Are the kepler shuffle instructions destructive ? in a sense that invoking same instruction twice doesn't issue the same result. I haven't assigned anything to those registers saying "my_reg > __shfl_down(my_reg,8,16)" - this adds up to my confusion. Can anyone explain me what is the problem with invoking shuffle twice? I'm pretty much a newbie in CUDA, so detailed explanation for dummies is welcomed

warp shuffle is not destructive. The operation, if repeated under the exact same conditions, will return the same result each time. The var value (myreg in your example) does not get modified by the warp shuffle function itself.
The problem you are experiencing is due to the fact that the number of participating threads on the second invocation of __shfl_down() in your first method is different than the other invocations, in either method.
First, let's remind ourselves of a key point in the documentation:
Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.
Now let's take a look at your first "broken" method:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
The first time you call __shfl_down() above (within the if-clause), all threads are participating. Therefore all values returned by __shfl_down() will be what you expect. However, once the if clause is complete, only threads that satisfied the if-clause will participate in the body of the if-statement. Therefore, on the second invocation of __shfl_down() within the if-statement body, only threads for which their my_reg value was greater than the my_reg value of the thread 8 lanes above them will participate. This means that some of these assignment statements probably will not return the value you expect, because the other thread may not be participating. (The participation of the thread 8 lanes above would be dependent on the result of the if comparison done by that thread, which may or may not be true.)
The second method you propose has no such issue, and works correctly according to your statements. All threads participate in each invocation of __shfl_down().

Related

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or <cg>.shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.

Is there a performance penalty for CUDA method not running in sync?

If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?
eg:
Say # x=1500, there are 500 data points spanning the last 2hr period.
# x = 1510, there are 300 data points spanning the last 2hr period.
the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.
Is this typically an issue?
EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.
public void PopulateMeanArray(float[] data)
{
float lookFwdDistance = 108000000000f;
float lookBkDistance = 12000000000f;
int counter = thread.blockIdx.x * 1000; //Ensures unique position in data is written to (assuming i have less than 1000 entries).
float numberOfTicksInLookBack = 0;
float sum = 0; //Stores the sum of difference between two time ticks during x min look back.
//Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
//Thread 1 could be working here.
for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
{
sum = 0;
numberOfTicksInLookBack = 0;
//Thread 2 could be working here. Is this warp divergence?
for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
{
sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
numberOfTicksInLookBack++;
}
data[counter] = sum/numberOfTicksInLookBack;
counter++;
}
}
CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.
In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.
In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.
Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.
A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.