How slow is comparison and branching on GPU - cuda

I have read that comparisons and branching is slow on GPU. I would like to know how much. (I'm familier with OpenCL, but the question is general also for CUDA, AMP ... )
I would like to know it, before I start to port my code to GPU. In particular I'm interested in finding lowest value in neighborhood ( 4 or 9 nearest neighbors) of each point in 2D array. i.e. something like convolution, but instead of summing and multiplying I need comparisons and branching.
for example code like this ( NOTE: this example code is not yet optimized for GPU to be more readeable ... so partitioning to workgroups, prefeaching of local memory ... is missing )
for(int i=1;i<n-1;i++){ for(int j=1;j<n-1;j++){ // iterate over 2D array
float hij = h[i][j];
int imin = 0,jmin = 0;
float dh,dhmin=0;
// find lowest neighboring element h[i+imin][j+jmin] of h[i][j]
dh = h[i-1][j ]-hij; if( dh<dhmin ){ imin = -1; jmin = 0; dhmin = dh; }
dh = h[i+1][j ]-hij; if( dh<dhmin ){ imin = +1; jmin = 0; dhmin = dh; }
dh = h[i ][j-1]-hij; if( dh<dhmin ){ imin = 0; jmin = -1; dhmin = dh; }
dh = h[i ][j+1]-hij; if( dh<dhmin ){ imin = 0; jmin = +1; dhmin = dh; }
if( dhmin<-0.00001 ){ // if lower
// ... Do something with hij, dhmin and save to h[i+imin][j+jmin] ...
}
} }
Would it be worth to port to GPU despite a lot of if branching and
comparison? ( i.e. if this 4-5 comparisons per elemet would be 10x slower than the same 4-5 comparisons on CPU it would be a bottleneck )
is there any optimization trick how to minizmize if
branching and comparison slow down?
Which I used in this hydraulic errosion code:
http://www.openprocessing.org/sketch/146982

Branching itself is not slow. Divergence is what gets you. GPUs compute multiple work items (typ. 16 or 32) in lock-step in "warps" or "wavefronts" and if different work items take different paths they all take all paths but gate writes based on which path they are on (using predicate flags)). So if your work items always (or mostly) branch the same way, you're good. If they don't the penalty can rob performance.

If you need to do comparison and if the array length 'n' is really big then you can use reduction instead of sequential comparison. Reduction would do comparison in parallel in O (log n) time as opposed to O (n) when done sequentially.
When you access memory sequentially in a GPU thread, the memory accesses are sequential since consecutive blocks are accessed from the same bank. Instead, its good to use coalesced reads. You can find plethora of examples on this.
On GPUs, don't access global memory multiple times (as GPU memory management and caching work not exactly like a CPU). Instead, cache the global memory elements into thread's private variables / shared memory as much as possible.

Related

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

Can i parallelize my code or it is not worth? [duplicate]

This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.

Is there an efficient way to optimize my serialized code?

This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)