Parallel reduction and find index on CUDA - cuda

I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400.
My code looks like this:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
The problem/issue here is that at some point “nTotalThreads” is not exactly a power of 2, resulting in garbage value for the index. The array out1 gives me the maximum value in each block, which is correct and validated. But the value of the index is wrong. For example: the max value in the first block occurs at index=40, but the kernel gives the values of index as 15. Similarly the value of the max in the second block is at 440, but the kernel gives 416.
Any suggestions??

It should be easy to ensure that nTotalThreads is always a power of 2.
Make the first reduction a special case that gets the nTotalThreads to a power of 2. eg, since you start with 400 threads in a block, do the first reduction with 256 threads. Threads 0-199 will reduce from two values, and threads 200-255 just won't have to do a reduction in this initial step. From then on out you'd be fine.

Are you sure you really need the 'issue' “nTotalThreads” is not exactly a power of 2?
It makes the code less readable and I think it can interfere with the performance too.
Anyway if you substitute
nTotalThreads = (nTotalThreads >> 1);
with
nTotalThreads = (nTotalThreads +1 ) >> 1;
it should solve one bug concerning this 'issue'.
Francesco

Second Jeff's suggestion.
Take a look at the CUDA Thrust Library's reduce function. This is demonstrated to have 95+% efficiency compared with heavily hand-tuned kernels and is pretty flexible and easy to use.

check my kernel. You can put your blockresults to array(which can be in global memory) and get the result in global memory
And see how I call it in host code:
sumSeries<<<dim3(blockCount),dim3(threadsPerBlock)>>>(deviceSum,threadsPerBlock*blockCount);

Related

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

How to collect individual results of the threads within a block?

In my Kernel, the threads are processing a small part of an array in global memory.
After processing I would also like to set a flag indicating that the result of the calculation is zero for all threads within a block:
__global__ void kernel( int *a, bool *blockIsNull) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int result = 0;
// {...} Here calculate result
a[tid] = result;
// some code here, but I don't know, that's my question...
if (condition)
blockIsNull[blockIdx.x] = true; // if all threads have returned result==0
}
Each individual thread owns the information. But I don't find an efficient way to collect it.
For example, I could have a counter in shared memory that is atomically incremented by each thread when result==0. So when the counter reaches blockDim.x it means that all threads have returned zero. Althought not tested, I am afraid that this solution will have a negative impact on performance (atomic functions are slow).
A zero result does not occur very often, so it is very unlikely to have zeros for all threads within a block. I would like to find a solution that has little impact on the performance in the general case.
What would be your recommendation ?
It sounds like you want to perform a block level reduction of the condition value across a block. Just about all CUDA hardware supports a set of very useful warp voting primitives. You could use the __all() warp vote to determine that each warp of threads satisfied the condition, and then use __all() again to check whether all warps satisfy the condition. In code, it might look like this:
__global__ void kernel( int *a, bool *blockIsNull) {
// assume that threads per block is <= 1024
__shared__ volatile int blockcondition[32];
int laneid = threadIdx.x % 32;
int warpid = threadIdx.x / 32;
// Set each condition value to non zero to begin
if (warpid == 0) {
blockcondition[threadIdx.x] = 1;
}
__syncthreads();
//
// your code goes here
//
// warpcondition holds the vote from each warp
int warpcondition = __all(condition);
// First thread in each warp loads the warp vote to shared memory
if (laneid == 0) {
blockcondition[warpid] = warpcondition;
}
__syncthreads();
// First warp reduces all the votes in shared memory
if (warpid == 0) {
int result = __all(blockcondition[threadIdx.x] != 0);
// first thread stores the block result to global memory
if (laneid == 0) {
blockIsNull[blockIdx.x] = (result !=0);
}
}
}
[ Huge disclaimer: written in browser, never compiled or tested, use at own risk ]
This code should (I think) work for any number of threads per block up to 1024. You could, if required, adjust the size of blockcondition to a smaller value if you were confident of an upper block size limit less than 1024. Probably the smartest way would be to use C++ templating and make the warp count limit a template parameter.

Implementing Max Reduce in Cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)

CUDA - specifiying <<<x,y>>> for a for loop

Hey,
I have two arrays of size 2000. I want to write a kernel to copy one array to the other. The array represents 1000 particles. index 0-999 will contain an x value and 1000-1999 the y value for their position.
I need a for loop to copy up to N particles from 1 array to the other. eg
int halfway = 1000;
for(int i = 0; i < N; i++){
array1[i] = array2[i];
array1[halfway + i] = array[halfway + i];
}
Due to the number of N always being less than 2000, can I just create 2000 threads? or do I have to create several blocks.
I was thinking about doing this inside a kernel:
int tid = threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
and calling it as follows:
kernel<<<1,2000>>>(...);
Would this work? will it be fast? or will I be better off splitting the problem into blocks. I'm not sure how to do this, perhaps: (is this correct?)
int tid = blockDim.x*blockIdx.x + threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
kernel<<<4,256>>>(...);
Would this work?
Have you actually tried it?
It will fail to launch, because you are allowed to have 512 threads maximum (value may vary on different architectures, mine is one of GTX 200-series). You will either need more blocks or have fewer threads and a for-loop inside with blockDim.x increment.
Your multi-block solution should work as well.
Other approach
If this is the only purpose of the kernel, you might as well try using cudaMemcpy with cudaMemcpyDeviceToDevice as the last parameter.
The only way to answer questions about configurations is to test them. To do this, write your kernels so that they work regardless of the configuration. Often, I will assume that I will launch enough threads, which makes the kernel easier to write. Then, I will do something like this:
threads_per_block = 512;
num_blocks = SIZE_ARRAY/threads_per_block;
if(num_blocks*threads_per_block<SIZE_ARRAY)
num_blocks++;
my_kernel <<< num_blocks, threads_per_block >>> ( ... );
(except, of course, threads_per_block might be a define, or a command line argument, or iterated to test many configurations)
Is better to use more than one block for any kernel.
It Seems to me that you are simply copying from one array to another as a sequence of values with an offset.
If this is the case you can simply use the cudaMemcpy API call and specify
cudaMemcpyDeviceToDevice
cudaMemcpy(array1+halfway,array1,1000,cudaMemcpyDeviceToDevice);
The API will figure out the best partition of block / threads.