Parallel reduction example - cuda

I found this parallel reduction code from Stanford which uses shared memory.
The code is an example of 1<<18 number of elements which is equal to 262144 and gets correct results.
Why for certain number of elements I get the correct results and for other number of elements, like 200000 or 25000 I get different results from what is to be expected?
It looks to me it's always appointing the needed thread blocks

// launch a single block to compute the sum of the partial sums
block_sum<<<1,num_blocks,num_blocks * sizeof(float)>>>
this code causes the bug.
suppose numblocks is 13,
Then in the kernal blockDim.x / 2 will be 6,
and
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
will only add the first 12 elements causing the bug.
when the element count is 200000 or 250000, num_blocks will be odd numbers and causes the bug, for even num_blocks it will work fine

This kernel is sensitive to the blocking parameters (grid and threadblock size) of the kernel. Are you invoking it with enough threads to cover the input size?
It is more robust to formulate kernels like this with for loops - instead of:
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
something like:
for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The source code in the CUDA Handbook has lots of examples of "blocking agnostic" code. The reduction code is here:
https://github.com/ArchaeaSoftware/cudahandbook/tree/master/reduction

Related

numba: how to understand the stride [duplicate]

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Why do we need stride in CUDA kernel?

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Struggling with intuition regarding how warp-synchronous thread execution works

I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code:
__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
extern __shared__ int sPartials[];
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
sPartials[tid] = sum;
__syncthreads();
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 ) {
if ( tid < activeThreads ) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
if ( threadIdx.x < 32 ) {
volatile int *wsSum = sPartials;
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this statement, any exampele please?
wsSum[tid] += wsSum[tid + 16]; //how these statements are executed in paralle within a warp
wsSum[tid] += wsSum[tid + 8];
wsSum[tid] += wsSum[tid + 4];
wsSum[tid] += wsSum[tid + 2];
wsSum[tid] += wsSum[tid + 1];
if ( tid == 0 ) {
volatile int *wsSum = sPartials;// why this statement is needed?
out[blockIdx.x] = wsSum[0];
}
}
}
Unfortunately it is not clear to me how the code is working from the if ( threadIdx.x < 32 )condition and after. Can somebody give an intuitive example with thread ids and how the statements are executed? I think it is important to understand these conceptes so any help it would be helpful!!
Let's look at the code in blocks, and answer your questions along the way:
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The above code travels through a data set of size N. An assumption we can make for understanding purposes is that N > blockDim.x*gridDim.x, this last term simply being the total number of threads in the grid. Since N is larger than the total threads, each thread is summing multiple elements from the data set. From the standpoint of a given thread, it is summing elements that are spaced by the grid dimension of threads (blockDim.x*gridDim.x) Each thread stores it's sum in a local (presumably register) variable named sum.
sPartials[tid] = sum;
__syncthreads();
As each thread finishes (i.e., as it's for-loop exceeds N) it stores it's intermediate sum in shared memory, and then waits for all other threads in the block to finish.
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 ) {
if ( tid < activeThreads ) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
So far we haven't talked about the dimension of the block - it hasn't mattered. Let's assume each block has some integer multiple of 32 threads. The next step will be to start gathering the various intermediate sums stored in shared memory into smaller and smaller groups of variables. The above code starts out by selecting half of the threads in the threadblock (blockDim.x>>1) and uses each of those threads to combine two of the partial sums in shared memory. So if our threadblock started out at 128 threads, we just used 64 of those threads to reduce 128 partial sums into 64 partial sums. This process continues repetetively in the for loop, each time cutting the threads in half and combining partial sums, two at a time per thread. This process continues as long as activeThreads > 32. So if activeThreads is 64, then those 64 threads will combine 128 partial sums into 64 partial sums. But when activeThreads becomes 32, the for-loop is terminated, without combining 64 partial sums into 32. So at the completion of this block of code, we have taken the (arbitrary multiple of 32 threads) threadblock, and reduced however many partial sums we started out with, down to 64. This process of combining say 256 partial sums, to 128 partial sums, to 64 partial sums, must wait at each iteration for all threads (in multiple warps) to complete their work, so the __syncthreads(); statement is executed with each pass of the for-loop.
Keep in mind, at this point, we have reduced our threadblock to 64 partial sums.
if ( threadIdx.x < 32 ) {
For the remainder of the kernel after this point, we will only be using the first 32 threads (i.e. the first warp). All other threads will remain idle. Note that there are no __syncthreads(); after this point either, as that would be a violation of the rule for using it (all threads must participate in a __syncthreads();).
volatile int *wsSum = sPartials;
We are now creating a volatile pointer to shared memory. In theory, this tells the compiler that it should not do various optimizations, such as optimizing a particular value into a register, for example. Why didn't we need this before? Because __syncthreads(); also carries with it a memory-fencing function. A __syncthreads(); call, in addition to causing all threads to wait at the barrier for each other, also forces all thread updates back into shared or global memory. We can no longer depend on this feature, however, because from here on out we will not be using __syncthreads(); because we have restricted ourselves -- for the remainder of the kernel -- to a single warp.
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this
The previous reduction block left us with 64 partial sums. But we have at this point restricted ourselves to 32 threads. So we must do one more combination to gather the 64 partial sums into 32 partial sums, before we can proceed with the remainder of the reduction.
wsSum[tid] += wsSum[tid + 16]; //how these statements are executed in paralle within a warp
Now we are finally getting into some warp-synchronous programming. This line of code depends on the fact that 32 threads are executing in lockstep. To understand why (and how it works at all) it will be convenient to break this down into the sequence of operations needed to complete this line of code. It looks something like:
read the partial sum of my thread into a register
read the partial sum of the thread that is 16 higher than my thread, into a register
add the two partial sums
store the result back into the partial sum corresponding to my thread
All 32 threads will follow the above sequence in lock-step. All 32 threads will begin by reading wsSum[tid] into a (thread-local) register. That means thread 0 reads wsSum[0], thread 1 reads wsSum[1] etc. After that, each thread reads another partial sum into a different register: thread 0 reads wsSum[16], thread 1 reads wsSum[17], etc. It's true that we don't care about the wsSum[32](and higher) values; we've already collapsed those into the first 32 wsSum[] values. However, as we'll see, only the first 16 threads (at this step) will contribute to the final result, so the first 16 threads will be combining the 32 partial sums into 16. The next 16 threads will be acting as well, but they are just doing garbage work -- it will be ignored.
The above step combined 32 partial sums into the first 16 locations in wsSum[]. The next line of code:
wsSum[tid] += wsSum[tid + 8];
repeats this process with a granularity of 8. Again, all 32 threads are active, and the micro-sequence is something like this:
read the partial sum of my thread into a register
read the partial sum of the thread that is 8 higher than my thread, into a register
add the two partial sums
store the result back into the partial sum corresponding to my thread
So the first 8 threads combine the first 16 partial sums (wsSum[0..15]) into 8 partial sums (contained in wsSum[0..7]). The next 8 threads are also combining wsSum[8..23] into wsSums[8..15], but the writes to 8..15 occur after those values were read by threads 0..8, so the valid data is not corrupted. It's just extra junk work going on. Likewise for the other blocks of 8 threads within the warp. So at this point we have combined the partial sums of interest into 8 locations.
wsSum[tid] += wsSum[tid + 4]; //this combines partial sums of interest into 4 locations
wsSum[tid] += wsSum[tid + 2]; //this combines partial sums of interest into 2 locations
wsSum[tid] += wsSum[tid + 1]; //this combines partial sums of interest into 1 location
And these lines of code follow a similar pattern as the previous two, partitioning the warp into 8 groups of 4 threads (only the first 4-thread group contributes to the final result) and then partitioning the warp into 16 groups of 2 threads, with only the first 2-thread group contributing to the final result. And finally, into 32 groups of 1 thread each, each thread generating a partial sum, with only the first partial sum being of interest.
if ( tid == 0 ) {
volatile int *wsSum = sPartials;// why this statement is needed?
out[blockIdx.x] = wsSum[0];
}
At last, in the previous step, we had reduced all partial sums down to a single value. It's now time to write that single value out to global memory. Are we done with the reduction? Perhaps, but probably not. If the above kernel were launched with only 1 threadblock, then we would be done -- our final "partial" sum is in fact the sum of all elements in the data set. But if we launched multiple blocks, then the final result from each block is still a "partial" sum, and the results from all blocks must be added together (somehow).
And to answer your final question?
I don't know why that statement is needed.
My guess is that it was left around from a previous iteration of the reduction kernel, and the programmer forgot to delete it, or didn't notice that it wasn't needed. Perhaps someone else will know the answer to this one.
Finally, the cuda reduction sample provides very good reference code for study, and the accompanying pdf document does a good job of describing the optimizations that can be made along the way.
After the first two code block(separate by __syncthreads()), you can get 64 values in each thread block(stored in sPartials[] of each thread block). So the code from the if ( threadIdx.x < 32 ) is to accumulate the 64 values in each sPartials[]. It is just for optimizing the speed of the reduction. Because the data of the rest steps of the accumulation is small, it's not worthy to decrease threads and loop. You can just adjust the condition in second code block
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 )
to
for ( int activeThreads = blockDim.x>>1;
activeThreads > 0;
activeThreads >>= 1 )
instead of
if ( threadIdx.x < 32 ) {
volatile int *wsSum = sPartials;
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
wsSum[tid] += wsSum[tid + 16];
wsSum[tid] += wsSum[tid + 8];
wsSum[tid] += wsSum[tid + 4];
wsSum[tid] += wsSum[tid + 2];
wsSum[tid] += wsSum[tid + 1];
for better understanding.
After the accumulation, you can get only one value of each sPartials[], and stored in sPartials[0], here in your code is wsSum[0].
And after the kernel function, you can accumulate the values in wsSum in CPU to get the final result.
The CUDA execution model in a nutshell: computations are divided between blocks on a grid. The blocks can have some shared resources (shared memory).
Each block is executed on a single Stream Multi-processor (SM), which is what makes the fast shared memory possible.
The work for each block is again split into warps of 32 threads. You can look at the work done by warps as independent tasks. The SM switches between warps very quickly. For example, when a thread accesses global memory, the SM will switch to a different warp.
You know nothing about the order in which warps are executed. All you know is that, after a call to __syncthreads, all threads will have run up to that point, and all memory reads and writes have been completed.
The important thing to note is that all threads in a warp all execute the same instruction, or some may be paused when there is a branch and different threads take different branches.
So, in the reduction example, the first part may be executed by multiple warps. In the last part, there are only 32 threads left, so only one warp is active. The line
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
is there to add the partial sums computed by the other warps to the partial sums of our final warp.
The next lines work as follows. Because execution within a warp is synchronized, it is safe to assume that the write operation to wsSum[tid] is completed before the next read, and so there is no need for a __syncthreads call.
The volatile keyword lets the compiler know that values in the wsSum array may be changed by other threads, so it will make sure that the value of wsSum[tid + X] isn't read earlier, before it was updated by some thread in the previous instruction.
The last volatile declaration seems redundant: you could just as well use the existing wsSum variable.

creation 2D grid in CUDA for GPGPU using C++

I am trying to extend my grid from a 1d to a 2d grid. Is there any way to do this?
Here is my current code:
int idx = threadIdx.x + blockDim.x * blockIdx.x;
In the #include list I have these definitions:
#define BLOCKS_PER_GRID 102
#define THREADS_PER_BLOCK 1024
Given that you want 1024 threads per block, the block can be easily reshaped to 2D.
32 x 32 = 1024;
So your block will look like this:
dim3 Block(32,32); //1024 threads per block. Will only work for devices of at least 2.0 Compute Capability.
I don't know what is your exact requirement, but usually number of blocks is not fixed (as you have defined in the macro). The number of blocks depend on the input data size, so that the grid scales dynamically.
Going with you case, you have many options, but the nearest optimal size for your grid comes out to be 17 x 6 or 6 x 17.
dim3 Grid(17,6);
Now you can call the kernel with these parameters:
kernel<<<Grid,Block>>>();
Inside the kernel, the 2-Dimensional index of the thread is calculated as follows:
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
Or if you follow the Row/Column convention instead of x/y, then:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
You can also have a 2D grid of 1-dimensional threadblocks, in order to get around the limitation of 65535 blocks per grid dimension (for pre-cc3.0 devices). This may be an easier way of extending a fundamentally 1-D problem past the limit without introducing a 2-D array representation for the data.
Let's assume we have a DATA_ELEMENTS parameter defined to be the number of elements (one element per thread) that your kernel will work on. If DATA_ELEMENTS is larger than 65535*1024, then you cannot handle them all using a 1-D grid, if each thread handles only 1 element.
you can leave your THREADS_PER_BLOCK parameter the same. Your thread index calculation inside the kernel will change to something like:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
you will want to be sure to condition your kernel calculations with something like:
if (idx < DATA_ELEMENTS){
(kernel code)
}
Your grid dimensions will be as follows:
dim3 grid;
if (DATA_ELEMENTS > (65535*THREADS_PER_BLOCK)){ // create a 2-D grid
int gridx = 65535; // could choose another number here
int gridy = ((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)/gridx;
if ((((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)%gridx) != 0) gridy++;
grid.x=gridx;
grid.y=gridy;
grid.z=1;
}
else{ // create a 1-D grid
int gridx = (DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;
grid.x=gridx;
grid.y=1;
grid.z=1;
}
and you would launch your kernel as:
kernel<<<grid, THREADS_PER_BLOCK>>>(...);
Another method to tackle this kind of problem is to create a 1-D grid of some dimension (let's say the total number of threads in the grid is NUM_THREADS_PER_GRID), and have each thread work on more than one element in the array of data elements, using something like a for-loop or while-loop:
while (idx < DATA_ELEMENTS) {
(code to process an element)
idx += NUM_THREADS_PER_GRID
}
I like Robert's solutions above. The only comment I have about his first solution is that it seems that one should make gridx as small as one can when DATA_ELEMENTS > (65535*THREADS_PER_BLOCK). The reason is that if the number of data elements is 65535*THREADS_PER_BLOCK + 1, and gridx is 65535, then 65535*2*THREADS_PER_BLOCK are launched, so almost half of the threads will do nothing. If gridx is smaller, then there will be less threads that do nothing.

CUDA - specifiying <<<x,y>>> for a for loop

Hey,
I have two arrays of size 2000. I want to write a kernel to copy one array to the other. The array represents 1000 particles. index 0-999 will contain an x value and 1000-1999 the y value for their position.
I need a for loop to copy up to N particles from 1 array to the other. eg
int halfway = 1000;
for(int i = 0; i < N; i++){
array1[i] = array2[i];
array1[halfway + i] = array[halfway + i];
}
Due to the number of N always being less than 2000, can I just create 2000 threads? or do I have to create several blocks.
I was thinking about doing this inside a kernel:
int tid = threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
and calling it as follows:
kernel<<<1,2000>>>(...);
Would this work? will it be fast? or will I be better off splitting the problem into blocks. I'm not sure how to do this, perhaps: (is this correct?)
int tid = blockDim.x*blockIdx.x + threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
kernel<<<4,256>>>(...);
Would this work?
Have you actually tried it?
It will fail to launch, because you are allowed to have 512 threads maximum (value may vary on different architectures, mine is one of GTX 200-series). You will either need more blocks or have fewer threads and a for-loop inside with blockDim.x increment.
Your multi-block solution should work as well.
Other approach
If this is the only purpose of the kernel, you might as well try using cudaMemcpy with cudaMemcpyDeviceToDevice as the last parameter.
The only way to answer questions about configurations is to test them. To do this, write your kernels so that they work regardless of the configuration. Often, I will assume that I will launch enough threads, which makes the kernel easier to write. Then, I will do something like this:
threads_per_block = 512;
num_blocks = SIZE_ARRAY/threads_per_block;
if(num_blocks*threads_per_block<SIZE_ARRAY)
num_blocks++;
my_kernel <<< num_blocks, threads_per_block >>> ( ... );
(except, of course, threads_per_block might be a define, or a command line argument, or iterated to test many configurations)
Is better to use more than one block for any kernel.
It Seems to me that you are simply copying from one array to another as a sequence of values with an offset.
If this is the case you can simply use the cudaMemcpy API call and specify
cudaMemcpyDeviceToDevice
cudaMemcpy(array1+halfway,array1,1000,cudaMemcpyDeviceToDevice);
The API will figure out the best partition of block / threads.