creation 2D grid in CUDA for GPGPU using C++ - cuda

I am trying to extend my grid from a 1d to a 2d grid. Is there any way to do this?
Here is my current code:
int idx = threadIdx.x + blockDim.x * blockIdx.x;
In the #include list I have these definitions:
#define BLOCKS_PER_GRID 102
#define THREADS_PER_BLOCK 1024

Given that you want 1024 threads per block, the block can be easily reshaped to 2D.
32 x 32 = 1024;
So your block will look like this:
dim3 Block(32,32); //1024 threads per block. Will only work for devices of at least 2.0 Compute Capability.
I don't know what is your exact requirement, but usually number of blocks is not fixed (as you have defined in the macro). The number of blocks depend on the input data size, so that the grid scales dynamically.
Going with you case, you have many options, but the nearest optimal size for your grid comes out to be 17 x 6 or 6 x 17.
dim3 Grid(17,6);
Now you can call the kernel with these parameters:
kernel<<<Grid,Block>>>();
Inside the kernel, the 2-Dimensional index of the thread is calculated as follows:
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
Or if you follow the Row/Column convention instead of x/y, then:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;

You can also have a 2D grid of 1-dimensional threadblocks, in order to get around the limitation of 65535 blocks per grid dimension (for pre-cc3.0 devices). This may be an easier way of extending a fundamentally 1-D problem past the limit without introducing a 2-D array representation for the data.
Let's assume we have a DATA_ELEMENTS parameter defined to be the number of elements (one element per thread) that your kernel will work on. If DATA_ELEMENTS is larger than 65535*1024, then you cannot handle them all using a 1-D grid, if each thread handles only 1 element.
you can leave your THREADS_PER_BLOCK parameter the same. Your thread index calculation inside the kernel will change to something like:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
you will want to be sure to condition your kernel calculations with something like:
if (idx < DATA_ELEMENTS){
(kernel code)
}
Your grid dimensions will be as follows:
dim3 grid;
if (DATA_ELEMENTS > (65535*THREADS_PER_BLOCK)){ // create a 2-D grid
int gridx = 65535; // could choose another number here
int gridy = ((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)/gridx;
if ((((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)%gridx) != 0) gridy++;
grid.x=gridx;
grid.y=gridy;
grid.z=1;
}
else{ // create a 1-D grid
int gridx = (DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;
grid.x=gridx;
grid.y=1;
grid.z=1;
}
and you would launch your kernel as:
kernel<<<grid, THREADS_PER_BLOCK>>>(...);
Another method to tackle this kind of problem is to create a 1-D grid of some dimension (let's say the total number of threads in the grid is NUM_THREADS_PER_GRID), and have each thread work on more than one element in the array of data elements, using something like a for-loop or while-loop:
while (idx < DATA_ELEMENTS) {
(code to process an element)
idx += NUM_THREADS_PER_GRID
}

I like Robert's solutions above. The only comment I have about his first solution is that it seems that one should make gridx as small as one can when DATA_ELEMENTS > (65535*THREADS_PER_BLOCK). The reason is that if the number of data elements is 65535*THREADS_PER_BLOCK + 1, and gridx is 65535, then 65535*2*THREADS_PER_BLOCK are launched, so almost half of the threads will do nothing. If gridx is smaller, then there will be less threads that do nothing.

Related

numba: how to understand the stride [duplicate]

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Calculating Grid and Block dimensions of a Kernel

Suppose you want to write a kernel that operates on an image of size 400x900 pixels. You also want to assign one GPU thread to each pixel. Your thread blocks are square and you want to use the maximum number of threads per block possible on the device. The maximum number of threads per block is 1024. How would you select the grid dimensions and block dimensions of your kernel?
My understanding of how this works is that attributing one thread to each pixel, I'd need 360,000 (400x900) threads. The data hierarchy goes grid -> block -> threads. I think the formula would end up being 360,000 = (# of blocks)*(# of threads per block), with # of blocks having to be a perfect square number and multiple of 32.
I've tried the numbers from 2 to 4096 and none of them give me an even quotient when dividing from 360,000. Does that mean threads can be an decimal number?
When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. If we want to set maximum possible block size, we have to make sure that the product of its dimensions does not exceed the block size limit. Keeping in mind the limit of block size (1024), following are a few examples of valid block sizes.
dim3 block(32,32); //32 x 32 = 1024
or
dim3 block(64,16); //64 x 16 = 1024
or
dim3 block(16,64); //16 x 64 = 1024 ... Duh
Next comes the calculation of 2D grid size. If we want to map a thread for every pixel, then the grid should be created such that the total number of threads in each dimension is at-least equal to the corresponding image dimension. Remember that grid size means the number of block in each dimension. It means that the total number of threads in a dimension is equal to the product of grid size and block size in that dimension. For a 2D grid, the number of threads in X dimension is equal to block.x * grid.x and in Y dimension equal to block.y * grid.y.
Assuming you have an image of size 400 x 900, then the total number of threads in the corresponding dimension should also be at-least the same.
Let's say you choose a block of size (32,32). Then the number of blocks for the x and y dimensions of the image should be 400/32 and 900/32 . But neither of the image dimensions are an integer multiple of the corresponding block dimensions, so due to integer division we will end up creating grid of size 12 x 28 which will result in total number of threads equal to 384 x 896. (because 32 x 12 = 384 and 32 x 28 = 896).
As we can see that the total number of threads in each dimension are less than the corresponding image dimensions. What we need to do is to round up the number of blocks so that if the image dimension is not a multiple of block dimension, we create an additional block which will cover up the remaining pixels.
Following are 2 ways to do that.
Instead of integer division to calculate the number of blocks, we use floating point division and ceil the results.
int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = ceil( float(image_width)/block.x );
grid.y = ceil( float(image_height)/block.y );
Another smart way is to use the following formula
int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1 )/block.x;
grid.y = (image_height + block.y - 1 )/block.y;
When the grid is created in the above mentioned ways, you will end up creating a grid of size 13 x 29 which will result in total number of threads equal to 416 x 928.
Now in this case, we have total number of threads in each dimension greater than the corresponding image dimension. This will result in some of the threads accessing memory outside the image bounds causing undefined behavior. The solution for this problem is that we perform bound checks inside the kernel and do processing only with those threads which fall inside the image bounds. Of course to do that, we would need to pass image dimensions as arguments to the kernel. Following sample kernel shows this process.
__global__ void kernel(unsigned char* image, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number
if(xIndex < width && yIndex < height)
{
//Do processing only here
}
}
TLDR
Create the grid and block like this:
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1)/block.x;
grid.y = (image_height + block.y - 1)/block.y;
Call the kernel and pass image dimensions as arguments like this:
kernel<<<grid, block>>>(...., image_width, image_height);
Perform bound checks inside the kernel like this:
__global__ void kernel(unsigned char* image, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number
if(xIndex < width && yIndex < height)
{
//Do processing only here
}
}
Usually, you make the dimensions the next multiple up of the size you need, and then do a bound check in the kernel.
A simple example is here:
https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/
Here the number of blocks is calculated so the total number of threads is equal to or up to +256 above the number of threads needed.
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
And in the kernel, the calculation is only performed if it is required:
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}

Why do we need stride in CUDA kernel?

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Cuda block/grid dimensions: when to use dim3?

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.
I have an image in a 1D float array, which I'm copying to the device with:
checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));
Now I need to set the grid and block sizes to launch my kernel:
dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)
I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using
unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)
instead?
Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
And when I'm not using dim3, I'll just use one index?
Thank you very much,
The way you arrange the data in memory is independently on how you would configure the threads of your kernel.
The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The same happens for the blocks and the grid.
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3
So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.
As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = img.col * y + x;
because blockIdx.y and threadIdx.y will be zero.
To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Parallel reduction example

I found this parallel reduction code from Stanford which uses shared memory.
The code is an example of 1<<18 number of elements which is equal to 262144 and gets correct results.
Why for certain number of elements I get the correct results and for other number of elements, like 200000 or 25000 I get different results from what is to be expected?
It looks to me it's always appointing the needed thread blocks
// launch a single block to compute the sum of the partial sums
block_sum<<<1,num_blocks,num_blocks * sizeof(float)>>>
this code causes the bug.
suppose numblocks is 13,
Then in the kernal blockDim.x / 2 will be 6,
and
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
will only add the first 12 elements causing the bug.
when the element count is 200000 or 250000, num_blocks will be odd numbers and causes the bug, for even num_blocks it will work fine
This kernel is sensitive to the blocking parameters (grid and threadblock size) of the kernel. Are you invoking it with enough threads to cover the input size?
It is more robust to formulate kernels like this with for loops - instead of:
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
something like:
for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The source code in the CUDA Handbook has lots of examples of "blocking agnostic" code. The reduction code is here:
https://github.com/ArchaeaSoftware/cudahandbook/tree/master/reduction