Calculating Grid and Block dimensions of a Kernel - cuda

Suppose you want to write a kernel that operates on an image of size 400x900 pixels. You also want to assign one GPU thread to each pixel. Your thread blocks are square and you want to use the maximum number of threads per block possible on the device. The maximum number of threads per block is 1024. How would you select the grid dimensions and block dimensions of your kernel?
My understanding of how this works is that attributing one thread to each pixel, I'd need 360,000 (400x900) threads. The data hierarchy goes grid -> block -> threads. I think the formula would end up being 360,000 = (# of blocks)*(# of threads per block), with # of blocks having to be a perfect square number and multiple of 32.
I've tried the numbers from 2 to 4096 and none of them give me an even quotient when dividing from 360,000. Does that mean threads can be an decimal number?

When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. If we want to set maximum possible block size, we have to make sure that the product of its dimensions does not exceed the block size limit. Keeping in mind the limit of block size (1024), following are a few examples of valid block sizes.
dim3 block(32,32); //32 x 32 = 1024
or
dim3 block(64,16); //64 x 16 = 1024
or
dim3 block(16,64); //16 x 64 = 1024 ... Duh
Next comes the calculation of 2D grid size. If we want to map a thread for every pixel, then the grid should be created such that the total number of threads in each dimension is at-least equal to the corresponding image dimension. Remember that grid size means the number of block in each dimension. It means that the total number of threads in a dimension is equal to the product of grid size and block size in that dimension. For a 2D grid, the number of threads in X dimension is equal to block.x * grid.x and in Y dimension equal to block.y * grid.y.
Assuming you have an image of size 400 x 900, then the total number of threads in the corresponding dimension should also be at-least the same.
Let's say you choose a block of size (32,32). Then the number of blocks for the x and y dimensions of the image should be 400/32 and 900/32 . But neither of the image dimensions are an integer multiple of the corresponding block dimensions, so due to integer division we will end up creating grid of size 12 x 28 which will result in total number of threads equal to 384 x 896. (because 32 x 12 = 384 and 32 x 28 = 896).
As we can see that the total number of threads in each dimension are less than the corresponding image dimensions. What we need to do is to round up the number of blocks so that if the image dimension is not a multiple of block dimension, we create an additional block which will cover up the remaining pixels.
Following are 2 ways to do that.
Instead of integer division to calculate the number of blocks, we use floating point division and ceil the results.
int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = ceil( float(image_width)/block.x );
grid.y = ceil( float(image_height)/block.y );
Another smart way is to use the following formula
int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1 )/block.x;
grid.y = (image_height + block.y - 1 )/block.y;
When the grid is created in the above mentioned ways, you will end up creating a grid of size 13 x 29 which will result in total number of threads equal to 416 x 928.
Now in this case, we have total number of threads in each dimension greater than the corresponding image dimension. This will result in some of the threads accessing memory outside the image bounds causing undefined behavior. The solution for this problem is that we perform bound checks inside the kernel and do processing only with those threads which fall inside the image bounds. Of course to do that, we would need to pass image dimensions as arguments to the kernel. Following sample kernel shows this process.
__global__ void kernel(unsigned char* image, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number
if(xIndex < width && yIndex < height)
{
//Do processing only here
}
}
TLDR
Create the grid and block like this:
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1)/block.x;
grid.y = (image_height + block.y - 1)/block.y;
Call the kernel and pass image dimensions as arguments like this:
kernel<<<grid, block>>>(...., image_width, image_height);
Perform bound checks inside the kernel like this:
__global__ void kernel(unsigned char* image, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //image x index or column number
int yIndex = blockIdx.y * blockDim.y + threadIdx.y; //image y index of row number
if(xIndex < width && yIndex < height)
{
//Do processing only here
}
}

Usually, you make the dimensions the next multiple up of the size you need, and then do a bound check in the kernel.
A simple example is here:
https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/
Here the number of blocks is calculated so the total number of threads is equal to or up to +256 above the number of threads needed.
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
And in the kernel, the calculation is only performed if it is required:
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}

Related

CUDA number of data cannot be divide by the CUDA threads evenly

For example, there are two 4-threads, but I have 5 data, the first 0-3 can be mapped to the first 4-threads, how about the rest, it only says there might be a runtime error, but how to fix it?
I think I ask this question in the wrong direction, now suppose I have
perfromwork<<<2,2>>>;
Now my dataIndex calculated by this pseudocode is smaller than the number of data elements(N=5), so what to do with the last one (5-2x2=1)? If I use another block for it, it will come across the same problem, the <<<2, 2>>> block will create a larger dataIndex.
There are two canonical approaches here.
Size the grid to be larger than or equal to the data set size, and make sure to use a "thread check" that prevents unneeded extra threads from doing any work.
Use a grid-stride loop, which allows the grid size to be determined independently from the data set size (if you wish) while still providing correct results.
vector add example kernels for each:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size) // thread check
z[idx] = x[idx] + y[idx];
}
The above kernel does not use a grid-stride loop. It will require that you size the grid to be larger than or equal to the data set size, in order for all elements to be processed. That sizing code might look like this:
int size = MY_DATA_SET_SIZE;
dim3 block(256); // this is threads per block, the choice here is not critical for correctness, but must be 1 or larger and less than or equal to 1024;
dim3 grid((size+block.x-1)/block.x);
vectorAdd<<<grid,block>>>(...);
A kernel implementing a grid-stride loop to do the same thing might look like this:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < size; idx += blockDim.x*gridDim.x)
z[idx] = x[idx] + y[idx];
}
In this case, grid sizing can be arbitrary (1 or larger) and still yield correct results.

How to calculate individual thread coordinate indices in 3 D grids?

I have a 3 D grid consisting of 3D blocks. I wish to calculate the individual thread indexes of each coordinates every time the kernel is being called. I have these parameters:
dim3 blocks_query(32,32,32);
dim3 threads_query(32,32,32);
kernel<<< blocks_query,threads_query >>>();
Inside the kernel, I wish to calculate the individual values of x,y and z coordinates for instance, x=0,y=0,z=0, x=0,y=0,z=1, x=0,y=0,z=2,....thanks in advance....
Individual thread indices (x, y, z coordinates) can be calculated inside the kernel as follows:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
Keep in mind that the number of threads per block is limited by the GPU. So the block size you have created is invalid.
dim3 threads_query(32,32,32)
It equals to 32768 threads per block which is not supported by any of the current CUDA devices. Currently, maximum 1024 threads per block is supported for GPUs of Compute capability 2.0 and above while maximum 512 threads for older GPUs. You should reduce the block size otherwise the kernel would not launch.
Another thing to be noted is that you are creating 3D grid which is supported only on CUDA GPUs of Compute 2.0 and above.
UPDATE
Suppose the dimensions of your 3D data are xDim, yDim and zDim, then a generic grid of thread blocks can be formed as follows:
dim3 threads_query(8,8,8);
dim3 blocks_query;
blocks_query.x = (xDim + threads_query.x - 1)/threads_query.x;
blocks_query.y = (yDim + threads_query.y - 1)/threads_query.y;
blocks_query.z = (zDim + threads_query.z - 1)/threads_query.z;
The above approach will create total number of threads equal to or greater than the total data size. The extra threads may cause invalid memory access. So perform bound checks inside the kernel. You can do this by passing xDim, yDim and zDim as kernel arguments and adding the following line inside the kernel:
if(x>=xDim || y>=yDim || z>=zDim) return;

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

creation 2D grid in CUDA for GPGPU using C++

I am trying to extend my grid from a 1d to a 2d grid. Is there any way to do this?
Here is my current code:
int idx = threadIdx.x + blockDim.x * blockIdx.x;
In the #include list I have these definitions:
#define BLOCKS_PER_GRID 102
#define THREADS_PER_BLOCK 1024
Given that you want 1024 threads per block, the block can be easily reshaped to 2D.
32 x 32 = 1024;
So your block will look like this:
dim3 Block(32,32); //1024 threads per block. Will only work for devices of at least 2.0 Compute Capability.
I don't know what is your exact requirement, but usually number of blocks is not fixed (as you have defined in the macro). The number of blocks depend on the input data size, so that the grid scales dynamically.
Going with you case, you have many options, but the nearest optimal size for your grid comes out to be 17 x 6 or 6 x 17.
dim3 Grid(17,6);
Now you can call the kernel with these parameters:
kernel<<<Grid,Block>>>();
Inside the kernel, the 2-Dimensional index of the thread is calculated as follows:
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
Or if you follow the Row/Column convention instead of x/y, then:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
You can also have a 2D grid of 1-dimensional threadblocks, in order to get around the limitation of 65535 blocks per grid dimension (for pre-cc3.0 devices). This may be an easier way of extending a fundamentally 1-D problem past the limit without introducing a 2-D array representation for the data.
Let's assume we have a DATA_ELEMENTS parameter defined to be the number of elements (one element per thread) that your kernel will work on. If DATA_ELEMENTS is larger than 65535*1024, then you cannot handle them all using a 1-D grid, if each thread handles only 1 element.
you can leave your THREADS_PER_BLOCK parameter the same. Your thread index calculation inside the kernel will change to something like:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
you will want to be sure to condition your kernel calculations with something like:
if (idx < DATA_ELEMENTS){
(kernel code)
}
Your grid dimensions will be as follows:
dim3 grid;
if (DATA_ELEMENTS > (65535*THREADS_PER_BLOCK)){ // create a 2-D grid
int gridx = 65535; // could choose another number here
int gridy = ((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)/gridx;
if ((((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)%gridx) != 0) gridy++;
grid.x=gridx;
grid.y=gridy;
grid.z=1;
}
else{ // create a 1-D grid
int gridx = (DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;
grid.x=gridx;
grid.y=1;
grid.z=1;
}
and you would launch your kernel as:
kernel<<<grid, THREADS_PER_BLOCK>>>(...);
Another method to tackle this kind of problem is to create a 1-D grid of some dimension (let's say the total number of threads in the grid is NUM_THREADS_PER_GRID), and have each thread work on more than one element in the array of data elements, using something like a for-loop or while-loop:
while (idx < DATA_ELEMENTS) {
(code to process an element)
idx += NUM_THREADS_PER_GRID
}
I like Robert's solutions above. The only comment I have about his first solution is that it seems that one should make gridx as small as one can when DATA_ELEMENTS > (65535*THREADS_PER_BLOCK). The reason is that if the number of data elements is 65535*THREADS_PER_BLOCK + 1, and gridx is 65535, then 65535*2*THREADS_PER_BLOCK are launched, so almost half of the threads will do nothing. If gridx is smaller, then there will be less threads that do nothing.

CUDA-Kernel supposed to be dynamic crashes depending upon block size

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.
My kernel looks like the following:
__global__ void
krnlSpMVmul1(
float *data_mat,
int num_nonzeroes,
unsigned int *row_ptr,
float *data_vec,
float *data_result)
{
extern __shared__ float local_result[];
local_result[threadIdx.x] = 0;
float vector_elem = data_vec[blockIdx.x];
unsigned int start_index = row_ptr[blockIdx.x];
unsigned int end_index = row_ptr[blockIdx.x + 1];
for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
local_result[threadIdx.x] += (data_mat[index] * vector_elem);
__syncthreads();
// Reduction
// Writing accumulated sum into result vector
}
As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.
Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.
Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.
Help is very much appreciated!
Edit towards talonmies comment:
I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:
The reduction part:
int k = blockDim.x / 2;
while (k != 0)
{
if (threadIdx.x < k)
local_result[threadIdx.x] += local_result[threadIdx.x + k];
else
return;
__syncthreads();
k /= 2;
}
and how I write the results back to global memory:
data_result[blockIdx.x] = local_result[0];
Thats all I got.
Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:
float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }
And thats an excerpt of my wrapper function:
float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;
// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));
// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
dev_data_mat,
data_mat,
num_nonzeroes * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_row_ptr,
row_ptr,
num_row_ptr * sizeof(unsigned int),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_vec,
data_vec,
dim_x * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_result,
data_result,
dim_y * sizeof(float),
cudaMemcpyHostToDevice));
// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);
// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
dev_data_mat,
num_nonzeroes,
dev_row_ptr,
dev_data_vec,
dev_data_result);
I hope this is straightforward but will explain things if it is of any interest.
One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.
Your dynamically allocated shared memory size is incorrect. Right now you are doing this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)
The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.
You should be doing something like this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)
That will give you the correct dynamic shared memory size and should eliminate the problem.