Cuda block/grid dimensions: when to use dim3? - cuda

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.
I have an image in a 1D float array, which I'm copying to the device with:
checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));
Now I need to set the grid and block sizes to launch my kernel:
dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)
I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using
unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)
instead?
Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
And when I'm not using dim3, I'll just use one index?
Thank you very much,

The way you arrange the data in memory is independently on how you would configure the threads of your kernel.
The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The same happens for the blocks and the grid.
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3
So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.
As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = img.col * y + x;
because blockIdx.y and threadIdx.y will be zero.
To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Related

How to calculate individual thread coordinate indices in 3 D grids?

I have a 3 D grid consisting of 3D blocks. I wish to calculate the individual thread indexes of each coordinates every time the kernel is being called. I have these parameters:
dim3 blocks_query(32,32,32);
dim3 threads_query(32,32,32);
kernel<<< blocks_query,threads_query >>>();
Inside the kernel, I wish to calculate the individual values of x,y and z coordinates for instance, x=0,y=0,z=0, x=0,y=0,z=1, x=0,y=0,z=2,....thanks in advance....
Individual thread indices (x, y, z coordinates) can be calculated inside the kernel as follows:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
Keep in mind that the number of threads per block is limited by the GPU. So the block size you have created is invalid.
dim3 threads_query(32,32,32)
It equals to 32768 threads per block which is not supported by any of the current CUDA devices. Currently, maximum 1024 threads per block is supported for GPUs of Compute capability 2.0 and above while maximum 512 threads for older GPUs. You should reduce the block size otherwise the kernel would not launch.
Another thing to be noted is that you are creating 3D grid which is supported only on CUDA GPUs of Compute 2.0 and above.
UPDATE
Suppose the dimensions of your 3D data are xDim, yDim and zDim, then a generic grid of thread blocks can be formed as follows:
dim3 threads_query(8,8,8);
dim3 blocks_query;
blocks_query.x = (xDim + threads_query.x - 1)/threads_query.x;
blocks_query.y = (yDim + threads_query.y - 1)/threads_query.y;
blocks_query.z = (zDim + threads_query.z - 1)/threads_query.z;
The above approach will create total number of threads equal to or greater than the total data size. The extra threads may cause invalid memory access. So perform bound checks inside the kernel. You can do this by passing xDim, yDim and zDim as kernel arguments and adding the following line inside the kernel:
if(x>=xDim || y>=yDim || z>=zDim) return;

creation 2D grid in CUDA for GPGPU using C++

I am trying to extend my grid from a 1d to a 2d grid. Is there any way to do this?
Here is my current code:
int idx = threadIdx.x + blockDim.x * blockIdx.x;
In the #include list I have these definitions:
#define BLOCKS_PER_GRID 102
#define THREADS_PER_BLOCK 1024
Given that you want 1024 threads per block, the block can be easily reshaped to 2D.
32 x 32 = 1024;
So your block will look like this:
dim3 Block(32,32); //1024 threads per block. Will only work for devices of at least 2.0 Compute Capability.
I don't know what is your exact requirement, but usually number of blocks is not fixed (as you have defined in the macro). The number of blocks depend on the input data size, so that the grid scales dynamically.
Going with you case, you have many options, but the nearest optimal size for your grid comes out to be 17 x 6 or 6 x 17.
dim3 Grid(17,6);
Now you can call the kernel with these parameters:
kernel<<<Grid,Block>>>();
Inside the kernel, the 2-Dimensional index of the thread is calculated as follows:
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
Or if you follow the Row/Column convention instead of x/y, then:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
You can also have a 2D grid of 1-dimensional threadblocks, in order to get around the limitation of 65535 blocks per grid dimension (for pre-cc3.0 devices). This may be an easier way of extending a fundamentally 1-D problem past the limit without introducing a 2-D array representation for the data.
Let's assume we have a DATA_ELEMENTS parameter defined to be the number of elements (one element per thread) that your kernel will work on. If DATA_ELEMENTS is larger than 65535*1024, then you cannot handle them all using a 1-D grid, if each thread handles only 1 element.
you can leave your THREADS_PER_BLOCK parameter the same. Your thread index calculation inside the kernel will change to something like:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
you will want to be sure to condition your kernel calculations with something like:
if (idx < DATA_ELEMENTS){
(kernel code)
}
Your grid dimensions will be as follows:
dim3 grid;
if (DATA_ELEMENTS > (65535*THREADS_PER_BLOCK)){ // create a 2-D grid
int gridx = 65535; // could choose another number here
int gridy = ((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)/gridx;
if ((((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)%gridx) != 0) gridy++;
grid.x=gridx;
grid.y=gridy;
grid.z=1;
}
else{ // create a 1-D grid
int gridx = (DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;
grid.x=gridx;
grid.y=1;
grid.z=1;
}
and you would launch your kernel as:
kernel<<<grid, THREADS_PER_BLOCK>>>(...);
Another method to tackle this kind of problem is to create a 1-D grid of some dimension (let's say the total number of threads in the grid is NUM_THREADS_PER_GRID), and have each thread work on more than one element in the array of data elements, using something like a for-loop or while-loop:
while (idx < DATA_ELEMENTS) {
(code to process an element)
idx += NUM_THREADS_PER_GRID
}
I like Robert's solutions above. The only comment I have about his first solution is that it seems that one should make gridx as small as one can when DATA_ELEMENTS > (65535*THREADS_PER_BLOCK). The reason is that if the number of data elements is 65535*THREADS_PER_BLOCK + 1, and gridx is 65535, then 65535*2*THREADS_PER_BLOCK are launched, so almost half of the threads will do nothing. If gridx is smaller, then there will be less threads that do nothing.

CUDA 2D Convolution kernel

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);
First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;

cuda sub matrix

problem:
I have 4 matrix (64x64) of single precision numbers. need to do calculation like:
R = A * sin(B) + C * cos(D)
idea:
to speed up calculation use shared memory. since each block of threads has (in case of my GPU) 16KB shared memory and size of float is 4 there can by stored 4000 floating point numbers in shared memory. so for each matrix use 1000 elements which is 31 elements per dimension.
so each matrix shoud be devided in 16 submatrix (16x16).
dim3 dimBlock(16, 16, 1)
dim3 dimGrid(4, 4, 1)
kernel:
int Tx = threadIdx.x;
int Ty = threadIdx.y;
int Bx = blockIdx.x;
int By = blockIdx.y;
int idx = Bx * blockDim.x + Tx;
int idy = By * blockDim.y + Ty;
__shared__ float s_A[16*16];
__shared__ float s_B[16*16];
__shared__ float s_C[16*16];
__shared__ float s_D[16*16];
// I am not sure how to write this part
s_A[(Tx * blockDim.x + Ty + By) + Bx] = A[idx * 64 + idy];
s_B[(Tx * blockDim.x + Ty + By) + Bx] = B[idx * 64 + idy];
s_C[(Tx * blockDim.x + Ty + By) + Bx] = C[idx * 64 + idy];
s_D[(Tx * blockDim.x + Ty + By) + Bx] = D[idx * 64 + idy];
R[idx * 64 + idy] = s_A[(Tx * blockDim.x + Ty + By) + Bx] * sin(s_B[(Tx * blockDim.x + Ty + By) + Bx]) + s_C[(Tx * blockDim.x + Ty + By) + Bx] * cos(s_D[(Tx * blockDim.x + Ty + By) + Bx]);
How to devide original matrix to submatrixs so each block has own 4 submatrix and calculate on them.
Unless I have misinterpreted your question, you don't need to and shouldn't use shared memory for this operation. Shared memory is useful for sharing and resuing data between threads within the same block, and for facilitating coalesced memory access. Your operation seems to required neither of those things to work correctly. Using shared memory in the way you propose would probably be slower than just reading from global memory directly. Also, because you are only worried about element wise operations, the indexing scheme of your kernel can be greatly simplified -- the fact that A, B, C and D are "matrices" is irrelevant to the calculations as I understand your question.
As a result, an near optimal version of your kernel could be written as simply as
__global__ void kernel(const float *A, const float *B, const float *C,
const float *D, const int n, float *R)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while(tidx < n) {
R[tidx] = A[idx] * sinf(B[idx]) + C[idx]*cosf(D[idx]);
tidx += stride
}
}
In this code, you would launch as many blocks as would reach peak throughput of your GPU, and each thread will process more than one input/output value if the size of the array exceeds the size of the optimal 1D grid you have launched. Of course this is pretty academic if you are only processing 4096 elements in total -- that is probably about 2 orders of magnitude too small to get any benefit from using a GPU.
You have a problem here that your operation/transfer ratio is of order 1. You might have a hard time actually getting any decent speed from your GPU because of a bandwidth bottleneck between the thread and global memory and not having a way to reduce that.
A shared memory solution is usually best when there is some data called repeatedly from global memory. Instead of loading this data repeatedly from the low bandwidth, high latency global memory, you load it once from there, and do subsequent loads from the higher bandwidth, lower latency shared memory. Note, that's higher and lower, not high and low. There is still a performance penalty from using shared memory.
You your case, since elements aren't called several times from global memory, storing them in shared memory will only add the bandwidth limitations and latency that comes with shared memory usage. So, in effect, this solution will just add on the latency of access from shared memory to your data loading.
Now, if you have several calculations to perform, and some of these matrices are used in them too, then combining them into one kernel might give you a speed boost, since you might be able to load these once for the whole thing instead of once per operation. If that's not the case, and you can't increase your operation/transfer ratio, then you'll have a hard time getting some decent speeds, and might be better off doing these calculations on the CPU.
You might even get some decent results from multithreading on the CPU.

CUDA-Kernel supposed to be dynamic crashes depending upon block size

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.
My kernel looks like the following:
__global__ void
krnlSpMVmul1(
float *data_mat,
int num_nonzeroes,
unsigned int *row_ptr,
float *data_vec,
float *data_result)
{
extern __shared__ float local_result[];
local_result[threadIdx.x] = 0;
float vector_elem = data_vec[blockIdx.x];
unsigned int start_index = row_ptr[blockIdx.x];
unsigned int end_index = row_ptr[blockIdx.x + 1];
for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
local_result[threadIdx.x] += (data_mat[index] * vector_elem);
__syncthreads();
// Reduction
// Writing accumulated sum into result vector
}
As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.
Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.
Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.
Help is very much appreciated!
Edit towards talonmies comment:
I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:
The reduction part:
int k = blockDim.x / 2;
while (k != 0)
{
if (threadIdx.x < k)
local_result[threadIdx.x] += local_result[threadIdx.x + k];
else
return;
__syncthreads();
k /= 2;
}
and how I write the results back to global memory:
data_result[blockIdx.x] = local_result[0];
Thats all I got.
Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:
float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }
And thats an excerpt of my wrapper function:
float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;
// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));
// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
dev_data_mat,
data_mat,
num_nonzeroes * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_row_ptr,
row_ptr,
num_row_ptr * sizeof(unsigned int),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_vec,
data_vec,
dim_x * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_result,
data_result,
dim_y * sizeof(float),
cudaMemcpyHostToDevice));
// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);
// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
dev_data_mat,
num_nonzeroes,
dev_row_ptr,
dev_data_vec,
dev_data_result);
I hope this is straightforward but will explain things if it is of any interest.
One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.
Your dynamically allocated shared memory size is incorrect. Right now you are doing this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)
The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.
You should be doing something like this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)
That will give you the correct dynamic shared memory size and should eliminate the problem.