correctly computing gridDim for CUDA kernel - cuda

i expected to see numbers from 0.0 to 999.0 but instead getting some very weird and long number for some of the indices for the below code:
__global__ void kernel(double *res, int N)
{
int i = (gridDim.y*blockIdx.y+
blockIdx.x)*blockDim.x*blockDim.y+
blockDim.y*threadIdx.y+threadIdx.x;
if(i<N) res[i] = i;
}
void callGPU(int N)
{
dim3 dimBlock(8, 8);
dim3 dimGrid(2, 8);
...
kernel<<<dimGrid, dimBlock>>>(res, N);
...
}
even if i change the dimGrid to (8,2) and (1,16), but if I change the gridDim to (16,1) then i am getting the indices right. plz can you show how to correctly compute the gridDim for this case? if possible to arbitrary N. many thanks!

Your indexing pattern is wrong.
Firstly, You should compute index by x and y dimensions.
int i_x = blockIdx.x * blockDim.x + threadIdx.x;
int i_y = blockIdx.y * blockDim.y + threadIdx.y;
Then you should compute pitch as count of whole threads by x dimension
int pitch = gridDim.x * blockDim.x;
Finally, You can compute your 1D index from 2D grid.
int i = i_y * pitch + i_x;

Related

Cuda Dot Product Failing for Non Multiples of 1024

I'm just looking for some help here when it comes to calculating the dot product of two arrays.
Let's say I set the array size to 2500 and the max thread count per block to 1024.
In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:
nblcks = (n + 1024 -1)/1024
So, nblcks = 3
This is my kernel function:
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
if (i < n)
s[tIdx] = a[i] * b[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
I call the kernel:
dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);
And everything works! Well, almost.
d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.
d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18
Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?
Thank you!
Edit:
// host vectors
float* h_a = new float[n];
float* h_b = new float[n];
init(h_a, n);
init(h_b, n);
// device vectors (d_a, d_b, d_c)
float* d_a;
float* d_b;
float* d_c;
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMalloc((void**)&d_b, n * sizeof(float));
cudaMalloc((void**)&d_c, nblocks * sizeof(float));
// copy from host to device h_a -> d_a, h_b -> d_b
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
Initialization of the array's are done in this function (n times):
void init(float* a, int n) {
float f = 1.0f / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f; // [0.0f 1.0f]
}
The basic problem here is that the sum reduction can only work correctly when you have a round power of two threads per block, with every entry in the shared memory initialised. That isn't a limitation in practice if you do something like this:
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
s[tIdx] = 0.f;
while (i < n) {
s[tIdx] += a[i] * b[i];
i += blockDim.x * gridDim.x;
}
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
and run a power of two threads per block (ie. 32, 64, 128, 256, 512 or 1024). The while loop accumulates multiple values and stores that partial dot product in shared memory, with every entry containing either 0 or a valid partial sum, and then the reduction happens as normal. Instead of running as many blocks as the data size dictates, run only as many as will "fill" your GPU simultaneously (or one less than you think you require if the problem size is small). Performance will be improved as well at larger problem sizes.
If you haven't already seen it, here is a very instructive whitepaper written by Mark Harris from NVIDIA on step by step optimisation of the basic parallel reduction. I highly recommend reading it.

block per grid allocation habit in cuda

There is one common habit I saw in cuda example when they allocation the grid size. The following is an example:
int
main(){
...
int numElements = 50000;
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
...
}
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
}
}
What I am curious about is the initialization of blocksPerGrid. I don't understand why it's
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
rather than straightforward
int blocksPerGrid = numElements / threadsPerblock;
It seems it's a quite common habit. I saw in various projects. They all do this in this way.
I am new to cuda. Any explanation or knowledge behind this are welcomed.
The calculation is done the way you see to allow for cases where numElements isn't a round multiple of threadsPerblock.
For example, using threadsPerblock = 256 and numElements = 500
(numElements + threadsPerBlock - 1) / threadsPerBlock = (500 + 255) / 256 = 2
whereas
numElements / threadsPerblock = 500 / 256 = 1
In the first case, 512 threads are run, covering the 500 elements in the input data, but in the second case, only 256 threads are run, leaving 244 input items unprocessed.
Note also this kind of "guard" code in the kernel:
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
... Access input here
}
is essential to prevent any of the extra threads from performing out of bounds memory operations.

CUDA Jacobian Relaxation

I am in the process of mapping this sequential computation to a CUDA computation. This computation is a 2-dimensional Jacobian relaxation on an NxN grid, where N is unknown. N is evenly divisible by 32.
Jacobi(float *a,float *b,int N){
for (i=1; i<N+1; i++){
for (j=1; j<N+1; j++) {
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}
}
I'm parallelizing the outer two loops, and each thread should compute just one element. The goal is to parallelize it to use a cyclic distribution in the the x and y dimensions. Can some one aid me in implementing a Jacobi_GPU that has the appropriate indexing functions in CUDA that results in the following distribution?
dim3 dimGrid(N/32,N/32);
dim3 dimBlock(32,32);
Jacobi_GPU<<<dimGrid,dimBlock>>>(A,B,N)
forThis is the simple implementation. You can use shared memory optimization for this kernel function
__global__ void jacobi(int* a, const int* b,const int N)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[j*N+i] = 0.8* (2*b[(i+1)+j*N] + 2*b[i+N*(j+1)]);
}
}
Or, if you want to use "arrays of arrays" rather than arrays:
__global__ void Jacobi(int** a, const int** b,const int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}

How to use 2D Arrays in CUDA?

How to allocate a 2D array of size MXN? And how to traverse that array in CUDA?
__global__ void test(int A[BLOCK_SIZE][BLOCK_SIZE], int B[BLOCK_SIZE][BLOCK_SIZE],int C[BLOCK_SIZE][BLOCK_SIZE])
{
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i < BLOCK_SIZE && j < BLOCK_SIZE)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
int d_A[BLOCK_SIZE][BLOCK_SIZE];
int d_B[BLOCK_SIZE][BLOCK_SIZE];
int d_C[BLOCK_SIZE][BLOCK_SIZE];
int C[BLOCK_SIZE][BLOCK_SIZE];
for(int i=0;i<BLOCK_SIZE;i++)
for(int j=0;j<BLOCK_SIZE;j++)
{
d_A[i][j]=i+j;
d_B[i][j]=i+j;
}
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(GRID_SIZE, GRID_SIZE);
test<<<dimGrid, dimBlock>>>(d_A,d_B,d_C);
cudaMemcpy(C,d_C,BLOCK_SIZE*BLOCK_SIZE , cudaMemcpyDeviceToHost);
for(int i=0;i<BLOCK_SIZE;i++)
for(int j=0;j<BLOCK_SIZE;j++)
{
printf("%d\n",C[i][j]);
}
}
How to allocate 2D array:
int main(){
#define BLOCK_SIZE 16
#define GRID_SIZE 1
int d_A[BLOCK_SIZE][BLOCK_SIZE];
int d_B[BLOCK_SIZE][BLOCK_SIZE];
/* d_A initialization */
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // so your threads are BLOCK_SIZE*BLOCK_SIZE, 256 in this case
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 1*1 blocks in a grid
YourKernel<<<dimGrid, dimBlock>>>(d_A,d_B); //Kernel invocation
}
How to traverse that array:
__global__ void YourKernel(int d_A[BLOCK_SIZE][BLOCK_SIZE], int d_B[BLOCK_SIZE][BLOCK_SIZE]){
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= h || col >= w)return;
/* whatever you wanna do with d_A[][] and d_B[][] */
}
i hope this is helpful
and also you can refer to CUDA Programming Guide page 22 about Matrix Multiplication
The best way would be storing a two-dimensional array A in its vector form.
For example you have a matrix A size nxm, and it's (i,j) element in pointer to pointer representation will be
A[i][j] (with i=0..n-1 and j=0..m-1).
In a vector form you can write
A[i*n+j] (with i=0..n-1 and j=0..m-1).
Using one-dimensional array in this case will simplify the copy process, which would be simple:
double *A,*dev_A; //A-hous pointer, dev_A - device pointer;
A=(double*)malloc(n*m*sizeof(double));
cudaMalloc((void**)&dev_A,n*m*sizeof(double));
cudaMemcpy(&dev_A,&A,n*m*sizeof(double),cudaMemcpyHostToDevice); //In case if A is double

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.