My kernel only works in block (0,0) - cuda

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?

The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.

Related

CUDA: large kernel gives strange behavior

I recently bought a gtx550ti boost card. Programs that used to work on my old gf440 card fails. Here is an example. The following program works fine with smaller kernels, but goes wrong with larger ones.
#include "stdio.h"
__global__ void kernel(float * d_in, float * d_out){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
d_out[idx] = d_in[idx];
}
int main(){
const dim3 gridSize(10,10);
const dim3 blockSize(80,80);
const int size = 800*800;
float * h_in = new float[size];
float * h_out = new float[size];
float * d_in;
float * d_out;
cudaMalloc((void**)&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size);
for(int i = 0; i < size; i++)
h_in[i] = (float)i;
cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
kernel<<<gridSize,blockSize>>>(d_in, d_out);
cudaMemcpy(h_out, d_out, sizeof(float)*size, cudaMemcpyDeviceToHost);
for(int i = 0; i < size; i++)
printf("%f\n",h_out[i]);
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
I expected it to output index in floats. But it outputs some random floats:
0.131061
2.520029
9.304665
0.000189
0.242134
0.525557
0.560013
size 100*100
Instead, when I switch to size 100*100:
const dim3 gridSize(10,10);
const dim3 blockSize(10,10);
const int size = 100*100;
And it works fine(last 5 outputs):
9995.000000
9996.000000
9997.000000
9998.000000
9999.000000
size 500*500
But for larger size 500*500:
const dim3 gridSize(10,10);
const dim3 blockSize(50,50);
const int size = 500*500;
It outputs wrong index(last 5 outputs):
512139.000000
512140.000000
512141.000000
512142.000000
512143.000000
I installed CUDA 5.5. Thanks!
Whenever you are having trouble with cuda code, you should be doing proper cuda error checking.
This is not valid:
const dim3 blockSize(80,80);
This is asking for a threadblock of 80*80 = 6400 threads. There are no GPUs that support 6400 threads per threadblock.
This is also not valid:
const dim3 blockSize(50,50);
2500 threads is also too many. These configs would not work on either of your cards.
This is acceptable:
const dim3 blockSize(10,10);
In the "not valid" cases, your kernel is not running. If you had done proper cuda error checking, you would have discovered this and even got a clue as to what might be wrong (invalid launch configuration).
You may also want to familiarize yourself with the deviceQuery cuda sample, and study the output for your GPUs.

correctly computing gridDim for CUDA kernel

i expected to see numbers from 0.0 to 999.0 but instead getting some very weird and long number for some of the indices for the below code:
__global__ void kernel(double *res, int N)
{
int i = (gridDim.y*blockIdx.y+
blockIdx.x)*blockDim.x*blockDim.y+
blockDim.y*threadIdx.y+threadIdx.x;
if(i<N) res[i] = i;
}
void callGPU(int N)
{
dim3 dimBlock(8, 8);
dim3 dimGrid(2, 8);
...
kernel<<<dimGrid, dimBlock>>>(res, N);
...
}
even if i change the dimGrid to (8,2) and (1,16), but if I change the gridDim to (16,1) then i am getting the indices right. plz can you show how to correctly compute the gridDim for this case? if possible to arbitrary N. many thanks!
Your indexing pattern is wrong.
Firstly, You should compute index by x and y dimensions.
int i_x = blockIdx.x * blockDim.x + threadIdx.x;
int i_y = blockIdx.y * blockDim.y + threadIdx.y;
Then you should compute pitch as count of whole threads by x dimension
int pitch = gridDim.x * blockDim.x;
Finally, You can compute your 1D index from 2D grid.
int i = i_y * pitch + i_x;

CUDA Jacobian Relaxation

I am in the process of mapping this sequential computation to a CUDA computation. This computation is a 2-dimensional Jacobian relaxation on an NxN grid, where N is unknown. N is evenly divisible by 32.
Jacobi(float *a,float *b,int N){
for (i=1; i<N+1; i++){
for (j=1; j<N+1; j++) {
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}
}
I'm parallelizing the outer two loops, and each thread should compute just one element. The goal is to parallelize it to use a cyclic distribution in the the x and y dimensions. Can some one aid me in implementing a Jacobi_GPU that has the appropriate indexing functions in CUDA that results in the following distribution?
dim3 dimGrid(N/32,N/32);
dim3 dimBlock(32,32);
Jacobi_GPU<<<dimGrid,dimBlock>>>(A,B,N)
forThis is the simple implementation. You can use shared memory optimization for this kernel function
__global__ void jacobi(int* a, const int* b,const int N)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[j*N+i] = 0.8* (2*b[(i+1)+j*N] + 2*b[i+N*(j+1)]);
}
}
Or, if you want to use "arrays of arrays" rather than arrays:
__global__ void Jacobi(int** a, const int** b,const int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}

cuda multiplication

Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
If this is the serial code:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
then you should be doing this:
__global__ void fn(float *x, int nx)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx, i = tid - j * nx;
x[tid] *= y[i];
}
fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(**tid** = 0; **tid** <nx*ny; **tid**++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
Given your comment to this answer:
the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;
if (tid < (nx*ny))
x[tid] *= y[i];
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;
for(; tid <nx*ny; tid+=gsize)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
Try this using shared memory. One of the best implementations around:
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride; // In number of elements
float *elements;
} Matrix;
// Thread block size
#define BLOCK_SIZE 16
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row +
BLOCK_SIZE * col];
return Asub;
}
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Same as in previous example, except the followings:
// d_A.width = d_A.stride = A.width;
// d_B.width = d_B.stride = B.width;
// d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

How to write CUDA global function for this?

I want to convert the following function into CUDA.
void fun()
{
for(i = 0; i < terrainGridLength; i++)
{
for(j = 0; j < terrainGridWidth; j++)
{
//CODE of function
}
}
}
I wrote the function like this :
__global__ void fun()
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < terrainGridLength)&&(j<terrainGridWidth))
{
//CODE of function
}
}
I declared both terrainGridLength and terrainGridWidth as constants and assigned value 120 to both. And I am calling function like
fun<<<30,500>>>()
But i am not getting correct output.
Is the code which i wrote is correct?.I didn't understood much about the parellel execution of the code.Please explain me how the code will work and correct me if i done any mistakes.
You use y dimension which means you are using 2D array threads, so you cannot invoke the kernel with only:
int numBlock = 30;
int numThreadsPerBlock = 500;
fun<<<numBlock,numThreadsPerBlock>>>()
The invocation should be: (Note that now Blocks have 2D Threads)
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 2D Grids with size = GRID_SIZE*GRID_SIZE
dim3 dimBlocks(BLOCK_SIZE, BLOCK_SIZE); //2D Blocks with size = BLOCK_SIZE*BLOCK_SIZE
fun<<<dimGrid, dimBlocks>>>()
Refer to CUDA Programming Guide for further info, and also if you want to do 2D array or 3D, you better use cudaMalloc3D or cudaMallocPitch
As of your code, I think this would work (but I haven't tried though, hope you can grab the idea with this):
//main
dim3 dimGrid(1, 1); // 2D Grids with size = 1
dim3 dimBlocks(Width, Height); //2D Blocks with size = Height*Width
fun<<<dimGrid, dimBlocks>>>(Width, Height)
//kernel
__global__ void fun(int Width, int Height)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < Width)&&(j<Height))
{
//CODE of function
}
}