CUDA index blockDim.y is always 1 - cuda

I'm trying to solve the 2D Laplace equation with shared memory. But one strange thing is that the blockDim.y value is always 1.Could someone help me?
host code
checkCudaErrors(cudaMalloc((void**)&d_A, h*h * sizeof(float)));
checkCudaErrors(cudaMalloc((void**)&d_out, h*h * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_A, A, h*h * sizeof(float), cudaMemcpyHostToDevice));
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
dim3 gridSize = ((h+BLOCK_SIZE-1)/BLOCK_SIZE, (h + BLOCK_SIZE - 1) / BLOCK_SIZE);
LaplaceDifference << <gridSize, blockSize >> > (d_A, h, d_out);
checkCudaErrors(cudaMemcpy(B, d_out, h*h * sizeof(float), cudaMemcpyDeviceToHost));
kernel code
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int idy = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float A_ds[BLOCK_SIZE + 2][BLOCK_SIZE + 2];
int n = 1;
//Load data in shared memory
int halo_index_left = (blockIdx.x - 1)*blockDim.x + threadIdx.x;
int halo_index_right = (blockIdx.x + 1)*blockDim.x + threadIdx.x;
int halo_index_up = (blockIdx.y - 1)*blockDim.y + threadIdx.y;
int halo_index_down = (blockIdx.y + 1)*blockDim.y + threadIdx.y;
A_ds[n + threadIdx.y][n + threadIdx.x] = A[idy * h +idx];
if (threadIdx.x >= blockDim.x - n) {
A_ds[threadIdx.y + n][threadIdx.x - (blockDim.x - n)] = (halo_index_left < 0) ? 0 : A[idy*h + halo_index_left];
}
if (threadIdx.x < n) {
A_ds[threadIdx.y + n][blockDim.x + n + threadIdx.x] = (halo_index_right >= h) ? 0 : A[idy*h + halo_index_right];
}
if (threadIdx.y >= blockDim.y - n) {
A_ds[threadIdx.y - (blockDim.y - n)][threadIdx.x+n] = (halo_index_up < 0) ? 0 : A[halo_index_up*h + idx];
}
if (threadIdx.y < n) {
A_ds[blockDim.y + n + threadIdx.y][threadIdx.x + n] = (halo_index_down >= h) ? 0 : A[halo_index_down*h + idx];
}
__syncthreads();
P[idy*h + idx] = 0.25*(A_ds[threadIdx.y + n - 1][threadIdx.x + n] + A_ds[threadIdx.y + n + 1][threadIdx.x + n] + A_ds[threadIdx.y + n][threadIdx.x + n - 1] + A_ds[threadIdx.y + n][threadIdx.x + n + 1]);

(I spent quite some time looking for a dupe, but could not find it.)
A dim3 variable is a particular data type defined in the CUDA header file vector_types.h.
It provides several constructors. Here are a couple valid uses of constructors for this variable:
dim3 grid(gx, gy, gz);
dim3 grid = dim3(gx, gy, gz);
What you have shown:
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
won't work the way you expect.
Since there is no dim3 usage on the right hand side of the equal sign, the compiler will use some other method to process what is there. It is not a syntax error, because both the use of parentheses and the comma are legal in this form, from a C++ language perspective.
Hopefully you understand how parentheses work in C++. I'm not going to try to describe the comma operator, you can read about it here and here. The net effect is that the compiler will evaluate each of the two expressions (one on the left of the comma, one on the right) and it will evaluate the overall expression value as the value produced by the evaluation of the expression on the right. So this:
(BLOCK_SIZE, BLOCK_SIZE)
becomes this:
BLOCK_SIZE
which is quite obviously a scalar quantity, not multi-dimensional.
When you assign a scalar to a dim3 variable:
dim3 blockSize = BLOCK_SIZE;
You end up with a dim3 variable that has these dimensions:
(BLOCK_SIZE, 1, 1)
One method to fix what you have is as follows:
dim3 blockSize = dim3(BLOCK_SIZE, BLOCK_SIZE);
^^^^

This line:
dim3 blockSize = (BLOCK_SIZE, BLOCK_SIZE);
initializes a 1D block size. What you want is:
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);

Related

Box filter in CUDA using Google Colab

I have to implement Box filter using GPU with CUDA and I'm doing it on Google Colab. The code runs without any errors but my resulting image is all black.
This is my blurring function:
__global__ void apply_box_blur(int height, int width, unsigned char* buffer, unsigned char* out) {
int i, j;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < 2 || col < 2 || row >= height -3 || col >= width -3 ) return ;
float v = 1.0 / 9.0;
float kernel[3][3] = { {v,v,v},
{v,v,v},
{v,v,v} };
float sum0 = 0.0;
float sum1 = 0.0;
float sum2 = 0.0;
for (i = -1; i <= 1; i++)
{
for (j = -1; j <= 1; j++)
{
// matrix multiplication with kernel with every color plane
sum0 = sum0 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 0];
sum1 = sum1 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 1];
sum2 = sum2 + (float)kernel[i + 1][j + 1] * buffer[((row + i) * width + (col + j)) * 3 + 2];
}
}
out[(row * width + col) * 3 + 0] = (unsigned char)sum0;
out[(row * width + col) * 3 + 1] = (unsigned char)sum1;
out[(row * width + col) * 3 + 2] = (unsigned char)sum2;
};
And my main function:
// device copies
unsigned char* d_buffer;
unsigned char* d_out;
// allocate space for device copies
cudaMalloc((void**)&d_buffer, size * 3 * sizeof(unsigned char));
cudaMalloc((void**)&d_out, size * 3 * sizeof(unsigned char));
// Copy inputs to device
cudaMemcpy(d_buffer, buffer, size * 3 * sizeof(unsigned char), cudaMemcpyHostToDevice);
// perform the Box blur and store the resulting pixels in the output buffer
dim3 block(16, 16);
dim3 grid(width / 16, height / 16);
apply_box_blur <<<grid, block>>> (height, width, d_buffer, d_out);
cudaMemcpy(out, d_out, size * 3 * sizeof(unsigned char), cudaMemcpyDeviceToHost);
Am I doing something wrong with the block and grid sizes? Or is there something wrong with my blurring function? Is it maybe a Google Colab issue?
Found the issue.
The block and grid sizes should've been this:
dim3 blockSize(16, 16, 1);
dim3 gridSize((size*3)/blockSize.x, (size*3)/blockSize.y, 1);
Also my Google Colab wasn't connected to a GPU.

What is a correct way to implement memcpy inside a CUDA kernel?

I am implementing a PDE solver (Lax-Friedrichs) in CUDA that I previously wrote in C. Please find the C code below:
void solve(int M, double u[M+3][M+3], double unp1[M+3][M+3], double params[3]){
int i;
int j;
int n;
for (n=0; n<params[0]; n++){
for (i=0; i<M+2; i++)
for(j=0; j<M+2; j++){
unp1[i][j] = 0.25*(u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1])
- params[1]*(u[i+1][j] - u[i-1][j])
- params[2]*(u[i][j+1] - u[i][j-1]);
}
memcpy(u, unp1, pow(M+3,2)*sizeof(double));
/*Periodic Boundary Conditions*/
for (i=0; i<M+2; i++){
u[0][i] = u[N+1][i];
u[i][0] = u[i][N+1];
u[N+2][i] = u[1][i];
u[i][N+2] = u[i][1];
}
}
}
And it works fine. But when I am trying to implement it in CUDA I do not get the same data. Unfortunately I cannot exactly pinpoint the exact problem since I am a totally beginner to the whole parallel programming thing, but I think it might have to do with the u[i*(N+3) + j] = unp1[i*(N+3) + j] on the solver, since I cannot really perform a memcpy inside the kernel, because it doesn't change anything, I don't know how to proceed. I took a look at This previous answer, but it unfortunately couldn't help solving my problem. Here is the solver in CUDA I am trying to code:
#include <stdio.h>
#include <math.h>
#include <string.h>
#include <stdlib.h>
#include <iostream>
#include <algorithm>
/*Configuration of the grid*/
const int N = 100; //Number of nodes
const double xmin = -1.0;
const double ymin = -1.0;
const double xmax = 1.0;
const double ymax = 1.0;
const double tmax = 0.5;
/*Configuration of the simulation physics*/
const double dx = (xmax - xmin)/N;
const double dy = (ymax - ymin)/N;
const double dt = 0.009;
const double vx = 1.0;
const double vy = 1.0;
__global__ void initializeDomain(double *x, double *y){
/*Initializes the grid of size (N+3)x(N+3) to better accomodate Boundary Conditions*/
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int j=index ; j<N+3; j+=stride){
x[j] = xmin + (j-1)*dx;
y[j] = ymin + (j-1)*dy;
}
}
__global__ void initializeU(double *x, double *y, double *u0){
double sigma_x = 2.0;
double sigma_y = 6.0;
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int stride_y = blockDim.y * gridDim.y;
for (int i = index_x; i < N+3; i += stride_x)
for (int j = index_y; j < N+3; j+= stride_y){
u0[i*(N+3) + j] = exp(-200*(pow(x[i],2)/(2*pow(sigma_x,2)) + pow(y[j],2)/(2*pow(sigma_y,2))));
u0[i*(N+3) + j] *= 1/(2*M_PI*sigma_x*sigma_y);
//u[i*(N+3) + j] = u0[i*(N+3) + j];
//unp1[i*(N+3) + j] = u0[i*(N+3) + j];
}
}
void initializeParams(double params[3]){
params[0] = round(tmax/dt);
params[1] = vx*dt/(2*dx);
params[2] = vy*dt/(2*dy);
}
__global__ void solve(double *u, double *unp1, double params[3]){
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int stride_y = blockDim.y * gridDim.y;
for (int i = index_x; i < N+2; i += stride_x)
for (int j = index_y; j < N+2; j += stride_y){
unp1[i*(N+3) + j] = 0.25*(u[(i+1)*(N+3) + j] + u[(i-1)*(N+3) + j] + u[i*(N+3) + (j+1)] + u[i*(N+3) + (j-1)]) \
- params[1]*(u[(i+1)*(N+3) + j] - u[(i-1)*(N+3) + j]) \
- params[2]*(u[i*(N+3) + (j+1)] - u[i*(N+3) + (j-1)]);
}
}
__global__ void bc(double *u){
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
/*Also BC are set on parallel */
for (int i = index_x; i < N+2; i += stride_x){
u[0*(N+3) + i] = u[(N+1)*(N+3) + i];
u[i*(N+3) + 0] = u[i*(N+3) + (N+1)];
u[(N+2)*(N+3) + i] = u[1*(N+3) + i];
u[i*(N+3) + (N+2)] = u[i*(N+3) + 1];
}
}
int main(){
int i;
int j;
double *x = (double *)malloc((N+3)*sizeof(double));
double *y = (double *)malloc((N+3)*sizeof(double));
double *d_x, *d_y;
cudaMalloc(&d_x, (N+3)*sizeof(double));
cudaMalloc(&d_y, (N+3)*sizeof(double));
initializeDomain<<<1, 1>>>(d_x, d_y);
cudaDeviceSynchronize();
cudaMemcpy(x, d_x, (N+3)*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(y, d_y, (N+3)*sizeof(double), cudaMemcpyDeviceToHost);
FILE *fout1 = fopen("data_x.csv", "w");
FILE *fout2 = fopen("data_y.csv", "w");
for (i=0; i<N+3; i++){
if (i==N+2){
fprintf(fout1, "%.5f", x[i]);
fprintf(fout2, "%.5f", y[i]);
}
else{
fprintf(fout1, "%.5f, ", x[i]);
fprintf(fout2, "%.5f, ", y[i]);
}
}
dim3 Block2D(1,1);
dim3 ThreadsPerBlock(1,1);
double *d_u0;
double *u0 = (double *)malloc((N+3)*(N+3)*sizeof(double));
cudaMalloc(&d_u0, (N+3)*(N+3)*sizeof(double));
initializeU<<<Block2D, ThreadsPerBlock>>>(d_x, d_y, d_u0);
cudaDeviceSynchronize();
cudaMemcpy(u0, d_u0, (N+3)*(N+3)*sizeof(double), cudaMemcpyDeviceToHost);
/*Initialize parameters*/
double params[3];
initializeParams(params);
/*Allocate memory for u and unp1 on device for the solver*/
double *d_u, *d_unp1;
cudaMalloc(&d_u, (N+3)*(N+3)*sizeof(double));
cudaMalloc(&d_unp1, (N+3)*(N+3)*sizeof(double));
cudaMemcpy(d_u, d_u0, (N+3)*(N+3)*sizeof(double), cudaMemcpyDeviceToDevice);
cudaMemcpy(d_unp1, d_u0, (N+3)*(N+3)*sizeof(double), cudaMemcpyDeviceToDevice);
/*Solve*/
for (int n=0; n<params[0]; n++){
solve<<<Block2D, ThreadsPerBlock>>>(d_u, d_unp1, params);
double *temp = d_u;
d_u = d_unp1;
d_unp1 = temp;
bc<<<1,1>>>(d_u);
cudaDeviceSynchronize();
}
/*Copy results on host*/
double *u = (double *)malloc((N+3)*(N+3)*sizeof(double));
cudaMemcpy(u, d_u, (N+3)*(N+3)*sizeof(double), cudaMemcpyDeviceToHost);
FILE *fu = fopen("data_u.csv", "w");
for (i=0; i<N+3; i++){
for(j=0; j<N+3; j++)
if (j==N+2)
fprintf(fu, "%.5f", u[i*(N+3) + j]);
else
fprintf(fu, "%.5f, ", u[i*(N+3) + j]);
fprintf(fu, "\n");
}
fclose(fu);
free(x);
free(y);
free(u0);
free(u);
cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_u0);
cudaFree(d_u);
cudaFree(d_unp1);
return 0;
}
I unfortunately keep having the same issue: The data I get is 0.0000.
One thing that is tripping you up is that your original algorithm has an ordering that is required for correctness:
Update unp from u
copy unp to u
enforce boundary condition
repeat
Your algorithm requires that step 1 be completed entirely before step 2 begins, and likewise for step 2 before step 3. Your CUDA realization (putting steps 1 and 3, or 1,2,3) in a single kernel does not preserve or guarantee that ordering. CUDA threads can execute in any order. If you apply that rigorously to your code (for example, imagine that thread with index 0 executes completely before any other thread begins. That would be valid CUDA execution) then you will see that your kernel design does not preserve the ordering required.
So do something like this:
Create a solve kernel that is just the first step:
__global__ void solve(double *u, double *unp1, double params[3]){
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int stride_y = blockDim.y * gridDim.y;
for (int i = index_x; i < N+2; i += stride_x)
for (int j = index_y; j < N+2; j += stride_y){
unp1[i*(N+3) + j] = 0.25*(u[(i+1)*(N+3) + j] + u[(i-1)*(N+3) + j] + u[i*(N+3) + (j+1)] + u[i*(N+3) + (j-1)]) \
- params[1]*(u[(i+1)*(N+3) + j] - u[(i-1)*(N+3) + j]) \
- params[2]*(u[i*(N+3) + (j+1)] - u[i*(N+3) + (j-1)]);
u[i*(N+3) + j] = unp1[i*(N+3) + j];
}
}
don't bother with the memcpy operation. the better way to do that is swapping pointers (in host code).
Create a separate kernel to enforce the boundary:
__global__ void bc(double *u, double *unp1, double params[3]){
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int stride_x = blockDim.x * gridDim.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
int stride_y = blockDim.y * gridDim.y;
/*Also BC are set on parallel */
for (int i = index_x; i < N+2; i += stride_x){
u[0*(N+3) + i] = u[(N+1)*(N+3) + i];
u[i*(N+3) + 0] = u[i*(N+3) + (N+1)];
u[(N+2)*(N+3) + i] = u[1*(N+3) + i];
u[i*(N+3) + (N+2)] = u[i*(N+3) + 1];
}
}
Modify your host code to call these kernels in sequence, with the pointer swap in-between:
/*Solve*/
for(int n = 0; n<params[0]; n++){
solve<<<Block2D, ThreadsPerBlock>>>(d_u, d_unp1, params);
double *temp = d_u;
d_u = d_unp1;
d_unp1 = temp;
bc<<<Block2D, ThreadsPerBlock>>>(d_u, d_unp1, params);
cudaDeviceSynchronize();
}
(coded in browser, not tested)
This will enforce the ordering that your algorithm requires.
NOTE: As identified in the comments below, the solve kernel as depicted above (and in OP's original post, and in their posted CPU code version) has indexing errors at least associated with i-1 and j-1 indexing patterns. These should be fixed otherwise the code is broken. Fixing them requires some decision as to what to do for the edge cases, which OP provides no guidance on, therefore I have left that code as-is.

Calculating indices for nested loops in CUDA

I'm trying to learn CUDA and I'm a bit confused about calculating thread indices. Let's say I have this loop I'm trying to parallelize:
...
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
In PyCUDA, I set:
block = (8, 8, 8)
grid = (96, 96, 16)
Most of the examples I've seen for parallelizing loops calculate thread indices like this:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1)
DIM_x = 580, DIM_y = 550, psize = 50
However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.
Instead, if I use this (3D grid of 3D blocks):
int blockId = blockIdx.x + blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
int x = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x) + threadIdx.x;
It fixes the multiple same thread Ids problem for x, but I'm not sure how I'd parallelize y and dx.
If anyone could help me understand where I'm going wrong, and show me the right way to parallelize the loops, I'd really appreciate it.
However, if I print x, I see that multiple threads with the same
thread Id are created, and the final result is wrong.
It would be normal for you to see multiple threads with the same x thread ID in a multi-dimensional grid, as it would also be normal to observe many iterations of the loops in your host code with the same x value. If the result is wrong, it has nothing to do with any of the code you have shown, viz:
#include <vector>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <assert.h>
void host(int* array, int DIM_x, int DIM_y, int psize)
{
for(int x = 0; x < DIM_x; x++){
for(int y = 0; y < DIM_y; y++){
for(int dx = 0; dx < psize; dx++){
array[y*DIM_x + x + dx] += 1;
}
}
}
}
__global__
void kernel(int* array, int DIM_x, int DIM_y, int psize)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;
if (x >= DIM_x || y >= DIM_y || dx >= psize)
return;
atomicAdd(&array[y*DIM_x + x + dx], 1);
}
int main()
{
dim3 block(8, 8, 8);
dim3 grid(96, 96, 16);
int DIM_x = 580, DIM_y = 550, psize = 50;
std::vector<int> array_h(DIM_x * DIM_y * psize, 0);
std::vector<int> array_hd(DIM_x * DIM_y * psize, 0);
thrust::device_vector<int> array_d(DIM_x * DIM_y * psize, 0);
kernel<<<grid, block>>>(thrust::raw_pointer_cast(array_d.data()), DIM_x, DIM_y, psize);
host(&array_h[0], DIM_x, DIM_y, psize);
thrust::copy(array_d.begin(), array_d.end(), array_hd.begin());
cudaDeviceSynchronize();
for(int i=0; i<DIM_x * DIM_y * psize; i++) {
assert( array_h[i] == array_hd[i] );
}
return 0;
}
which when compiled and run
$ nvcc -arch=sm_52 -std=c++11 -o looploop loop_the_loop.cu
$ cuda-memcheck ./looploop
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
emits no errors and passes the check of all elements against the host code in your question.
If you are getting incorrect results, it is likely that you have a problem with initialization of the device memory before running the kernel. Otherwise I fail to see how incorrect results could be emitted by the code you have shown.
In general, performing a large number of atomic memory transactions, as your code does, is not the optimal way to perform computation on the GPU. Using non-atomic transactions would probably need to rely on other a priori information about the structure of the problem (such as a graph decomposition or a precise description of the write patterns of the problem).
In a 3D grid with 3D blocks, the thread ID is:
unsigned long blockId = blockIdx.x
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
unsigned long threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
Not the x you computed. The x is only the x index of that 3D matrix.
There is a nice cheatsheet in this blog

Coalesced access in the following matrix copy kernel

The following kernel performs matrix copy that I came across in this article:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}
I am confused with the notation used. From what I understand, the data is in row-major format. "y" corresponds to rows and "x" corresponds to columns. So, the linear index is calculated as data[y][x] = data[y*width+x];
How is odata[(y+j)*width + x] coalesced? In row-major, elements in the same row are in successive locations. So, accessing elements in the fashion, (y,x) (y,x+1) (y,x+2) ... is contiguous.
However "j" above is added to "y" which does not seem coalesced.
Is my understanding of the notation incorrect or am I missing something here?
Coalescing memory transactions only requires that threads from the same warp read and write into a contiguous block of memory which can be served by a single transaction. Your code
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
odata[(y+j)*width + x] = idata[(y+j)*width + x];
produces coalesced access because j is constant across every thread in a warp. So the access patterns become:
0. (y * width); (y * width + 1); (y * width + 2); .....
1. (y * width + width); (y * width + width + 1); (y * width + width + 2); .....
2. (y * width + 2 * width); (y * width + 2 * width + 1); (y * width + 2 * width + 2); .....
Within each warp at any value of J access is still sequential elements with memory, so reads and writes will coalesce.

Getting wrong results from CUDA matrix multiplication kernel [duplicate]

This question already has answers here:
Multiply Rectangular Matrices in CUDA
(5 answers)
Closed 7 years ago.
I am new to CUDA. I have a kernel to do matrix multiplication. It seems alright for me but it is failing in some cases. Please help me where the problem is.
__global__ void matrixMultiply(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns)
{
//## Insert code to implement matrix multiplication here
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if (numAColumns != numBRows) return;
if ((Row < numARows) && (Col < numBColumns)){
float Cvalue = 0;
for (int k = 0 ; k < numAColumns ; ++k )
Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
C[Row*numCColumns + Col] = Cvalue;
__syncthreads();
}
}
I am invoking the kernel as follows.
int BLOCKX = (int)(ceil((numCRows / 8.0)));
int BLOCKY = (int)(ceil((numCColumns / 8.0)));
printf("Number of blocks: %d\t%d\n", BLOCKX, BLOCKY);
dim3 DimGrid(BLOCKX, BLOCKY);
dim3 DimBlock(8 , 8, 1);
Your code will deadlock in the below :
if ((Row < numARows) && (Col < numBColumns)){
float Cvalue = 0;
for (int k = 0 ; k < numAColumns ; ++k )
Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
C[Row*numCColumns + Col] = Cvalue;
__syncthreads();
}
Consider a block where for some threads, the condition is satisfied, while for some it is not. In that case, this will deadlock. Put __syncthreads() outside the if conditions
Also replace dim3 DimGrid(BLOCKX, BLOCKY); by dim3 DimGrid(BLOCKY, BLOCKX);. That should fix it