Find max of matrix with window size in CUDA [duplicate] - cuda

I just started in CUDA. Now I have a question.
I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of this.
For example if I have 64*64 matrix so I will have 8 small matrix with 8*8 scale and find out 8 max values. Finally I save all max values into new array, but its order always change. I want find solution to keep them in right order
__global__ void calculate_emax_kernel(float emap[],float emax[], int img_height, int img_width,int windows_size)
{
int x_index = blockIdx.x*blockDim.x+threadIdx.x;
int y_index = blockIdx.y*blockDim.y+threadIdx.y;
int num_row_block = img_height/windows_size;
int num_col_block = img_width/windows_size;
__shared__ float window_elements[256];
__shared__ int counter;
__shared__ int emax_count;
if (threadIdx.x == 0) emax_count = 0;
__syncthreads();
int index;
int emax_idx = 0;
if(y_index >= img_height|| x_index >= img_width) return;
for(int i = 0; i < num_row_block; i++)
{
for(int j = 0; j < num_col_block; j++)
{
counter = 0;
if(y_index >= i*windows_size && y_index < (i+1)*windows_size
&& x_index >= j*windows_size && x_index < (j+1)*windows_size)
{
int idx = y_index*img_height + x_index;
index = atomicAdd(&counter, 1);
window_elements[index] = emap[idx];
__syncthreads();
// reduction
unsigned int k = (windows_size*windows_size)/2;
while(k != 0)
{
if(index < k)
{
window_elements[index] = fmaxf(window_elements[index], window_elements[index+k]);
}
k /= 2;
}
if(index == 0)
{
emax[i*num_row_block+j] = window_elements[index];
}
}
__syncthreads();
}
__syncthreads();
}
__syncthreads();
}
This is my configuration
void construct_emax(float *input,float *output, int img_height, int img_width)
{
int windows_size = 4;
float * d_input, * d_output;
cudaMalloc(&d_input, img_width*img_height*sizeof(float));
cudaMalloc(&d_output, img_width*img_height*sizeof(float));
cudaMemcpy(d_input, input, img_width*img_height*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(16,16);
dim3 gridsize;
gridsize.x=(img_width+blocksize.x-1)/blocksize.x;
gridsize.y=(img_height+blocksize.y-1)/blocksize.y;
calculate_emax_kernel<<<gridsize,blocksize>>>(d_input,d_output,img_height,img_width,windows_size);
}

With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.
For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.
For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.
Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.
const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);
In the kernel function, you could
reduce each mxm window to a 1xm vector during global data loading;
use tree reduction method to reduce the 1xm vector to a scalar.
This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.
#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>
__global__ void calculate_emax_kernel(const float* input, float* output,
int height, int width, int win_size,
int out_width) {
const int tid = threadIdx.x;
const int i = blockIdx.y * win_size;
const int j = blockIdx.x * 256 + tid;
const int win_id = j % win_size;
__shared__ float smax[256];
float tmax = -1e20;
if (j < width) {
for (int tile = 0; tile < win_size; tile++) {
if (i + tile < height) {
tmax = max(tmax, input[(i + tile) * width + j]);
}
}
}
smax[tid] = tmax;
for (int shift = win_size / 2; shift > 0; shift /= 2) {
if (win_id < shift) {
smax[tid] = max(smax[tid], smax[tid + shift]);
}
}
if (win_id == 0 && j < width) {
output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
}
}
int main() {
const int height = 1024;
const int width = 1024;
const int m = 4;
thrust::device_vector<float> in(height * width);
thrust::device_vector<float> out(
((height + m - 1) / m) * ((width + m - 1) / m));
dim3 blocksize(256);
dim3 gridsize((width + 255) / 256, (height + m - 1) / m);
assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
calculate_emax_kernel<<<gridsize, blocksize>>>(
thrust::raw_pointer_cast(in.data()),
thrust::raw_pointer_cast(out.data()),
height, width, m, (width + m - 1) / m);
return 0;
}

In case you're willing to use a library, few pointers:
use NPP, set of primitives (from nvidia)
https://docs.nvidia.com/cuda/npp/group__image__filter__max.html
a lower level library, for other reduce operations and more granularity in the way you use the hardware (from nvidia / nvlabs)
http://nvlabs.github.io/cub/

Related

Having an issue detecting problem in my code | CUDA c

I'm having a hard time understanding where is the bug in my code.
The purpose of the project is to multiply two matrices and compare the time between sequential and parallel.
When I'm printing the Matrices I see that the device matrix is basically empty.
Also, I treated the matrices as an array of size n*n .
Thanks!
//This program computes the multiplication of two Matrices GPU using CUDA
#include <stdio.h>
#include <cassert>
__global__ void matrixMul(int * m,int * n,int * p,int size)
{
//Calculate Row and Column
int row=threadIdx.y*blockDim.y+threadIdx.y;
int column=threadIdx.x*blockDim.x+threadIdx.x;
int p_sum=0;
for (int i = 0; i < size; i++)
{
p_sum += m[row*size + i] * n[i*size +column];
}
p[row*size + column] = p_sum;
}
void matrixMul_seq(int * m,int * n,int * p,int size){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
for(int k = 0; k < size; k++){
p[i*size +j] += m[i*size +k] +n[k*size +j];
}
}
}
}
//Initialize matricies
void init_matricies(int * mat,int n){
for(int i = 0; i < n; i++) {
for (int j = 0; j < n; j++)
{
mat[i*n+j]=rand()%1024;
}
}
}
int main(int argc,char **argv)
{
//Set our problem Size(Default = 2^10 == 1024)
int n = 1<<10;
printf("Square Matrix of size:%d\n",n);
//Size in Bytes
size_t bytes=n*n*sizeof(bytes);
//Host matricies
int *h_m;
int *h_p;
int *h_n;
int *h_p_seq;
//Host matricies
int *d_m;
int *d_p;
int *d_n;
//Memory allocation for Host Matricies
h_m=(int*)malloc(bytes);
h_n=(int*)malloc(bytes);
h_p=(int*)malloc(bytes);
h_p_seq=(int*)malloc(bytes);
init_matricies(h_m,n);
init_matricies(h_n,n);
//Allocate memory on device side
cudaMalloc(&d_n, bytes);
cudaMalloc(&d_m, bytes);
cudaMalloc(&d_p, bytes);
//Copy data to Device
cudaMemcpy(d_m,h_m, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_n,h_n, bytes, cudaMemcpyHostToDevice);
int threads_per_block =16;
dim3 block_size(threads_per_block,threads_per_block);
dim3 grid_size( n / block_size.x , n / block_size.y);
printf("Grid size X:%d, Grid size y:%d\n",grid_size.x,grid_size.y);
printf("THE RESULT OF THE SIZES: 2^6 * 2^4 * 2^6 * 2^4 \n");
matrixMul <<<grid_size,block_size>>>(d_m,d_n,d_p,n);
matrixMul_seq(h_m,h_n,h_p_seq,n);
cudaMemcpy(h_p,d_p, bytes, cudaMemcpyDeviceToHost);
for(int i = 0; i < n; i++){
for(int j = 0; j < n; j++){
//printf("Grid size X:%d, Grid size y:%d\n",h_p[ n * i + j],h_p_seq[ n * i + j]);
assert(h_p[ n * i + j]==h_p_seq[ n * i + j]);
}
}
free(h_m);
free(h_p);
free(h_n);
free(h_p_seq);
cudaFree(d_m);
cudaFree(d_n);
cudaFree(d_p);
return 0;
}
You have a variety of problems in your code:
You are calculating kernel index variables incorrectly. This is incorrect:
int row=threadIdx.y*blockDim.y+threadIdx.y;
int column=threadIdx.x*blockDim.x+threadIdx.x;
it should be:
int row=blockIdx.y*blockDim.y+threadIdx.y;
int column=blockIdx.x*blockDim.x+threadIdx.x;
The matrix operations in your calculation functions don't match each other. Kernel:
p_sum += m[row*size + i] * n[i*size +column];
^
multiplication
host code:
p[i*size +j] += m[i*size +k] +n[k*size +j];
^
addition
we also observe, from above, that the host code is doing a summation to the output variable (+=), whereas the the kernel is doing an assignment to the output variable (=):
p[row*size + column] = p_sum;
This has implications for the next issue.
malloc doesn't initialize data. Since this operation is creating the output array that will be used by the host code, which is doing a summation to it, we must initialize this allocation to zero:
h_p_seq=(int*)malloc(bytes);
memset(h_p_seq, 0, bytes); // must add this line to initialize to zero
The calculation of the size of your arrays in bytes is too large. You have defined your arrays to be of type int. But your size calculation is like this:
size_t bytes=n*n*sizeof(bytes);
An int is a 4-byte quantity, whereas a size_t variable like bytes is an 8-byte quantity. This doesn't cause an actual problem, but is unnecessary. I would suggest changing it to:
size_t bytes=n*n*sizeof(int);
With the above items addressed, your code runs correctly for me.

Finding minimum distance between n points in cuda

I am trying to find the minimum distance between n points in cuda. I wrote the below code. This is working fine for number of points from 1 to 1024 i.e., 1 block. But if num_points is greater than 1024 i am getting wrong value for minimum distance. I am checking the gpu min value with the value I found in CPU using brute force algorithm.
The min value is stored in the temp1[0] at the end of kernel function.
I don't know what is wrong in this. Please help me out..
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#define MAX_POINTS 50000
__global__ void minimum_distance(float * X, float * Y, float * D, int n) {
__shared__ float temp[1024];
float temp1[1024];
int tid = threadIdx.x;
int bid = blockIdx.x;
int ref = tid+bid*blockDim.x;
temp[ref] = 1E+37F;
temp1[bid] = 1E+37F;
float dx,dy;
float Dij;
int i;
//each thread will take a point and find min dist to all
// points greater than its unique id(ref)
for (i = ref + 1; i < n; i++)
{
dx = X[ref]-X[i];
dy = Y[ref]-Y[i];
Dij = sqrtf(dx*dx+dy*dy);
if (temp[tid] > Dij)
{
temp[tid] = Dij;
}
}
__syncthreads();
//In each block the min value is stored in temp[0]
if(tid == 0)
{
if( bid == (n-1)/1024 ) {
int end = n - (bid) * 1024;
for (i = 1; i < end; i++ )
{
if (temp[i] < temp[tid])
temp[tid] = temp[i];
}
temp1[bid] = temp[tid];
}
else {
for (i = 1; i < 1024; i++ )
{
if (temp[i] < temp[tid])
temp[tid] = temp[i];
}
temp1[bid] = temp[tid];
}
}
__syncthreads();
//Here the min value is stored in temp1[0]
if (ref == 0)
{
for (i = 1; i <= (n-1)/1024; i++)
if( temp1[bid] > temp1[i])
temp1[bid] = temp1[i];
*D=temp1[bid];
}
}
//part of Main function
//kernel function invocation
// Invoking kernel of 1D grid and block sizes
// Vx and Vy are arrays of x-coordinates and y-coordinates respectively
int main(int argc, char* argv[]) {
.
.
blocks = (num_points-1)/1024 + 1;
minimum_distance<<<blocks,1024>>>(Vx,Vy,dmin_dist,num_points);
.
.
I'd say what's wrong is your choice of algorithm. You can certainly do better than O(n^2) - even if yours is pretty straightforward. Sure, on 5,000 points it might not seem terrible, but try 50,000 points and you'll feel the pain...
I'd think about parallelizing the construction of a Voronoi Diagram, or maybe some kind of BSP-like structure which might be easier to query with less code divergence.

cuda multiplication

Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
If this is the serial code:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
then you should be doing this:
__global__ void fn(float *x, int nx)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx, i = tid - j * nx;
x[tid] *= y[i];
}
fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(**tid** = 0; **tid** <nx*ny; **tid**++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
Given your comment to this answer:
the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;
if (tid < (nx*ny))
x[tid] *= y[i];
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;
for(; tid <nx*ny; tid+=gsize)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
Try this using shared memory. One of the best implementations around:
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride; // In number of elements
float *elements;
} Matrix;
// Thread block size
#define BLOCK_SIZE 16
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row +
BLOCK_SIZE * col];
return Asub;
}
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Same as in previous example, except the followings:
// d_A.width = d_A.stride = A.width;
// d_B.width = d_B.stride = B.width;
// d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

Errors in Polynomial fitting problem on CUDA

I tried to use CUDA to do some simple loops on device, but it seem that it is hard to understand Cuda. I am getting 0 from every function call, when I use CUDA kernel function with normal C code.
The original code:
double evaluate(int D, double tmp[], long *nfeval)
{
/* polynomial fitting problem */
int i, j;
int const M=60;
double px, x=-1, dx=(double)M, result=0;
(*nfeval)++;
dx = 2/dx;
for (i=0;i<=M;i++)
{
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1) result+=(1-px)*(1-px);
x+=dx;
}
px = tmp[0];
for (j=1;j<D;j++) px=1.2*px+tmp[j];
px = px-72.661;
if (px<0) result+=px*px;
px = tmp[0];
for (j=1;j<D;j++) px=-1.2*px+tmp[j];
px =px-72.661;
if (px<0) result+=px*px;
return result;
}
I wanted to do first for loop on CUDA:
double evaluate_gpu(int D, double tmp[], long *nfeval)
{
/* polynomial fitting problem */
int j;
int const M=60;
double px, dx=(double)M, result=0;
(*nfeval)++;
dx = 2/dx;
int N = M;
double *device_tmp = NULL;
size_t size_tmp = sizeof tmp;
cudaMalloc((double **) &device_tmp, size_tmp);
cudaMemcpy(device_tmp, tmp, size_tmp, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cEvaluate <<< n_blocks, block_size >>> (device_tmp, result, D);
// cudaMemcpy(result, result, size_result, cudaMemcpyDeviceToHost);
px = tmp[0];
for (j=1;j<D;j++) px=1.2*px+tmp[j];
px = px-72.661;
if (px<0) result+=px*px;
px = tmp[0];
for (j=1;j<D;j++) px=-1.2*px+tmp[j];
px =px-72.661;
if (px<0) result+=px*px;
return result;
}
Where the device function looks like:
__global__ void cEvaluate_temp(double* tmp,double result, int D)
{
int M =60;
double px;
double x=-1;
double dx=(double)M ;
int j;
dx = 2/dx;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 60) //<==>if (idx < M)
{
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1)
{ __syncthreads();
result+=(1-px)*(1-px); //+=
}
x+=dx;
}
}
I know that I have not specified the problem, but it seem that I have much more than one.
I do not know when to copy variable to device, and when it will be copied 'automatically'.
Now, I am using CUDA 3.2 and there is problem with emulation (I would like to use printf),
when I run NVCC with make emu=1 , there is no error when I use printf, but I also do not get any output.
There is the simplest version of device function, I tested. Can anybody explain what will happen with result value after incrementing it in parallel ? I think I should use device shared memory and synchronization to do sth like "+=" .
__global__ void cEvaluate(double* tmp,double result, int D)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 60) //<==>if (idx < M)
{
result+=1;
printf("res = %f ",result); //-deviceemu, make emu=1
}
}
No, the variable result is not shared across multiple threads.
What I would suggest is to have a matrix of result values in shared memory, one result for each thread, compute every value and the reduce it to a single value.
__global__ void cEvaluate_temp(double* tmp,double *global_result, int D)
{
int M =60;
double px;
double x=-1;
double dx=(double)M ;
int j;
dx = 2/dx;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ shared_result [blocksize];
if (idx >= 60) return;
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1)
{
result[threadIdx] +=(1-px)*(1-px);
}
x+=dx;
}
__syncthreads();
if( threadIdx.x == 0) {
total_result = 0.
for (idx in blocksize){
total_result += result[idx];
}
global_result[0] = total_result;
}
Also you need the cudaMemcpy after the kernel invocation. Kernel are asynchronous and needs a sync function.
Also use the error check functions at each CUDA API invocation.

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.