Issue regarding data of constant memory in CUDA - cuda

I have a CUDA application where I am trying to use constant memory. But when I am writing the kernel in the same file where the main function is, then only the data in the constant memory is getting recognized inside the kernel. Otherwise if I declare the kernel function in some other file then the constant memory is becoming 0 and the operation is operating properly. I am providing a simple dummy code which would explain the problem more easily. This program have a 48x48 matrix divided into 16x16 blocks and I am storing random numbers 1 to 50 in it. Inside the kernel I am adding numbers stored in constant memory to the each rows in a block. The code is given below :
Header File:
#include <windows.h>
#include <dos.h>
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cutil.h>
#include <curand.h>
#include <curand_kernel.h>
__constant__ int test_cons[16];
__global__ void test_kernel_1(int *,int *);
Main Program :
int main(int argc,char *argv[])
{ int *mat,*dev_mat,*res,*dev_res;
int i,j;
int test[16 ] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
mat = (int *)malloc(48*48*sizeof(int));
res = (int *)malloc(48*48*sizeof(int));
memset(res,0,48*48*sizeof(int));
srand(time(NULL));
for(i=0;i<48;i++)
{ for(j=0;j<48;j++)
{ mat[i*48+j] = rand()%(50-1)+1;
printf("%d\t",mat[i*48+j] );
}
printf("\n");
}
cudaMalloc((void **)&dev_mat,48*48*sizeof(int));
cudaMemcpy(dev_mat,mat,48*48*sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void **)&dev_res,48*48*sizeof(int));
dim3 gridDim(48/16,48/16,1);
dim3 blockDim(16,16,1);
test_kernel_1<<< gridDim,blockDim>>>(dev_mat,dev_res);
cudaMemcpy(res,dev_res,48*48*sizeof(int),cudaMemcpyDeviceToHost);
printf("\n\n\n\n");
for(i=0;i<48;i++)
{ for(j=0;j<48;j++)
{ printf("%d\t",res[i*48+j] );
}
printf("\n");
}
cudaFree(dev_mat);
cudaFree(dev_res);
free(mat);
free(res);
exit(0);
}
Kernel Function :
__global__ void test_kernel_1(int *dev_mat,int* dev_res)
{
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x +threadIdx.x;
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
}
Now when I am declaring the kernel function inside the main program file along with the main program then the constant memory values are correct otherwise if it is in a different file the test_cons[threadIdx.x] values are becoming 0.
I came across this link which kind of discuss the same problem but I am not getting it properly. It would be very much helpful if someone could tell me why this is happening and what I need to do avoid this problem. Any sort of help would be highly appreciated. Thanks.

I just recently answered a similar question here
CUDA can handle code that references device code (entry points) or symbols in other files, but it requires separate compilation with device linking (as described and linked in the link I gave above). (And separate compilation/linking requires CC 2.0 or greater)
So if you modify the link steps you can have your __constant__ variable in a given file, and reference it from a different file.
If not (if you don't specify separate compilation and device linking), then the device code that references the __constant__ variable, the host code that references the __constant__ variable, and the definition/declaration of the variable itself, all need to be in the same file.
So this:
__constant__ int test_cons[16];
This:
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
And this:
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
all need to be in the same file.

The above answer is totally acceptable I am adding this since the user is not able to make it working. You can accept the above answer this is just for your reference.
Kernel.cu file:
#include <stdio.h>
__constant__ int test_cons[16];
void copymemory (int *test)
{
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
}
__global__ void test_kernel_1(int *dev_mat,int* dev_res)
{
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x +threadIdx.x;
if (threadIdx.x ==0)
{
printf ("testcons[0] is %d\n", test_cons[threadIdx.x]) ;
}
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
}
simple.cu file
#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <curand.h>
#include <curand_kernel.h>
void copymemory (int *temp) ;
__global__ void test_kernel_1(int *,int *);
int main(int argc,char *argv[])
{
int *mat,*dev_mat,*res,*dev_res;
int i,j;
int test[16 ] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
mat = (int *)malloc(48*48*sizeof(int));
res = (int *)malloc(48*48*sizeof(int));
memset(res,0,48*48*sizeof(int));
copymemory (test) ;
srand(time(NULL));
for(i=0;i<48;i++)
{
for(j=0;j<48;j++)
{
mat[i*48+j] = rand()%(50-1)+1;
//printf("%d\t",mat[i*48+j] );
}
//printf("\n");
}
cudaMalloc((void **)&dev_mat,48*48*sizeof(int));
cudaMemcpy(dev_mat,mat,48*48*sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void **)&dev_res,48*48*sizeof(int));
dim3 gridDim(48/16,48/16,1);
dim3 blockDim(16,16,1);
test_kernel_1<<< gridDim,blockDim>>>(dev_mat,dev_res);
cudaMemcpy(res,dev_res,48*48*sizeof(int),cudaMemcpyDeviceToHost);
for(i=0;i<48;i++)
{
for(j=0;j<48;j++)
{
// printf("%d\t",res[i*48+j] );
}
//printf("\n");
}
cudaFree(dev_mat);
cudaFree(dev_res);
free(mat);
free(res);
exit(0);
}
I have commented your printf. And the printf in the kernel prints the value 1. I also tested by changing the value of test[0] in main function and it works perfectly.

Related

Need help optimizing thrust cuda code with nested iterator transform_reduce operations

I am working on code I would like to execute efficiently on a GPU. Most of the code has been easy to vectorize and prepare for parallel execution. There are several nice examples on Stack Overflow that have helped me with the standard nested iterators. I have one section I have not been able to successfully condense into an efficient thrust construct. I have taken that section of my code and made a minimum reproducible example. Any advice or hint on how to structure this code would be appreciated.
Thanks
#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <ctime>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
typedef thrust::device_vector<double> tDoubleVecDevice;
typedef tDoubleVecDevice::iterator tDoubleVecDeviceIter;
struct functorB{
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
return ( fmod((thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple)),1.0) );
}
};
struct functorC {
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
double distance = fabs( fmod((thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple)),1.0));
return((fmin( distance, 1.0 - distance)) / (5.0));
}
};
int main(void)
{
tDoubleVecDevice resF(36);
tDoubleVecDevice freqI(36);
tDoubleVecDevice trialTs(128);
std::srand(std::time(nullptr));
for(tDoubleVecDeviceIter tIter = trialTs.begin();tIter < trialTs.end(); tIter++ ){
(*tIter) = rand() % 10 + 1.5; // make some random numbers
}
for(tDoubleVecDeviceIter rIter = resF.begin(), fIter = freqI.begin();fIter < resF.end(); rIter++ ,fIter++){
(*fIter) = rand() % 10 + 1.5; // make some random numbers
(*rIter) = rand() % 10 + 1.5; // make some random numbers
}
tDoubleVecDevice trialRs(36);
tDoubleVecDevice errorVect(128);
for( tDoubleVecDeviceIter itTrial = trialTs.begin(), itError = errorVect.begin(); itTrial != trialTs.end(); itTrial++,itError++){
thrust::transform( (thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial), freqI.begin()))),
(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial)+36, freqI.end()))),
trialRs.begin() ,functorB());
(*itError) =thrust::transform_reduce(
thrust::make_zip_iterator(thrust::make_tuple(trialRs.begin(),resF.begin())),
thrust::make_zip_iterator(thrust::make_tuple(trialRs.end(),resF.end())),
functorC(),(double) 0,thrust::plus<double>()
);
}
// finds the index of the minimum element;
int minElementIndex = thrust::min_element(errorVect.begin(),errorVect.end()) - errorVect.begin();
double result = trialTs[minElementIndex];
std::cout << "result = " << result;
return 0;
}
It looks like you need to expand your trialsTs,trialsRs,errorVect,freqI and resF vectors to 4608 elements. This will allow you to vectorize the loops. Derive a class from thrust::iterator_adaptor to make a cyclic iterator to expand your freqI and resF to create repeated sequences of the data in those vectors.
After you run your functors use a reduce by key transform to create your error result with each 36 element trial.
Give that a try and if you get stuck I will provide some additional code.

Multiple occurrence subvector search with cuda Thrust

I want to find occurrences of subvector in a device vector in GPU, with thrust library.
Say for an array of str = "aaaabaaab", I need to find occurrences of substr = "ab".
How shall I use thrust::find function to search a subvector?
In nutshell How shall I implement string search algorithm with thrust library?
I would agree with the comments provided that thrust doesn't provide a single function that does this in "typical thrust fashion" and you would not want to use a sequence of thrust functions (e.g. a loop) as that would likely be quite inefficient.
A fairly simple CUDA kernel can be written that does this in a brute-force fashion.
For relatively simple CUDA kernels, we can realize something equivalent in thrust in a "un-thrust-like" fashion, by simply passing the CUDA kernel code as a functor to a thrust per-element operation such as thrust::transform or thrust::for_each.
Here is an example:
$ cat t462.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
struct my_f
{
char *array, *string;
size_t arr_len;
int str_len;
my_f(char *_array, size_t _arr_len, char *_string, int _str_len) :
array(_array), arr_len(_arr_len), string(_string), str_len(_str_len) {};
__host__ __device__
bool operator()(size_t idx){
for (int i=0; i < str_len; i++)
if ((i+idx)>= arr_len) return false;
else if (array[i+idx] != string[i]) return false;
return true;
}
};
int main(){
char data[] = "aaaabaaab";
char str[] = "ab";
size_t data_len = sizeof(data)-1;
int str_len = sizeof(str)-1;
thrust::device_vector<char> d_data(data, data+data_len);
thrust::device_vector<char> d_str(str, str+str_len);
thrust::device_vector<bool> result(data_len);
thrust::transform(thrust::counting_iterator<size_t>(0), thrust::counting_iterator<size_t>(data_len), result.begin(), my_f(thrust::raw_pointer_cast(d_data.data()), data_len, thrust::raw_pointer_cast(d_str.data()), str_len));
thrust::copy(result.begin(), result.end(), std::ostream_iterator<bool>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t462 t462.cu
$ ./t462
0,0,0,1,0,0,0,1,0,
$
Whether or not such a "brute-force" approach is efficient for this type of problem I don't know. Probably there are better/more efficient methods, especially when searching for occurrence of longer strings.

Pointers in structs changed in kernel function

I am trying access an data from array allocated in CUDA. First step was allocate a struct defined by me. After I pass the allocated struct to a kernel function that change the values from the struct. Finally, I pass the struct and the array to a host variables so read them. But actually I am having a problem to read the vector allocated.
#include <stdio.h>
#include <stdlib.h>
typedef struct x{
float *y;
float v;
}x_t;
__global__ void initTeste(x_t *param){
param->v = 10;
param->y[0] = 10;
param->y[1] = 10;
}
int main(void) {
x_t *hvar;
x_t hvarBackup;
float *temp = (float*)malloc(10*sizeof(float));
cudaError_t result;
cudaMalloc(&hvar , sizeof(x_t) );
cudaMalloc(&hvarBackup.y, 10*sizeof(float) );
cudaMemcpy(hvar, &hvarBackup, sizeof(x_t), cudaMemcpyHostToDevice);
initTeste<<<1,1>>>(hvar);
cudaMemcpy(&hvarBackup, hvar, sizeof(x_t), cudaMemcpyDeviceToHost);
cudaMemcpy(temp, &hvar->y, 10*sizeof(float), cudaMemcpyDeviceToHost);
printf("%f",(hvarBackup.v)); //here ok
printf("%f",(temp[0])); //here's the problem
return 0;
}
You cannot do it like that, because you haven't allocated y for the device, hence it will only give you segmentation fault when copying from y content to host. Aside of that, you have to allocate y for the device with the amount of 10*sizeof(float), and this is a truthfully pain in the a** job, especially when your struct becomes a huge container of arrays (and you should always know, that arrays inside structs always have to be avoided in CUDA).
Here's what you can do with the current code
int main(void) {
x_t *h_hvar = (x_t*)malloc(sizeof(x_t));
x_t *d_hvar;
float *h_y = (float*)malloc(10*sizeof(float));
float *d_y;
cudaMalloc(&d_hvar, sizeof(x_t) );
cudaMalloc(&d_y, 10*sizeof(float) );
// Insert the float pointer you allocated in CUDA
// to the host pointer first, and then copy the whole thing
// to the device area
h_hvar->y = d_y;
cudaMemcpy(d_hvar, h_hvar, sizeof(x_t), cudaMemcpyHostToDevice);
initTeste<<<1,1>>>(d_hvar);
cudaMemcpy(h_hvar, d_hvar, sizeof(x_t), cudaMemcpyDeviceToHost);
cudaMemcpy(h_y, d_y, 10*sizeof(float), cudaMemcpyDeviceToHost);
printf("%f",h_hvar->v);
printf("%f",h_y[0]);
return 0;
}
And that should give you the right value..
cudaMemcpy(temp, &hvar->y, 10*sizeof(float), cudaMemcpyDeviceToHost);
should be
cudaMemcpy(temp, hvar->y, 10*sizeof(float), cudaMemcpyDeviceToHost);
because hvar->y is already a pointer and you don't want to get the pointer to that pointer.

How to call a CUDA kernel from inside a class containing device member variables

I want to use CUDA 5.0 linking to write re-usable CUDA objects. i've set up this simple test of but my kernel fails silently (runs without error or exception and outputs junk).
My simple test (below) allocates an array of integers to CUDA device memory. The CUDA kernel should populate the array with sequential entries (0,1,2,....,9). The device array is copied to CPU memory and output to the console.
Currently, this code outputs "0,0,0,0,0,0,0,0,0," instead of the desired "0,1,2,3,4,5,6,7,8,9,". It is compiled using VS2010 and CUDA 5.0 (with compute_35 and sm_35 set). Running on Win7-64-bit with a GeForce 580.
In Test.h:
class Test
{
public:
Test();
~Test();
void Run();
private:
int* cuArray;
};
In Test.cu:
#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include "Test.h"
#define ARRAY_LEN 10
__global__ void kernel(int *p)
{
int elemID = blockIdx.x * blockDim.x + threadIdx.x;
p[elemID] = elemID;
}
Test::Test()
{
cudaMalloc(&cuArray, ARRAY_LEN * sizeof(int));
}
Test::~Test()
{
cudaFree(cuArray);
}
void Test::Run()
{
kernel<<<1,ARRAY_LEN>>>(cuArray);
// Copy the array contents to CPU-accessible memory
int cpuArray[ARRAY_LEN];
cudaMemcpy(static_cast<void*>(cpuArray), static_cast<void*>(cuArray), ARRAY_LEN * sizeof(int), cudaMemcpyDeviceToHost);
// Write the array contents to console
for (int i = 0; i < ARRAY_LEN; ++i)
printf("%d,", cpuArray[i]);
printf("\n");
}
In main.cpp:
#include <iostream>
#include "Test.h"
int main()
{
Test t;
t.Run();
}
I've experimented with the DECLs (__device__ __host__) as suggested by #harrism but to no effect.
Can anyone suggest how to make his work? (The code works when it isn't inside a class.)
The device you are using is GTX 580 whose compute capability is 2.0. If you compile the code for any architecture greater than 2.0, the kernel will not run on your device, and the output will be garbage. Compile the code for compute 2.0 or lower, and the code will run fine.

Error in the result of matrix multiplication example of CUDA C programming guide

I'm doing the matrix multiplication example from the book CUDA C Programming Guide, page 35, for practice, I copied the code and completed the missing code. I understand the logic of the program and how it should work, but I get no the expected result.
Here is the complete code i made, I do not know if the error is mine or from the example?
The code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>
#include <stdio.h>
using namespace std;
#define BLOCK_SIZE 16
typedef struct
{
int width;
int height;
float *elements;
}Matrix;
__global__ void MatMulKernel(const Matrix,const Matrix, Matrix C);
void MatMul(const Matrix A,const Matrix B, Matrix C)
{
size_t size;
//Matrix A creation y storage in device memory
Matrix d_A;
d_A.width=A.width;
d_A.height=A.height;
size=A.height*A.width*sizeof(float);
cudaMalloc(&d_A.elements,size);
cudaMemcpy(d_A.elements,A.elements,size,cudaMemcpyHostToDevice);
//Matrix B creation y storage in device memory
Matrix d_B;
d_B.width=B.width;
d_B.height=B.height;
size=B.height*B.width*sizeof(float);
cudaMalloc(&d_B.elements,size);
cudaMemcpy(d_B.elements,B.elements,size,cudaMemcpyHostToDevice);
//Matrix C creation y storage in device memory
Matrix d_C;
d_C.width=C.width;
d_C.height=C.height;
size=C.height*C.width*sizeof(float);
cudaMalloc(&d_C.elements,size);
//
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
MatMulKernel<<<dimGrid,dimBlock>>>(d_A,d_B,d_C);
//Copy the result in the matrix C from the device to the host.
cudaMemcpy(C.elements,d_C.elements,size,cudaMemcpyDeviceToHost);
//edit the missing code.
// for(int i=0;i<BLOCK_SIZE*BLOCK_SIZE;i++){cout<<C.elements[i]<<endl;}
// result in random numbers
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
float Cvalue=0;
int row=blockIdx.y*blockDim.y+threadIdx.y;
int col=blockIdx.x*blockDim.x+threadIdx.x;
for(int e=0;e<A.width;++e)
{
Cvalue+=A.elements[row*A.width+e]*B.elements[e*B.width+col];
}
C.elements[row*C.width+col]=Cvalue;
}
int main()
{
cout<<"Matrices"<<endl;
//Declarationd of the A,B,C matrix´s
float a[15][15];
float b[15][15];
float c[15][15];
//Fill the matrix whit some numbers.
int cont0=0;
for(int c=0;c<15;c++)
{
for(int v=0;v<15;v++)
{
a[v][c]=cont0;
b[v][c]=cont0;
cont0++;
}
}
//Flatten the matrix for the passing to the kernel
int offset=0;
float a_t[256];
float b_t[256];
for(int y=0;y<15;y++)
{
for(int x=0;x<15;x++)
{
a_t[x+offset]=a[x][y];
b_t[x+offset]=a[x][y];
}
offset=offset+15;
}
float t_C[256];
//Completing the matrix format for the kernel.
Matrix m_A;
m_A.height=15;
m_A.width=15;
m_A.elements=a_t;
Matrix m_B;
m_B.height=15;
m_B.width=15;
m_B.elements=b_t;
Matrix m_C;
m_C.height=15;
m_C.width=15;
m_C.elements=t_C;
//Passing the formated matrix to the kernel.
MatMul(m_A,m_B,m_C);
cout<<"Final"<<endl;
return 0;
}
The program compiles and runs but the result matrix C.elements from: cudaMemcpy(C.elements,d_C.elements,size,cudaMemcpyDeviceToHost);
is a random number. I've tried to use it like a pointer to a array but i don't get anything from it and treating it like array does not work either.
I will be glad if anyone can help me to finish this.
Your code has minor miss match between array indexing in kernel and initialization on CPU. Here is the corrected code with debugging suggested by #harrism:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>
#include <stdio.h>
using namespace std;
#define BLOCK_SIZE 16
typedef struct
{
int width;
int height;
float *elements;
}Matrix;
__global__ void MatMulKernel(const Matrix,const Matrix, Matrix C);
void MatMul(const Matrix A,const Matrix B, Matrix C)
{
size_t size;
//Matrix A creation y storage in device memory
Matrix d_A;
d_A.width=A.width;
d_A.height=A.height;
size=A.height*A.width*sizeof(float);
cudaMalloc(&d_A.elements,size);
cudaMemcpy(d_A.elements,A.elements,size,cudaMemcpyHostToDevice);
//Matrix B creation y storage in device memory
Matrix d_B;
d_B.width=B.width;
d_B.height=B.height;
size=B.height*B.width*sizeof(float);
cudaMalloc(&d_B.elements,size);
cudaMemcpy(d_B.elements,B.elements,size,cudaMemcpyHostToDevice);
//Matrix C creation y storage in device memory
Matrix d_C;
d_C.width=C.width;
d_C.height=C.height;
//cudaMalloc(&d_C,sizeof(Matrix));
//cudaMemcpy(d_C,C,sizeof(Matrix),cudaMemcpyHostToDevice);
size=C.height*C.width*sizeof(float);
cudaMalloc(&d_C.elements,size);
//
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
MatMulKernel<<<dimGrid,dimBlock>>>(d_A,d_B,d_C);
//Copy the result in the matrix C from the device to the host.
printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaMemcpy(C.elements,d_C.elements,size,cudaMemcpyDeviceToHost);
//
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
//printf("%d\n",threadIdx.x);
float Cvalue=0;
int row=blockIdx.y*blockDim.y+threadIdx.y;
int col=blockIdx.x*blockDim.x+threadIdx.x;
for(int e=0;e<A.width;++e)
{
Cvalue+=A.elements[row*A.width+e]*B.elements[e*B.width+col];
}
C.elements[row*C.width+col]=Cvalue;
}
int print_matrix(Matrix A){
printf("Matrix:\n");
int i;
for(i=0; i<A.width*A.height; i++){
if(i%A.width==0) printf("\n");
printf("%6.4f\t",A.elements[i]);
}
printf("\n");
}
int main()
{
cout<<"Matrices"<<endl;
//Declarationd of the A,B,C matrix.s
float a[BLOCK_SIZE][BLOCK_SIZE];
float b[BLOCK_SIZE][BLOCK_SIZE];
float c[BLOCK_SIZE][BLOCK_SIZE];
//Fill the matrix whit some numbers.
int cont0=0;
for(int c=0;c<BLOCK_SIZE;c++)
{
for(int v=0;v<BLOCK_SIZE;v++)
{
a[v][c]=cont0;
b[v][c]=cont0;
cont0++;
}
}
//Flatten the matrix for the passing to the kernel
int offset=0;
float a_t[BLOCK_SIZE*BLOCK_SIZE];
float b_t[BLOCK_SIZE*BLOCK_SIZE];
for(int y=0;y<BLOCK_SIZE;y++)
{
for(int x=0;x<BLOCK_SIZE;x++)
{
a_t[x+offset]=a[x][y];
b_t[x+offset]=a[x][y];
}
offset=offset+BLOCK_SIZE;
}
float t_C[BLOCK_SIZE*BLOCK_SIZE];
//Completing the matrix format for the kernel.
Matrix m_A;
m_A.height=BLOCK_SIZE;
m_A.width=BLOCK_SIZE;
m_A.elements=a_t;
Matrix m_B;
m_B.height=BLOCK_SIZE;
m_B.width=BLOCK_SIZE;
m_B.elements=b_t;
Matrix m_C;
m_C.height=BLOCK_SIZE;
m_C.width=BLOCK_SIZE;
m_C.elements=t_C;
//Passing the formated matrix to the kernel.
print_matrix(m_A);
print_matrix(m_B);
MatMul(m_A,m_B,m_C);
print_matrix(m_C);
cout<<"Final"<<endl;
return 0;
}
Check the output. If you see the results are wrong, check the kernel error on your system which is reported in output.
Firstly, see here for how to get useful answers to your questions. In particular, you should always check the return value of your CUDA API calls and kernel launches. Also, running cuda-memcheck can often be very helpful to detect out-of-bounds accesses like this.
#harrism asked how you know the result is wrong since you don't appear to do anything with it.
But more importantly you have 15x15 matrices being computed with a 16x16 threadblock, but you're not taking care to disable the out-of-bounds threads. Since you're trying to create a simple example, just increase the matrix size to 16x16 - if you want to handle odd sizes then you'll need to implement the control logic (or use cuBLAS!).