Cuda matrix multiplication- wrong result - cuda

this is my code for matrix multiplication, but when i run it i get correct result for first row but wrong ones for second and third(mostly big negative numbers). This is my first programm so i used some code that i found on net
#include <iostream>
__global__ void MnozenjeMatrica(int* d_c, int* d_a, int* d_b)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int d = 0;
for(int i=0; i<3; i++)
{
int x = d_a[row * 3 + i];
int y = d_b[i * 3 + col];
d += x * y;
}
d_c[row * 3 + col] = d;
}
int main()
{
const int SIZE = 9 * sizeof(int);
int a[3][3] = {{2, 4, 6}, {1, 3, 5}, {8, 4, 1}};
int b[3][3] = {{5, 8, 34}, {5, 7, 5}, {1, 4, 31}};
int c[3][3] = {{5, 8, 34}, {5, 7, 5}, {1, 4, 31}};
int* d_a;
int* d_b;
int* d_c;
cudaMalloc((void**) &d_a, SIZE);
cudaMalloc((void**) &d_b, SIZE);
cudaMalloc((void**) &d_c, SIZE);
cudaMemcpy(d_a, a, SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, SIZE, cudaMemcpyHostToDevice);
MnozenjeMatrica<<<3, 3>>>(d_c, d_a, d_b);
cudaMemcpy(c, d_c, SIZE, cudaMemcpyDeviceToHost);
for(int i=0; i<3; i++)
{
for(int j=0; j<3; j++)
{
printf("%d\t", c[i][j]);
}
printf("\n");
}
}

Completely agree with #talonmies.
More suggestions:
There are plenty of people who have posted questions about cuda
matrix multiplication, you might take a look at some of those to get
some ideas.
You're not doing any cuda error checking on kernel
calls and cuda calls (but it's recommended)
You might try running your code with cuda-memcheck, and see what it says.
You could debug this kernel pretty quickly with a few choice printf statements. This is mostly C code after all, you should consider using basic C troubleshooting techniques.
Since I was able to quickly spot this, I can tell you that your kernel is depending on a 2-D threadblock structure to do anything useful:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
But you are launching a 1D grid of 1D threadblocks:
MnozenjeMatrica<<<3, 3>>>(d_c, d_a, d_b);
^ ^
| 1-D threadblock (3 threads)
1-D grid (3 blocks)
So I'm not surprised it only works for a single row.

Related

Cuda Dot Product Failing for Non Multiples of 1024

I'm just looking for some help here when it comes to calculating the dot product of two arrays.
Let's say I set the array size to 2500 and the max thread count per block to 1024.
In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:
nblcks = (n + 1024 -1)/1024
So, nblcks = 3
This is my kernel function:
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
if (i < n)
s[tIdx] = a[i] * b[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
I call the kernel:
dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);
And everything works! Well, almost.
d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.
d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18
Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?
Thank you!
Edit:
// host vectors
float* h_a = new float[n];
float* h_b = new float[n];
init(h_a, n);
init(h_b, n);
// device vectors (d_a, d_b, d_c)
float* d_a;
float* d_b;
float* d_c;
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMalloc((void**)&d_b, n * sizeof(float));
cudaMalloc((void**)&d_c, nblocks * sizeof(float));
// copy from host to device h_a -> d_a, h_b -> d_b
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
Initialization of the array's are done in this function (n times):
void init(float* a, int n) {
float f = 1.0f / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f; // [0.0f 1.0f]
}
The basic problem here is that the sum reduction can only work correctly when you have a round power of two threads per block, with every entry in the shared memory initialised. That isn't a limitation in practice if you do something like this:
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
s[tIdx] = 0.f;
while (i < n) {
s[tIdx] += a[i] * b[i];
i += blockDim.x * gridDim.x;
}
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
and run a power of two threads per block (ie. 32, 64, 128, 256, 512 or 1024). The while loop accumulates multiple values and stores that partial dot product in shared memory, with every entry containing either 0 or a valid partial sum, and then the reduction happens as normal. Instead of running as many blocks as the data size dictates, run only as many as will "fill" your GPU simultaneously (or one less than you think you require if the problem size is small). Performance will be improved as well at larger problem sizes.
If you haven't already seen it, here is a very instructive whitepaper written by Mark Harris from NVIDIA on step by step optimisation of the basic parallel reduction. I highly recommend reading it.

CUDA Kernel for copying array location from neighbour location

I have a cuda kernel that copies from i+1 th location to the ith location in a device array. The copying is not done from the locations whose index values are multiples of 32. [32]->[31] not copied, [64]->[63] not copied. This happens irrespective of the block size. How this could be resolved?
Here is the full program. No calls for syncthreads(). Still the problem exists.
#include <cstdio>
struct SodA { float *df0; size_t pitch; };
__global__ void stream_kernel (SodA dA1, SodA dA2, int M, int N);
int main(int argc, char **argv){
int i, M=32, N=32;float *f0;
SodA dA1, dA2;
dim3 blockSize = dim3(32,32);
dim3 gridSize = dim3(1,1);
f0 = (float *)malloc(M*N*sizeof(float));
cudaMallocPitch((void **)&dA1.df0, &dA1.pitch, sizeof(float)*M, N);
cudaMallocPitch((void **)&dA2.df0, &dA2.pitch, sizeof(float)*M, N);
for (i=0; i<M*N; i++) f0[i] = (float)rand()/RAND_MAX;
cudaMemcpy2D((void *)dA1.df0, dA1.pitch, (void *)f0, sizeof(float)*M, sizeof(float)*M, N, cudaMemcpyHostToDevice);
printf("\n");
for(int i=28;i<70; i++)
printf("%5d ", i);
printf("\n\n");
printf("\n");
for(int i=28;i<70; i++)
printf("%.3f ", f0[i]);
printf("\n\n");
stream_kernel<<<gridSize, blockSize>>>(dA1, dA2, M, N);
cudaMemcpy2D( (void *)f0, sizeof(float)*M, (void *)dA2.df0, dA2.pitch,sizeof(float)*M, N, cudaMemcpyDeviceToHost);
printf("\n");
for(int i=28;i<70; i++)
printf("%.3f ", f0[i]);
printf("\n\n");
free(f0);cudaFree(dA2.df0);
cudaFree(dA1.df0);
printf("\n\n");
return 0;
}
__global__ void stream_kernel (SodA dA1, SodA dA2, int M, int N)
{
int i, j, i2d;
i = blockIdx.x * blockDim.x + threadIdx.x;
j = blockIdx.y * blockDim.y + threadIdx.y;
i2d = i + j * M;
if (i2d>0) { dA2.df0[i2d-1] = dA1.df0[i2d];}
}
The output
28 29 30 31 32 33 ....
0.999 0.218 0.513 0.839 0.613 0.296 0.638....
0.218 0.513 0.839 0.198 0.296 0.638 ....
Thanks for the comments. In a 2D array stored in row major order, this kernel moves the (i,j)th position to its previous position. Since the array is pitched, as mentioned in the comments, the previous element of the first element in each row could not be found using -1 offset. This special case is handled by computing the last element in the previous array. I got the answer. Thanks.

Create 2D Array with CUDA

in cuda c programming guide document there is a sample that show a 2d array:
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
i use 2d array with below form and works correctly:
dim3 grid[COLUMNS][ROWS];
kernel_Matrix<<<grid,1>>>(dev_strA, dev_strB, dev_Matrix);
__global__ void add(int *a, int *b, int *c)
{
int x = blockIdx.x;
int y = blockIdx.y;
int i = (COLUMNS*y) + x;
c[i] = a[i] + b[i];
}
there is a way that implement 2d array with [ ][ ] definition? i tested this way but not works.
dim3 is not array but structure defined in CUDA header file (vector_types.h). This structure is used to specify dimensions of GRID in execution configuration of global functions, i.e. in <<< >>>. It doesn't keep the 'real' blocks it just configures a number of blocks that will be executed.
The only two ways (to my knowledge) to initialize this structure are:
1. dim3 grid(x, y, z);
2. dim3 grid = {x, y, z};
EDIT:
Host code with dim3 initialization and with passing the arrays to kernel function in a way you will be able to access its elements via [][]:
float A[N][N];
float B[N][N];
float C[N][N];
float (*d_A)[N]; //pointers to arrays of dimension N
float (*d_B)[N];
float (*d_C)[N];
for(int i = 0; i < N; i++) {
for(int j = 0; j < N; j++) {
A[i][j] = i;
B[i][j] = j;
}
}
//allocation
cudaMalloc((void**)&d_A, (N*N)*sizeof(float));
cudaMalloc((void**)&d_B, (N*N)*sizeof(float));
cudaMalloc((void**)&d_C, (N*N)*sizeof(float));
//copying from host to device
cudaMemcpy(d_A, A, (N*N)*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, (N*N)*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_C, C, (N*N)*sizeof(float), cudaMemcpyHostToDevice);
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C);
//copying from device to host
cudaMemcpy(A, (d_A), (N*N)*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(B, (d_B), (N*N)*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(C, (d_C), (N*N)*sizeof(float), cudaMemcpyDeviceToHost);

Get an int from a threadId in CUDA

I am pretty new to CUDA. I need to use a thread id in a computation but it doesn't work. rem is always 0. I need the index of the thread to computes indices in arrays so I can't convert them to floats to do the computations.
the kernel is as follows :
_global__ void initializationCubes(float* dVer, float* dCub, int gridSize, float* test)
{
int index=blockIdx.x*blockDim.x+threadIdx.x;
if(index<(gridSize*gridSize*gridSize))
{
// conversion index -> i,j,k
int rem=index;
int qot=(rem/gridSize);
int i=rem-(qot*gridSize);
rem=(rem)/(gridSize);
qot=(rem/gridSize);
int j=rem-(qot*gridSize);
rem=(rem)/(gridSize);
qot=(rem/gridSize);
int k=rem-(qot*gridSize);
for(int x=0;x<7;x++){
// these first three are used to test
dCub[index*56+0+x] =index;
dCub[index*56+7+x] =rem;
dCub[index*56+14+x]=k;
dCub[index*56+21+x]=dVer[((i*(gridSize+1)+(j+1))*(gridSize+1)+k)*7+x];
dCub[index*56+28+x]=dVer[(((i+1)*(gridSize+1)+(j))*(gridSize+1)+k)*7+x];
dCub[index*56+35+x]=dVer[(((i+1)*(gridSize+1)+(j))*(gridSize+1)+k+1)*7+x];
dCub[index*56+42+x]=dVer[(((i+1)*(gridSize+1)+(j+1))*(gridSize+1)+k+1)*7+x];
dCub[index*56+49+x]=dVer[(((i+1)*(gridSize+1)+(j+1))*(gridSize+1)+k)*7+x];
}
}
}
__global__ void initializationVertices(float* dVer, int gridSize){
int currentVertex=0;
for(int i=0; i<gridSize+1; i++)
{
for(int j=0; j<gridSize+1; j++)
{
for(int k=0; k<gridSize+1; k++)
{
dVer[currentVertex+0]=((i*2.0f)/(gridSize)-1.0f)*2.0f;
dVer[currentVertex+1]=((j*2.0f)/(gridSize)-1.0f)*2.0f;
dVer[currentVertex+2]=((k*2.0f)/(gridSize)-1.0f)*2.0f;
currentVertex+=7;
}
}
}
extern "C"
void initializationCUDA1( const int verticesAtEndsOfEdges[24], const int eTable[256], int gSize, int numberParticles ) {
numParticles=numberParticles;
gridSize=gSize;
numVertices=(gridSize+1)*(gridSize+1)*(gridSize+1);
numCubes=(gridSize)*(gridSize)*(gridSize);
size_t pitchv=7;
cudaMallocPitch((void**)&dVer, &pitchv, 7 * sizeof(float), (gridSize+1)*(gridSize+1)*(gridSize+1));
size_t pitchc=7;
cudaMallocPitch((void**)&dCub, &pitchc, 7 * sizeof(float), (gridSize)*(gridSize)*(gridSize)*8);
cudaMalloc((void **)&verticesAtEnds, 24*sizeof(int));
cudaMalloc((void **)&dedgeTable, 256*sizeof(int));
cudaMalloc((void **)&dtriTable, 256*16*sizeof(int));
cudaMalloc((void **)&ballPoint, 3*sizeof(float));
cudaMalloc((void **)&dpositions, 3*numberParticles*sizeof(float));
cudaMalloc((void **)&dedgeVertices, numCubes*6*12*sizeof(float));
cudaMalloc((void **)&result, numCubes*18*sizeof(float));
output=(float*)malloc(numCubes*18*sizeof(float));
cudaMalloc((void **)&numFaces, 10*sizeof(int));
cudaMalloc((void **)&test, sizeof(float));
initializationVertices<<<1,1>>>(dVer, gridSize);
initializationCubes<<<128,256>>>( dVer, dCub, gridSize, test);
float* tmp =(float*)malloc(numCubes*56*(sizeof(float)));
cudaMemcpy(tmp, dCub, numCubes*56*sizeof(float), cudaMemcpyDeviceToHost);
for(int a=0;a<100;a++){
printf("%f\n",tmp[a]);
}
}
EDIT
gridSize is 40 -> the iteration of the threads go from 0 to 64000
and when I print the values outside of my function, rem, i, j and k are all equal to 0.
size_t pitchv=7;
cudaMallocPitch((void**)&dVer, &pitchv, 7 * sizeof(float), (gridSize+1)(gridSize+1)(gridSize+1));
size_t pitchc=7;
cudaMallocPitch((void**)&dCub, &pitchc, 7 * sizeof(float), (gridSize)(gridSize)(gridSize)*8);
initializationCubes<<<1,1>>>( dVer, dCub, gridSize, test);
If gridSize is the size of the grid, as the name suggests, both rem and qot will always be zero after execution of your code because they get divided by a value larger than themselves.
If you are looking for indices into a three-dimensional grid, that is exactly why threadIdx and blockIdx have three components. No expensive division is required at all, just use this standard code snippet:
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
if (i < myBlockSize.x && j < myBlockSize.y && k<myBlockSize.z) {
// your kernel code...
}
and launch your kernel with appropriate values for the y and z components of block- and gridsize, as well as a parameter or global variable myBlockSize set to the desired grid size (in case it cannot be factored into integer block- and grid dimensions).

CUDA programming

I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
}
}
extern "C" void cuda_p(float* A, float* B, float* C)
{
float* dev_A;
float* dev_B;
float* dev_C;
cudaMalloc((void**) &dev_A, sizeof(float) * 256);
cudaMalloc((void**) &dev_B, sizeof(float) * 256);
cudaMalloc((void**) &dev_C, sizeof(float) * 256);
cudaMemcpy(dev_A, A, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_C, C, sizeof(float) * 256, cudaMemcpyHostToDevice);
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
cudaMemcpy(A, dev_A, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(B, dev_B, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(C, dev_C, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
}
Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then
ix = threadIdx.x;
iy = blockIdx.x;
Read about it in CUDA Programming Guide (Appendix B.15).
But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...
Your primary issue is that the core of your kernel doesn't make sense. What you have is:
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:
for(int i = 0; i<256; i++)
C[ix+iy*16] += A[i] + B[i];
If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
}
}
Each thread grabs one entry from A and one from B, and writes one entry in C.
Your secondary issue is that you're launching the kernel wrong. You're doing:
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:
// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;
// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;
// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);
To avoid using atomicAdd, you can allocate shared memory and write the value into shared memory, then add them, and write out. Note that do not tried to use shared memory atomicAdd, it is even slower than global memory's atomicAdd. Only shared memory's int value atomicAdd is faster than global's atomicAdd. Also notice, write into shared memory should avoid bank conflict. Actually my test shows using shared memory will increase the algorithm faster 1-5% than atomicAdd. But try syncwrap can be even faster!
In general, my suggestions are:
Using shared memory instead of atomicAdd
Using syncwrap() than syncthread() (need special design)
And you might enjoy a 5-10% increase in speed.