CUDA programming - cuda

I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
}
}
extern "C" void cuda_p(float* A, float* B, float* C)
{
float* dev_A;
float* dev_B;
float* dev_C;
cudaMalloc((void**) &dev_A, sizeof(float) * 256);
cudaMalloc((void**) &dev_B, sizeof(float) * 256);
cudaMalloc((void**) &dev_C, sizeof(float) * 256);
cudaMemcpy(dev_A, A, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_C, C, sizeof(float) * 256, cudaMemcpyHostToDevice);
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
cudaMemcpy(A, dev_A, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(B, dev_B, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(C, dev_C, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
}

Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then
ix = threadIdx.x;
iy = blockIdx.x;
Read about it in CUDA Programming Guide (Appendix B.15).
But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...

Your primary issue is that the core of your kernel doesn't make sense. What you have is:
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:
for(int i = 0; i<256; i++)
C[ix+iy*16] += A[i] + B[i];
If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
}
}
Each thread grabs one entry from A and one from B, and writes one entry in C.
Your secondary issue is that you're launching the kernel wrong. You're doing:
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:
// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;
// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;
// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);

To avoid using atomicAdd, you can allocate shared memory and write the value into shared memory, then add them, and write out. Note that do not tried to use shared memory atomicAdd, it is even slower than global memory's atomicAdd. Only shared memory's int value atomicAdd is faster than global's atomicAdd. Also notice, write into shared memory should avoid bank conflict. Actually my test shows using shared memory will increase the algorithm faster 1-5% than atomicAdd. But try syncwrap can be even faster!
In general, my suggestions are:
Using shared memory instead of atomicAdd
Using syncwrap() than syncthread() (need special design)
And you might enjoy a 5-10% increase in speed.

Related

Cuda Dot Product Failing for Non Multiples of 1024

I'm just looking for some help here when it comes to calculating the dot product of two arrays.
Let's say I set the array size to 2500 and the max thread count per block to 1024.
In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:
nblcks = (n + 1024 -1)/1024
So, nblcks = 3
This is my kernel function:
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
if (i < n)
s[tIdx] = a[i] * b[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
I call the kernel:
dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);
And everything works! Well, almost.
d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.
d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18
Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?
Thank you!
Edit:
// host vectors
float* h_a = new float[n];
float* h_b = new float[n];
init(h_a, n);
init(h_b, n);
// device vectors (d_a, d_b, d_c)
float* d_a;
float* d_b;
float* d_c;
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMalloc((void**)&d_b, n * sizeof(float));
cudaMalloc((void**)&d_c, nblocks * sizeof(float));
// copy from host to device h_a -> d_a, h_b -> d_b
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
Initialization of the array's are done in this function (n times):
void init(float* a, int n) {
float f = 1.0f / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f; // [0.0f 1.0f]
}
The basic problem here is that the sum reduction can only work correctly when you have a round power of two threads per block, with every entry in the shared memory initialised. That isn't a limitation in practice if you do something like this:
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
s[tIdx] = 0.f;
while (i < n) {
s[tIdx] += a[i] * b[i];
i += blockDim.x * gridDim.x;
}
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
and run a power of two threads per block (ie. 32, 64, 128, 256, 512 or 1024). The while loop accumulates multiple values and stores that partial dot product in shared memory, with every entry containing either 0 or a valid partial sum, and then the reduction happens as normal. Instead of running as many blocks as the data size dictates, run only as many as will "fill" your GPU simultaneously (or one less than you think you require if the problem size is small). Performance will be improved as well at larger problem sizes.
If you haven't already seen it, here is a very instructive whitepaper written by Mark Harris from NVIDIA on step by step optimisation of the basic parallel reduction. I highly recommend reading it.

CUDA: large kernel gives strange behavior

I recently bought a gtx550ti boost card. Programs that used to work on my old gf440 card fails. Here is an example. The following program works fine with smaller kernels, but goes wrong with larger ones.
#include "stdio.h"
__global__ void kernel(float * d_in, float * d_out){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
d_out[idx] = d_in[idx];
}
int main(){
const dim3 gridSize(10,10);
const dim3 blockSize(80,80);
const int size = 800*800;
float * h_in = new float[size];
float * h_out = new float[size];
float * d_in;
float * d_out;
cudaMalloc((void**)&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size);
for(int i = 0; i < size; i++)
h_in[i] = (float)i;
cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
kernel<<<gridSize,blockSize>>>(d_in, d_out);
cudaMemcpy(h_out, d_out, sizeof(float)*size, cudaMemcpyDeviceToHost);
for(int i = 0; i < size; i++)
printf("%f\n",h_out[i]);
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
I expected it to output index in floats. But it outputs some random floats:
0.131061
2.520029
9.304665
0.000189
0.242134
0.525557
0.560013
size 100*100
Instead, when I switch to size 100*100:
const dim3 gridSize(10,10);
const dim3 blockSize(10,10);
const int size = 100*100;
And it works fine(last 5 outputs):
9995.000000
9996.000000
9997.000000
9998.000000
9999.000000
size 500*500
But for larger size 500*500:
const dim3 gridSize(10,10);
const dim3 blockSize(50,50);
const int size = 500*500;
It outputs wrong index(last 5 outputs):
512139.000000
512140.000000
512141.000000
512142.000000
512143.000000
I installed CUDA 5.5. Thanks!
Whenever you are having trouble with cuda code, you should be doing proper cuda error checking.
This is not valid:
const dim3 blockSize(80,80);
This is asking for a threadblock of 80*80 = 6400 threads. There are no GPUs that support 6400 threads per threadblock.
This is also not valid:
const dim3 blockSize(50,50);
2500 threads is also too many. These configs would not work on either of your cards.
This is acceptable:
const dim3 blockSize(10,10);
In the "not valid" cases, your kernel is not running. If you had done proper cuda error checking, you would have discovered this and even got a clue as to what might be wrong (invalid launch configuration).
You may also want to familiarize yourself with the deviceQuery cuda sample, and study the output for your GPUs.

Cuda matrix multiplication- wrong result

this is my code for matrix multiplication, but when i run it i get correct result for first row but wrong ones for second and third(mostly big negative numbers). This is my first programm so i used some code that i found on net
#include <iostream>
__global__ void MnozenjeMatrica(int* d_c, int* d_a, int* d_b)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int d = 0;
for(int i=0; i<3; i++)
{
int x = d_a[row * 3 + i];
int y = d_b[i * 3 + col];
d += x * y;
}
d_c[row * 3 + col] = d;
}
int main()
{
const int SIZE = 9 * sizeof(int);
int a[3][3] = {{2, 4, 6}, {1, 3, 5}, {8, 4, 1}};
int b[3][3] = {{5, 8, 34}, {5, 7, 5}, {1, 4, 31}};
int c[3][3] = {{5, 8, 34}, {5, 7, 5}, {1, 4, 31}};
int* d_a;
int* d_b;
int* d_c;
cudaMalloc((void**) &d_a, SIZE);
cudaMalloc((void**) &d_b, SIZE);
cudaMalloc((void**) &d_c, SIZE);
cudaMemcpy(d_a, a, SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, SIZE, cudaMemcpyHostToDevice);
MnozenjeMatrica<<<3, 3>>>(d_c, d_a, d_b);
cudaMemcpy(c, d_c, SIZE, cudaMemcpyDeviceToHost);
for(int i=0; i<3; i++)
{
for(int j=0; j<3; j++)
{
printf("%d\t", c[i][j]);
}
printf("\n");
}
}
Completely agree with #talonmies.
More suggestions:
There are plenty of people who have posted questions about cuda
matrix multiplication, you might take a look at some of those to get
some ideas.
You're not doing any cuda error checking on kernel
calls and cuda calls (but it's recommended)
You might try running your code with cuda-memcheck, and see what it says.
You could debug this kernel pretty quickly with a few choice printf statements. This is mostly C code after all, you should consider using basic C troubleshooting techniques.
Since I was able to quickly spot this, I can tell you that your kernel is depending on a 2-D threadblock structure to do anything useful:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
But you are launching a 1D grid of 1D threadblocks:
MnozenjeMatrica<<<3, 3>>>(d_c, d_a, d_b);
^ ^
| 1-D threadblock (3 threads)
1-D grid (3 blocks)
So I'm not surprised it only works for a single row.

Read an Array With Threads in CUDA

I was wondering if it was possible, and what was the best way to read cells from an array with threads in CUDA. To simplify what I mean this is an example :
I have an array : {1,2,3,4,5,6,...} and I would like each threads to read n cells of my array depending mainly of its size.
I have been trying a few things, but it seems not to work, so if anyone could point out a (right) way to do it, that would be great.
Thank you.
Generally you want contiguous threads to read contiguous array indices. Doing so results in "coalesced" memory transactions. The simple way to think of it is that if 32 threads are running physically in parallel, and they all do a load, then if all 32 loads fall into the same cache line, then a single memory access can be performed to fill the cache line, rather than 32 separate ones.
So what you want to do is have each thread access n cells that are strided by the number of threads, like this (assuming input data is in the float array data).
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < numElements; i += stride) {
float element = data[i];
process(element);
}
If your algorithm requires that each thread reads n contiguous data elements, then you are going to incur non-coalesced loads, which will be much more expensive. In this case, I would consider re-designing the algorithm so this type of access is not required.
You need to:
the threads have to look at the n next numbers
So you can use:
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
// call the kernel by:
accessArray<<<NTHREAD/256, 256>>>(d_array);
dump out the array and check whether it is how you want your thread work or not.
Full code:
#include <cuda.h>
#include <stdio.h>
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
int main()
{
int h_array[ARRAYSIZE];
int *d_array;
size_t memsize= ARRAYSIZE * sizeof(float);
for (int i=0; i< ARRAYSIZE; i++) {
h_array[i] = 0;
}
cudaMalloc(&d_array, memsize);
cudaMemcpy(d_array, h_array, memsize, cudaMemcpyHostToDevice);
accessArray<<<NTHREAD/256, 256>>>(d_array);
cudaMemcpy(h_array, d_array, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<ARRAYSIZE; i++)
printf("A[%d] => %d\n",i,h_array[i]);
cudaFree(d_array);
}

cuda multiplication

Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
If this is the serial code:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
then you should be doing this:
__global__ void fn(float *x, int nx)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx, i = tid - j * nx;
x[tid] *= y[i];
}
fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(**tid** = 0; **tid** <nx*ny; **tid**++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
Given your comment to this answer:
the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;
if (tid < (nx*ny))
x[tid] *= y[i];
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;
for(; tid <nx*ny; tid+=gsize)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
Try this using shared memory. One of the best implementations around:
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride; // In number of elements
float *elements;
} Matrix;
// Thread block size
#define BLOCK_SIZE 16
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row +
BLOCK_SIZE * col];
return Asub;
}
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Same as in previous example, except the followings:
// d_A.width = d_A.stride = A.width;
// d_B.width = d_B.stride = B.width;
// d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}