CUDA: large kernel gives strange behavior - cuda

I recently bought a gtx550ti boost card. Programs that used to work on my old gf440 card fails. Here is an example. The following program works fine with smaller kernels, but goes wrong with larger ones.
#include "stdio.h"
__global__ void kernel(float * d_in, float * d_out){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * blockDim.x * gridDim.x;
d_out[idx] = d_in[idx];
}
int main(){
const dim3 gridSize(10,10);
const dim3 blockSize(80,80);
const int size = 800*800;
float * h_in = new float[size];
float * h_out = new float[size];
float * d_in;
float * d_out;
cudaMalloc((void**)&d_in, sizeof(float)*size);
cudaMalloc((void**)&d_out, sizeof(float)*size);
for(int i = 0; i < size; i++)
h_in[i] = (float)i;
cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);
kernel<<<gridSize,blockSize>>>(d_in, d_out);
cudaMemcpy(h_out, d_out, sizeof(float)*size, cudaMemcpyDeviceToHost);
for(int i = 0; i < size; i++)
printf("%f\n",h_out[i]);
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
I expected it to output index in floats. But it outputs some random floats:
0.131061
2.520029
9.304665
0.000189
0.242134
0.525557
0.560013
size 100*100
Instead, when I switch to size 100*100:
const dim3 gridSize(10,10);
const dim3 blockSize(10,10);
const int size = 100*100;
And it works fine(last 5 outputs):
9995.000000
9996.000000
9997.000000
9998.000000
9999.000000
size 500*500
But for larger size 500*500:
const dim3 gridSize(10,10);
const dim3 blockSize(50,50);
const int size = 500*500;
It outputs wrong index(last 5 outputs):
512139.000000
512140.000000
512141.000000
512142.000000
512143.000000
I installed CUDA 5.5. Thanks!

Whenever you are having trouble with cuda code, you should be doing proper cuda error checking.
This is not valid:
const dim3 blockSize(80,80);
This is asking for a threadblock of 80*80 = 6400 threads. There are no GPUs that support 6400 threads per threadblock.
This is also not valid:
const dim3 blockSize(50,50);
2500 threads is also too many. These configs would not work on either of your cards.
This is acceptable:
const dim3 blockSize(10,10);
In the "not valid" cases, your kernel is not running. If you had done proper cuda error checking, you would have discovered this and even got a clue as to what might be wrong (invalid launch configuration).
You may also want to familiarize yourself with the deviceQuery cuda sample, and study the output for your GPUs.

Related

cudaMallocManaged for 2D and 3D array

If one wants to copy the arrays to device from host one does cudamalloc and cudaMemcpy. But to lessen the hassle one just does cudaMallocManaged without the former two things and life was never simpler before.
The code looks like this(more or less)
__global__ void convert(float kelvin[], float celsius[]) //can pass
arrays in kernel
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<N)
kelvin[i]=celsius[i]+273.15;
}
int main()
{
float *celsius =(float *)malloc(N*sizeof(float));
float *kelvin =(float *)malloc(N*sizeof(float));
cudaMallocManaged(&celsius, N*sizeof(float));
cudaMallocManaged(&kelvin, N*sizeof(float));
// init celsius here
dim3 blocksPerGrid(1,1,1); //use only one block
dim3 threadsPerBlock(N,1,1); //use N threads in the block
convert<<<blocksPerGrid, threadsPerBlock>>>(kelvin,celsius);
cudaDeviceSynchronize();
//Doing stuff with the output here
return 0;
}
The previous example seems clear to me. But, how to do cudaMallocManaged for 2D and 3D array? I've been trying
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{ // I thonk, 2D arrays can be passed as pointer to pointers
float **A = (float **)malloc(N*N*sizeof(float));
float **B = (float **)malloc(N*N*sizeof(float));
float **C = (float **)malloc(N*N*sizeof(float));
cudaMallocManaged(&A, N*N*sizeof(float));
cudaMallocManaged(&B, N*N*sizeof(float));
cudaMallocManaged(&C, N*N*sizeof(float));
A[N][N]={{1,0,0},{0,1,0},{0,0,1}};
B[N][N]={{1,0,0},{0,1,0},{0,0,1}};
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
//outputs and all
}
But, It shows the following error
matrix_add.cu(22): error: too many initializer values
matrix_add.cu(25): error: argument of type "float **" is incompatible with parameter of type "float (*)[3]"
Your help is highly appreciated.
You got a lot wrong in your attempt, so much that it was faster to write a working version than list out all the individual problems in the code in your question. So here is a working version of what it appears you were trying to do:
#include <algorithm>
#include <iostream>
const int N = 3;
__global__ void MatAdd(float A[][N], float B[][N], float C[][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
float* A; cudaMallocManaged(&A, N*N*sizeof(float));
float* B; cudaMallocManaged(&B, N*N*sizeof(float));
float* C; cudaMallocManaged(&C, N*N*sizeof(float));
const float A_vals[N][N]={{1,0,0},{0,1,0},{0,0,1}};
const float B_vals[N][N]={{1,0,0},{0,1,0},{0,0,1}};
float (*C_vals)[N] = reinterpret_cast<float (*)[N]>(C);
std::copy(&A_vals[0][0], &A_vals[0][0] + N*N, A);
std::copy(&B_vals[0][0], &B_vals[0][0] + N*N, B);
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(1, 1);
MatAdd<<<numBlocks, threadsPerBlock>>>( reinterpret_cast<float (*)[N]>(A),
reinterpret_cast<float (*)[N]>(B),
C_vals );
cudaDeviceSynchronize();
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
std::cout << C_vals[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
}
Some important points:
Managed memory allocation replaces standard host memory allocation and produces memory which is directly accessible on both the host and the device.
All arrays decay to a pointer when passed as arguments to a function by value. That decay is not recursive. See here for more details.
You can (and will need to) cast in order to use the [][] access syntax on linear memory allocated dynamically at runtime (this applies to malloc, new, or any of the CUDA host memory allocation APIs. See here for more details).
Initialization syntax and assignment syntax for arrays are not interchangeable.
All I can suggest is that you study it thoroughly until you understand how it works.

Read an Array With Threads in CUDA

I was wondering if it was possible, and what was the best way to read cells from an array with threads in CUDA. To simplify what I mean this is an example :
I have an array : {1,2,3,4,5,6,...} and I would like each threads to read n cells of my array depending mainly of its size.
I have been trying a few things, but it seems not to work, so if anyone could point out a (right) way to do it, that would be great.
Thank you.
Generally you want contiguous threads to read contiguous array indices. Doing so results in "coalesced" memory transactions. The simple way to think of it is that if 32 threads are running physically in parallel, and they all do a load, then if all 32 loads fall into the same cache line, then a single memory access can be performed to fill the cache line, rather than 32 separate ones.
So what you want to do is have each thread access n cells that are strided by the number of threads, like this (assuming input data is in the float array data).
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < numElements; i += stride) {
float element = data[i];
process(element);
}
If your algorithm requires that each thread reads n contiguous data elements, then you are going to incur non-coalesced loads, which will be much more expensive. In this case, I would consider re-designing the algorithm so this type of access is not required.
You need to:
the threads have to look at the n next numbers
So you can use:
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
// call the kernel by:
accessArray<<<NTHREAD/256, 256>>>(d_array);
dump out the array and check whether it is how you want your thread work or not.
Full code:
#include <cuda.h>
#include <stdio.h>
#define N 2
#define NTHREAD 1024
#define ARRAYSIZE N*NTHREAD
// develop the kernel as:
__global__ void accessArray(int *array){
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid*N;
// access thread's stride
for(int i=0; i<N; i++){
array[startId+i]=tid;
}
}
int main()
{
int h_array[ARRAYSIZE];
int *d_array;
size_t memsize= ARRAYSIZE * sizeof(float);
for (int i=0; i< ARRAYSIZE; i++) {
h_array[i] = 0;
}
cudaMalloc(&d_array, memsize);
cudaMemcpy(d_array, h_array, memsize, cudaMemcpyHostToDevice);
accessArray<<<NTHREAD/256, 256>>>(d_array);
cudaMemcpy(h_array, d_array, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<ARRAYSIZE; i++)
printf("A[%d] => %d\n",i,h_array[i]);
cudaFree(d_array);
}

stack overflow exception at program start (CUDA Monte Carlo Pi)

My problem is that I am receiving a stack overflow exception at program start when the program first enters main. My program is a Parallel Monte Carlo Pi calculator using CUDA. When I try and debug the program in Visual Studio, the exception pops up before any breakpoint I can select. Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 512
#define NUM_BLOCK 65534
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
extern __shared__ int sdata[];
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int k, incircle;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z = sqrt(x*x + y*y);
if (z <= 1) incircle++;
else{}
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
int trials, total;
curandState *devStates;
trials = 100;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
cudaMalloc((void **) &sumDev, size); // Allocate array on device
cudaMalloc((void **) &devStates, size*sizeof(curandState));
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
//*solution = NULL;
return 0;
}
I would assume the problem is this:
float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
for
#define NUM_THREAD 512
#define NUM_BLOCK 65534
That leaves you with a roughly 130Mb statically declared array. I doubt the compiler runtime library can deal with such a large static allocation, which is why you get an instant stack overflow. Replace it with a dynamic allocation and the stack overflow problem will go away. But then read Pavan's post carefully, because once you fix the stack overflow, the CUDA code itself also needs some redesign before it will work.
You are declaring the size of shared memory = size; like here
monteCarlo <<<dimGrid, dimBlock, size>>>
The value of size = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (more than the maximum value of shared memory on any card I can think of).
But looking at your kernels, I think you want the shared memory to be equal to the number of threads you have.
So you either need to do
1)
this for launching your kernels
monteCarlo <<<dimGrid, dimBlock, (NUM_THREADS * sizeof(int))>>>
2)
Or use this for launching your kernels
monteCarlo <<<dimGrid, dimBlock>>>
And this to declare your shared memory inside your kernel.
__shared__ int sdata[NUM_THREADS]; // Note: no extern before __shared__
I personally prefer method two for these kinds of kernels because the shared memory is proportional to the number of threads, but the number of threads is known to be constant. It is also slightly faster.
EDIT
Apart from the forementioned problems I doubt that this might be causing problems too.
cudaMalloc((void **) &devStates, size*sizeof(curandState));
Becuase size itself is this.
size = NUM_BLOCKS * NUM_THREADS * sizeof(float);
May be you wanted to do this instead ?
cudaMalloc((void **) &devStates, (NUM_BLOCKS *NUM_THREADS)*sizeof(curandState));
As for the actual stack overflow problem you may want to look at talonmies post.

CUDA programming

I am new to CUDA. I had a question on a simple program, hope someone can notice my mistake.
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
}
}
extern "C" void cuda_p(float* A, float* B, float* C)
{
float* dev_A;
float* dev_B;
float* dev_C;
cudaMalloc((void**) &dev_A, sizeof(float) * 256);
cudaMalloc((void**) &dev_B, sizeof(float) * 256);
cudaMalloc((void**) &dev_C, sizeof(float) * 256);
cudaMemcpy(dev_A, A, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, sizeof(float) * 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_C, C, sizeof(float) * 256, cudaMemcpyHostToDevice);
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
cudaMemcpy(A, dev_A, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(B, dev_B, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaMemcpy(C, dev_C, sizeof(float) * 256, cudaMemcpyDeviceToHost);
cudaFree(dev_A);
cudaFree(dev_B);
cudaFree(dev_C);
}
Are you sure about kernel launch configuration? In your code you try to start some unknown function ADDD. And your execution configuration is: gridDim = (16, 0, 0) and blockDim = (16, 0, 0). So in your kernel blockIdx.x = [0..16) and threadIdx.x = [0..16). If I understood you right, then
ix = threadIdx.x;
iy = blockIdx.x;
Read about it in CUDA Programming Guide (Appendix B.15).
But it's not only one mistake. When you accumulate values in C[i] you have a race condition. 16 threads (1 warp) simultaneously read C[i], add some value (A[ix+iy*16] + B[ix+iy*16]) and write the results back to C[i]. You should use atomic add operations (CUDA Programming Guide, Appendix B.11.1.1) or redesign your kernel to maximize memory coalescing (CUDA C Best Practices Guide 3.2.1) because atomics are very-VERY slow...
Your primary issue is that the core of your kernel doesn't make sense. What you have is:
for(int i = 0; i<256; i++)
C[i] = A[ix+iy*16] + B[ix+iy*16] + C[i]; // << I wish to store all in C
This is going to have each thread to through and read every entry in C, add its own part of A and B to it, and write it back. Since each thread is doing this at the same time, they're going to step on each other. If you really want every entry in C to be the sum of all entries in A and all entries in B, you want to make each thread responsible for a certain entry in C:
for(int i = 0; i<256; i++)
C[ix+iy*16] += A[i] + B[i];
If instead you want every entry in C to be the sum of the corresponding entries in A and B, which seems more likely, then you would get rid of the loop, and your kernel would look like:
__global__ void ADD(float* A, float* B, float* C)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < 16 && iy < 16)
{
C[ix+iy*16] = A[ix+iy*16] + B[ix+iy*16];
}
}
Each thread grabs one entry from A and one from B, and writes one entry in C.
Your secondary issue is that you're launching the kernel wrong. You're doing:
ADDD<<<16,16>>>(dev_A,dev_B,dev_C);
This launches a 1x16 grid of blocks of 1x16 threads each (of the typo'd kernel). If you want to have your threads positioned in 2 dimensions (using both the x and y indexes), you need to use dim3 as your size specifier type. Something like:
// Use a grid of 4x4 blocks
dim3 gridSize;
gridSize.x = 4;
gridSize.y = 4;
// Use blocks of 4x4 threads.
dim3 blockSize;
blockSize.x = 4;
blockSize.y = 4;
// Run a 4x4 grid of blocks, each with 4x4 threads.
// So you end up with a 16x16 group of threads, matching your data layout.
ADD<<<gridSize,blockSize>>>(dev_A,dev_B,dev_C);
To avoid using atomicAdd, you can allocate shared memory and write the value into shared memory, then add them, and write out. Note that do not tried to use shared memory atomicAdd, it is even slower than global memory's atomicAdd. Only shared memory's int value atomicAdd is faster than global's atomicAdd. Also notice, write into shared memory should avoid bank conflict. Actually my test shows using shared memory will increase the algorithm faster 1-5% than atomicAdd. But try syncwrap can be even faster!
In general, my suggestions are:
Using shared memory instead of atomicAdd
Using syncwrap() than syncthread() (need special design)
And you might enjoy a 5-10% increase in speed.

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.