How to write CUDA global function for this? - cuda

I want to convert the following function into CUDA.
void fun()
{
for(i = 0; i < terrainGridLength; i++)
{
for(j = 0; j < terrainGridWidth; j++)
{
//CODE of function
}
}
}
I wrote the function like this :
__global__ void fun()
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < terrainGridLength)&&(j<terrainGridWidth))
{
//CODE of function
}
}
I declared both terrainGridLength and terrainGridWidth as constants and assigned value 120 to both. And I am calling function like
fun<<<30,500>>>()
But i am not getting correct output.
Is the code which i wrote is correct?.I didn't understood much about the parellel execution of the code.Please explain me how the code will work and correct me if i done any mistakes.

You use y dimension which means you are using 2D array threads, so you cannot invoke the kernel with only:
int numBlock = 30;
int numThreadsPerBlock = 500;
fun<<<numBlock,numThreadsPerBlock>>>()
The invocation should be: (Note that now Blocks have 2D Threads)
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 2D Grids with size = GRID_SIZE*GRID_SIZE
dim3 dimBlocks(BLOCK_SIZE, BLOCK_SIZE); //2D Blocks with size = BLOCK_SIZE*BLOCK_SIZE
fun<<<dimGrid, dimBlocks>>>()
Refer to CUDA Programming Guide for further info, and also if you want to do 2D array or 3D, you better use cudaMalloc3D or cudaMallocPitch
As of your code, I think this would work (but I haven't tried though, hope you can grab the idea with this):
//main
dim3 dimGrid(1, 1); // 2D Grids with size = 1
dim3 dimBlocks(Width, Height); //2D Blocks with size = Height*Width
fun<<<dimGrid, dimBlocks>>>(Width, Height)
//kernel
__global__ void fun(int Width, int Height)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if((i < Width)&&(j<Height))
{
//CODE of function
}
}

Related

The type of random number generator in cuRAND kernels

A typical example of random number generation in CUDA or pyCUDA is reported in the question How to generate random number inside pyCUDA kernel?, namely
#include <curand_kernel.h>
const int nstates = %(NGENERATORS)s;
__device__ curandState_t* states[nstates];
__global__ void initkernel(int seed)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (tidx < nstates) {
curandState_t* s = new curandState_t;
if (s != 0) {
curand_init(seed, tidx, 0, s);
}
states[tidx] = s;
}
}
__global__ void randfillkernel(float *values, int N)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (tidx < nstates) {
curandState_t s = *states[tidx];
for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
values[i] = curand_uniform(&s);
}
*states[tidx] = s;
}
}
Using this classical example, what is the random number generator that is activated (XORWOW, MTGP32, others)?
How is it possible to change the random number generator from within the kernel?
The default generator in the curand device API is XORWOW, as defined by
typedef struct curandStateXORWOW curandState_t;
in the device API header. You can change to another generator by substituting another state type to the curandInit call. Note that some generators require different arguments to the curandInit routine compared to the default.

Shared memory mutex with CUDA - adding to a list of items

My problem is the following: I have an image in which I detect some points of interest using the GPU. The detection is a heavyweight test in terms of processing, however only about 1 in 25 points pass the test on average. The final stage of the algorithm is to build up a list of the points. On the CPU this would be implemented as:
forall pixels x,y
{
if(test_this_pixel(x,y))
vector_of_coordinates.push_back(Vec2(x,y));
}
On the GPU I have each CUDA block processing 16x16 pixels. The problem is that I need to do something special to eventually have a single consolidated list of points in global memory. At the moment I am trying to generate a local list of points in shared memory per block which eventually will be written to global memory. I am trying to avoid sending anything back to the CPU because there are more CUDA stages after this.
I was expecting that I could use atomic operations to implement the push_back function on shared memory. However I am unable to get this working. There are two issues. The first annoying issue is that I am constantly running into the following compiler crash: "nvcc error : 'ptxas' died with status 0xC0000005 (ACCESS_VIOLATION)" when using atomic operations. It is hit or miss whether I can compile something. Does anyone know what causes this?
The following kernel will reproduce the error:
__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pCounts)
{
__shared__ unsigned int test;
atomicInc(&test, 1000);
}
Secondly, my code which includes a mutex lock on shared memory hangs the GPU and I dont understand why:
__device__ void lock(unsigned int *pmutex)
{
while(atomicCAS(pmutex, 0, 1) != 0);
}
__device__ void unlock(unsigned int *pmutex)
{
atomicExch(pmutex, 0);
}
__global__ void gpu_kernel_non_max_suppress(int w, int h, RtmPoint *pPoints, int *pCounts)
{
__shared__ RtmPoint localPoints[64];
__shared__ int localCount;
__shared__ unsigned int mutex;
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int threadid = threadIdx.y * blockDim.x + threadIdx.x;
int blockid = blockIdx.y * gridDim.x + blockIdx.x;
if(threadid==0)
{
localCount = 0;
mutex = 0;
}
__syncthreads();
if(x<w && y<h)
{
if(some_test_on_pixel(x,y))
{
RtmPoint point;
point.x = x;
point.y = y;
// this is a local push_back operation
lock(&mutex);
if(localCount<64) // we should never get >64 points per block
localPoints[localCount++] = point;
unlock(&mutex);
}
}
__syncthreads();
if(threadid==0)
pCounts[blockid] = localCount;
if(threadid<localCount)
pPoints[blockid * 64 + threadid] = localPoints[threadid];
}
In the example code at this site, the author manages to successfully use atomic operations on shared memory, so I am confused as to why my case does not function. If I comment out the lock and unlock lines, the code runs ok, but obviously incorrectly adding to the list.
I would appreciate some advice about why this problem is happening and also perhaps if there is a better solution to achieving the goal, since I am concerned anyway about the performance issues with using atomic operations or mutex locks.
I suggest using prefix-sum to implement that part to increase parallelism. To do that you need to use a shared array. Basically prefix-sum will turn an array (1,1,0,1) into (0,1,2,2,3), i.e., will calculate an in-place running exclusive sum so that you'll get per-thread write indices.
__shared__ uint8_t vector[NUMTHREADS];
....
bool emit = (x<w && y<h);
emit = emit && some_test_on_pixel(x,y);
__syncthreads();
scan(emit, vector);
if (emit) {
pPoints[blockid * 64 + vector[TID]] = point;
}
prefix-sum example:
template <typename T>
__device__ uint32 scan(T mark, T *output) {
#define GET_OUT (pout?output:values)
#define GET_INP (pin?output:values)
__shared__ T values[numWorkers];
int pout=0, pin=1;
int tid = threadIdx.x;
values[tid] = mark;
syncthreads();
for( int offset=1; offset < numWorkers; offset *= 2) {
pout = 1 - pout; pin = 1 - pout;
syncthreads();
if ( tid >= offset) {
GET_OUT[tid] = (GET_INP[tid-offset]) +( GET_INP[tid]);
}
else {
GET_OUT[tid] = GET_INP[tid];
}
syncthreads();
}
if(!pout)
output[tid] =values[tid];
__syncthreads();
return output[numWorkers-1];
#undef GET_OUT
#undef GET_INP
}
Based on recommendations here, I include the code that I used in the end. It uses 16x16 pixel blocks. Note that I am now writing the data out in one global array without breaking it up. I used the global atomicAdd function to compute a base address for each set of results. Since this only gets called once per block, I did not find too much of a slow down, while I gained a lot more convenience by doing this. I'm also avoiding shared buffers for the input and output of prefix_sum. GlobalCount is set to zero prior to the kernel call.
#define BLOCK_THREADS 256
__device__ int prefixsum(int threadid, int data)
{
__shared__ int temp[BLOCK_THREADS*2];
int pout = 0;
int pin = 1;
if(threadid==BLOCK_THREADS-1)
temp[0] = 0;
else
temp[threadid+1] = data;
__syncthreads();
for(int offset = 1; offset<BLOCK_THREADS; offset<<=1)
{
pout = 1 - pout;
pin = 1 - pin;
if(threadid >= offset)
temp[pout * BLOCK_THREADS + threadid] = temp[pin * BLOCK_THREADS + threadid] + temp[pin * BLOCK_THREADS + threadid - offset];
else
temp[pout * BLOCK_THREADS + threadid] = temp[pin * BLOCK_THREADS + threadid];
__syncthreads();
}
return temp[pout * BLOCK_THREADS + threadid];
}
__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pGlobalCount)
{
__shared__ int write_base;
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int threadid = threadIdx.y * blockDim.x + threadIdx.x;
int valid = 0;
if(x<w && y<h)
{
if(test_pixel(x,y))
{
valid = 1;
}
}
int index = prefixsum(threadid, valid);
if(threadid==BLOCK_THREADS-1)
{
int total = index + valid;
if(total>64)
total = 64; // global output buffer is limited to 64 points per block
write_base = atomicAdd(pGlobalCount, total); // get a location to write them out
}
__syncthreads(); // ensure write_base is valid for all threads
if(valid)
{
RtmPoint point;
point.x = x;
point.y = y;
if(index<64)
pPoints[write_base + index] = point;
}
}

CUDA Jacobian Relaxation

I am in the process of mapping this sequential computation to a CUDA computation. This computation is a 2-dimensional Jacobian relaxation on an NxN grid, where N is unknown. N is evenly divisible by 32.
Jacobi(float *a,float *b,int N){
for (i=1; i<N+1; i++){
for (j=1; j<N+1; j++) {
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}
}
I'm parallelizing the outer two loops, and each thread should compute just one element. The goal is to parallelize it to use a cyclic distribution in the the x and y dimensions. Can some one aid me in implementing a Jacobi_GPU that has the appropriate indexing functions in CUDA that results in the following distribution?
dim3 dimGrid(N/32,N/32);
dim3 dimBlock(32,32);
Jacobi_GPU<<<dimGrid,dimBlock>>>(A,B,N)
forThis is the simple implementation. You can use shared memory optimization for this kernel function
__global__ void jacobi(int* a, const int* b,const int N)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[j*N+i] = 0.8* (2*b[(i+1)+j*N] + 2*b[i+N*(j+1)]);
}
}
Or, if you want to use "arrays of arrays" rather than arrays:
__global__ void Jacobi(int** a, const int** b,const int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i<N && j<N)
{
a[i][j]=0.8*(b[i+1][j]+b[i+1][j]+b[i][j+1]+b[i][j+1]);
}
}

Errors in Polynomial fitting problem on CUDA

I tried to use CUDA to do some simple loops on device, but it seem that it is hard to understand Cuda. I am getting 0 from every function call, when I use CUDA kernel function with normal C code.
The original code:
double evaluate(int D, double tmp[], long *nfeval)
{
/* polynomial fitting problem */
int i, j;
int const M=60;
double px, x=-1, dx=(double)M, result=0;
(*nfeval)++;
dx = 2/dx;
for (i=0;i<=M;i++)
{
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1) result+=(1-px)*(1-px);
x+=dx;
}
px = tmp[0];
for (j=1;j<D;j++) px=1.2*px+tmp[j];
px = px-72.661;
if (px<0) result+=px*px;
px = tmp[0];
for (j=1;j<D;j++) px=-1.2*px+tmp[j];
px =px-72.661;
if (px<0) result+=px*px;
return result;
}
I wanted to do first for loop on CUDA:
double evaluate_gpu(int D, double tmp[], long *nfeval)
{
/* polynomial fitting problem */
int j;
int const M=60;
double px, dx=(double)M, result=0;
(*nfeval)++;
dx = 2/dx;
int N = M;
double *device_tmp = NULL;
size_t size_tmp = sizeof tmp;
cudaMalloc((double **) &device_tmp, size_tmp);
cudaMemcpy(device_tmp, tmp, size_tmp, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cEvaluate <<< n_blocks, block_size >>> (device_tmp, result, D);
// cudaMemcpy(result, result, size_result, cudaMemcpyDeviceToHost);
px = tmp[0];
for (j=1;j<D;j++) px=1.2*px+tmp[j];
px = px-72.661;
if (px<0) result+=px*px;
px = tmp[0];
for (j=1;j<D;j++) px=-1.2*px+tmp[j];
px =px-72.661;
if (px<0) result+=px*px;
return result;
}
Where the device function looks like:
__global__ void cEvaluate_temp(double* tmp,double result, int D)
{
int M =60;
double px;
double x=-1;
double dx=(double)M ;
int j;
dx = 2/dx;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 60) //<==>if (idx < M)
{
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1)
{ __syncthreads();
result+=(1-px)*(1-px); //+=
}
x+=dx;
}
}
I know that I have not specified the problem, but it seem that I have much more than one.
I do not know when to copy variable to device, and when it will be copied 'automatically'.
Now, I am using CUDA 3.2 and there is problem with emulation (I would like to use printf),
when I run NVCC with make emu=1 , there is no error when I use printf, but I also do not get any output.
There is the simplest version of device function, I tested. Can anybody explain what will happen with result value after incrementing it in parallel ? I think I should use device shared memory and synchronization to do sth like "+=" .
__global__ void cEvaluate(double* tmp,double result, int D)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 60) //<==>if (idx < M)
{
result+=1;
printf("res = %f ",result); //-deviceemu, make emu=1
}
}
No, the variable result is not shared across multiple threads.
What I would suggest is to have a matrix of result values in shared memory, one result for each thread, compute every value and the reduce it to a single value.
__global__ void cEvaluate_temp(double* tmp,double *global_result, int D)
{
int M =60;
double px;
double x=-1;
double dx=(double)M ;
int j;
dx = 2/dx;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ shared_result [blocksize];
if (idx >= 60) return;
px = tmp[0];
for (j=1;j<D;j++)
{
px = x*px + tmp[j];
}
if (px<-1 || px>1)
{
result[threadIdx] +=(1-px)*(1-px);
}
x+=dx;
}
__syncthreads();
if( threadIdx.x == 0) {
total_result = 0.
for (idx in blocksize){
total_result += result[idx];
}
global_result[0] = total_result;
}
Also you need the cudaMemcpy after the kernel invocation. Kernel are asynchronous and needs a sync function.
Also use the error check functions at each CUDA API invocation.

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.