Adding values on GPU - cuda

i have a class called Product.
Each product has a value and i want to add these values on GPU. I filled my array on host side
int * h_A, * d_A;
h_A = (int*) malloc(enterNum * sizeof(int));
cudaMalloc((void **) &d_A, enterNum * sizeof(int));
Product p("Product", price);
h_A[i] = p.getValue();
while (i < enterNum) {
i++;
cout << "Enter product name:";
cin >> desc;
cout << "Enter product price:";
cin >> price;
Product p("Product", price);
h_A[i] = p.getValue();
}
cudaMemcpy(d_A, h_A, enterNum, cudaMemcpyHostToDevice);
priceSum<<<enterNum, 1024>>>(d_A,enterNum,result);
int result2 = 0;
cudaMemcpy(result, result2, enterNum, cudaMemcpyDeviceToHost);
here cudaMemcpy function gives error because i dont use pointer. What can i do here? I dont need to use pointer here isn't it?
this is my summation function:
__global__ void priceSum(int *dA, int count, int result) {
int tid = blockIdx.x;
if (tid < count){
result+= dA[tid];
}
}
full code:
using namespace std;
#include "cuda_runtime.h"
#include <stdio.h>
#include <string.h>
#include <iostream>
#include <stdlib.h>
class Product {
private:
char * description;
int productCode;
int value;
static int lastCode;
public:
Product(char* descriptionP, int valueP) {
productCode = ++lastCode;
value = valueP;
description = new char[strlen(descriptionP) + 1];
strcpy(description, descriptionP);
}
Product(Product& other) {
productCode = ++lastCode;
description = new char[strlen(other.description) + 1];
strcpy(description, other.description);
}
~Product() {
delete[] description;
}
char* getDescription() const {
return description;
}
void setDescription(char* description) {
this->description = description;
}
int getValue() const {
return value;
}
void setValue(int value) {
this->value = value;
}
};
int Product::lastCode = 1000;
__global__ void priceSum(int *dA, int count, int * result) {
int tid = blockIdx.x;
if (tid < count)
result+= dA[tid];
}
int main(void) {
int enterNum, price, * result = 0;
string desc;
const char * desc2;
cout << "How many products do you want to enter?";
cin >> enterNum;
int * h_A, * d_A;
h_A = (int*) malloc(enterNum * sizeof(int));
cudaMalloc((void **) &d_A, enterNum * sizeof(int));
int i = 0;
while (i < enterNum) {
cout << "Enter product name:";
cin >> desc;
cout << "Enter product price:";
cin >> price;
Product p("Product", price);
h_A[i] = p.getValue();
i++;
}
cudaMemcpy(d_A, h_A, enterNum * sizeof(int), cudaMemcpyHostToDevice);
priceSum<<<enterNum, 1>>>(d_A,enterNum,result);
int result2 = 0;
cudaMemcpy(&result2, result, enterNum, cudaMemcpyDeviceToHost);
cout << result2;
return 0;
}

You should show the definition of result in your host code, but I assume it is:
int result;
based on how you are passing it to your priceSum kernel.
You have more than 1 problem here.
In your priceSum kernel, you are summing the values in dA[] and storing the answer in result. But you have passed the variable result to the kernel by value instead of by reference so the value you are modifying is local to the function, and will not show up anywhere else. When a function in C needs to modify a variable that is passed to it via the parameter list, and the modified variable is to show up in the function calling context, it's necessary to pass that parameter by reference (i.e. using a pointer) rather than by value. Note this is based on the C programming language and is not specific to CUDA. So you should rewrite your kernel definition as:
__global__ void priceSum(int *dA, int count, int *result) {
Regarding your cudaMemcpy call, there are several issues that need to be cleaned up. First, we need the storage for result to be properly created using cudaMalloc (before the kernel is called, because the kernel will store something there.) Next, we need to fix the parameter list of the cudaMemcpy call itself. So your host code should be rewritten as:
cudaMemcpy(d_A, h_A, enterNum, cudaMemcpyHostToDevice);
int *result;
cudaMalloc((void **)&result, sizeof(int));
priceSum<<<enterNum, 1024>>>(d_A,enterNum,result);
int result2 = 0;
cudaMemcpy(&result2, result, sizeof(int), cudaMemcpyDeviceToHost);
There appear to be other problems with your code, around the grouping of data for threads and blocks. But you haven't shown enough of your program for me to make sense of it. So let me point out that your code shows only a single value for result (and result2), yet the way your kernel is written, each thread will add its value of dA[tid] to result. You can't have a bunch of threads all updating a single value in global memory with no control mechanism, and expect to get a sensible result. Problems like this are usually best handled with a classical parallel reduction algorithm, but for the sake of simplicity, to try and get something working, you can use atomics:
atomicAdd(result, dA[tid]);
Sorry, but your kernel just makes no sense at all. You are using blockIdx.x as your tid variable, but let's note that blockIdx.x is a number that is the same for every thread in a particular block. So then going on to have every thread add dA[tid] to result in this fashion just doesn't make sense. I believe it will make more sense if you change your kernel invocation to:
priceSum<<<enterNum, 1>>>(d_A,enterNum,result);

Related

Zero padding on the fly with cuFFT

I have a float array and want to FFT from this with an amount of data and padding by zero padding to 2^N. I also want to overlap the data by a selectable factor.
So far I have a cuda kernel with which I create another array in which I store the overlapped and padded data. Afterwards a cufftPlanMany is executed.
By the two factors, the amount of data becomes very large and it is in principle only copies of the original data and zeros with which I waste my entire memory bandwidth.
I could not find anything if cuFFT supports zero padding or if I have a possibility to create custom scripts.
(Nvidia Quadro P5000, C++14, Kubuntu)
Update
I have written a callback function which is called when loading the data into the FFT. Unfortunately this is still a little bit slower than my previous solution with a kernel which prepares the data in another array and then calls the FFT.
I need an average of 2.4ms for the example with the given values.
My hope was that if I process the data on the fly, my memory bandwidth will not limit me anymore. Unfortunately this does not look like that at the moment.
Does anyone have an idea how I can speed this up even more?
// Don't forget to include cufft_static(not cufft), culibos and set flag -dc
#include <stdio.h>
#include <cstdint>
#include <unistd.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <math.h>
typedef struct fft_CB_LD_callerInfo{
uint16_t rgLen;
uint16_t rgDataLen;
uint16_t overlapFactor;
};
static __device__ cufftReal myOwnCallback(void *dataIn,
size_t offset,
void *callerInfo,
void *sharedPtr) {
const fft_CB_LD_callerInfo *fftInfo = (fft_CB_LD_callerInfo*)callerInfo;
int idx_rg = offset/fftInfo->rgLen;
int idx_realRg = idx_rg/fftInfo->overlapFactor;
int idx_posInRg = offset-(size_t)idx_rg*fftInfo->rgLen;
if(idx_posInRg < fftInfo->rgDataLen){
const size_t idx_data = idx_posInRg
+ idx_realRg*fftInfo->rgDataLen
+ idx_rg - (idx_realRg*fftInfo->overlapFactor)*fftInfo->rgDataLen/fftInfo->overlapFactor;
return ((cufftReal*)dataIn)[idx_data];
}
else{
return 0.0f;
}
}
__device__ cufftCallbackLoadR myOwnCallbackPtr = myOwnCallback;
int main(){
// Data
float *dataHost;
float *data;
cufftComplex *spectrum;
cufftComplex *spectrumHost;
unsigned int rgDataLen = 400;
unsigned int rgLen = 2048;
unsigned int overlap = 8;
int peakPosHost[] = {0};
int *peakPos;
unsigned int rgCountClean = 52*16*4;
unsigned int rgCount = rgCountClean*overlap-(overlap-1);
int peakCountHost = 1;
int *peakCount;
// for FFT
cudaStream_t stream;
cufftHandle plan;
cufftResult result;
int fftRank = 1; // --- 1D FFTs
int fftIRide = 1, fftORide = 1; // --- Distance between two successive input/output elements
int fftInembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int fftOnembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int fftEachLen[] = { (int)rgLen }; // --- Size of the Fourier transform
int fftIDist = rgLen;
int fftODist = rgLen/2+1; // --- Distance between batches
// for Custom callback
cufftCallbackLoadR hostCopyOfCallbackPtr;
size_t worksize;
fft_CB_LD_callerInfo *fftInfo;
fft_CB_LD_callerInfo *fftInfoHost;
// Allocate host memory
dataHost = new float[rgDataLen*rgCountClean*peakCountHost];
spectrumHost = new cufftComplex[fftODist*rgCount];
fftInfoHost = new fft_CB_LD_callerInfo;
// create array with example data
for(int k=0; k<rgDataLen;k++){
for(int i=0; i<rgCountClean; i++){
dataHost[i*rgDataLen + k] = sin((2+i*4)*M_PI*k/rgDataLen);
}
}
fftInfoHost->overlapFactor = overlap;
fftInfoHost->rgDataLen = rgDataLen;
fftInfoHost->rgLen = rgLen;
// allocate device memory
cudaMalloc((void **)&data, sizeof(float) * rgDataLen*rgCountClean*peakCountHost);
cudaMalloc((void **)&peakPos, sizeof(int) * peakCountHost);
cudaMalloc((void **)&peakCount, sizeof(int));
cudaMalloc((void **)&spectrum, sizeof(cufftComplex)*fftODist*rgCount);
cudaMalloc((void **)&fftInfo, sizeof(fft_CB_LD_callerInfo));
// copy date from host to device
cudaMemcpy(data, dataHost, sizeof(float)*rgDataLen*rgCountClean*peakCountHost, cudaMemcpyHostToDevice);
cudaMemcpy(peakPos, peakPosHost, sizeof(int)*peakCountHost, cudaMemcpyHostToDevice);
cudaMemcpy(peakCount, &peakCountHost, sizeof(peakCountHost), cudaMemcpyHostToDevice);
cudaMemcpy(fftInfo, fftInfoHost, sizeof(fft_CB_LD_callerInfo), cudaMemcpyHostToDevice);
// get device pointer to custom callback function
cudaError_t error = cudaMemcpyFromSymbol(&hostCopyOfCallbackPtr, myOwnCallbackPtr, sizeof(hostCopyOfCallbackPtr));
if(error != 0) printf("cudaMemcpyFromSymbol faild with %d!\n", (int)error);
// Create a plan of FFTs to fast execute there later
cufftCreate(&plan);
result = cufftMakePlanMany(plan, fftRank, fftEachLen, fftInembed, fftIRide, fftIDist, fftOnembed, fftORide, fftODist, CUFFT_R2C, rgCount, &worksize);
if(result != CUFFT_SUCCESS) printf("cufftMakePlanMany failed with %d!\n", (int)result);
result = cufftXtSetCallback(plan, (void**)&hostCopyOfCallbackPtr, CUFFT_CB_LD_REAL, (void**)&fftInfo);
if(result != CUFFT_SUCCESS) printf("cufftXtSetCallback failed with %d!\n", (int)result);
// ----- Begin test area ---------------------------------------------------
if(cufftExecR2C(plan, data, spectrum) != CUFFT_SUCCESS)
printf("cufftExecR2C is failed!\n");
// ----- End test area ---------------------------------------------------
return 0;
}

terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorInvalidValue: invalid argument

I am trying to count the number of times curand_uniform() returns 1.0. However i cant seem to get the following code to work for me:
#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
using namespace std;
__global__
void counts(int length, int *sum, curandStatePhilox4_32_10_t* state) {
int tempsum = int(0);
int i = blockIdx.x * blockDim.x + threadIdx.x;
curandStatePhilox4_32_10_t localState = state[i];
for(; i < length; i += blockDim.x * gridDim.x) {
double thisnum = curand_uniform( &localState );
if ( thisnum == 1.0 ){
tempsum += 1;
}
}
atomicAdd(sum, tempsum);
}
__global__
void curand_setup(curandStatePhilox4_32_10_t *state, long seed) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, id, 0, &state[id]);
}
int main(int argc, char *argv[]) {
const int N = 1e5;
int* count_h = 0;
int* count_d;
cudaMalloc(&count_d, sizeof(int) );
cudaMemcpy(count_d, count_h, sizeof(int), cudaMemcpyHostToDevice);
int threads_per_block = 64;
int Nblocks = 32*6;
thrust::device_vector<curandStatePhilox4_32_10_t> d_state(Nblocks*threads_per_block);
curand_setup<<<Nblocks, threads_per_block>>>(d_state.data().get(), time(0));
counts<<<Nblocks, threads_per_block>>>(N, count_d, d_state.data().get());
cudaMemcpy(count_h, count_d, sizeof(int), cudaMemcpyDeviceToHost);
cout << count_h << endl;
cudaFree(count_d);
free(count_h);
}
I am getting the terminal error (on
linux):
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: cudaErrorInvalidValue: invalid argument
Aborted (core dumped)
And i am compiling like this:
nvcc -Xcompiler "-fopenmp" -o test uniform_one_hit_count.cu
I don't understand this error message.
This line:
thrust::device_vector<curandStatePhilox4_32_10_t> d_state(Nblocks*threads_per_block);
is initializing a new vector on the device. When thrust does that, it calls the constructor for the object in use, in this case curandStatePhilox4_32_10, a struct whose definition is in /usr/local/cuda/include/curand_philox4x32_x.h (on linux, anyway). Unfortunately that struct definition doesn't provide any constructors decorated with __device__, and this is causing trouble for thrust.
A simple workaround would be to assemble the vector on the host and copy it to the device:
thrust::host_vector<curandStatePhilox4_32_10_t> h_state(Nblocks*threads_per_block);
thrust::device_vector<curandStatePhilox4_32_10_t> d_state = h_state;
Alternatively, just use cudaMalloc to allocate space:
curandStatePhilox4_32_10_t *d_state;
cudaMalloc(&d_state, (Nblocks*threads_per_block)*sizeof(d_state[0]));
You have at least one other problem as well. This is not actually providing a proper allocation of storage for what the pointer should be pointing to:
int* count_h = 0;
after that, you should do something like:
count_h = (int *)malloc(sizeof(int));
memset(count_h, 0, sizeof(int));
and on your print-out line, you most likely want to do this:
cout << count_h[0] << endl;
The other way to address the count_h issue would be to start with:
int count_h = 0;
and this would necessitate a different set of changes to your code (to the cudaMemcpy operations).

Does bool variable in kernel need to be synchronized

I have a kernel consisting of a for loop that searches through an array for a specific int value. I'm using a grid block of 256 threads to do this. However, when one thread finds the value, I want to let the other threads know to exit. Currently I'm using a boolean flag, but I'm not sure if its working properly. My concern is synchronization.
__device__ bool found;
__global__
void search()
{
for(int i = threadIdx.x; i<1000000; i += stride)
{
if(found == true)
{
break;
}
else if(arr[i] = x)
{
found = true;
break;
}
}
}
int main()
{
bool flag = false;
cudaMemcpyToSymbol(found, &flag, sizeof(bool), 0,cudaMemcpyHostToDevice);
}
As pointed out in comments, you can probably achieve what you want by declaring the global device flag to be volatile, which will inhibit caching, and by using a memory fence function. There really isn't a global synchronization primitive which would do want you want other than the new grid synchronization mechanism introduced in CUDA 9 and new hardware, but that probably isn't necessary in this case. Turning your pseudocode into a toy example:
#include <iostream>
#include <thrust/device_vector.h>
__device__ volatile bool found;
__device__ volatile size_t idx;
template<bool docheck>
__global__
void search(const int* arr, int x, size_t N)
{
size_t i = threadIdx.x + blockIdx.x * blockDim.x;
size_t stride = blockDim.x * gridDim.x;
for(; (i<N) && (!found); i += stride)
{
if(arr[i] == x)
{
if (docheck) found = true;
idx = i;
__threadfence();
break;
}
}
}
int main()
{
const size_t N = 1 << 24;
const size_t findidx = 280270;
const int findval = 0xdeadbeef;
thrust::device_vector<int> data(N,1);
data[findidx] = findval;
bool flag = false;
size_t zero = 0;
{
cudaMemcpyToSymbol(found, &flag, sizeof(bool));
cudaMemcpyToSymbol(idx, &zero, sizeof(size_t));
int blocks, threads;
cudaOccupancyMaxPotentialBlockSize(&blocks, &threads, search<false>);
search<false><<<blocks, threads>>>(thrust::raw_pointer_cast(data.data()), findval, N);
cudaDeviceSynchronize();
size_t result = 0;
cudaMemcpyFromSymbol(&result, idx, sizeof(size_t));
std::cout << "result = " << result << std::endl;
}
{
cudaMemcpyToSymbol(found, &flag, sizeof(bool));
cudaMemcpyToSymbol(idx, &zero, sizeof(size_t));
int blocks, threads;
cudaOccupancyMaxPotentialBlockSize(&blocks, &threads, search<true>);
search<true><<<blocks, threads>>>(thrust::raw_pointer_cast(data.data()), findval, N);
cudaDeviceSynchronize();
size_t result = 0;
cudaMemcpyFromSymbol(&result, idx, sizeof(size_t));
std::cout << "result = " << result << std::endl;
}
return 0;
}
and profiling it gives the following:
$ nvcc -arch=sm_52 -o notify notify.cu
$ nvprof ./notify
==3916== NVPROF is profiling process 3916, command: ./notify
result = 280270
result = 280270
==3916== Profiling application: ./notify
==3916== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 78.00% 1.6773ms 1 1.6773ms 1.6773ms 1.6773ms void search<bool=0>(int const *, int, unsigned long)
19.93% 428.63us 1 428.63us 428.63us 428.63us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int)
1.82% 39.199us 1 39.199us 39.199us 39.199us void search<bool=1>(int const *, int, unsigned long)
As you can see, the version which sets the found flag completes the search in 40 microseconds, whereas the version which does not set the flag takes 1.7 milliseconds. Given that the kernel is run with the maximum number of resident blocks in both cases, we can conclude that the early exit mechanism worked correctly and running blocks detected that the required value had been found.

Can't get matrix*vector multiplication to go faster in CUDA than in CPU

#include <iostream>
#include <assert.h>
#include <sys/time.h>
#define BLOCK_SIZE 32 // CUDA block size
__device__ inline int getValFromMatrix(int* matrix, int row, int col,int matSize) {
if (row<matSize && col<matSize) {return matrix[row*matSize + col];}
return 0;
}
__device__ inline int getValFromVector(int* vector, int row, int matSize) {
if (row<matSize) {return vector[row];}
return 0;
}
__global__ void matVecMultCUDAKernel(int* aOnGPU, int* bOnGPU, int* cOnGPU, int matSize) {
__shared__ int aRowShared[BLOCK_SIZE];
__shared__ int bShared[BLOCK_SIZE];
__shared__ int myRow;
__shared__ double rowSum;
int myIndexInBlock = threadIdx.x;
myRow = blockIdx.x;
rowSum = 0;
for (int m = 0; m < (matSize / BLOCK_SIZE + 1);m++) {
aRowShared[myIndexInBlock] = getValFromMatrix(aOnGPU,myRow,m*BLOCK_SIZE+myIndexInBlock,matSize);
bShared[myIndexInBlock] = getValFromVector(bOnGPU,m*BLOCK_SIZE+myIndexInBlock,matSize);
__syncthreads(); // Sync threads to make sure all fields have been written by all threads in the block to cShared and xShared
if (myIndexInBlock==0) {
for (int k=0;k<BLOCK_SIZE;k++) {
rowSum += aRowShared[k] * bShared[k];
}
}
}
if (myIndexInBlock==0) {cOnGPU[myRow] = rowSum;}
}
static inline void cudaCheckReturn(cudaError_t result) {
if (result != cudaSuccess) {
std::cerr <<"CUDA Runtime Error: " << cudaGetErrorString(result) << std::endl;
assert(result == cudaSuccess);
}
}
static void matVecMultCUDA(int* aOnGPU,int* bOnGPU, int* cOnGPU, int* c, int sizeOfc, int matSize) {
matVecMultCUDAKernel<<<matSize,BLOCK_SIZE>>>(aOnGPU,bOnGPU,cOnGPU,matSize); // Launch 1 block per row
cudaCheckReturn(cudaMemcpy(c,cOnGPU,sizeOfc,cudaMemcpyDeviceToHost));
}
static void matVecMult(int** A,int* b, int* c, int matSize) {
// Sequential implementation:
for (int i=0;i<matSize;i++) {
c[i]=0;
for (int j=0;j<matSize;j++) {
c[i]+=(A[i][j] * b[j]);
}
}
}
int main() {
int matSize = 1000;
int** A,* b,* c;
int* aOnGPU,* bOnGPU,* cOnGPU;
A = new int*[matSize];
for (int i = 0; i < matSize;i++) {A[i] = new int[matSize]();}
b = new int[matSize]();
c = new int[matSize]();
int aSizeOnGPU = matSize * matSize * sizeof(int), bcSizeOnGPU = matSize * sizeof(int);
cudaCheckReturn(cudaMalloc(&aOnGPU,aSizeOnGPU)); // cudaMallocPitch?
cudaCheckReturn(cudaMalloc(&bOnGPU,bcSizeOnGPU));
cudaCheckReturn(cudaMalloc(&cOnGPU,bcSizeOnGPU));
srand(time(NULL));
for (int i=0;i<matSize;i++) {
b[i] = rand()%100;
for (int j=0;j<matSize;j++) {
A[i][j] = rand()%100;
}
}
for (int i=0;i<matSize;i++) {cudaCheckReturn(cudaMemcpy((aOnGPU+i*matSize),A[i],bcSizeOnGPU,cudaMemcpyHostToDevice));}
cudaCheckReturn(cudaMemcpy(bOnGPU,b,bcSizeOnGPU,cudaMemcpyHostToDevice));
int iters=1;
timeval start,end;
// Sequential run:
gettimeofday(&start,NULL);
for (int i=0;i<iters;i++) {matVecMult(A,b,c,matSize);}
gettimeofday(&end,NULL);
std::cout << (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec) << std::endl;
// CUDA run:
gettimeofday(&start,NULL);
for (int i=0;i<iters;i++) {matVecMultCUDA(aOnGPU,bOnGPU,cOnGPU,c,bcSizeOnGPU,matSize);}
gettimeofday(&end,NULL);
std::cout << (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec) << std::endl;
cudaCheckReturn(cudaFree(aOnGPU));
cudaCheckReturn(cudaFree(bOnGPU));
cudaCheckReturn(cudaFree(cOnGPU));
for (int i = 0; i < matSize; ++i) {
delete[] A[i];
}
delete[] A;
delete[] b;
delete[] c;
}
Gives:
267171
580253
I've followed the guide on http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory, on how to do a matrix multiplication. I used shared memory for both the matrix (A) and the vector (B), but no matter what matrix size (100*100-20000*20000) or block size (32-1024) i choose, the sequential implementation always outperforms the CUDA implementation in terms of speed, it is about twice as fast.
Since I'm using matrix*vector multiplication, the shared arrays and blocks are handled a bit different; I'm using one block per row of the matrix instead of a 2D block over a part of the matrix.
Is my implementation wrong, or is simply CUDA not faster than the CPU?
First item: You perform checks on boundaries in the cuda implementation where you don't on CPU. Branching are really expensive on a GPU.
Second : You count the cudamemcpy in the cuda performance. It's very uncommon to perform only one multiplication before having to get the result back to cpu.
Usually (on CG for example), you perform several hundreds of multiplication on GPU before having to copy back.
Third: Dont try to implement that (except for educational purposes) and use vendor libraries (like CUBLAS, which ships with every CUDA release), which are extremely hard to outperform.

Counting occurrences of specific events in CUDA kernels

Problem
I am trying to find the best way to count how many times my program ends up in some specific branches of my CUDA kernels. The idea is that some events should nearly never happen, but since the data processed by the GPU is given by a numerical optimization solver, there may be some situations where ill-defined cases become more common. Thus, I want to be able to track/monitor these phenomenons over multiple simulations to make some global statistics later.
Possible idea
The most straightforward way to do this may be to use a structure dedicated to monitoring such occurrences. Then, when entering a monitored branch, we increment the associated counter using atomicAdd. At the end of the simulation, we copy the counters back to the host and store them for some future statistics processing.
In my case, the cost of using atomicAdd should not be that important since I should not be entering those branches that much, but still, I may want to monitor some of the common branches later on, so what would be a better approach then? Since this is just for monitoring, I do not want the overhead to be too important.
I guess I could also have one monitoring structure per block and do a sum at the end, since it should not use much global memory anyway (1 unsigned int per monitored branch).
Code example
#include <iostream>
#include <time.h>
#include <cuda.h>
#include <stdio.h>
#define CUDA_CHECK_ERROR() __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)
inline void __cuda_check_errors(const char *filename, const int line_number)
{
cudaError err = cudaDeviceSynchronize();
if(err != cudaSuccess)
{
printf("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString(err));
exit(-1);
}
}
inline void __cuda_safe_call(cudaError err, const char *filename, const int line_number)
{
if (err != cudaSuccess)
{
printf("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString(err));
exit(-1);
}
}
struct Stats
{
unsigned int even;
};
__global__ void test_kernel(int* A, int* B, Stats* stats)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int res = A[tid] + (int)tid;
if (res%2 == 0)
atomicAdd(&(stats->even), 1);
B[tid] = res;
}
int get_random_int(int min, int max)
{
return min + (rand() % (int)(max - min + 1));
}
void print_array(int* ar, unsigned int n)
{
for (unsigned int i = 0; i < n; ++i)
std::cout << ar[i] << " ";
std::cout << std::endl;
}
void print_stats(Stats* s)
{
std::cout << "even: " << s->even << std::endl;
}
int main()
{
// vector size
const unsigned int N = 10;
// device vectors
int *d_A, *d_B;
Stats *d_stats;
// host vectors
int *h_A, *h_B;
Stats *h_stats;
// allocate device memory
CUDA_SAFE_CALL(cudaMalloc(&d_A, N * sizeof(int)));
CUDA_SAFE_CALL(cudaMalloc(&d_B, N * sizeof(int)));
CUDA_SAFE_CALL(cudaMalloc(&d_stats, sizeof(Stats)));
// allocate host memory
h_A = new int[N];
h_B = new int[N];
h_stats = new Stats;
// initialize host data
srand(time(NULL));
for (unsigned int i = 0; i < N; ++i)
{
h_A[i] = get_random_int(0,10);
h_B[i] = 0;
}
memset(h_stats, 0, sizeof(Stats));
// copy data to the device
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, N * sizeof(int), cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_stats, h_stats, sizeof(Stats), cudaMemcpyHostToDevice));
// launch kernel
dim3 grid_size, block_size;
grid_size.x = N;
test_kernel<<<grid_size, block_size>>>(d_A, d_B, d_stats);
// copy result back to host
CUDA_SAFE_CALL(cudaMemcpy(h_B, d_B, N * sizeof(int), cudaMemcpyDeviceToHost));
CUDA_SAFE_CALL(cudaMemcpy(h_stats, d_stats, sizeof(Stats), cudaMemcpyDeviceToHost));
print_array(h_B, N);
print_stats(h_stats);
// free device memory
CUDA_SAFE_CALL(cudaFree(d_A));
CUDA_SAFE_CALL(cudaFree(d_B));
CUDA_SAFE_CALL(cudaFree(d_stats));
// free host memory
delete [] h_A;
delete [] h_B;
delete h_stats;
}
Hardware/software information
The solution I am looking for should work for CC >= 2.0 devices and CUDA >= 5.0.
The atomicAdd is is one possibility and i would probably go that route. If you do not use the result of the atomicAdd function call the compiler will emit a reduction operation such as RED.E.ADD. Reduction is very fast as long as there are not many conflicts happening (i actually use it sometimes even if i do not need the operation to be atomic because it can be quicker than loading value from global memory, doing an arithmetic operation and saving back to global memory).
The second option you have is to use a profiler counter and use the profiler to analyze the result. Please see Profiler Counter Function for more details.