Has cudaMalloc changed to be asynchronous? - cuda

I've read in other places that cudaMalloc will synchronize across kernels.
(e.g. will cudaMalloc synchronize host and device?)
However, I just tested this code out and based on what I'm seeing in the visual profiler, it seems like cudaMalloc is not synchronizing. if you add cudaFree into the loop, that does synchronize. I'm using CUDA 7.5. Does anyone know if cudaMalloc changed its behavior? Or am I missing some subtlety? Thanks very much!
__global__ void slowKernel()
{
float input = 5;
for( int i = 0; i < 1000000; i++ ){
input = input * .9999999;
}
}
__global__ void fastKernel()
{
float input = 5;
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
slowKernel <<<1, 1, 0, stream1 >>>();
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, 4 * 1024 * 1024 );
fastKernel <<<1, 1, 0, stream2 >>>();
// cudaFree( dev_a ); // If you uncomment this, the second fastKernel launch will wait until slowKernel completes
}
}

Your methodology is flawed, but you conclusion looks correct to me (if you look at your profile data you should see that both long and short kernels are taking the same amount of time and run very quickly, because aggressive compiler optimisation is eliminating all the code in both cases).
I turned your example into something more reasonable
#include <time.h>
__global__ void slowKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 10000000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
__global__ void fastKernel(float *output, bool write=false)
{
float input = 5;
#pragma unroll
for( int i = 0; i < 100000; i++ ){
input = input * .9999999;
}
if (write) *output -= input;
}
void burntime(long val) {
struct timespec tv[] = {{0, val}};
nanosleep(tv, 0);
}
void mallocSynchronize(){
cudaStream_t stream1, stream2;
cudaStreamCreate( &stream1 );
cudaStreamCreate( &stream2 );
const size_t sz = 1 << 21;
slowKernel <<<1, 1, 0, stream1 >>>((float *)(0));
burntime(500000000L); // 500ms wait - slowKernel around 1300ms
int *dev_a = 0;
for( int i = 0; i < 10; i++ ){
cudaMalloc( &dev_a, sz );
fastKernel <<<1, 1, 0, stream2 >>>((float *)(0));
burntime(1000000L); // 1ms wait - fastKernel around 15ms
}
}
int main()
{
mallocSynchronize();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
[note requires POSIX time functions so this won't run on Windows]
On a fairly fast Maxwell device (GTX970), I see that cudaMalloc calls in the loop overlap with the still executing slowKernel call in the profile trace, and then with running fastKernel calls in the other stream. I was willing to accept the initial conclusion that minor timing variations could be cause the effect you saw in your broken example. However, in this code, 0.5 seconds time shift in synchronisation between the host and device traces seems very improbable. You might need to vary the duration of the burntime calls to get the same effect, depending on how fast your GPU is.
So this is a very long way of saying, yes it looks like it is a non-synchronising call on Linux with CUDA 7.5 and a Maxwell device. I don't believe that has always been the case, but then again the documentation has never, as best as I can tell, said whether is should block/synchronize or not. I don't have access to older CUDA versions and supported hardware to see what this example would do with an older driver and a Fermi or Kepler device.

Related

Cuda printf() overlapping when using multiple devices

I have a printf in my __global__ code. It works as intended most of the time. However when using a multi GPU system (typically happens when ran on an 4-8 GPU system), once in a while, the prints will merge. By once in a while Its about 100-500 lines out of 167000 lines.
I was wondering how this situation can be remedied without adding too much overhead of transferring the data back to host (if possible). I was thinking to try a mutex lock for printing but I dont think that sort of thing exists for use in the kernel. Any other solutions I could try?
Note: The actual kernel is a long running kernel usually around 20-50 minutes to complete depending on the GPU.
Note2: I barely know what I'm doing with C/C++.
Example of merged Output
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869fxLuZJu3ZkX
qVOuKlQ0ZcMrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
What it should look like
JmHp8rwXAw,031aa97714c800de47971829beded204000cfcf5e0f3775552ccf3e9b387869f
MrhGXAnZ75,08bf3e90a57c31b7f355214cdf442748d9ff6ae1d49a96f7a8b9e3c86bd8e68a
qVOuKlQ0Zc,5231a9e969d53c64f75bb1f07b1c95bb81f685744ed46f56348c733389c56ca5
xLuZJu3ZkX,623f62b3198c8b62cd7a3b3cf8bf8ede5f9bfdccb7c1dc48a55530c7d5f59ce8
My Example Code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
pthread_mutex_t solutionLock;
__global__ void kernel(unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Print to console - randomStr,Hash */
printf("%s,%s\n", random, data);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
while (1) {
hostRandomGen(&rngSeed);
kernel<<<BLOCKS, THREADS>>>(rngSeed);
cudaDeviceSynchronize();
}
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
pthread_mutex_lock(&solutionLock);
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I spent 4 days trying different things to no avail. I really don't understand memory management enough in C/C++ to get past the endless segmentation fault errors.
What I ended up doing was using Unified Memory as it seemed the easiest way to handle the memory for both device and host and it doesn't seem to add too much overhead to the whole process. Then each cpu thread (gpu) can write to its own file. I ran a couple of nvprof and it seemed that after the initial setup for the memory cudaMallocManaged the rest of the overhead seemed to be measured in the microseconds. Since each loop takes 20 minutes these are really barely noticeable.
I created two __device__ functions to copy the data over to the host accessible arrays, because I wanted to utilize the #pragma unroll feature. Not really sure if that helps or what it even does, but I decided to do things this way.
If anyone has further suggestions on ways to improve I am open to trying more things out.
Here is my new example code:
#define BLOCKS 384
#define THREADS 64
typedef struct HandlerInput {
unsigned char device;
} HandlerInput;
__device__ void mycpydigest(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 64
for (BYTE i = 0; i < 64; i++) {
dst[i] = src[i];
}
dst[64] = '\0';
}
__device__ void mycpyrandom(__restrict__ BYTE *dst, __restrict__ const BYTE *src) {
#pragma unroll 10
for (BYTE i = 0; i < 10; i++) {
dst[i] = src[i];
}
dst[10] = '\0';
}
__global__ void kernel(BYTE **d_random, BYTE **d_hashes, unsigned long baseSeed) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
BYTE random[RANDOM_LEN];
BYTE data[DIGEST_LEN];
SHA256_CTX ctx;
/* Randomization routine*/
d_getRandomString((unsigned long)idx + baseSeed, random);
/* Hashing routine*/
sha256_hash(&ctx, random, data, RANDOM_LEN);
/* Send to host - randomStr & Hash */
mycpydigest(d_hashes[idx], data);
mycpyrandom(d_random[idx], random);
}
void *launchGPUHandlerThread(void *vargp) {
HandlerInput *hi = (HandlerInput *)vargp;
cudaSetDevice(hi->device);
unsigned long rngSeed = timeus();
int threadBlocks = hi->BLOCKS * hi->THREADS;
BYTE **randoms;
BYTE **hashes;
cudaMallocManaged(&randoms, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
cudaMallocManaged(&hashes, sizeof(BYTE *) * (threadBlocks), cudaMemAttachGlobal);
for (int i = 0; i < threadBlocks; i++) {
cudaMallocManaged(&randoms[i], sizeof(BYTE) * (RANDOM_LEN), cudaMemAttachGlobal);
cudaMallocManaged(&hashes[i], sizeof(BYTE) * (DIGEST_LEN), cudaMemAttachGlobal);
}
while (1) {
hostRandomGen(&rngSeed);
kernel<<<hi->BLOCKS, hi->THREADS>>>(randoms, hashes, rngSeed);
cudaDeviceSynchronize();
print2File(randoms, hashes, threadBlocks, hi->device)
}
cudaFree(hashes);
cudaFree(randoms);
cudaDeviceReset();
return NULL;
}
int main() {
int GPUS;
cudaGetDeviceCount(&GPUS);
pthread_t *tids = (pthread_t *)malloc(sizeof(pthread_t) * GPUS);
for (int i = 0; i < GPUS; i++) {
HandlerInput *hi = (HandlerInput *)malloc(sizeof(HandlerInput));
hi->device = i;
pthread_create(tids + i, NULL, launchGPUHandlerThread, hi);
usleep(23);
}
for (int i = 0; i < GPUS; i++)
pthread_join(tids[i], NULL);
return 0;
}
I want to thank #paleonix for the help in the comments. I was working on this issue for a week before I posted and your comments helped guide me down a different path.

cudaStream strange performance

I try to develop an example of sobel with cudaStream. Here is the program:
void SobelStream(void)
{
cv::Mat imageGrayL2 = cv::imread("/home/xavier/Bureau/Image1.png",0);
u_int8_t *u8_PtImageHost;
u_int8_t *u8_PtImageDevice;
u_int8_t *u8_ptDataOutHost;
u_int8_t *u8_ptDataOutDevice;
u_int8_t u8_Used[NB_STREAM];
u8_ptDataOutHost = (u_int8_t *)malloc(WIDTH*HEIGHT*sizeof(u_int8_t));
checkCudaErrors(cudaMalloc((void**)&u8_ptDataOutDevice,WIDTH*HEIGHT*sizeof(u_int8_t)));
u8_PtImageHost = (u_int8_t *)malloc(WIDTH*HEIGHT*sizeof(u_int8_t));
checkCudaErrors(cudaMalloc((void**)&u8_PtImageDevice,WIDTH*HEIGHT*sizeof(u_int8_t)));
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<unsigned char>();
checkCudaErrors(cudaMallocArray(&Array_PatchsMaxDevice, &channelDesc,WIDTH,HEIGHT ));
checkCudaErrors(cudaBindTextureToArray(Image,Array_PatchsMaxDevice));
dim3 threads(BLOC_X,BLOC_Y);
dim3 blocks(ceil((float)WIDTH/BLOC_X),ceil((float)HEIGHT/BLOC_Y));
ClearKernel<<<blocks,threads>>>(u8_ptDataOutDevice,WIDTH,HEIGHT);
int blockh = HEIGHT/NB_STREAM;
Stream = (cudaStream_t *) malloc(NB_STREAM * sizeof(cudaStream_t));
for (int i = 0; i < NB_STREAM; i++)
{
checkCudaErrors(cudaStreamCreate(&(Stream[i])));
}
// for(int i=0;i<NB_STREAM;i++)
// {
// cudaSetDevice(0);
// cudaStreamCreate(&Stream[i]);
// }
cudaEvent_t Start;
cudaEvent_t Stop;
cudaEventCreate(&Start);
cudaEventCreate(&Stop);
cudaEventRecord(Start, 0);
//////////////////////////////////////////////////////////
for(int i=0;i<NB_STREAM;i++)
{
if(i == 0)
{
int localHeight = blockh;
checkCudaErrors(cudaMemcpy2DToArrayAsync( Array_PatchsMaxDevice,
0,
0,
imageGrayL2.data,//u8_PtImageDevice,
WIDTH,
WIDTH,
blockh,
cudaMemcpyHostToDevice ,
Stream[i]));
dim3 threads(BLOC_X,BLOC_Y);
dim3 blocks(ceil((float)WIDTH/BLOC_X),ceil((float)localHeight/BLOC_Y));
SobelKernel<<<blocks,threads,0,Stream[i]>>>(u8_ptDataOutDevice,0,WIDTH,localHeight-1);
checkCudaErrors(cudaGetLastError());
u8_Used[i] = 1;
}else{
int ioffsetImage = WIDTH*(HEIGHT/NB_STREAM );
int hoffset = HEIGHT/NB_STREAM *i;
int hoffsetkernel = HEIGHT/NB_STREAM -1 + HEIGHT/NB_STREAM* (i-1);
int localHeight = min(HEIGHT - (blockh*i),blockh);
//printf("hoffset: %d hoffsetkernel %d localHeight %d rest %d ioffsetImage %d \n",hoffset,hoffsetkernel,localHeight,HEIGHT - (blockh +1 +blockh*(i-1)),ioffsetImage*i/WIDTH);
checkCudaErrors(cudaMemcpy2DToArrayAsync( Array_PatchsMaxDevice,
0,
hoffset,
&imageGrayL2.data[ioffsetImage*i],//&u8_PtImageDevice[ioffset*i],
WIDTH,
WIDTH,
localHeight,
cudaMemcpyHostToDevice ,
Stream[i]));
u8_Used[i] = 1;
if(HEIGHT - (blockh +1 +blockh*(i-1))<=0)
{
break;
}
}
}
///////////////////////////////////////////
for(int i=0;i<NB_STREAM;i++)
{
if(i == 0)
{
int localHeight = blockh;
dim3 threads(BLOC_X,BLOC_Y);
dim3 blocks(1,1);
SobelKernel<<<blocks,threads,0,Stream[i]>>>(u8_ptDataOutDevice,0,WIDTH,localHeight-1);
checkCudaErrors(cudaGetLastError());
u8_Used[i] = 1;
}else{
int ioffsetImage = WIDTH*(HEIGHT/NB_STREAM );
int hoffset = HEIGHT/NB_STREAM *i;
int hoffsetkernel = HEIGHT/NB_STREAM -1 + HEIGHT/NB_STREAM* (i-1);
int localHeight = min(HEIGHT - (blockh*i),blockh);
dim3 threads(BLOC_X,BLOC_Y);
dim3 blocks(1,1);
SobelKernel<<<blocks,threads,0,Stream[i]>>>(u8_ptDataOutDevice,hoffsetkernel,WIDTH,localHeight);
checkCudaErrors(cudaGetLastError());
u8_Used[i] = 1;
if(HEIGHT - (blockh +1 +blockh*(i-1))<=0)
{
break;
}
}
}
///////////////////////////////////////////////////////
for(int i=0;i<NB_STREAM;i++)
{
if(i == 0)
{
int localHeight = blockh;
checkCudaErrors(cudaMemcpyAsync(u8_ptDataOutHost,u8_ptDataOutDevice,WIDTH*(localHeight-1)*sizeof(u_int8_t),cudaMemcpyDeviceToHost,Stream[i]));
u8_Used[i] = 1;
}else{
int ioffsetImage = WIDTH*(HEIGHT/NB_STREAM );
int hoffset = HEIGHT/NB_STREAM *i;
int hoffsetkernel = HEIGHT/NB_STREAM -1 + HEIGHT/NB_STREAM* (i-1);
int localHeight = min(HEIGHT - (blockh*i),blockh);
checkCudaErrors(cudaMemcpyAsync(&u8_ptDataOutHost[hoffsetkernel*WIDTH],&u8_ptDataOutDevice[hoffsetkernel*WIDTH],WIDTH*localHeight*sizeof(u_int8_t),cudaMemcpyDeviceToHost,Stream[i]));
u8_Used[i] = 1;
if(HEIGHT - (blockh +1 +blockh*(i-1))<=0)
{
break;
}
}
}
for(int i=0;i<NB_STREAM;i++)
{
cudaStreamSynchronize(Stream[i]);
}
cudaEventRecord(Stop, 0);
cudaEventSynchronize(Start);
cudaEventSynchronize(Stop);
float dt_ms;
cudaEventElapsedTime(&dt_ms, Start, Stop);
printf("dt_ms %f \n",dt_ms);
}
I had a really strange performance on th execution of my program. I decided to profile my example and I get that:
I don't understand it seems that each stream are waiting each other.
Can someone help me about that?
First of all, in the future, please provide a complete code. I'm also working off of your cross-posting here to fill in some details such as kernel sizes.
You have two issues to address:
First, any time you wish to use cudaMemcpyAsync, you will most likely want to be working with pinned host allocations. If you use allocations created e.g. with malloc, you will not get the expected behavior from cudaMemcpyAsync as far as asynchronous concurrent execution is concerned. This necessity is covered in the programming guide:
If host memory is involved in the copy, it must be page-locked.
So the first change to make to your code is to convert this:
u8_PtImageHost = (u_int8_t *)malloc(WIDTH*HEIGHT*sizeof(u_int8_t));
u8_ptDataOutHost = (u_int8_t *)malloc(WIDTH*HEIGHT*sizeof(u_int8_t));
to this:
checkCudaErrors(cudaHostAlloc(&u8_PtImageHost, WIDTH*HEIGHT*sizeof(u_int8_t), cudaHostAllocDefault));
checkCudaErrors(cudaHostAlloc(&u8_ptDataOutHost, WIDTH*HEIGHT*sizeof(u_int8_t), cudaHostAllocDefault));
with that change alone, your execution duration drops from about 21ms to 7ms according to my testing. The reason for this is that without the change, we get no overlap whatsoever:
With the change, the copy activity can overlap with each other (H->D and D->H) and with kernel execution:
The second issue you face to get to concurrent kernel execution is that your kernels are just too large (too many blocks/threads):
#define WIDTH 6400
#define HEIGHT 4800
#define NB_STREAM 10
#define BLOC_X 32
#define BLOC_Y 32
dim3 threads(BLOC_X,BLOC_Y);
dim3 blocks(ceil((float)WIDTH/BLOC_X),ceil((float)HEIGHT/BLOC_Y));
I would suggest that if these are the sizes of kernels you need to run, there's probably not much benefit to try and strive for kernel overlap - each kernel is launching enough blocks to "fill" the GPU, so you have already exposed enough parallelism to keep the GPU busy. But if you are desperate to witness kernel concurrency, you could make your kernels use a smaller number of blocks while causing each kernel to spend more time executing. We could do this by launching 1 block, and have just the the threads in each block perform the image filtering.

Does calling a CUDA kernel multiple times affect execution speed?

I am trying to measure the performance difference of a GPU between allocating memory using 'malloc' in a kernel function vs. using pre-allocated storage from 'cudaMalloc' on the host. To do this, I have two kernel functions, one that uses malloc, one that uses a pre-allocated array, and I time the execution of each function repeatedly.
The problem is that the first execution of each kernel function takes between 400 - 2500 microseconds, but all subsequent runs take about 15 - 30 microseconds.
Is this behavior expected, or am I witnessing some sort of carryover effect from previous runs? If this is carryover, what can I do to prevent it?
I have tried putting in a kernel function that zeros out all memory on the GPU between each timed test run to eliminate that carryover, but nothing changed. I have also tried reversing the order in which I run the tests, and that has no effect on relative or absolute execution times.
const int TEST_SIZE = 1000;
struct node {
node* next;
int data;
};
int main() {
int numTests = 5;
for (int i = 0; i < numTests; ++i) {
memClear();
staticTest();
memClear();
dynamicTest();
}
return 0;
}
__global__ void staticMalloc(int* sum) {
// start a linked list
node head[TEST_SIZE];
// initialize nodes
for (int j = 0; j < TEST_SIZE; j++) {
// allocate the node & assign values
head[j].next = NULL;
head[j].data = j;
}
// verify creation by adding up values
int total = 0;
for (int j = 0; j < TEST_SIZE; j++) {
total += head[j].data;
}
sum[0] = total;
}
/**
* This is a test that will time execution of static allocation
*/
int staticTest() {
int expectedValue = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedValue += i;
}
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
staticMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void dynamicMalloc(int* sum) {
// start a linked list
node* headPtr = (node*) malloc(sizeof(node));
headPtr->data = 0;
headPtr->next = NULL;
node* curPtr = headPtr;
// add nodes to test cudaMalloc in device
for (int j = 1; j < TEST_SIZE; j++) {
// allocate the node & assign values
node* nodePtr = (node*) malloc(sizeof(node));
nodePtr->data = j;
nodePtr->next = NULL;
// add it to the linked list
curPtr->next = nodePtr;
curPtr = nodePtr;
}
// verify creation by adding up values
curPtr = headPtr;
int total = 0;
while (curPtr != NULL) {
// add and increment current value
total += curPtr->data;
curPtr = curPtr->next;
// clean up memory
free(headPtr);
headPtr = curPtr;
}
sum[0] = total;
}
/**
* Host function that prepares data array and passes it to the CUDA kernel.
*/
int dynamicTest() {
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
dynamicMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void clearMemory(char *zeros) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
zeros[i] = 0;
}
void memClear() {
char *zeros[1024]; // device pointers
for (int i = 0; i < 1024; ++i) {
cudaMalloc((void**) &(zeros[i]), 4 * 1024 * 1024);
clearMemory<<<1024, 4 * 1024>>>(zeros[i]);
}
for (int i = 0; i < 1024; ++i) {
cudaFree(zeros[i]);
}
}
The first execution of a kernel takes more time because you have to load a lots of stuff on GPU (kernel, lib etc...). To prove it, you can just measure how long it takes to launch an empty kernel and you will see that it's take some times. Try like:
time -> start
launch emptykernel
time -> end
firstTiming = end - start
time -> start
launch empty kernel
time -> end
secondTiming = end - start
You will see that the secondTiming is significantly smaller thant the firstTiming.
The first CUDA (kernel) call initializes the CUDA system transparently. You can avoid this by calling an empty kernel first. Note that this is required in e.g. OpenCL, but there you have to do all that init-stuff manually. CUDA does it for you in the background.
Then some problems with your timing: CUDA kernel calls are asynchronous. So (assuming your Timer class is a host timer like time()) currently you measure the kernel launch time (and for the first call the init-time of CUDA) not the kernel execution time.
At the very least you HAVE to do a cudaDeviceSynchronize() before starting AND stopping the timer.
You are better of using CUDA events which can exactly measure the kernel execution time and only that. Using host-timers you still include the launch-overhead. See https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/

Cuda kernel producing the resultant vector as zero

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.

CUDA: Fermi (Tesla M2090) generating CUDA_EXCEPTION_10 without reason

I have a small piece of code which runs perfectly on Nvidia old architecture (Tesla T10 processor) but not on Fermi (Tesla M2090)
I learned that Fermi behaves slightly differently. Due to which unsafe code might work correctly on old architectures, while on Fermi it catches the bug.
But I don't know how to resolve it.
Here is my code:
__global__ void exec (int *arr_ptr, int size, int *result) {
int tx = threadIdx.x;
int ty = threadIdx.y;
*result = arr_ptr[-2];
}
void run(int *arr_dev, int size, int *result) {
cudaStream_t stream = 0;
int *arr_ptr = arr_dev + 5;
dim3 threads(1,1,1);
dim3 grid (1,1);
exec<<<grid, threads, 0, stream>>>(arr_ptr, size, result);
}
since I am accessing arr_ptr[-2], the fermi throws CUDA_EXCEPTION_10, Device Illegal Address. But it is not. The address is legal.
Can anyone help me on this.
My driver code is
int main(){
int *arr;
int *arr_dev = NULL;
int result = 1;
arr = (int*)malloc(10*sizeof(int));
for(int i = 0; i < 10; i++)
arr[i] = i;
if(arr_dev == NULL)
{
cudaMalloc((void**)&arr_dev, 10);
cudaMemcpy(arr_dev, arr, 10*sizeof(int), cudaMemcpyHostToDevice);
}
run(arr_dev, 10, &result);
printf("%d \n", result);
return 0;
}
Fermi cards have much better memory protection on the device and will detect out of bounds conditions which will appear to "work" on older cards. Use cuda-memchk (or the cuda-memchk mode in cuda-gdb) to get a better handle on what is going wrong.
EDIT:
This is the culprit:
cudaMalloc((void**)&arr_dev, 10);
which should be
cudaMalloc((void**)&arr_dev, 10*sizeof(int));
This will result in this code
int *arr_ptr = arr_dev + 5;
passing a pointer to the device which is out of bounds.