cudaDeviceReset causes memory leak? - cuda

I tried the following code with cuda 7.0.
If I set n_repeat to 1 and remove the last cudaDeviceReset, the code runs fine.
If I set n_repeat to 1 and keep the cudaDeviceReset, I can run the code segment towards the end but I got a memory leak detected by my memory leak detector after running the program.
If I set n_repeat to 2 and keep the cudaDeviceReset, I got an error in the second time I reach cublasCreate. The error code is CUBLAS_STATUS_NOT_INITIALIZED.
Can some one let me know what is the problem here and is cudaDeviceReset for the purpose of cleaning up between different runs of using the GPU, like what I'm trying to do here?
int device_id_ = 0;
cublasHandle_t blas_;
curandGenerator_t rand_gen_;
long alloc_size = 1000;
char* raw_;
int n_repeat = 2;
for (int i = 0; i < n_repeat; ++i) {
CHECK_CUDA(cudaSetDevice(device_id_));
CHECK_CUDA(cublasCreate(&blas_));
CHECK_CUDA(curandCreateGenerator(&rand_gen_, CURAND_RNG_PSEUDO_DEFAULT));
CHECK_CUDA(cudaMalloc((void **)&raw_, alloc_size));
CHECK_CUDA(curandDestroyGenerator(rand_gen_));
CHECK_CUDA(cublasDestroy(blas_));
CHECK_CUDA(cudaFree(raw_));
CHECK_CUDA(cudaDeviceReset());
}

I had the same problem, even with the example from Robert Crovella, cuda 7 ubuntu 14.04, K40c
Adding cudaDeviceSynchronize() after cudaSetDevice and before cublasCreate() made it work for me

Related

Free memory occupied by cudaMemGetInfo

I have the following simple code to find available GPUs
int * getFreeGpuList(int *numFree) {
int * gpuList;
int nDevices;
int i, j = 0, count = 0;
cudaGetDeviceCount(&nDevices);
gpuList = (int *) malloc(nDevices * sizeof(int));
for (i = 0; i < nDevices; ++i) {
cudaSetDevice(i);
size_t freeMem;
size_t totalMem;
cudaMemGetInfo(&freeMem, &totalMem);
if (freeMem > .9 * totalMem) {
gpuList[j] = i;
count++;
j++;
}
}
*numFree = count;
return gpuList;
}
The problem is that cudaMemGetInfo occupies some memory (~150MB in my case) in each GPU. This code is a part of a bigger program that runs for a long time, and I often run several processes at the same time, so in the end the memory occupied by this function is significant. Could you please let me know how I can free the GPU memory occupied by cudaMemGetInfo? Thanks!
Based on some insight from talonmies above that cudaSetDevice creates a context and occupies some memory in the device, I found out that cudaDeviceReset can "explicitly destroys and cleans up all resources associated with the current device in the current process" without affecting other processes on the same device.
Update Nov 26: If one wants to query GPU info, it's better using the NVML library. In my experience, it is much faster and does not take up memory for simple memory and name queryings.

Different running time for cublasSetMatrix on similar matrices

In the following code I'm using the function cublasSetMatrix for 3 random matrices of size 200x200. I measured the the time of this function in the code:
clock_t t1,t2,t3,t4;
int m =200,n = 200;
float * bold1 = new float [m*n];
float * bold2 = new float [m*n];
float * bold3 = new float [m*n];
for (int i = 0; i< m; i++)
for(int j = 0; j <n;j++)
{
bold1[i*n+j]=rand()%10;
bold2[i*n+j]=rand()%10;
bold3[i*n+j]=rand()%10;
}
float * dev_bold1, * dev_bold2,*dev_bold3;
cudaMalloc ((void**)&dev_bold1,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold2,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold3,sizeof(float)*m*n);
t1=clock();
cublasSetMatrix(m,n,sizeof(float),bold1,m,dev_bold1,m);
t2 = clock();
cublasSetMatrix(m,n,sizeof(float),bold2,m,dev_bold2,m);
t3 = clock();
cublasSetMatrix(m,n,sizeof(float),bold3,m,dev_bold2,m);
t4 = clock();
cout<<double(t2-t1)/CLOCKS_PER_SEC<<" - "<<double(t3-t2)/CLOCKS_PER_SEC<<" - "<<double(t4-t3)/CLOCKS_PER_SEC;
delete []bold1;
delete []bold2;
delete []bold3;
cudaFree(dev_bold1);
cudaFree(dev_bold2);
cudaFree(dev_bold3);
The output of this code is something like this:
0.121849 - 0.000131 - 0.000141
Actually, every time I run the code the time of applying cublasSetMatrix on the first matrix is more than other two matrices, although the size of all matrices are the same and they are filled with random numbers.
Can anyone please help me to find out what is the reason of this result?
Usually the first CUDA API call in any CUDA program will incur some start-up overhead - the CUDA runtime requires time to initialize everything.
Whenever CUDA libraries are used, there will be some additional one-time start up overhead associated with initialization of the library. This overhead will often be observed to impact the timing of the first library call.
That seems to be what is happening here. By placing another cuBLAS API call before the first one you are measuring, you have moved the start-up overhead cost to a previous call, and so you don't measure it on the cublasSetMatrix() call anymore.

Simplest way to clear CUDA shared memory between kernel runs

I am trying to implement box filter in C-CUDA, starting with implementing matrix average problem in CUDA first. When I try following code without commenting those lines within for loops than I get the certain output. But when I comment those lines then it generates the same output again!
if(tx==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius+ty][radius-i] = 6666.0;
}
if(tx==(Dx-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+ty][radius+Dx+i] = 7777;
}
if(ty==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius-i][radius+tx]= 8888;
}
if(ty==(Dy-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+Dy+i][radius+tx] = 9999;
}
if((tx==0)&&(ty==0))
for(int i=globalRow,l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=8866;
}
}
if((tx==(Dx-1))&&(ty==(Dx-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dx);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=HostPaddedCol,m=(radius+Dx);j<(HostPaddedCol+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=7799.0;
}
}
if((tx==(Dx-1))&&(ty==0))
for(int i=(globalRow),l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=(HostPaddedCol+1),m=(radius+Dx);j<(HostPaddedCol+1+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=9966;
}
}
if((tx==0)&&(ty==(Dy-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dy);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=0.0;
}
}
__syncthreads();
You can ignore those for loop conditions and all, they are irrelevant here right now.
May basic problem and question is why am I getting the same vales even after commenting those lines? I tried making some modification in my main program and kernel as well. Also entered manual errors and removed them, and again compiled and executed the same code, but still getting same values. Is there any way to clear cache memory in CUDA?
I am using Nsight + RedHat + CUDA 5.5.
Thanks in advance.
why am I getting the same vales even after commenting those lines?
Seems like sharedTile is pointing to the same piece of memory between multiple consecutive runs which is absolutely normal. Therefore commented out code does not "generate" anything, it is just your pointer pointing to the same memory which was not flushed.
Is there any way to clear cache memory in CUDA
I believe you are talking about clearing shared memory? If so then you can use analogy of approach described here. Instead of using cudaMemset in host code you'll be zeroing out your shared memory from inside of kernel. The simplest approach is to place following code at the beginning of your kernel which declares sharedTile (this is for one dimensional thread blocks, one dimensional shared memory array):
__global__ void your_kernel(int count) {
extern __shared__ float* sharedTile;
for (int i = threadIdx.x; i < count; i += blockDim.x)
sharedTile[i] = 0.0f;
__syncthreads();
// your code here
}
Following approaches do not guarantee clear shared memory as Robert Crovella pointed out in below comment:
Or possibly call nvidia-smi with
--gpu-reset parameter.
Yet another solution was offered in the other SO
thread which includes
driver unloading and reloading.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

Get rid of busy waiting during asynchronous cuda stream executions

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
while (true) {
if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
break;
}
sid = ++sid % S_N;
}
}
Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?
EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.
EDIT2: I got rid of busy-waiting in fallowing way:
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.
The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.
However, you need to understand how streams and sychronization work before you will be able to use them in your code.
What happens if you do not use streams at all? Consider the following code:
kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();
The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().
Now, what if you have two different kernels?
kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);
kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:
kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);
There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.
So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
}
Now, on to streams. you can use streams to manage concurrent device execution.
Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.
That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
No explicit device synchronization necessary.
My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed.
Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.
Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news
EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.
There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.
Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?
Instead of cudaStreamQuery, you want cudaStreamSynchronize
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaStreamSynchronize(streams[sid]);
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)
You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.
You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.
like this:
size = N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}
for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}
In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.