Illegal write of size 4 in CUDA - cuda

I wrote a code which is facing kernel launch failure due to Device Illegal Address when I run it using cuda-gdb for a particular input. I ran it using cuda-memcheck and got Invalid write of size 4 error.The code is too big so I will explain the scenario here.
I have a main kernel to which I am passing an array pointer which serves as a stack.
I have a device function which is called from the main kernel and uses the stack.
__device__ void find(int v , int* p, int* pv,int n, int* d_stackContents)
{
int d_stackTop;
d_stackTop = -1;
*pv = p[v];
if(*pv == -1){
*pv = v;
}
else{
cuPrintf("Stack top is %d\n",d_stackTop);
d_stackTop = d_stackTop + 1;
d_stackContents[d_stackTop] = v;
cuPrintf("Stack top is %d\n",d_stackTop);
while(*pv != -1){
d_stackTop = d_stackTop + 1;
d_stackContents[d_stackTop] = *pv;
cuPrintf("Stack top is %d\n",d_stackTop);
*pv = p[*pv];
}
}
The error is occurring at d_stackContents[d_stackTop] = *pv;
I am calling the device function in the main kernel as follows:
find(v[idx], p,&pv,n, d_stackContents);
where idx = threadIdx.x + blockDim.x * blockIdx.x and I have declared pv as int pv;
Also, the d_stackContents array is allocated in main using cudaMalloc and passed as an argument to the main kernel

This won't work unless you call your kernel with a single thread in a single block. Otherwise all threads scribble over each other's stack. If you then dereference a pointer that was stored on the corrupted stack, it would immediately explain why your code tries to access an illegal address.
You need to use separate stacks for each thread, or a single stack with a stack pointer in global memory that is manipulated only via atomic operations.

Related

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

CUDA streams performance

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each block contributes its own reduced sum.
I also have a host function dot_gpu that calls the kernel and reduces the vector result to the final dot product value.
The synchronous version does just this:
// copy to device
copy_to_device<double>(x_h, x_d, n);
copy_to_device<double>(y_h, y_d, n);
// kernel
double result = dot_gpu(x_d, y_d, n, blockNum, blockSize);
while the async one goes like:
double result[numChunks];
for (int i = 0; i < numChunks; i++) {
int offset = i * chunkSize;
// copy to device
copy_to_device_async<double>(x_h+offset, x_d+offset, chunkSize, stream[i]);
copy_to_device_async<double>(y_h+offset, y_d+offset, chunkSize, stream[i]);
// kernel
result[i] = dot_gpu(x_d+offset, y_d+offset, chunkSize, blockNum, blockSize, stream[i]);
}
for (int i = 0; i < numChunks; i++) {
finalResult += result[i];
cudaStreamDestroy(stream[i]);
}
I am getting worse performance when using streams and was trying to investigate the reasons. I tried to pipeline the downloads, kernel calls and uploads, but with no results.
// accumulate the result of each block into a single value
double dot_gpu(const double *x, const double* y, int n, int blockNum, int blockSize, cudaStream_t stream=NULL)
{
double* result = malloc_device<double>(blockNum);
dot_gpu_kernel<<<blockNum, blockSize, blockSize * sizeof(double), stream>>>(x, y, result, n);
#if ASYNC
double* r = malloc_host_pinned<double>(blockNum);
copy_to_host_async<double>(result, r, blockNum, stream);
CudaEvent copyResult;
copyResult.record(stream);
copyResult.wait();
#else
double* r = malloc_host<double>(blockNum);
copy_to_host<double>(result, r, blockNum);
#endif
double dotProduct = 0.0;
for (int i = 0; i < blockNum; i ++) {
dotProduct += r[i];
}
cudaFree(result);
#if ASYNC
cudaFreeHost(r);
#else
free(r);
#endif
return dotProduct;
}
My guess is that the problem is inside the dot_gpu() functions that doesn't only call the kernel. Tell me if I understand correctly the following stream executions
foreach stream {
cudaMemcpyAsync( device[stream], host[stream], ... stream );
LaunchKernel<<<...stream>>>( ... );
cudaMemcpyAsync( host[stream], device[stream], ... stream );
}
The host executes all the three instructions without being blocked, since cudaMemcpyAsync and kernel return immediately (however on the GPU they will execute sequentially as they are assigned to the same stream). So host goes on to the next stream (even if stream1 who knows what stage it is at, but who cares.. it's doing his job on the GPU, right?) and executes the three instructions again without being blocked.. and so on and so forth. However, my code blocks the host before it can process the next stream, somewhere inside the dot_gpu() function. Is it because I am allocating & freeing stuff, as well as reducing the array returned by the kernel to a single value?
Assuming your objectified CUDA interface does what the function and method names suggest, there are three reasons why work from subsequent calls to dot_gpu() might not overlap:
Your code explicitly blocks by recording an event and waiting for it.
If it weren't blocking for 1. already, your code would block on the pinned host side allocation and deallocation, as you suspected.
If your code weren't blocking for 2. already, work from subsequent calls to dot_gpu() might still not overlap depending on compute capbility. Devices of compute capability 3.0 or lower do not reorder operations even if they are enqueued to different streams.
Even for devices of compute capability 3.5 and higher the number of streams whose operations can be reordered is limited by the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable, which defaults to 8 and can be set to values as large as 32.

Does only one Thread executing a Kernel implement the declaration of an array in Shared Memory in CUDA

I am implementing the following CUDA kernel that stores an array in Shared Memory:
// Difference between adjacent array elements
__global__ void kernel( int* in, int* out ) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
// Allocate a shared array, one element per thread
__shared__ int sh_arr[BOCK_SIZE];
// each thread reads one element to sh_arr
sh_arr[i] = in[i];
// Ensure reads from all Threads in Block complete before continuing
__syncthreads();
if( i > 0 )
out[i] = sh_arr[i] - sh_arr[i-1];
// Ensure writes from all Threads in Block complete before continuing
__syncthreads();
}
BLOCK_SIZE is a constant declared outside the kernel.
It seems like every Thread that executes this Kernel will create a new array because every Thread that executes this Kernel will see this line:
__shared__ int sh_arr[BOCK_SIZE];
Is it the case that only the first Thread that executes this Kernel will "see" this line, and all subsequent kernels overlook this line?
Shared variables in CUDA are shared between threads in the same block. I don't know exactly how it is done under the hood but threads in the same thread-block will see __shared__ int sh_arr[BOCK_SIZE]; however, since it has the __shared__ modifier, only one thread will create the array while the others will just use it.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

CUDA synchronization and reading global memory

I have something like this:
__global__ void globFunction(int *arr, int N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
// calculating and Writing results to arr ...
__syncthreads();
// reading values of another threads(ex i+1)
int val = arr[idx+1]; // IT IS GIVING OLD VALUE
}
int main() {
// declare array, alloc memory, copy memory, etc.
globFunction<<< 4000, 256>>>(arr, N);
// do something ...
return 0;
}
Why am I getting the old value when I read arr[idx+1]? I called __syncthreads, so I expect to see the updated value. What did I do wrong? Am I reading a cache or what?
Using the __syncthreads() function only synchronizes the threads in the current block. In this case this would be the 256 threads per block you created when you launched the kernel. So in your given array, for each index value that crosses over into another block of threads, you'll end up reading a value from global memory that is not synchronized with respect to the threads in the current block.
One thing you can do to circumvent this issue is create shared thread-local storage using the __shared__ CUDA directive that allows the threads in your blocks to share information among themselves, but prevents threads from other blocks accessing the memory allocated for the current block. Once your calculation within the block is complete (and you can use __syncthreads() for this task), you can then copy back into the globally accessible memory the values in the shared block-level storage.
Your kernel could look something like:
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
If you must synchronize threads across blocks, you should be looking for another way to solve your problem, since the CUDA programing model works most effectively when a problem can be broken down into blocks, and threads synchronization only needs to take place within a block.