CUDA: Unspecified Launch Failure - cuda

I was using the CUDA-GDB to find out what the problem was with my kernel execution. It would always output; Cuda error: kernel execution: unspecified launch failure. That's probably the worst error anyone could possibly get because there is no indication whatsoever of what is going on!
Back to the CUDA-GDB... When I was using the debugger it would arrive at the kernel and output;
Breakpoint 1, myKernel (__cuda_0=0x200300000, __cuda_1=0x200400000, __cuda_2=320, __cuda_3=7872, __cuda_4=0xe805c0, __cuda_5=0xea05e0, __cuda_6=0x96dfa0, __cuda_7=0x955680, __cuda_8=0.056646065580379823, __cuda_9=-0.0045986640087569072, __cuda_10=0.125,
__cuda_11=18.598229033761132, __cuda_12=0.00048828125, __cuda_13=5.9604644775390625e-08)
at myFunction.cu:60
Then I would type: next.
output;
0x00007ffff7f7a790 in __device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd ()
from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
The notable part in that section is that it has a tag to a typedef'd datatype. COMPLEX16 is defined as: typedef double complex COMPLEX16
Then I would type: next.
output;
Single stepping until exit from function Z84_device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_ddddddPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd#plt,
which has no line number information.
0x00007ffff7f79560 in ?? () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
Type next...
output;
Cannot find bounds of current function
Type continue...
Cuda error: kernel execution: unspecified launch failure.
Which is the error I get without debugging. I have seen some forum topics on something similar where the debugger cannot find the bounds of current function, possibly because the library is somehow not linked or something along those lines? The ?? was said to be because the debugger is somewhere is shell for some reason and not in any function.
I believe the problem lies deeper in the fact that I have these interesting data types in my code. COMPLEX16 REAL8
Here is my kernel...
__global__ void chisquared_LogLikelihood_Kernel(REAL8 *d_temp, double *d_sum, int lower, int dataSize,
COMPLEX16 *freqModelhPlus_Data,
COMPLEX16 *freqModelhCross_Data,
COMPLEX16 *freqData_Data,
REAL8 *oneSidedNoisePowerSpectrum_Data,
double FplusScaled,
double FcrossScaled,
double deltaF,
double twopit,
double deltaT,
double TwoDeltaToverN)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ REAL8 ssum[MAX_THREADS];
if (idx < dataSize)
{
idx += lower; //accounts for the shift that was made in the original loop
memset(ssum, 0, MAX_THREADS * sizeof(*ssum));
int tid = threadIdx.x;
int bid = blockIdx.x;
REAL8 plainTemplateReal = FplusScaled * freqModelhPlus_Data[idx].re
+ freqModelhCross_Data[idx].re;
REAL8 plainTemplateImag = FplusScaled * freqModelhPlus_Data[idx].im
+ freqModelhCross_Data[idx].im;
/* do time-shifting... */
/* (also un-do 1/deltaT scaling): */
double f = ((double) idx) * deltaF;
/* real & imag parts of exp(-2*pi*i*f*deltaT): */
double re = cos(twopit * f);
double im = - sin(twopit * f);
REAL8 templateReal = (plainTemplateReal*re - plainTemplateImag*im) / deltaT;
REAL8 templateImag = (plainTemplateReal*im + plainTemplateImag*re) / deltaT;
double dataReal = freqData_Data[idx].re / deltaT;
double dataImag = freqData_Data[idx].im / deltaT;
/* compute squared difference & 'chi-squared': */
double diffRe = dataReal - templateReal; // Difference in real parts...
double diffIm = dataImag - templateImag; // ...and imaginary parts, and...
double diffSquared = diffRe*diffRe + diffIm*diffIm ; // ...squared difference of the 2 complex figures.
//d_temp[idx - lower] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
//ssum[tid] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
/***** REDUCTION *****/
//__syncthreads(); //all the temps should have data before we add them up
//for (int i = blockDim.x / 2; i > 0; i >>= 1) { /* per block */
// if (tid < i)
// ssum[tid] += ssum[tid + i];
// __syncthreads();
//}
//d_sum[bid] = ssum[0];
}
}
When I'm not debugging (-g -G not included in command) then the kernel only runs fine if I don't include the line(s) that begin with d_temp[idx - lower] and ssum[tid]. I only did d_temp to make sure that it wasn't a shared memory error, ran fine. I also tried running with ssum[tid] = 20.0 and other various number types to make sure it wasn't that sort of problem, ran fine too. When I run with either of them included then the kernel exits with the cuda error above.
Please ask me if something is unclear or confusing.

There was a lack of context here for my question. The assumption was probably that I had done cudaMalloc and other such preliminary things before the kernel execution for ALL the pointers involved. However I had only done it to d_temp and d_sum (I was making tons of switches and barely realized I was making the other four pointers). Once I did cudaMalloc and cudaMemcpy for the data needed, then everything ran perfectly.
Thanks for the insight.

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Will other threads block in this code with CUDA?

I am new in CUDA programming and have a strange behaviour.
I have a kernel like this:
__global__ void myKernel (uint64_t *input, int numOfBlocks, uint64_t *state) {
int const t = blockIdx.x * blockDim.x + threadIdx.x;
int i;
for (i = 0; i < numOfBlocks; i++) {
if (t < 32) {
if (t < 8) {
state[t] = state[t] ^ input[t];
}
if (t < 25) {
deviceFunc(device_state); /* will use some printf() */
}
}
}
}
I run this kernel with this parameter:
myKernel<<<1, 32>>>(input, numOfBlocks, state);
If 'numOfBlocks' is equal to 1, it will work fine, I get the result I expect back and the printf() inside the deviceFunc() are in the correct order.
If 'numOfBlocks' is equal to 2, it does not work fine! The result is not that what I expected and the printf() are not in the correct order (I only use printf() from thread 0)!
So, my question is now: The left threads from (32-25) which ARE NOT calling deviceFunc(), will they wait and block and this position or will they run the again and start over with the next for-loop iteration? I always thought that every line in the kernel is synchronized in the same block.
I worked the whole day on this and I finally found a solution. First, you are right that I had in my deviceFunc() many RAW hazards. I started to put some __syncthreads() after any WRITE operation, but I think this slows down my program. And I don't think that __syncthreads() is the common way to resolve them. Funny is, that the result is still the same with and without __syncthreads().
But my problem in my code above is that I used
input[t]
which was wrong, because I had to include 'numOfBlocks' in my calculation of index:
input[(NUM_OF_XOR_THREADS * i) + t)
Now, the result was correct and my problem is solved.

Cuda kernel seems to block, launch timed out and was terminated

i wrote a lib in CUDA that loads JPEG files and a viewer that displays them.
Both parts make heavy use of CUDA, the sources are on SourceForge:
cuview & cujpeg
I store an image as RGB data in GPU memory and i have a function bitblt that copies a rectangular array of RGB data from one image into another one.
The code worked fine on my last PC with a GTX580 with CUD3.x (can't restore any more).
Now i have a GTX680 and use CUDA 4.x.
The kernel looks like this, it worked fine on GTX580 / CUDA 3.x:
__global__ void cujpeg_k_bitblt(CUJPEG* dd, CUJPEG* src, int sx, int sy, int tx, int ty, int w, int h)
{
unsigned char* sb;
unsigned char* s;
unsigned char* db;
unsigned char* d;
int tid;
int x, y;
int xs, ys, xt, yt;
int ws, wt;
sb = src->dev_rgb;
db = dd->dev_rgb;
ws = src->stride;
wt = dd->stride;
for(tid = threadIdx.x + blockIdx.x * blockDim.x; tid < w * h; tid += blockDim.x * gridDim.x) {
y = tid / w;
x = tid - y * w;
xs = x + sx;
ys = y + sy;
xt = x + tx;
yt = y + ty;
s = sb + (ys * ws + xs) * 3;
d = db + (yt * wt + xt) * 3;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
}
}
I wonder what this could be related to, maybe the higher numbers for several properties on the GTX680 generate an overflow somewhere?
threads in warp 32
max threads per block 1024
max thread dim 1024 1024 64
max grid dim 2147483647 65535 65535
Any hints would be really appreciated.
I develop on Linux, use OpenSuSE 12.1.
Best regards,
Torsten.
Edit, 2012-08-22:
I use:
devdriver_4.0_linux_64_270.40.run
cudatools_4.0.13_linux_64.run
cudatoolkit_4.0.13_linux_64_suse11.2.run
Regarding the timing of that function bitblt:
On my last PC with Cuda 3.x and GTX580 that function took a few milliseconds.
Now it times out after several seconds.
There are other kernels running, if i comment out the call to bitblt everything runs fine.
Also using printf() i can see that all calls before bitblt were fine and after bitblt nothing is executed.
I can't really think that that kernel itself is the problem but i don't know what can influence the behaviour i see.
Best regards,
Torsten.
Ok, i found the problem. As the JPEG decoder is a library i give the user some flexibility in decoding, so when calling CUDA kernels i don't have fixed paramters for grids / threads but use pre-initialised values that i set at initialisation and that the user can overwrite. These default values i get from the CUDA properties of the GPU used but i use not the correct values. The grids are 2147483647, but 65535 is the maximum value allowed.

cudaMemcpyDeviceToHost error in basic example

I have recently started learning CUDA and I've integrated my CUDA into MS Visual Studio 2010 with Nsight. I have also acquired the book "CUDA by Example" and I'm going through all the examples and compiling them. I have come across an error however, which I do not understand.
The program comes from chapter 4 and it's the julia_gpu example. Original code:
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y ) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char *ptr ) {
// map from blockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
// globals needed by the update routine
struct DataBlock {
unsigned char *dev_bitmap;
};
int main( void ) {
DataBlock data;
CPUBitmap bitmap( DIM, DIM, &data );
unsigned char *dev_bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap, bitmap.image_size() ) );
data.dev_bitmap = dev_bitmap;
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_bitmap ) );
bitmap.display_and_exit();
}
My Visual Studio however forces me to embelish the cuComplex constructor to device, otherwise it won't compile (it tells me I cannot use it later in the julia function), which I guess is fair enough. So I have:
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
But when I do run the example (having added the necessary includes for it to run through VS, which is cuda_runtime.h and device_launch_parameters.h, as well as copying the glut32.dll into the same folder as the exe) it quickly fails, killing my device driver and saying it's due to an unknown error in line 94, which is the cudaMemcpy call in main. To be exact, it's the actual line containing the call "cudaDeviceToHost". To be frank however, I have tried creating some breakpoints line after line and the driver dies at the kernel call.
Could someone please tell me what might be wrong? I am a noob with CUDA and have no real idea why a trivial example would kill itself like that. What could I be doing wrong? Because frankly, I don't really even know what to investigate.
I have the CUDA 4.1 toolkit, NSight 2.1 and a GeForce GT445M with computational ability rated at 2.1 and the 295 version of the drivers.
I haven't had time to test this yet, but I think it may be your GFX "timing out" as far as windows is concerned.
Windows has a default behaviour from Vista to tell the gfx driver to recover after 2 seconds. If your job takes longer then you get booted. You can increase or remove this feature through the registry. I assume you need a reboot for this because I just made the changes and it's not working yet.
See this link for detail:
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx
...
Timeout Detection and Recovery : Windows Vista attempts to detect these
problematic hang situations and recover a responsive desktop
dynamically. In this process, the Windows Display Driver Model (WDDM)
driver is reinitialized and the GPU is reset. No reboot is necessary,
which greatly enhances the user experience. The only visible artifact
from the hang detection to the recovery is a screen flicker, which
results from resetting some portions of the graphics stack, causing a
screen redraw. Some older Microsoft DirectX applications may render to
a black screen at the end of this recovery. The end user would have to
restart these applications. The following is a brief overview of the
TDR process: ....
Clearly this is why its a weird bug because it will give you that mem copy error at different scales for different people depending on how fast their gfx is.
This is a known issue in CUDA.
You can try changing this:
const float scale = 1.5;
to something larger like 3.5, 4.5, 5.5.
example:
const float scale = 5.5;