cudaMemcpyDeviceToHost error in basic example - cuda

I have recently started learning CUDA and I've integrated my CUDA into MS Visual Studio 2010 with Nsight. I have also acquired the book "CUDA by Example" and I'm going through all the examples and compiling them. I have come across an error however, which I do not understand.
The program comes from chapter 4 and it's the julia_gpu example. Original code:
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y ) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char *ptr ) {
// map from blockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
// globals needed by the update routine
struct DataBlock {
unsigned char *dev_bitmap;
};
int main( void ) {
DataBlock data;
CPUBitmap bitmap( DIM, DIM, &data );
unsigned char *dev_bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap, bitmap.image_size() ) );
data.dev_bitmap = dev_bitmap;
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_bitmap ) );
bitmap.display_and_exit();
}
My Visual Studio however forces me to embelish the cuComplex constructor to device, otherwise it won't compile (it tells me I cannot use it later in the julia function), which I guess is fair enough. So I have:
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
But when I do run the example (having added the necessary includes for it to run through VS, which is cuda_runtime.h and device_launch_parameters.h, as well as copying the glut32.dll into the same folder as the exe) it quickly fails, killing my device driver and saying it's due to an unknown error in line 94, which is the cudaMemcpy call in main. To be exact, it's the actual line containing the call "cudaDeviceToHost". To be frank however, I have tried creating some breakpoints line after line and the driver dies at the kernel call.
Could someone please tell me what might be wrong? I am a noob with CUDA and have no real idea why a trivial example would kill itself like that. What could I be doing wrong? Because frankly, I don't really even know what to investigate.
I have the CUDA 4.1 toolkit, NSight 2.1 and a GeForce GT445M with computational ability rated at 2.1 and the 295 version of the drivers.

I haven't had time to test this yet, but I think it may be your GFX "timing out" as far as windows is concerned.
Windows has a default behaviour from Vista to tell the gfx driver to recover after 2 seconds. If your job takes longer then you get booted. You can increase or remove this feature through the registry. I assume you need a reboot for this because I just made the changes and it's not working yet.
See this link for detail:
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx
...
Timeout Detection and Recovery : Windows Vista attempts to detect these
problematic hang situations and recover a responsive desktop
dynamically. In this process, the Windows Display Driver Model (WDDM)
driver is reinitialized and the GPU is reset. No reboot is necessary,
which greatly enhances the user experience. The only visible artifact
from the hang detection to the recovery is a screen flicker, which
results from resetting some portions of the graphics stack, causing a
screen redraw. Some older Microsoft DirectX applications may render to
a black screen at the end of this recovery. The end user would have to
restart these applications. The following is a brief overview of the
TDR process: ....
Clearly this is why its a weird bug because it will give you that mem copy error at different scales for different people depending on how fast their gfx is.
This is a known issue in CUDA.

You can try changing this:
const float scale = 1.5;
to something larger like 3.5, 4.5, 5.5.
example:
const float scale = 5.5;

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

Cuda, two streams created by a NPP function

I'm working on an image processing project with Cuda 7.5 and a GeForce GTX 650 Ti. I decided to use 2 stream, one where I apply the algorithms responsible to enhance the image and another stream where I apply an independent algorithm from the rest of the processing.
I wrote an example to show my problem. In this example I created a stream and then I used nppSetStream.
I invoked the function nppiThreshold_LTValGTVal_32f_C1R but 2 stream are used when the function is executed.
Here there's a code example:
#include <npp.h>
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>
int main(void) {
int srcWidth = 1344;
int srcHeight = 1344;
int paddStride = 0;
float* srcArrayDevice;
float* srcArrayDevice2;
unsigned char* dstArrayDevice;
int status = cudaMalloc((void**)&srcArrayDevice, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&srcArrayDevice2, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&dstArrayDevice, srcWidth * srcHeight );
cudaStream_t testStream;
cudaStreamCreateWithFlags(&testStream, cudaStreamNonBlocking);
nppSetStream(testStream);
NppiSize roiSize = { srcWidth,srcHeight };
//status = cudaMemcpyAsync(srcArrayDevice, &srcArrayHost, srcWidth*srcHeight*4, cudaMemcpyHostToDevice, testStream);
int yRect = 100;
int xRect = 60;
float thrL = 50;
float thrH = 1500;
NppiSize sz = { 200, 400 };
for (int i = 0; i < 10; i++) {
int status3 = nppiThreshold_LTValGTVal_32f_C1R(srcArrayDevice + (srcWidth*yRect + xRect)
, srcWidth * 4
, srcArrayDevice2 + (srcWidth*yRect + xRect)
, srcWidth * 4
, sz
, thrL
, thrL
, thrH
, thrH);
}
int length = (srcWidth + paddStride)*srcHeight;
int status6 = nppiScale_32f8u_C1R(srcArrayDevice, srcWidth * 4, dstArrayDevice + paddStride, srcWidth + paddStride, roiSize, 0, 65535);
//int status7 = cudaMemcpyAsync(dstPinPtr, dstTest, length, cudaMemcpyDeviceToHost, testStream);
cudaFree(srcArrayDevice);
cudaFree(srcArrayDevice2);
cudaFree(dstArrayDevice);
cudaStreamDestroy(testStream);
cudaProfilerStop();
return 0;
}
This what I got from the Nvidia Visual Profiler: image_width1344
Why are there two streams if I set only one stream? This causes errors in my original project so I'm thinking to switch to a single stream.
I noticed that this behaviour is dependent from the size of the image, if srcWidth and srcHeight are set to 1500 the result is this:image_width1500.
Why changing the size of the image produces another stream?
Why are there two streams if I setted [sic] only one stream?
It appears that nppiThreshold_LTValGTVal_32f_C1R creates its own internal stream for executing one of the kernels it uses. The other is launched either into the default stream, or the stream you specified with nppSetStream.
I think this is really a documentation oversight/user expectation problem. nppSetStream is doing what it says, but nowhere is it stated that the library is limited to using one stream. It probably should be more explicit in the documentation about how many streams the library uses internally, and how nppSetStream interacts with the library. If this is a problem for your application, I suggest you raise a bug report with NVIDIA.
Why changing the size of the image produces another stream?
My guess would be that there are some performance heuristics at work, and whether the second stream is used depends in image size. The library is closed source, however, so I can't say for sure.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

Cuda kernel seems to block, launch timed out and was terminated

i wrote a lib in CUDA that loads JPEG files and a viewer that displays them.
Both parts make heavy use of CUDA, the sources are on SourceForge:
cuview & cujpeg
I store an image as RGB data in GPU memory and i have a function bitblt that copies a rectangular array of RGB data from one image into another one.
The code worked fine on my last PC with a GTX580 with CUD3.x (can't restore any more).
Now i have a GTX680 and use CUDA 4.x.
The kernel looks like this, it worked fine on GTX580 / CUDA 3.x:
__global__ void cujpeg_k_bitblt(CUJPEG* dd, CUJPEG* src, int sx, int sy, int tx, int ty, int w, int h)
{
unsigned char* sb;
unsigned char* s;
unsigned char* db;
unsigned char* d;
int tid;
int x, y;
int xs, ys, xt, yt;
int ws, wt;
sb = src->dev_rgb;
db = dd->dev_rgb;
ws = src->stride;
wt = dd->stride;
for(tid = threadIdx.x + blockIdx.x * blockDim.x; tid < w * h; tid += blockDim.x * gridDim.x) {
y = tid / w;
x = tid - y * w;
xs = x + sx;
ys = y + sy;
xt = x + tx;
yt = y + ty;
s = sb + (ys * ws + xs) * 3;
d = db + (yt * wt + xt) * 3;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
}
}
I wonder what this could be related to, maybe the higher numbers for several properties on the GTX680 generate an overflow somewhere?
threads in warp 32
max threads per block 1024
max thread dim 1024 1024 64
max grid dim 2147483647 65535 65535
Any hints would be really appreciated.
I develop on Linux, use OpenSuSE 12.1.
Best regards,
Torsten.
Edit, 2012-08-22:
I use:
devdriver_4.0_linux_64_270.40.run
cudatools_4.0.13_linux_64.run
cudatoolkit_4.0.13_linux_64_suse11.2.run
Regarding the timing of that function bitblt:
On my last PC with Cuda 3.x and GTX580 that function took a few milliseconds.
Now it times out after several seconds.
There are other kernels running, if i comment out the call to bitblt everything runs fine.
Also using printf() i can see that all calls before bitblt were fine and after bitblt nothing is executed.
I can't really think that that kernel itself is the problem but i don't know what can influence the behaviour i see.
Best regards,
Torsten.
Ok, i found the problem. As the JPEG decoder is a library i give the user some flexibility in decoding, so when calling CUDA kernels i don't have fixed paramters for grids / threads but use pre-initialised values that i set at initialisation and that the user can overwrite. These default values i get from the CUDA properties of the GPU used but i use not the correct values. The grids are 2147483647, but 65535 is the maximum value allowed.

Keeping unused variables in CUDA

I made some kernels for testing bandwidth and they do no useful computations. A minimal example is
__global__ void testKernel(float* a)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
}
When I compile, I get (not surprisingly)
warning: variable "x" was set but never used
and the kernel runs as quickly as an empty kernel:
__global__ void donothing()
{
}
This indicates that the read of a[i] has been optimized out.
I have tried tricks such as
volatile float x;
if(x);
(void)(x;)
and they suppress the warning, but the kernel still finishes too quickly.
How can I make sure that the useless instructions actually get executed?
I found the option CU_JIT_OPTIMIZATION_LEVEL but google provides mostly links to the documentation and not how to use it. Would this option help me and how do I use it?
Try introducing a branch which stores the variable:
__global__ void testKernel(float* a, float *b)
{
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
float x;
x = a[i];
if(b)
{
*b = x;
}
}
The cost of the branch compared to the cost of memory transfer is negligible.
At the kernel launch site, simply pass a null pointer:
testKernel<<<...>>>(a, static_cast<float*>(0));
nvcc will not perform constant folding at this granularity, so your load should not be removed because the compiler cannot prove it is useless.