Memory Error in CUDA Program for Fermi GPU

Memory Error in CUDA Program for Fermi GPU - cuda

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.

The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]

Related

How to debug numba error involving synchronization [duplicate]

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.

__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.

Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

How to properly add in global memory in CUDA?

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?

Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

Number of threads in a block

I used x & y for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x; in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x= threadIdx.x;
int y= threadIdx.y;
...

Your threadsPerBlock dim3 variable must satisfy the requirements for the compute capability that you are targetting.
CC 1.x devices can handle up to 512 threads per block
CC 2.0 - 8.6 devices can handle up to 1024 threads per block.
Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.
If you did cuda error checking on your kernel launch, you would see the error.
Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.
Additional notes:
You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.
If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.

GPGPU - CUDA: global store efficiency

I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).
As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.
Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = input[offset];
output[offset] = tmp;
}
What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.
Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.
Thx,
EDIT :
Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = round(input[offset]);
output[offset] = tmp;
}

Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.
However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.

Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008