Cuda kernel seems to block, launch timed out and was terminated - cuda

i wrote a lib in CUDA that loads JPEG files and a viewer that displays them.
Both parts make heavy use of CUDA, the sources are on SourceForge:
cuview & cujpeg
I store an image as RGB data in GPU memory and i have a function bitblt that copies a rectangular array of RGB data from one image into another one.
The code worked fine on my last PC with a GTX580 with CUD3.x (can't restore any more).
Now i have a GTX680 and use CUDA 4.x.
The kernel looks like this, it worked fine on GTX580 / CUDA 3.x:
__global__ void cujpeg_k_bitblt(CUJPEG* dd, CUJPEG* src, int sx, int sy, int tx, int ty, int w, int h)
{
unsigned char* sb;
unsigned char* s;
unsigned char* db;
unsigned char* d;
int tid;
int x, y;
int xs, ys, xt, yt;
int ws, wt;
sb = src->dev_rgb;
db = dd->dev_rgb;
ws = src->stride;
wt = dd->stride;
for(tid = threadIdx.x + blockIdx.x * blockDim.x; tid < w * h; tid += blockDim.x * gridDim.x) {
y = tid / w;
x = tid - y * w;
xs = x + sx;
ys = y + sy;
xt = x + tx;
yt = y + ty;
s = sb + (ys * ws + xs) * 3;
d = db + (yt * wt + xt) * 3;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
}
}
I wonder what this could be related to, maybe the higher numbers for several properties on the GTX680 generate an overflow somewhere?
threads in warp 32
max threads per block 1024
max thread dim 1024 1024 64
max grid dim 2147483647 65535 65535
Any hints would be really appreciated.
I develop on Linux, use OpenSuSE 12.1.
Best regards,
Torsten.
Edit, 2012-08-22:
I use:
devdriver_4.0_linux_64_270.40.run
cudatools_4.0.13_linux_64.run
cudatoolkit_4.0.13_linux_64_suse11.2.run
Regarding the timing of that function bitblt:
On my last PC with Cuda 3.x and GTX580 that function took a few milliseconds.
Now it times out after several seconds.
There are other kernels running, if i comment out the call to bitblt everything runs fine.
Also using printf() i can see that all calls before bitblt were fine and after bitblt nothing is executed.
I can't really think that that kernel itself is the problem but i don't know what can influence the behaviour i see.
Best regards,
Torsten.

Ok, i found the problem. As the JPEG decoder is a library i give the user some flexibility in decoding, so when calling CUDA kernels i don't have fixed paramters for grids / threads but use pre-initialised values that i set at initialisation and that the user can overwrite. These default values i get from the CUDA properties of the GPU used but i use not the correct values. The grids are 2147483647, but 65535 is the maximum value allowed.

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Why does this CUDA example kernel have a for loop?

I have been looking at the following example from the official CUDA website:
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cufft
Download here: http://developer.download.nvidia.com/compute/DevZone/C/Projects/x64/simpleCUFFT.zip
It contains the following kernel:
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
{
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
}
My question is, why is there a for loop here? Doesn't CUDA simultaneously call an array of thread? I removed the thread, replacing it with the following code and it produced the same output.
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
a[threadID] = ComplexScale(ComplexMul(a[threadID], b[threadID]), scale);
}
As this is an official example on the CUDA website, I imagine I must be missing something.
Your version is basically what happens when numThreads is equal to size (but only then).
What the official example does is the following: Suppose numThreads is equal to 4 (for simplicity, usually it will be much larger), and consider the array positions (both for a and b):
a or b x x x x x x x x
thread that works here 0 1 2 3 0 1 2 3
Then the first thread will work on all array position divisible by 4, et cetera.
The problem with your version is that the caller of your function will have to make sure that there are as many threads as size is large. For example, if you call your version with a 1-dim grid and both gridDim.x and blockDim.x being 2, but on vectors of length 8, then half of your vector isn't processed!
The official example works regardless - no matter how many threads the caller assigns to it, the entire vector will be processed.

CUDA different threads per block for different functions

I making a CUDA program and am stuck at a problem. I have two functions:
__global__ void cal_freq_pl(float *, char *, char *, int *, int *)
__global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)
I call the first function like this:
cal_freq_pl<<<M,512>>>( ... );
M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.
But when I call the 2nd function in a similar way:
cal_sum_vfreq_pl<<<M,512>>>( ... );
it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.
I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.
I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).
Can someone please suggest some possible solution?
Thanks in advance...
EDIT:
Here is the kernel code:
__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
int index = threadIdx.x;
int m = blockIdx.x;
int block_dim = blockDim.x;
int kv = *k; int vv = kv-1; int wv = kv-2;
int woffset = index*wv;
int no_vstrings = pow_pl(4, vv);
float temppp=0;
char wI[20], Iw[20]; int Iwi, wIi;
for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
for(int l=0;l<4;l++){
Iw[0] = get_nucleotide_pl(l);
wI[vv-1] = get_nucleotide_pl(l);
Iwi = binary_search_pl(vstrings, Iw, vv);
wIi = binary_search_pl(vstrings, wI, vv);
temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
}
freq[index + m*block_dim] = 0.5*temppp;
}
It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.
CUDA provides a tool to help calculate the proper nember of threads per block.
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
You can also find this .xls file in your CUDA installation dir.

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.