CUDA different threads per block for different functions - cuda

I making a CUDA program and am stuck at a problem. I have two functions:
__global__ void cal_freq_pl(float *, char *, char *, int *, int *)
__global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)
I call the first function like this:
cal_freq_pl<<<M,512>>>( ... );
M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.
But when I call the 2nd function in a similar way:
cal_sum_vfreq_pl<<<M,512>>>( ... );
it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.
I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.
I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).
Can someone please suggest some possible solution?
Thanks in advance...
EDIT:
Here is the kernel code:
__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
int index = threadIdx.x;
int m = blockIdx.x;
int block_dim = blockDim.x;
int kv = *k; int vv = kv-1; int wv = kv-2;
int woffset = index*wv;
int no_vstrings = pow_pl(4, vv);
float temppp=0;
char wI[20], Iw[20]; int Iwi, wIi;
for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
for(int l=0;l<4;l++){
Iw[0] = get_nucleotide_pl(l);
wI[vv-1] = get_nucleotide_pl(l);
Iwi = binary_search_pl(vstrings, Iw, vv);
wIi = binary_search_pl(vstrings, wI, vv);
temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
}
freq[index + m*block_dim] = 0.5*temppp;
}

It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.
CUDA provides a tool to help calculate the proper nember of threads per block.
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
You can also find this .xls file in your CUDA installation dir.

Related

CUDA atomicAdd using thread.x not returning expected results

I've been experimenting with atomic operations in CUDA, but I can't get thread index numbers to be included in the operations, it looks like they are just treated as zeros as in the examples shown below:
Is there anything I'm doing wrong in the code below?
Code 1: adding thread index value to dest[10] (not working, dest[10] is 0 after running, I would expect it to be greater than 0 as it would add the value of the index to dest[10] each time)
__global__ void add_test(int* dest, float *a, float *b, float *c)
{
int ix = ((blockIdx.x * blockDim.x) + threadIdx.x);
int idx = threadIdx.x;
atomicAdd(dest+10,idx);
}
Code 2: if I use a constant, then it seems to work (at the end of the run dest[10]=2, but again I would expect it to be greater than 2 as it should add 2 for every running thread/block):
__global__ void add_test(int* dest, float *a, float *b, float *c)
{
int ix = ((blockIdx.x * blockDim.x) + threadIdx.x);
int idx = threadIdx.x;
atomicAdd(dest+10,2);
}
My test call looks like:
add_test<<<(1024,1,1), (41,1584,1)>>>
This isn't a valid kernel launch:
add_test<<<(1024,1,1), (41,1584,1)>>>
You cannot ask for thread block dimensions of (41,1584,1)
My guess is you are doing no proper cuda error checking and have not run your code with cuda-memcheck, as either of these would have indicated the error, and that your kernel is not running properly.
The maximum in either of the first two dimensions is either 512 or 1024, and the maximum combined dimensions (i.e. the product of the dimensions = total threads) is 512 or 1024 depending on GPU.
In the future, please provide a complete, compilable code if you are asking for help with a code that is not working. SO expects this and it is a valid close reason for your question if you don't.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

Cuda kernel seems to block, launch timed out and was terminated

i wrote a lib in CUDA that loads JPEG files and a viewer that displays them.
Both parts make heavy use of CUDA, the sources are on SourceForge:
cuview & cujpeg
I store an image as RGB data in GPU memory and i have a function bitblt that copies a rectangular array of RGB data from one image into another one.
The code worked fine on my last PC with a GTX580 with CUD3.x (can't restore any more).
Now i have a GTX680 and use CUDA 4.x.
The kernel looks like this, it worked fine on GTX580 / CUDA 3.x:
__global__ void cujpeg_k_bitblt(CUJPEG* dd, CUJPEG* src, int sx, int sy, int tx, int ty, int w, int h)
{
unsigned char* sb;
unsigned char* s;
unsigned char* db;
unsigned char* d;
int tid;
int x, y;
int xs, ys, xt, yt;
int ws, wt;
sb = src->dev_rgb;
db = dd->dev_rgb;
ws = src->stride;
wt = dd->stride;
for(tid = threadIdx.x + blockIdx.x * blockDim.x; tid < w * h; tid += blockDim.x * gridDim.x) {
y = tid / w;
x = tid - y * w;
xs = x + sx;
ys = y + sy;
xt = x + tx;
yt = y + ty;
s = sb + (ys * ws + xs) * 3;
d = db + (yt * wt + xt) * 3;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
}
}
I wonder what this could be related to, maybe the higher numbers for several properties on the GTX680 generate an overflow somewhere?
threads in warp 32
max threads per block 1024
max thread dim 1024 1024 64
max grid dim 2147483647 65535 65535
Any hints would be really appreciated.
I develop on Linux, use OpenSuSE 12.1.
Best regards,
Torsten.
Edit, 2012-08-22:
I use:
devdriver_4.0_linux_64_270.40.run
cudatools_4.0.13_linux_64.run
cudatoolkit_4.0.13_linux_64_suse11.2.run
Regarding the timing of that function bitblt:
On my last PC with Cuda 3.x and GTX580 that function took a few milliseconds.
Now it times out after several seconds.
There are other kernels running, if i comment out the call to bitblt everything runs fine.
Also using printf() i can see that all calls before bitblt were fine and after bitblt nothing is executed.
I can't really think that that kernel itself is the problem but i don't know what can influence the behaviour i see.
Best regards,
Torsten.
Ok, i found the problem. As the JPEG decoder is a library i give the user some flexibility in decoding, so when calling CUDA kernels i don't have fixed paramters for grids / threads but use pre-initialised values that i set at initialisation and that the user can overwrite. These default values i get from the CUDA properties of the GPU used but i use not the correct values. The grids are 2147483647, but 65535 is the maximum value allowed.

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}

Using CUDA Occupancy Calculator

I am using occupancy calculator but I cannot understand how to get the Registers per thread / shared memory per block .I read the documentation.I use visual studio .So in the project properties under CUDA build rule->Command Line -> Additional Options I add --ptxas-options=-v.The program compiles fine .But i do not see any output .Can anybody help?
Thanks
With this switch on there should be a line on the compiler output window that tells you about the number of registers and amount of shared memory.
Do you see anything at all on the compiler output window? can you copy and paste it to the question?
It should look something like
ptxas info : Used 3 registers, 2084+1060 bytes smem, 40 bytes cmem[0], 12 bytes cmem[1]
Try this simple rule:
All the local variables like int a, float b etc in your kernel are stored in registers. This is only when local variables in your code remains within the limit of available registers in a multiprocessor, See Limits. However, if you declare a thousand integers like int a[1000] then a will not be stored in registers rather it will be stored in local memory (DRAM).
The amount of shared memory used in your kernel code is Shared Memory/Block. For example if you define __shared__ float shMem[256] then you are using 256*4(size of float) = 1024 bytes of shared memory.
The following sample code (it'll not work properly, just for example) uses 9 32-bit-registers per thread which are: int xIndex, yIndex, Idx, shY, shX, aLocX, aLocY and float t, temp. The code uses 324 bytes of shared memory per block, as BLOCK_DIM = 16.
__global__ void averageFilter (unsigned char * outImage,
int imageWidth,
int imageHeight,
cuviPoint2 loc){
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
unsigned int Idx = yIndex*imageWidth + xIndex;
float t = INC;
if(xIndex>= imageWidth|| yIndex>=imageHeight)
return;
else if(xIndex==0 || xIndex== imageWidth-1 || yIndex==0 || yIndex==imageHeight-1){
for (int i=-1; i<=1; i++)
for (int j=-1; j<=1; j++)
t+= tex1Dfetch(texMem,Idx+i*imageWidth+j);
outImage[Idx] = t/6;
}
__shared__ unsigned char shMem[BLOCK_DIM+2][BLOCK_DIM+2];
unsigned int shY = threadIdx.y + 1;
unsigned int shX = threadIdx.x + 1;
if (threadIdx.x==0 || threadIdx.x==BLOCK_DIM-1 || threadIdx.y==0 || threadIdx.y==BLOCK_DIM-1){
for (int i=-1; i<=1; i++)
for (int j=-1; j<=1; j++)
shMem[shY+i][shX+j]= tex1Dfetch(texMem,Idx+i*imageWidth+j);
}
else
shMem[shY][shX] = tex1Dfetch(texMem,Idx);
__syncthreads();
if(xIndex==0 || xIndex== imageWidth-1 || yIndex==0 || yIndex==imageHeight-1)
return;
int aLocX = loc.x, aLocY = loc.y;
float temp=INC;
for (int i=aLocY; i<=aLocY+2; i++)
for (int j=aLocX; j<=aLocX+2; j++)
temp+= shMem[shY+i][shX+j];
outImage[Idx] = floor(temp/9);
}
shoosh's answer is probably the easiest way to find the register and shared memory usage. Make sure you're looking at the output pane first (select "Output" in the "View" pull-down menu) and then re-compile. The compiler should give you ptxas info for all kernels in the output pane, like you can see in the picture below...
Another way to find this information out is to use the visual profiler or nvidia's parallel nsight.