Using CUDA Occupancy Calculator - cuda

I am using occupancy calculator but I cannot understand how to get the Registers per thread / shared memory per block .I read the documentation.I use visual studio .So in the project properties under CUDA build rule->Command Line -> Additional Options I add --ptxas-options=-v.The program compiles fine .But i do not see any output .Can anybody help?
Thanks

With this switch on there should be a line on the compiler output window that tells you about the number of registers and amount of shared memory.
Do you see anything at all on the compiler output window? can you copy and paste it to the question?
It should look something like
ptxas info : Used 3 registers, 2084+1060 bytes smem, 40 bytes cmem[0], 12 bytes cmem[1]

Try this simple rule:
All the local variables like int a, float b etc in your kernel are stored in registers. This is only when local variables in your code remains within the limit of available registers in a multiprocessor, See Limits. However, if you declare a thousand integers like int a[1000] then a will not be stored in registers rather it will be stored in local memory (DRAM).
The amount of shared memory used in your kernel code is Shared Memory/Block. For example if you define __shared__ float shMem[256] then you are using 256*4(size of float) = 1024 bytes of shared memory.
The following sample code (it'll not work properly, just for example) uses 9 32-bit-registers per thread which are: int xIndex, yIndex, Idx, shY, shX, aLocX, aLocY and float t, temp. The code uses 324 bytes of shared memory per block, as BLOCK_DIM = 16.
__global__ void averageFilter (unsigned char * outImage,
int imageWidth,
int imageHeight,
cuviPoint2 loc){
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
unsigned int Idx = yIndex*imageWidth + xIndex;
float t = INC;
if(xIndex>= imageWidth|| yIndex>=imageHeight)
return;
else if(xIndex==0 || xIndex== imageWidth-1 || yIndex==0 || yIndex==imageHeight-1){
for (int i=-1; i<=1; i++)
for (int j=-1; j<=1; j++)
t+= tex1Dfetch(texMem,Idx+i*imageWidth+j);
outImage[Idx] = t/6;
}
__shared__ unsigned char shMem[BLOCK_DIM+2][BLOCK_DIM+2];
unsigned int shY = threadIdx.y + 1;
unsigned int shX = threadIdx.x + 1;
if (threadIdx.x==0 || threadIdx.x==BLOCK_DIM-1 || threadIdx.y==0 || threadIdx.y==BLOCK_DIM-1){
for (int i=-1; i<=1; i++)
for (int j=-1; j<=1; j++)
shMem[shY+i][shX+j]= tex1Dfetch(texMem,Idx+i*imageWidth+j);
}
else
shMem[shY][shX] = tex1Dfetch(texMem,Idx);
__syncthreads();
if(xIndex==0 || xIndex== imageWidth-1 || yIndex==0 || yIndex==imageHeight-1)
return;
int aLocX = loc.x, aLocY = loc.y;
float temp=INC;
for (int i=aLocY; i<=aLocY+2; i++)
for (int j=aLocX; j<=aLocX+2; j++)
temp+= shMem[shY+i][shX+j];
outImage[Idx] = floor(temp/9);
}

shoosh's answer is probably the easiest way to find the register and shared memory usage. Make sure you're looking at the output pane first (select "Output" in the "View" pull-down menu) and then re-compile. The compiler should give you ptxas info for all kernels in the output pane, like you can see in the picture below...

Another way to find this information out is to use the visual profiler or nvidia's parallel nsight.

Related

read 4 char per thread in 1 transaction in cuda

I am learning CUDA recently. And I have a question about the memory transaction.
What I understand is, in each transaction, 32 consecutive threads (in the same block) can access a consecutive 128 bytes (32 single precision words) of memory concurrently, which is called a warp.
But in the example, each thread is always accessing the (4-bytes) word as 1 whole variable. So my question is, if my array in global memory is defined in type for char, then can all the 32 threads access this piece of memory, and read 4 consecutive char respectively in the same time?
So, for eaxmple, if I write the code:
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = 4 * (threadIdx.x + blockDim.x * blockIdx.x);
s_tmp[thId_x] = d_mask[elementId];
s_tmp[1 + thId_x] = d_mask[elementId + 1];
s_tmp[2 + thId_x] = d_mask[elementId + 2];
s_tmp[3 + thId_x] = d_mask[elementId + 3];
__syncthreads();
/* calculation */
}
Then, will each thread read the 4 bytes concurrently? And if not, how can I manage to do it? should I use some API like memcpy?
In order to get a properly efficient read, it's necessary to combine the bytes being read into a single transaction; we generally can't do this by breaking things up across several lines of code.
To combine things into a single transaction, there are vector types which combine multiple elements into a single type. As long as we pay attention to proper alignment, we can treat char or unsigned char arrays as arrays of e.g. uchar4 which is a vector type that combines four characters into a single (32-bit) type. You can find lots more goodies in the cuda header files vector_types.h and vector_functions.h.
Anyway, we could re-write your sample like this, to take advantage of a "vector load":
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = threadIdx.x + blockDim.x * blockIdx.x;
uchar4 *s_tmp_v = reinterpret_cast<uchar4 *>(s_tmp);
uchar4 *d_mask_v = reinterpret_cast<uchar4 *>(d_mask);
s_tmp_v[thId] = d_mask_v[elementId];
__syncthreads();
/* calculation */
}

CUDA different threads per block for different functions

I making a CUDA program and am stuck at a problem. I have two functions:
__global__ void cal_freq_pl(float *, char *, char *, int *, int *)
__global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)
I call the first function like this:
cal_freq_pl<<<M,512>>>( ... );
M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.
But when I call the 2nd function in a similar way:
cal_sum_vfreq_pl<<<M,512>>>( ... );
it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.
I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.
I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).
Can someone please suggest some possible solution?
Thanks in advance...
EDIT:
Here is the kernel code:
__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
int index = threadIdx.x;
int m = blockIdx.x;
int block_dim = blockDim.x;
int kv = *k; int vv = kv-1; int wv = kv-2;
int woffset = index*wv;
int no_vstrings = pow_pl(4, vv);
float temppp=0;
char wI[20], Iw[20]; int Iwi, wIi;
for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
for(int l=0;l<4;l++){
Iw[0] = get_nucleotide_pl(l);
wI[vv-1] = get_nucleotide_pl(l);
Iwi = binary_search_pl(vstrings, Iw, vv);
wIi = binary_search_pl(vstrings, wI, vv);
temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
}
freq[index + m*block_dim] = 0.5*temppp;
}
It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.
CUDA provides a tool to help calculate the proper nember of threads per block.
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
You can also find this .xls file in your CUDA installation dir.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

cuda shared memory overwrite?

I am trying to write a parallel prefix scan on cuda by following this tutorial -
I am trying the work-inefficient 'double buffered one' as explained in the tutorial.
This is what I have:
// double buffered naive.
// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{
//get the block index
int idx = blockIdx.x*blockDim.x + threadIdx.x;
// allocate shared memory
extern __shared__ int temp_in[], temp_out[];
// copy data to it.
temp_in[idx] = in[idx];
temp_out[idx] = 0;
// block until all threads copy
__syncthreads();
int i = 1;
for (i; i<=d; i++)
{
if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
{
// copy new result to temp_out
temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
}
else
{
// if the element is to remain unchanged, copy the same thing
temp_out[idx] = temp_in[idx];
}
// block until all theads do this
__syncthreads();
// copy the result to temp_in for next iteration
temp_in[idx] = temp_out[idx];
// wait for all threads to do so
__syncthreads();
}
//finally copy everything back to global memory
in[idx] = temp_in[idx];
}
Can you point out what's wrong with this? I have written comments for what I think should happen.
This is the kernel invocation -
prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);
This is the grid and block allocations:
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
The problem is that I don't get the correct output for any input that's more than 8 elements long.
I see two problems in your code
Problem 1: extern shared memory
Agh.... I hate extern __shared__ memory. The problem is, that the compiler does not know how big are the arrays. As a result, they both point to the same piece of memory!
So, in your case: temp_in[5] and temp_out[5] refer to the same word in shared memory.
If you really want the extern __shared__ memory, you can manually offset the second array, for example something like this:
size_t size = .... //the size of your array
extern __shared__ int memory[];
int* temp_in=memory;
int* temp_out=memory+size;
Problem 2: Shared array index
Shared memory is private for each block. That is, temp[0] in one block can be different than temp[0] in another block. However, you index it by blockIdx.x*blockDim.x + threadIdx.x as if the temp arrays were shared between the blocks.
Instead, you should most likely index your temp arrays just by threadIdx.x.
Of course, the idx array is global and you index that one correctly.

CUDA: Shared memory allocation with overlapping borders

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?
The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.
I.e the total memory size allocated to shared memory on each block is
(blocksize+patternlength)*sizeof(char)
I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.
UPDATE: I suspect some people have misunderstood my question (or I mis-explained it).
Say I have a dataset QWERTYUIOP, and i want to search for a 3 character match, and i dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx
This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.
In this case, what I want is for each block to have in shared memory:
QWERTY TYUIOP OPxxxx
ie each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.
Hope that explains things better.
Since #jmilloy is being persistent... :P
//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (T[startIndex+i] != P[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char shaP[];
for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
}
__syncthreads();
//At this point shaP is populated with the pattern
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
// only continue if an earlier instance hasn't already been found
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.
I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:
__global__ void kernel()
{
extern __shared__ int buffer[];
....
}
kernel<<< gridsize, blocksize, buffersize >>>();
but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.
No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.
As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).
Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.
I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!
edit 2
after answer was marked as correct
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char* shared;
char* shaP = &shared[0];
char* shaT = &shared[lenP];
//copy pattern into shaP in parallel
if(threadIdx.x < lenP)
shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);
//determine texT start and length for this block
blockStartIndex = blockIdx.x * gridDim.x/lenT;
lenS = gridDim.x/lenT + lenP - 1;
//copy text into shaT in parallel
shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
if(threadIdx.x < lenP)
shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)
__syncthreads();
//We have one pattern in shaP for each thread in the block
//We have the necessary portion of the text (with overlaps) in shaT
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}
key notes:
we only need one copy of the pattern in shared memory per block - they can all use it
shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
we can copy into shared memory in parallel (no need for for loops if you have enough threads)
Overlapping shared memory is not good, the thread will have to synchronize each time they want to access the same address in shared memory (although in architecture >= 2.0 this has been mitigated).
The simplest idea that comes into my mind is to duplicate the portion of the text that you want to be overlapped.
Instead of reading from the global memory in exact chuncks:
AAAA BBBB CCCC DDDD EEEE
Read with overlapping:
AAAA BBBB CCCC CCCC DDDD EEEEE