How to properly add in global memory in CUDA? - cuda

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?

Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

Related

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

Number of threads in a block

I used x & y for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x; in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x= threadIdx.x;
int y= threadIdx.y;
...
Your threadsPerBlock dim3 variable must satisfy the requirements for the compute capability that you are targetting.
CC 1.x devices can handle up to 512 threads per block
CC 2.0 - 8.6 devices can handle up to 1024 threads per block.
Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.
If you did cuda error checking on your kernel launch, you would see the error.
Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.
Additional notes:
You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.
If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.

How to properly coalesce writes from global memory into global memory?

Please understand me, but I don't know English.
My Computing environment is
CPU : Intel Xeon x5690 3.46Ghz * 2EA
OS : CentOS 5.8
VGA : Nvidia Geforce GTX580 (CC is 2.0)
I read already the documents about "coalesced memory access" on CUDA C programming guide.
But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block.
That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access.
But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator
Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz} varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.

Memory coalescing while implementing FDTD equations

I was trying to implement FDTD equations on the GPU. I initially
had implemented the kernel which used global memory. The memory
coalescing wasn't that great. Hence I implemented another kernel
which used shared memory to load the values. I am working on a grid
of 1024x1024.
The code is below
__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
__shared__ float Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
/*int top = offset + x_index_dim;*/
if(threadIdx.y == (blockDim.y - 1)){
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
}
else{
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
}
}
The constants BLOCKSIZE_HX = 16 and BLOCKSIZE_HY = 16.
When I run the visual profiler, it still says that the memory is not coalesced.
EDIT:
I am using GT 520 graphic card with cuda compute capability of 2.1.
My Global L2 transactions / Access = 7.5 i.e there is 245 760 L2 transactions for
32768 executions of the line
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Global memory load efficiency is 50%.
Global memory load efficiency = 100 * gld_requested_throughput/ gld_throughput
I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?
EDIT: Thanks for all the help.
Your memory access pattern is the problem here. You are getting only 50% efficiency (for both L1 and L2) because you are accessing consecutive regions of 16 floats, that is 64 bytes but the L1 transaction size is 128 bytes. This means that for every 64 bytes requested 128 bytes must be loaded into L1 (and in consequence also into L2).
You also have a problem with shared memory bank conflicts but that is currently not negatively affecting your global memory load efficiency.
You could solve the the load efficiency problem in several ways. The easiest would be to change the x dimension block size to 32. If that is not an option you could change the global memory data layout so that each two consecutive blockIdx.y ([0, 1], [2,3] etc.) values would map to a continuous memory block. If even that is not an option and you have to load the global data only once anyway you could use non-cached global memory loads to bypass the L1 - that would help because L2 uses 32 byte transactions so your 64bytes would be loaded in two L2 transactions without overhead.

GPGPU - CUDA: global store efficiency

I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).
As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.
Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = input[offset];
output[offset] = tmp;
}
What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.
Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.
Thx,
EDIT :
Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = round(input[offset]);
output[offset] = tmp;
}
Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.
However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.
Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.