Memory coalescing while implementing FDTD equations - cuda

I was trying to implement FDTD equations on the GPU. I initially
had implemented the kernel which used global memory. The memory
coalescing wasn't that great. Hence I implemented another kernel
which used shared memory to load the values. I am working on a grid
of 1024x1024.
The code is below
__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
__shared__ float Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
/*int top = offset + x_index_dim;*/
if(threadIdx.y == (blockDim.y - 1)){
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
}
else{
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
}
}
The constants BLOCKSIZE_HX = 16 and BLOCKSIZE_HY = 16.
When I run the visual profiler, it still says that the memory is not coalesced.
EDIT:
I am using GT 520 graphic card with cuda compute capability of 2.1.
My Global L2 transactions / Access = 7.5 i.e there is 245 760 L2 transactions for
32768 executions of the line
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Global memory load efficiency is 50%.
Global memory load efficiency = 100 * gld_requested_throughput/ gld_throughput
I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?
EDIT: Thanks for all the help.

Your memory access pattern is the problem here. You are getting only 50% efficiency (for both L1 and L2) because you are accessing consecutive regions of 16 floats, that is 64 bytes but the L1 transaction size is 128 bytes. This means that for every 64 bytes requested 128 bytes must be loaded into L1 (and in consequence also into L2).
You also have a problem with shared memory bank conflicts but that is currently not negatively affecting your global memory load efficiency.
You could solve the the load efficiency problem in several ways. The easiest would be to change the x dimension block size to 32. If that is not an option you could change the global memory data layout so that each two consecutive blockIdx.y ([0, 1], [2,3] etc.) values would map to a continuous memory block. If even that is not an option and you have to load the global data only once anyway you could use non-cached global memory loads to bypass the L1 - that would help because L2 uses 32 byte transactions so your 64bytes would be loaded in two L2 transactions without overhead.

Related

Global load transaction count when in coalesced memory access

I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,
__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
d_out[tid] = d_in[tid];
}
When I run this with the following kernel configurations
#define BLOCKSIZE 32
int data_size = 10240; //always a multiply of the BLOCKSIZE
int gridSize = data_size / BLOCKSIZE;
copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);
Since the the data access in the kernel is fully coalasced, and since the data type is float (4 bytes), The number of Load/Store Transactions expected can be found as following,
Load Transaction Size = 32 bytes
Number of floats that can be loaded per transaction = 32 bytes / 4 bytes = 8
Number of transactions needed to load 10240 of data = 10240/8 = 1280 transactions
The same amount of transactions are expected for writing the data as well.
But when observing the nvprof metrics, following was the results
gld_transactions 2560
gst_transactions 1280
gld_transactions_per_request 8.0
gst_transactions_per_request 4.0
I cannot figure out why it takes twice the transactions that it needs for loading the data. But when it comes to load/store efficiency both the metrics gives out 100%
What am I missing out here?
I reproduced your results on linux,
1 gld_transactions Global Load Transactions 2560
1 gst_transactions Global Store Transactions 1280
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 1280
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 1280
However, on Windows using NSIGHT Visual Studio edition, I get values that appear to be better:
You may want to contact NVIDIA as it could simply be a display issue in nvprof.

Maximum threads per block vs shared memory size

Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.
Each thread has to do a particular task repeatedly. For example my kernel might look like,
int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
for(j = 0; j <= N; j++) {
tId = threadIdx.x + (j * threadsPerBlock);
uniqueTid = bId*blockDim.x + tId;
curand_init(uniqueTid, 0, 0, &seedValue);
randomP = (float) curand_uniform( &seedValue );
if(randomP <= input_value)
/* Some task */
else
/* Some other task */
}
But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?
You have to distinct between shared memory and global memory. The former is always per block. The latter refers to the off-chip memory that is available on the GPU.
So generally speaking, there is a kind of relation when it comes to threads, i.e. when having more threads per block, the maximum amount of shared memory stays the same.
Also refer to e.g. Using Shared Memory in CUDA C/C++.
There is no immediate relationship between the maximum number of threads per block and the size of the shared memory (not 'device memory' - they're not the same thing).
However, there is an indirect relationship, in that with different Compute Capabilities, both these numbers change:
Compute Capability
1.x
2.x - 3.x
Threads per block
512
1024
Max shared memory (per block)
16KB
48KB
as one of them has increased with newer CUDA devices, so has the other.
Finally, there is a block-level resource which is affected, used up, by the launching of more threads: The Register File. There is a single register file which all block threads share, and the constraint is
ThreadsPerBlock x RegistersPerThread <= RegisterFileSize
It is not trivial to determine how many registers your kernel code is using; but as a rule of thumb, if you use "a lot" of local variables, function call parameters etc., you might hit the above limit, and will not be able to schedule as many threads.

How to properly add in global memory in CUDA?

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?
Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

How to properly coalesce writes from global memory into global memory?

Please understand me, but I don't know English.
My Computing environment is
CPU : Intel Xeon x5690 3.46Ghz * 2EA
OS : CentOS 5.8
VGA : Nvidia Geforce GTX580 (CC is 2.0)
I read already the documents about "coalesced memory access" on CUDA C programming guide.
But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block.
That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access.
But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator
Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz} varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.