Analyzing memory access coalescing of my CUDA kernel - cuda

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to the shared memory and I have developed the following code.
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];
In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2).
Thanks in advance.

I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:
#include <stdio.h>
#include <malloc.h>
typedef struct dim3_t{
int x;
int y;
} dim3;
// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16
// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1))
int main(){
// fixed dim3 variables
// grid and block size
dim3 blockDim,gridDim;
blockDim.x=BLOCKDIMX;
blockDim.y=BLOCKDIMY;
gridDim.x=GRIDDIMX;
gridDim.y=GRIDDIMY;
// counters
int unq_accesses=0;
int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
int total_unq_accesses=0;
// iter over total number of threads
// and count the number of memory requests (the coalesced requests)
int I, II, III;
for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
dim3 blockIdx;
blockIdx.x = I%GRIDDIMX;
blockIdx.y = I/GRIDDIMX;
for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
if(II%COALESCINGWIDTH==0){
// new coalescing bunch
total_unq_accesses+=unq_accesses;
unq_accesses=0;
}
dim3 threadIdx;
threadIdx.x=II%BLOCKDIMX;
threadIdx.y=II/BLOCKDIMX;
////////////////////////////////////////////////////////
// Change this section to evaluate different accesses //
////////////////////////////////////////////////////////
// do your indexing here
#define BLOCK_SIZE_X BLOCKDIMX
#define BLOCK_SIZE_Y BLOCKDIMY
#define xdim 32
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
// calculate the accessed location and offset here
// change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
int size = sizeof(double);
//////////////////////////
// End of modifications //
//////////////////////////
printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
// check whether it can be merged with existing requests or not
short merged=0;
for(III=0; III<unq_accesses; III++){
if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
merged=1;
break;
}
}
if(!merged){
// new cache block accessed over this coalescing width
unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
unq_accesses++;
}
}
}
printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}
The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.
To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
and the memory access is:
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.

As GPUs have evolved, the requirements for getting coalesced accesses have become less restrictive. Your description of coalesced accesses is more accurate for the earlier GPU architectures than the more recent ones. In particular, Fermi (compute capability 2.0) significantly loosened the requirements. On Fermi and later, it is not important to access the memory locations consecutively. Instead, focus has shifted to accessing memory with as few memory transactions as possible. On Fermi, global memory transactions are 128 bytes wide. So, when the 32 threads in a warp hit an instruction that performs a load or store, 128-byte transactions will be scheduled to service all the threads in the warp. Performance then depends on how many transactions are necessary. If all the threads access values within a 128-byte area that is aligned to 128 bytes, a single transaction is necessary. If all the threads access values in different 128-byte areas, 32 transactions will be necessary. That would be the worst case scenario for servicing the requests for a single instruction in a warp.
You use one of the CUDA profilers to determine the average for how many transactions were required for servicing the requests. The number should be as close to 1 as possible. Higher numbers mean that you should see if there are opportunities for optimizing the memory accesses in your kernel.

The visual profiler is a great tool for checking your work. After you have a piece of code functionally correct, then run it from within the visual profiler. On linux for example, assuming you have an X session, just run nvvp from a terminal window. You will then be given a wizard which will prompt you for the application to profile along with any command line parameters.
The profiler will then do a basic run of your app to collect statistics. You can also select more advanced statistic gathering (requiring addtional runs), and one of these will be memory utilization statistics. It will report memory utilization as a percentage of peak and will also flag warnings for what it considers to be low utilization that merits your attention.
If you have a utilzation number above 50%, your app is probably running the way you expect. If you have a low number, you have probably missed some coalescing details. It will report statistics separately for memory reads and memory writes. To get 100% or close to it, you will also need to make sure that your coalesced reads and writes from the warp are aligned on 128 byte boundaries.
A common mistake in these situations is to use the threadIdx.y based variable to be the most rapidly changing index. It doesn't seem to me that you've made that error. e.g. it's a common mistake to do shared[threadIdx.x][threadIdx.y] since this is often the way we think about it in C. But threads are grouped together first in the x axis, so we want to use shared[threadIdx.y][threadIdx.x] or something similar. If you do make this mistake, your code can still be functionally correct but you will get low percentage utilization numbers in the profiler, like around 12% or even 3%.
And as already stated, to get above 50% and close to 100%, you will want to make sure that not only are all your thread requests are adjacent, but that they are aligned on a 128B boundary. Due to L1/L2 caches, these aren't hard and fast rules, but guidelines. The caches may mitigate some mistakes, to some degree.

Related

why unroll can not accelerate transpose matrix?

I am following a tutorial to learn cuda now and I learn that unroll a kernel function will accelerate the program. And it indeed works when I write a function which used to summarize a array.
But when I write a function used to transpose matrix following tutorial, it dosen't work.
The origin function like below:
__global__ void transform_matrix_read_col(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x + blockDim.x * blockIdx.x;
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
}
}
and unrool function:
__global__ void transform_matrix_read_col_unrool(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x +(blockDim.x * blockIdx.x * 4);
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
mat_b[row_idx + blockDim.x*1] = mat_a[col_idx + row_num*blockDim.x*1];
mat_b[row_idx + blockDim.x*2] = mat_a[col_idx + row_num*blockDim.x*2];
mat_b[row_idx + blockDim.x*3] = mat_a[col_idx + row_num*blockDim.x*3];
}
}
and the main function:
size_t width = 128 , height = 128,
array_size = width*height,array_bytes = array_size * sizeof(int);
int* matrix_data = nullptr,*output_data = nullptr;
cudaMallocHost(&matrix_data, array_bytes);
cudaMallocHost(&output_data, array_bytes);
util::init_array_int(matrix_data,array_size);//this func will random generate some integer
int* matrix_data_dev = nullptr,* output_matrix_dev = nullptr;
cudaMalloc(&matrix_data_dev, array_bytes);
cudaMemcpy(matrix_data_dev, matrix_data, array_bytes, cudaMemcpyHostToDevice);
cudaMalloc(&output_matrix_dev, array_bytes);
dim3 block(32,16);
dim3 grid((width-1)/block.x+1,(height-1)/block.y+1);
dim3 gridUnrool4((width-1)/(block.x*4)+1,(height-1)/block.y +1);
transform_matrix_read_col<<<grid,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
transform_matrix_read_col_unrool<<<gridUnrool4,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
and the staticstis of nsys(run on linux with a rtx 3090):
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- -------- ------- ------- ---------------------------------------------------------------------------
6.3 3,456 1 3,456.0 3,456 3,456 transform_matrix_read_col_unrool(int*, int*, unsigned long, unsigned long)
5.2 2,880 1 2,880.0 2,880 2,880 transform_matrix_read_col(int*, int*, unsigned long, unsigned long)
We can see that unrool version slower a lot.
But on the tutorial , it say that unroll will acclerate transpose actually.
So What cause this problem? And how to accelerate transpose matrix ?
Unrolling only help if the computation is compute bound so that a higher (useful) instruction throughput can decrease the execution time. Memory-bound code tends not to be much faster once unrolled because memory-bound instruction are slowed down by the contention of the memory controller.
A transposition may not seem memory-bound at first glance because of a low apparent memory throughput, but one need to care about cache lines. Indeed, when a single value is requested from memory from the user code, the hardware actually request a pretty big cache line for (subsequent) contiguous accesses to be fast.
Another consideration to take into account is that the code can also be latency bound. Indeed, the inefficient strided accesses can be slow due to the memory latency. The memory controller may not be able to fully saturate the RAM in this case (although this is quite unlikely on GPUs, especially regarding the large cache lines). If so, adding more instruction do not help because they are typically executed in an in-order way as opposed to modern mainstream CPUs. Using larger blocks and more blocks helps to provide more parallelism to the GPUs which can then perform more concurrent memory accesses and possibly better use the memory.
The key with the transposition is to make accesses as contiguous as possible and reuse cache lines. The most critical thing is to operate on small 2D blocks and not on full row/lines (ie. not a 1D kernel) to increase the cache locality. Moreover, one efficient well-known solution is to use the shared memory: each threads of a CUDA block fetch a part of a 2D array block and can then perform the transposition in shared memory possibly more efficiently. It is not so easy due to possible shared memory conflicts that can impact performance. Fortunately, there are few research papers and articles talking about that since the last decades.
The simplest efficient solution to this problem is basically to use cuBLAS which is heavily optimized. This post may also be useful.
Note that a 128x128 transposition is very small for a GPU. GPUs are designed to compute bigger datasets (or far more expensive computations on such small input). If the input array is initially stored on the CPU, then I strongly advise you to do that directly on the CPU as moving data on the GPU will likely be already slower than computing the transposition efficiently on the CPU directly. Indeed, data cannot be moved faster than the main RAM permit and a 128x128 transposition can be implemented in a way it saturate the main RAM (in fact, it can be likely done directly in the CPU caches that are significantly faster than the main RAM).

Implementing Max Reduce in Cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

2D Finite Difference Time Domain (FDTD) in CUDA

I would like to write an electromagnetic 2D Finite Difference Time Domain (FDTD) code in CUDA language.
The C code for the update of the magnetic field is the following
// --- Update for Hy and Hx
for(int i=n1; i<=n2; i++)
for(int j=n11; j<=n21; j++){
Hy[i*ydim+j]=A[i*ydim+j]*Hy[i*ydim+j]+B[i*ydim+j]*(Ezx[(i+1)*ydim+j]-Ezx[i*ydim+j]+Ezy[(i+1)*ydim+j]-Ezy[i*ydim+j]);
Hx[i*ydim+j]=G[i*ydim+j]*Hx[i*ydim+j]-H[i*ydim+j]*(Ezx[i*ydim+j+1]-Ezx[i*ydim+j]+Ezy[i*ydim+j+1]-Ezy[i*ydim+j]);
}
}
My first parallelization attempt has been the following kernel:
__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h[(idx+1)*ydim+idy]-Ezx_h[idx*ydim+idy]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h[idx*ydim+idy+1]-Ezx_h[idx*ydim+idy]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }
}
However, by also using the Visual Profiler, I have been unsatisfied by this solution for two reasons:
1) The memory accesses are poorly coalesced;
2) The shared memory is not used.
I then decided to use the following solution
__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
__shared__ double Ezy_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
if (((blockIdx.x*BLOCK_SIZE_X+i1)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j1)<ydim))
Ezx_h_shared[i1][j1]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)];
if (((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))&&(((blockIdx.x*BLOCK_SIZE_X+i2)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j2)<ydim)))
Ezx_h_shared[i2][j2]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)];
__syncthreads();
if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h_shared[i+1][j]-Ezx_h_shared[i][j]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h_shared[i][j+1]-Ezx_h_shared[i][j]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }
}
The index trick is needed to make a block of BS_x * BS_y threads read (BS_x+1)*(BS_y+1) global memory locations to the shared memory.
I believe that this choice is better than the previous one, due to the use of the shared memory, although not all the accesses are really coalesced, see
Analyzing memory access coalescing of my CUDA kernel
My question is that if anyone of you can address me to a better solution in terms of coalesced memory access. Thank you.
Thank you for providing the profiling information.
Your second algorithm is better because you are getting a higher IPC. Still, on CC 2.0, max IPC is 2.0, so your average in the second solution of 1.018 means that only half of your available compute power is utilized. Normally, that means that your algorithm is memory bound, but I'm not sure in your case, because almost all the code in your kernel is inside if conditionals. A significant amount of warp divergence will affect performance, but I haven't checked if instructions which results are not used count towards the IPC.
You may want to look into reading through the texture cache. Textures are optimized for 2D spatial locality and better support semi-random 2D access. It may help your [i][j] type accesses.
In the current solution, make sure that it's the Y position ([j]) that changes the least between two threads with adjacent thread IDs (to keep memory accesses as sequential as possible).
It could be that the compiler has optimized this for you, but you recalculate idx*ydim+idy many times. Try calculating it once and reusing the result. That would have more potential for improvement if your algorithm was compute bound though.
I believe that in this case your first parallel solution is better because each thread reads each global memory array element only once. Therefore storing these arrays in shared memory doesn't bring you expected improvement.
It could speed up your program due to better coalesced access to global memory during storing data in shared memory but IMHO this is balanced by caching of global memory accesses if you are using Compute Capability 2.x and also using of shared memory can be downgraded due to bank conflicts.

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".