thread management in nbody code of cuda-sdk - cuda

When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA".
My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line
__syncthreads();
// This is the "tile_calculation" function from the GPUG3 article.
acc = gravitation(bodyPos, acc);
__syncthreads();
}
isn't it supposed to be like this according to paper? I wonder why
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
positions[WRAP(tile, gridDim.x) * p + threadIdx.x];
Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line
__syncthreads();
// Save the result in global memory for the integration step
if (threadIdx.y == 0)
{
for (int i = 1; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x; //this line
acc.y += SX_SUM(threadIdx.x,i).y; //this line
acc.z += SX_SUM(threadIdx.x,i).z; //this line
}
}
}
Can anyone explain this to me? Is it some kind of optimization for faster code?

I am an author of this code and the paper. Numbered answers correspond to your numbered questions.
The blockIdx.x offset to the WRAP macro is not mentioned in the paper because this is a micro-optimization. I'm not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without the blockIdx.x offset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.
The sum is across threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to use threadIdx.x to index the right column of the (conceptually 2D) shared memory array.
To Answer SystmD's question in his (not really correct) answer, gridDim.y is only 1 in the (default/common) 1D block case.

1)
the array SharedPos is loaded in the shared memory of each block (i.e. each tile) before synchronization of the threads of each block (with __syncthreads()). blockIdx.x is the index of the tile, according to the algorithm.
each thread (index threadIdx.x threadIdx.y) loads a part of the shared array SharedPos. blockIdx.x refers to the index of the tile (without multithreading).
2)
acc is the float3 of the body index blockIdx.x * blockDim.x + threadIdx.x (see the beginning of the integrateBodies function)
I found some trouble with multithreadBodies=true during this sum with q>4 (128 bodies,p =16,q=8 gridx=8) . (with GTX 680). Some sums were not done on the whole blockDim.y ...
I changed the code to avoid that, It works but I don't know really why...
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x;
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y;
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z;
__syncthreads();
for (int i = 0; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x;
acc.y += SX_SUM(threadIdx.x,i).y;
acc.z += SX_SUM(threadIdx.x,i).z;
}
}
Another question:
In the first loop:
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
}
I don't know why blockIdx.y is used since grid.y=1.
3) For a faster code, I use asynchronous H2D and D2D memory copies (my code only uses the gravitation kernel).

Related

numba: how to understand the stride [duplicate]

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Why do we need stride in CUDA kernel?

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Matrix Multiplication of matrix and its transpose in Cuda

I am relatively new to CUDA programming so there are some unsolved issues for which I hope I can get some hints in the right direction.
So the case is that I want to multiply a 2D array with its transpose and to be precise I want to execute the operation ATA.
I have already used the cublas Dgemm function and now I am trying to do the same operation with a tiled algorithm, very similar to the one from CUDA guide.
The case is that while the initial algorithm runs properly, I want to calculate only the upper triangular matrix of the product hoping that I could achieve a better time for the operation, and I am not sure on how to extract tiles/blocks which will have the respective elements.
So if you could enlighten me on this, or give any hint I would be grateful, cause I have stuck on that for a while.
This is the code of the kernel
__shared__ double Ads1[TILE_WIDTH][TILE_WIDTH];
__shared__ double Ads2[TILE_WIDTH][TILE_WIDTH];
//block row and column
//we save in registers for faster access
int by = blockIdx.y;
int bx = blockIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
double Rvalue = 0;
if(row >= width || col >= width) return;
//Each thread block computes one sub-matrix Rsub of result R
for (int i=0; i<(int) ceil(((double) height/TILE_WIDTH)); ++i)
{
Ads1[tx][ty] = Ad[(i * TILE_WIDTH + ty)*width + col];
Ads2[tx][ty] = Ad[(i * TILE_WIDTH + tx)*width + row];
__syncthreads();
for (int j = 0; j < TILE_WIDTH; ++j)
{
if ((i*TILE_WIDTH + j) > height ) break; //in order not to exceed the matrix's height
Rvalue+=Ads1[j][tx]*Ads2[ty][j];
}
__syncthreads();
}
Rd [row * width + col] = Rvalue;
You may want to use the batch dgemm API function described here recursely dividing your output matrix with block diagonal and corner. You also want to balance smallest block size versus overhead in compute to avoid small invokes. Finally, note that matrix multiply turns memory bound at some stage, which can be on modern GPU somewhat large.

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.
This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all.
Any help would be greatly appreciated. Thanks!
Here is the regular code
float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
Here is my cuda kernel
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
My complete code : http://pastebin.com/zS85URX5
Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.
In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
But this doesn't make sense and is not what we want at all. What we want is something like:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.