block per grid allocation habit in cuda - cuda

There is one common habit I saw in cuda example when they allocation the grid size. The following is an example:
int
main(){
...
int numElements = 50000;
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
...
}
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
}
}
What I am curious about is the initialization of blocksPerGrid. I don't understand why it's
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
rather than straightforward
int blocksPerGrid = numElements / threadsPerblock;
It seems it's a quite common habit. I saw in various projects. They all do this in this way.
I am new to cuda. Any explanation or knowledge behind this are welcomed.

The calculation is done the way you see to allow for cases where numElements isn't a round multiple of threadsPerblock.
For example, using threadsPerblock = 256 and numElements = 500
(numElements + threadsPerBlock - 1) / threadsPerBlock = (500 + 255) / 256 = 2
whereas
numElements / threadsPerblock = 500 / 256 = 1
In the first case, 512 threads are run, covering the 500 elements in the input data, but in the second case, only 256 threads are run, leaving 244 input items unprocessed.
Note also this kind of "guard" code in the kernel:
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
... Access input here
}
is essential to prevent any of the extra threads from performing out of bounds memory operations.

Related

Cuda Dot Product Failing for Non Multiples of 1024

I'm just looking for some help here when it comes to calculating the dot product of two arrays.
Let's say I set the array size to 2500 and the max thread count per block to 1024.
In essence, I want to calculate the dot product of each block, and then sum the dot products in another kernel function. I calculate the number of blocks as such:
nblcks = (n + 1024 -1)/1024
So, nblcks = 3
This is my kernel function:
// calculate the dot product block by block
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
if (i < n)
s[tIdx] = a[i] * b[i];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
I call the kernel:
dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);
And everything works! Well, almost.
d_c, which has 3 elements - each one the dot product of the block is thrown off on the last element.
d_c[0] = correct
d_c[1] = correct
d_c[2] = some massive number of 10^18
Can someone point out why this is occurring? It only seems to work for multiples of 1024. So... 2048, 3072, etc... Am I iterating over null values or stack overflowing?
Thank you!
Edit:
// host vectors
float* h_a = new float[n];
float* h_b = new float[n];
init(h_a, n);
init(h_b, n);
// device vectors (d_a, d_b, d_c)
float* d_a;
float* d_b;
float* d_c;
cudaMalloc((void**)&d_a, n * sizeof(float));
cudaMalloc((void**)&d_b, n * sizeof(float));
cudaMalloc((void**)&d_c, nblocks * sizeof(float));
// copy from host to device h_a -> d_a, h_b -> d_b
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
Initialization of the array's are done in this function (n times):
void init(float* a, int n) {
float f = 1.0f / RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * f; // [0.0f 1.0f]
}
The basic problem here is that the sum reduction can only work correctly when you have a round power of two threads per block, with every entry in the shared memory initialised. That isn't a limitation in practice if you do something like this:
__global__ void dotProduct(const float* a, const float* b, float* c, int n){
// store the product of a[i] and b[i] in shared memory
// sum the products in shared memory
// store the sum in c[blockIdx.x]
__shared__ float s[ntpb];
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
//calc product
s[tIdx] = 0.f;
while (i < n) {
s[tIdx] += a[i] * b[i];
i += blockDim.x * gridDim.x;
}
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride <<= 1) {
if (tIdx % (2 * stride) == 0)
s[tIdx] += s[tIdx + stride];
__syncthreads();
}
if (threadIdx.x == 0){
c[blockIdx.x] = s[0];
}
}
and run a power of two threads per block (ie. 32, 64, 128, 256, 512 or 1024). The while loop accumulates multiple values and stores that partial dot product in shared memory, with every entry containing either 0 or a valid partial sum, and then the reduction happens as normal. Instead of running as many blocks as the data size dictates, run only as many as will "fill" your GPU simultaneously (or one less than you think you require if the problem size is small). Performance will be improved as well at larger problem sizes.
If you haven't already seen it, here is a very instructive whitepaper written by Mark Harris from NVIDIA on step by step optimisation of the basic parallel reduction. I highly recommend reading it.

Faster array copy when using fewer threads in CUDA

I tested two different approaches to copy a 2D array in a CUDA kernel.
The first one launchs blocks of TILE_DIM x TILE_DIM threads. Each block copy a tile of the array assigning one thread per element:
__global__ void simple_copy(float *outdata, const float *indata){
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
outdata[y*width + x] = indata[y*width + x];
}
The second one is taken from the NVIDIA Blog. It is similar to the previous kernel but use TILE_DIM x BLOCK_ROWS threads per block. Each thread loops over multiple elements of the matrix:
__global__ void fast_copy(float *outdata, const float *indata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int k = 0 ; k < TILE_DIM ; k += BLOCK_ROWS)
outdata[(y+k)*width + x] = indata[(y+k)*width + x];
}
I run a test to compare these two approaches.
Both kernels perform coalescent access to the global memory, yet the second one seems to be noticeably faster.
The NVIDIA visual profiler confirms this test.
So how the second kernel manages to achieve a faster copy?
This is the complete code I used to test the kernels:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <conio.h>
#define TILE_DIM 32
#define BLOCK_ROWS 8
/* KERNELS */
__global__ void simple_copy(float *outdata, const float *indata){
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
outdata[y*width + x] = indata[y*width + x];
}
//###########################################################################
__global__ void fast_copy(float *outdata, const float *indata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int k = 0 ; k < TILE_DIM ; k += BLOCK_ROWS)
outdata[(y+k)*width + x] = indata[(y+k)*width + x];
}
//###########################################################################
/* MAIN */
int main(){
float *indata,*dev_indata,*outdata1,*dev_outdata1,*outdata2,*dev_outdata2;
cudaEvent_t start, stop;
float time1,time2;
int i,j,k;
int n_iter = 100;
int N = 2048;
cudaEventCreate(&start);
cudaEventCreate(&stop);
dim3 grid(N/TILE_DIM, N/TILE_DIM);
dim3 threads1(TILE_DIM,TILE_DIM);
dim3 threads2(TILE_DIM,BLOCK_ROWS);
// Allocations
indata = (float *)malloc(N*N*sizeof(float));
outdata1 = (float *)malloc(N*N*sizeof(float));
outdata2 = (float *)malloc(N*N*sizeof(float));
cudaMalloc( (void**)&dev_indata,N*N*sizeof(float) );
cudaMalloc( (void**)&dev_outdata1,N*N*sizeof(float) );
cudaMalloc( (void**)&dev_outdata2,N*N*sizeof(float) );
// Initialisation
for(j=0 ; j<N ; j++){
for(i=0 ; i<N ; i++){
indata[i + N*j] = i + N*j;
}
}
// Transfer to Device
cudaMemcpy( dev_indata, indata, N*N*sizeof(float),cudaMemcpyHostToDevice );
// Simple copy
cudaEventRecord( start, 0 );
for(k=0 ; k<n_iter ; k++){
simple_copy<<<grid, threads1>>>(dev_outdata1,dev_indata);
}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time1, start, stop );
printf("Elapsed time with simple copy: %f\n",time1);
// Fast copy
cudaEventRecord( start, 0 );
for(k=0 ; k<n_iter ; k++){
fast_copy<<<grid, threads2>>>(dev_outdata2,dev_indata);
}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time2, start, stop );
printf("Elapsed time with fast copy: %f\n",time2);
// Transfer to Host
cudaMemcpy( outdata1, dev_outdata1, N*N*sizeof(float),cudaMemcpyDeviceToHost );
cudaMemcpy( outdata2, dev_outdata2, N*N*sizeof(float),cudaMemcpyDeviceToHost );
// Check for error
float error = 0;
for(j=0 ; j<N ; j++){
for(i=0 ; i<N ; i++){
error += outdata1[i + N*j] - outdata2[i + N*j];
}
}
printf("error: %f\n",error);
/*// Print the copied matrix
printf("Copy\n");
for(j=0 ; j<N ; j++){
for(i=0 ; i<N ; i++){
printf("%f\t",outdata1[i + N*j]);
}
printf("\n");
}*/
cudaEventDestroy( start );
cudaEventDestroy( stop );
free(indata);
free(outdata1);
free(outdata2);
cudaFree(dev_indata);
cudaFree(dev_outdata1);
cudaFree(dev_outdata2);
cudaDeviceReset();
getch();
return 0;
}
//###########################################################################
I think you will find the answer by comparing the microcode for the two kernels.
When I compile these kernels for SM 3.0, the compiler completely unrolls the loop in the second kernel (since it knows it will iterate 4x). That probably explains the performance difference - CUDA hardware can use registers to cover memory latency as well as instruction latency. Vasily Volkov did a terrific presentation "Better Performance At Low Occupancy" on the topic a couple years ago (https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf).
Launching threads costs some GPU time. Less threads and more work per thread means less overhead of launching thread. That's why fast_copy() is faster.
But of course you still need enough number of threads and blocks to fully utilize the GPU.
In fact the following blog expands this idea further. It uses fixed number of blocks/threads to do work with arbitrary size by using Grid-stride loops. Several advantages of this method are discussed.
https://developer.nvidia.com/content/cuda-pro-tip-write-flexible-kernels-grid-stride-loops

Using CUDA Shared Memory to Improve Global Access Patterns

I have the following kernel to get the magnitude of a bunch of vectors:
__global__ void norm_v1(double *in, double *out, int n)
{
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
{
double x = in[3*i], y = in[3*i+1], z = in[3*i+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However due to the packing of in as [x0,y0,z0,...,xn,yn,zn] it performs poorly with the profiler indicating a 32% global load efficiency. Repacking the data as [x0, x1, ..., xn, y0, y1, ..., yn, z0, z1, ..., zn] improves things greatly (with the offsets for x, y, and z changing accordingly). Runtime is down and efficiency is up to 100%.
However, this packing is simply not practical for my application. I therefore wish to investigate the use of shared memory. My idea is for each thread in a block to copy three values (blockDim.x apart) from global memory -- yielding coalesced access. Under the assumption of a maximum blockDim.x = 256 I came up with:
#define BLOCKDIM 256
__global__ void norm_v2(double *in, double *out, int n)
{
__shared__ double invec[3*BLOCKDIM];
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
invec[0*BLOCKDIM + threadIdx.x] = in[0*BLOCKDIM+i];
invec[1*BLOCKDIM + threadIdx.x] = in[1*BLOCKDIM+i];
invec[2*BLOCKDIM + threadIdx.x] = in[2*BLOCKDIM+i];
__syncthreads();
if (i < n)
{
double x = invec[3*threadIdx.x];
double y = invec[3*threadIdx.x+1];
double z = invec[3*threadIdx.x+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However this is clearly deficient when n % blockDim.x != 0, requires knowing the maximum blockDim in advance and generates incorrect results for out[i > 255] when tested with an n = 1024. How should I best remedy this?
I think this can solve the out[i > 255] problem:
__shared__ double shIn[3*BLOCKDIM];
const uint blockStart = blockIdx.x * blockDim.x;
invec[0*blockDim.x+threadIdx.x] = in[ blockStart*3 + 0*blockDim.x + threadIdx.x];
invec[1*blockDim.x+threadIdx.x] = in[ blockStart*3 + 1*blockDim.x + threadIdx.x];
invec[2*blockDim.x+threadIdx.x] = in[ blockStart*3 + 2*blockDim.x + threadIdx.x];
__syncthreads();
double x = shIn[3*threadIdx.x];
double y = shIn[3*threadIdx.x+1];
double z = shIn[3*threadIdx.x+2];
out[blockStart+threadIdx.x] = sqrt(x*x + y*y + z*z);
As for n % blockDim.x != 0 I would suggest padding the input/output arrays with 0 to match the requirement.
If you dislike the BLOCKDIM macro - explore using extern __shared__ shArr[] and then passing 3rd parameter to kernel configuration:
norm_v2<<<gridSize,blockSize,dynShMem>>>(...)
the dynShMem is the dynamic shared memory usage (in bytes). This is extra shared memory pool with its size specified at run-time, where all extern __shared__ variables will be initially assigned to.
What GPU are you using? Fermi or Kepler might help your original code with their L1 caching.
If you don't want to pad your in array, or you end up doing similar trick somewhere else, you may want to consider implementing a device-side memcopy, something like this:
template <typename T>
void memCopy(T* destination, T* source, size_t numElements) {
//assuming sizeof(T) is a multiple of sizeof(int)
//assuming one-dimentional kernel (only threadIdx.x and blockDim.x matters)
size_t totalSize = numElements*sizeof(T)/sizeof(int);
int* intDest = (int*)destination;
int* intSrc = (int*)source;
for (size_t i = threadIdx.x; i < totalSize; i += blockDim.x) {
intDest[i] = intSrc[i];
}
__syncthreads();
}
It basically treats any array as an array of int-s and copy the data from one location to another. You may want to replace the underlying int type with double-s or long long int if you work with 64-bit types only.
Then you can replace the copying lines with:
memCopy(invec, in+blockStart*3, min(blockDim.x, n-blockStart));

correctly computing gridDim for CUDA kernel

i expected to see numbers from 0.0 to 999.0 but instead getting some very weird and long number for some of the indices for the below code:
__global__ void kernel(double *res, int N)
{
int i = (gridDim.y*blockIdx.y+
blockIdx.x)*blockDim.x*blockDim.y+
blockDim.y*threadIdx.y+threadIdx.x;
if(i<N) res[i] = i;
}
void callGPU(int N)
{
dim3 dimBlock(8, 8);
dim3 dimGrid(2, 8);
...
kernel<<<dimGrid, dimBlock>>>(res, N);
...
}
even if i change the dimGrid to (8,2) and (1,16), but if I change the gridDim to (16,1) then i am getting the indices right. plz can you show how to correctly compute the gridDim for this case? if possible to arbitrary N. many thanks!
Your indexing pattern is wrong.
Firstly, You should compute index by x and y dimensions.
int i_x = blockIdx.x * blockDim.x + threadIdx.x;
int i_y = blockIdx.y * blockDim.y + threadIdx.y;
Then you should compute pitch as count of whole threads by x dimension
int pitch = gridDim.x * blockDim.x;
Finally, You can compute your 1D index from 2D grid.
int i = i_y * pitch + i_x;

My kernel only works in block (0,0)

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.