CUDA summation reduction puzzle - cuda

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.

The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

Related

Cuda Tensor Cores: What is the effect of NumBlocks and ThreadsPerBlock?

I am wondering what the effect of NumBlocks and ThreadsPerBlock on this simple matrix multiplication routine is
__global__ void wmma_matrix_mult(half *a, half *b, half *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, half> c_frag;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load the inputs
wmma::load_matrix_sync(a_frag, a, N);
wmma::load_matrix_sync(b_frag, b, N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store the output
wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}
Calling
`wmma_matrix_mult<<1, 1>>`: Incorrect
`wmma_matrix_mult<<1, 2>>`: Incorrect
`wmma_matrix_mult<<1, 4>>`: Incorrect
`wmma_matrix_mult<<1, 8>>`: Incorrect
`wmma_matrix_mult<<1, 16>>`: Incorrect
`wmma_matrix_mult<<1, 32>>`: Correct
Why does the number of threads per block even matter if every thread is doing then same execution? As you can see, I am not doing anything with threadIdx.x inside the kernel.
Tensor core operations happen at the warp level. The w in wmma signifies that. Referring to the documentation:
This requires co-operation from all threads in a warp.
Each tensorcore unit can accept one matrix multiply operation (i.e. wmma::mma_sync), from a warp, per clock cycle.
This means that a full warp (32 threads) must be available and participating, for the operation to make any sense (i.e. to be legal). All of the wmma:: operations are collective ops, which means that an entire warp is expected to be executing them, and is necessary for correct usage.
If you have multiple warps participating (e.g. a threadblock size of 64, or 128, etc.), you are effectively asking for multiple operations to be done, just like any other CUDA code.
Like any other CUDA code, launching an operation with multiple blocks is just a way to scale up the work being done, and of course is necessary if you want to engage the resources of a GPU that has multiple SMs. Since tensorcore units are a per-SM resource, this would be necessary to witness a CUDA GPU delivering anything approaching its full rated throughput for tensorcore ops.
Why does the number of threads per block even matter if every thread is doing then same execution?
Every thread is not doing the same thing. The wmma:: collective ops are hiding code under the hood that is specializing thread behavior according to which warp lane it belongs to. For example, the thread in warp lane 0 will select different elements of the fragment to associate with (i.e. load, store) than any thread in any other warp lane.

Influence of division operation in cuda kernel on number of registers per thread

Im was writing a program which includes a cuda kernel. I found that if you are using#define OPERATOR * one thread will use 11 registers, but I you will use #define OPERATOR / (division operator) one thread will use 52 registers!! Whats wrong? I must
decrease register number (I dot want to set maxregcount)! How can I decrease number of registers when Im using devision operator in cuda kernel?
#include <stdio.h>
#include <stdlib.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define OPERATOR /
__global__ void kernel(double* array){
for (int curEl=0;curEl<BLOCK_SIZE;++curEl){
array[curEl]=array[curEl] OPERATOR 10;
}
}
int main(void) {
double *devPtr=NULL,*data=(double*)malloc(sizeof(double)*BLOCK_SIZE);
cudaFuncAttributes cudaFuncAttr;
cudaFuncGetAttributes(&cudaFuncAttr,kernel);
for (int curElem=0;curElem<BLOCK_SIZE;++curElem){
data[curElem]=curElem;
}
cudaMalloc(&devPtr,sizeof(double)*BLOCK_SIZE);
cudaMemcpy(devPtr,data,sizeof(double)*BLOCK_SIZE,cudaMemcpyHostToDevice);
kernel<<<1,BLOCK_SIZE>>>(devPtr);
printf("1 thread needs %d regs\n",cudaFuncAttr.numRegs);
return 0;
}
The increase in register use when switching from a double-precision multiplication to a double-precision division in kernel computation is due to the fact that double-precision multiplication is a built-in hardware instruction, while double-precision division is a sizable called software subroutine (that is, a function call of sorts). This is easily verified by inspection of the generated machine code (SASS) with cuobjdump --dump-sass.
The reason that double-precision divisions (and in fact all divisions, including single-precision division and integer division) are emulated either by inline code or called subroutines is due to the fact that the GPU hardware has no direct support for division operations, in order to keep the individual computational cores ("CUDA cores") as simple and as small as possible, which ultimately leads to higher peak performance for a given size chip. It likely also improves the efficiency of the cores as measured by the GFLOPS/watt metric.
For release builds, the typical increase in register use caused by the introduction of double-precision division is around 26 registers. These additional registers are needed to store intermediate variables in the division computation, where each double-precision temporary variable requires two 32-bit registers.
As Marco13 points out in a comment above, it may be possible to manually replace division by multiplication with the reciprocal. However, this causes slight numerical differences in most cases, which is why the CUDA compiler does not apply this transformation automatically.
Generally speaking, register use can be controlled with compilation-unit granularity through the -maxrregcount nvcc compiler flag, or with per-function granularity using the __launch_bounds__ function attribute. However, forcing lower register use by more than a few registers below the level determined by the compiler frequently leads to register spilling in the generated code, which usually has a negative impact on kernel performance.

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

Why is cuda shared memory dynamic allocation not valid here?

This works fine:
a_size=FindSizeAtrunTime();
Kernel<<< gridDim, blockDim, a_size >>>(count)
But this shows error
__global__ void Kernel(int count_a, int count_b)
{
a_size=FindSizeAtrunTime();
__shared__ int a[a_size];
}
error: expression must have a constant value
In both cases size is being determined at runtime. So why the first case is ok, and not the second case?
The second is illegal on two levels.
Firstly, C++98 (which is what CUDA mostly derived from), doesn't allow statically declared, dynamically sized arrays. The language doesn't permit it, and so neither does CUDA.
Secondly, and more importantly, the size of dynamic shared memory allocations must be known before a kernel can be launched. The GPU must know how much shared memory to reserve before a block can be scheduled. That isn't possible in your second example, whereas it is in your first example.

Kernel calls in CUDA

I am new with CUDA, and I am confuse with the kernel calls.
When you call a Kernel method you specify the number of blocks and the thread per block, like this kernelMethod<<< block, Threads >>>(parameters);"
So why it is possible to use a 3rd parameter?
kernelMethod<<< block, Threads, ???>>>(parameters);
Using cudaDeviceProp you can read the number of thread per block in the variable maxThreadsPerBlock. But how can I know the maximum number of blocks?
Thanks!!
The third parameter specifies the amount of shared memory per block to be dynamically allocated. The programming guide provides additional detail about shared memory, as well as a description and example.
Shared memory can be allocated statically in a kernel:
__shared__ int myints[256];
or dynamically:
extern __shared__ int myints[];
In the latter case, it's necessary to pass as an additional kernel config parameter (the 3rd parameter you mention) the size of the shared memory to be allocated in bytes.
In that event, the pointer myints then points to the beginning of that dynamically allocated region.
The maximum number of blocks is specified per grid dimension (x, y, z) and can also be obtained through the device properties query. It is specified in the maxGridSize parameter. You may want to refer to the deviceQuery sample for a worked example.