Cuda: Access Block and Thread Count within Kernel? [duplicate] - cuda

I get what blockDim is, but I have a problem with gridDim. Blockdim gives the size of the block, but what is gridDim? On the Internet it says gridDim.x gives the number of blocks in the x coordinate.
How can I know what blockDim.x * gridDim.x gives?
How can I know that how many gridDim.x values are there in the x line?
For example, consider the code below:
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double temp = a[tid];
tid += blockDim.x * gridDim.x;
while (tid < count)
{
if (a[tid] > temp)
{
temp = a[tid];
}
tid += blockDim.x * gridDim.x;
}
I know that tid starts with 0. The code then has tid+=blockDim.x * gridDim.x. What is tid now after this operation?

blockDim.x,y,z gives the number of threads in a block, in the
particular direction
gridDim.x,y,z gives the number of blocks in a grid, in the
particular direction
blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)
block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.
In the CUDA documentation, these variables are defined here
In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.
You might want to consider taking an introductory CUDA webinars For example, the first 4 units. It would be 4 hours well spent, if you want to understand these concepts better.

Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2); would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x line would in effect be the unique index of each thread within your grid. This is because your blockDim.x would be the size of each block, and your gridDim.x would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x, you would have values within the range defined by: [0, 128) + 128 * [1, 10), which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10), and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)

In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?
#define N 10 //(33*1024)
__global__ void add(int *c){
int tid = threadIdx.x + blockIdx.x * gridDim.x;
if(tid < N)
c[tid] = 1;
while( tid < N)
{
c[tid] = 1;
tid += blockDim.x * gridDim.x;
}
}
int main(void)
{
int c[N];
int *dev_c;
cudaMalloc( (void**)&dev_c, N*sizeof(int) );
for(int i=0; i<N; ++i)
{
c[i] = -1;
}
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
add<<< 2, 2>>>(dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );
for(int i=0; i< N; ++i)
{
printf("c[%d] = %d \n" ,i, c[i] );
}
cudaFree( dev_c );
}
Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.
This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.
see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
In this source code,
gridDim.x : 2 this means number of block of x
gridDim.y : 1 this means number of block of y
blockDim.x : 2 this means number of thread of x in a block
blockDim.y : 1 this means number of thread of y in a block
Our number of thread are 4, because 2*2(blocks * thread).
In add kernel function, we can access 0, 1, 2, 3 index of thread
->tid = threadIdx.x + blockIdx.x * blockDim.x
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3
How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
tid += blockDim.x + gridDim.x in while
** first call of kernel **
-1 loop: 0+2*2=4
-2 loop: 4+2*2=8
-3 loop: 8+2*2=12 ( but this value is false, while out!)
** second call of kernel **
-1 loop: 1+2*2=5
-2 loop: 5+2*2=9
-3 loop: 9+2*2=13 ( but this value is false, while out!)
** third call of kernel **
-1 loop: 2+2*2=6
-2 loop: 6+2*2=10 ( but this value is false, while out!)
** fourth call of kernel **
-1 loop: 3+2*2=7
-2 loop: 7+2*2=11 ( but this value is false, while out!)
So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.
refer to this page.
http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html
I cannot upload image, because low reputation.

First, see this figure Grid of thread blocks from the CUDA official document
Usually, we use kernel as this:
__global__ void kernelname(...){
const id_x = blockDim.x * blockIdx.x + threadIdx.x;
const id_y = blockDim.y * blockIdx.y + threadIdx.y;
...
}
// invoke kernel
// assume we have assigned the proper gridsize and blocksize
kernelname<<<gridsize, blocksize>>>(...)
The meaning of some variables:
gridsize the number of blocks per grid, corresponding to the gridDim
blocksize the number of threads per block, corresponding to the blockDim
threadIdx.x varies in [0, blockDim.x)
blockIdx.x varies in [0, gridDim.x)
So, let's try to calculate the index at the x direction when we have threadIdx.x and blockIdx.x. According to the figure, blockIdx.x determines which block you are, and threadIdx.x determines which thread you are when the location of block is given. Hence, we have:
which_blk = blockDim.x * blockIdx.x; // which block you are
final_index_x = which_blk + threadIdx.x; // based on the given block, we can have the final location by adding the threadIdx.x
that is:
final_index_x = blockDim.x * blockIdx.x + threadIdx.x;
which is same as the sample code above.
Similarly, we can get the index at y or z direction respectively.
As we can see, we usually don't use gridDim in our code, because this information is performed as the range of blockIdx. On the contrary, we have to use blockDim although this information is performed as the range of threadIdx. The reason I have shown above step by step.
I hope this answer may help resolve your confusion.

Related

Summing up elements in array using managedCuda

Problem Description
I try to get a kernel summing up all elements of an array to work. The kernel is intended to be launched with 256 threads per block and an arbitary number of blocks. The length of the array passsed in as a is always a multiple of 512, in fact it is #blocks * 512. One block of the kernel should sum up 'its' 512 elements (256 threads can sum up 512 elements using this algorithm), storing the result in out[blockIdx.x]. The final summation over the values in out ,and therefore the results of the blocks, will be done on the host.
This kernel works fine for up to 6 blocks, meaning up to 3072 elements. But launching it with more than 6 blocks result in the first block calculating a strictly greater, wrong result than the other blocks (i. e. out = {572, 512, 512, 512, 512, 512, 512}), this wrong result is reproducable, the wrong value is the same for multiple executions.
I guess this means there is a structural error somewhere in my code, which has something to do with blockIdx.x, but the only use this is to calculate blockStart, and this seams to be a correct calculation, also for the first block.
I verified if my host code computes the correct number of blocks for the kernel and passes in an array of correct size. That's not the problem.
Of course I read a lot of similar questions here on stackoverflow, but none seems to describe my problem (See i. e. here or here)
The kernel is called via managedCuda (C#), I don't know if this might be a problem.
Hardware
I use a MX150 with the follwing specifications:
Revision Number: 6.1
Total global memory: 2147483648
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Max Threads per block: 1024
Max Blocks: 2147483648
Number of multiprocessors: 3
Code
Kernel
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
int leftSumElementIdx = blockStart + tid * 2;
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
__syncthreads();
if (tid < 128)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if(tid < 64)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 32)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 16)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 8)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 4)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 2)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid == 0)
{
out[blockIdx.x] = a[blockStart] + a[blockStart + 1];
}
}
Kernel Invocation
//Get the cuda kernel
//PathToPtx and MangledKernelName must be replaced
CudaContext cntxt = new CudaContext();
CUmodule module = cntxt.LoadModule("pathToPtx");
CudaKernel vectorReduceAsSumKernel = new CudaKernel("MangledKernelName", module, cntxt);
//Get an array to reduce
float[] array = new float[4096];
for(int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//Calculate execution info for the kernel
int threadsPerBlock = 256;
int numOfBlocks = array.Length / (threadsPerBlock * 2);
//Memory on the device
CudaDeviceVariable<float> m_d = array;
CudaDeviceVariable<float> out_d = new CudaDeviceVariable<float>(numOfBlocks);
//Give the kernel necessary execution info
vectorReduceAsSumKernel.BlockDimensions = threadsPerBlock;
vectorReduceAsSumKernel.GridDimensions = numOfBlocks;
//Run the kernel on the device
vectorReduceAsSumKernel.Run(out_d.DevicePointer, m_d.DevicePointer);
//Fetch the result
float[] out_h = out_d;
//Sum up the partial sums on the cpu
float sum = 0;
for(int i = 0; i < out_h.Length; i++)
{
sum += out_h[i];
}
//Verify the correctness
if(sum != 4096)
{
throw new Exception("Thats the wrong result!");
}
Update:
The very helpfull and only answer did address all my problems. Thank you! The problem was an unforeseen race condition.
Important Hint:
In the comments the author of managedCuda pointed out all NPPs methods are indeed already implmented in managedCuda (using ManagedCuda.NPP.NPPsExtensions;). I wasn't aware of that, and i guess so are many people reading ths question.
You are not correctly incorporating into your code the idea that each block will process 512 elements out of your total array. According to my testing, you need to make at least 2 changes to fix this:
In the kernel, you have incorrectly calculated the starting point for each block:
int blockStart = blockDim.x * blockIdx.x;
since blockDim.x is 256, but each block processes 512 elements, you must multiply this by 2. (the multiplication by 2 in your calculation of leftSumElementIdx doesn't take care of this -- since it is only multiplying tid).
In your host code, your number of blocks calculation is incorrect:
vectorReduceAsSumKernel.GridDimensions = array.Length / threadsPerBlock;
for a value of 2048 for array.Length and a value of 256 for threadsPerBlock, this creates 8 blocks. But as you already indicate, your intention is to launch for blocks (2048/512). So you need to multiply the denominator by 2:
vectorReduceAsSumKernel.GridDimensions = array.Length / (2*threadsPerBlock);
In addition, your reduction sweep pattern is broken. It is warp-execution-order dependent, to give the proper result, and CUDA does not specify a warp execution order.
To see why, let's take a simple example. Let's consider just a single threadblock, with a starting point of the array being all 1, just as you have initialized it.
Now, warp 0 consists of threads 0-31. Your reduction sweep operation is like this:
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
So each thread in warp 0 will collect two other values and add them, and store them. Thread 31 will take the values a[62] and a[63] and add them together. If the values of a[62] and a[63] are still 1, as initialized, then this will work as expected. But the values of a[62] and a[63] are written to by warp 1, consisting of threads 32-63. So if warp 1 executes before warp 0 (perfectly legal), then you will get a different result. This is a global memory race condition. It is arising due to the fact that your input array is both the source and destination of your intermediate results, and __syncthreads() will not sort this out for you. It doesn't force warps to execute in any particular order.
One possible solution is to fix your sweep pattern. On any given reduction cycle, let's have a sweep pattern where each thread writes and reads values that are not touched by any other thread during that cycle. The following adaptation of your kernel code accomplishes that:
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
for (int j = blockDim.x; j > 0; j>>=1){
if (tid < j)
a[i] += a[i+j];
__syncthreads();}
if (tid == 0)
{
out[blockIdx.x] = a[i];
}
}
For general purpose reductions, this is still a very slow method. This tutorial covers how to write faster reductions. And, as already pointed out, managedCuda may have methods to avoid writing a kernel at all.

CUDA Reduction: Warp Unrolling (School)

I am currently working on a project in which I am unrolling the last warp of a reduction. I have finished the code above; however, some modifications were done by guessing and I'd like an explanation why. The code I have written is only the function kernel4
// in is input array, out is where to store result, n is number of elements from in
// T is a float (32bit)
__global__ void kernel4(T *in, T *out, unsigned int n)
which is a reduction algorithm, the rest of the code was already provided.
Code:
#include <stdlib.h>
#include <stdio.h>
#include "timer.h"
#include "cuda_utils.h"
typedef float T;
#define N_ (8 * 1024 * 1024)
#define MAX_THREADS 256
#define MAX_BLOCKS 64
#define MIN(x,y) ((x < y) ? x : y)
#define tid threadIdx.x
#define bid blockIdx.x
#define bdim blockDim.x
#define warp_size 32
unsigned int nextPow2( unsigned int x ) {
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return ++x;
}
void getNumBlocksAndThreads(int whichKernel, int n, int maxBlocks, int maxThreads, int &blocks, int &threads)
{
if (whichKernel < 3) {
threads = (n < maxThreads) ? nextPow2(n) : maxThreads;
blocks = (n + threads - 1) / threads;
} else {
threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads;
blocks = (n + (threads * 2 - 1)) / (threads * 2);
}
if (whichKernel == 5)
blocks = MIN(maxBlocks, blocks);
}
T reduce_cpu(T *data, int n) {
T sum = data[0];
T c = (T) 0.0;
for (int i = 1; i < n; i++)
{
T y = data[i] - c;
T t = sum + y;
c = (t - sum) - y;
sum = t;
}
return sum;
}
__global__ void
kernel4(T *in, T *out, unsigned int n)
{
__shared__ volatile T d[MAX_THREADS];
unsigned int i = bid * bdim + tid;
n >>= 1;
d[tid] = (i < n) ? in[i] + in[i+n] : 0;
__syncthreads ();
for(unsigned int s = bdim >> 1; s > warp_size; s >>= 1) {
if(tid < s)
d[tid] += d[tid + s];
__syncthreads ();
}
if (tid < warp_size) {
if (n > 64) d[tid] += d[tid + 32];
if (n > 32) d[tid] += d[tid + 16];
d[tid] += d[tid + 8];
d[tid] += d[tid + 4];
d[tid] += d[tid + 2];
d[tid] += d[tid + 1];
}
if(tid == 0)
out[bid] = d[0];
}
int main(int argc, char** argv)
{
T *h_idata, h_odata, h_cpu;
T *d_idata, *d_odata;
struct stopwatch_t* timer = NULL;
long double t_kernel_4, t_cpu;
int whichKernel = 4, threads, blocks, N, i;
if(argc > 1) {
N = atoi (argv[1]);
printf("N: %d\n", N);
} else {
N = N_;
printf("N: %d\n", N);
}
getNumBlocksAndThreads (whichKernel, N, MAX_BLOCKS, MAX_THREADS, blocks, threads);
stopwatch_init ();
timer = stopwatch_create ();
h_idata = (T*) malloc (N * sizeof (T));
CUDA_CHECK_ERROR (cudaMalloc (&d_idata, N * sizeof (T)));
CUDA_CHECK_ERROR (cudaMalloc (&d_odata, blocks * sizeof (T)));
srand48(time(NULL));
for(i = 0; i < N; i++)
h_idata[i] = drand48() / 100000;
CUDA_CHECK_ERROR (cudaMemcpy (d_idata, h_idata, N * sizeof (T), cudaMemcpyHostToDevice));
dim3 gb(blocks, 1, 1);
dim3 tb(threads, 1, 1);
kernel4 <<<gb, tb>>> (d_idata, d_odata, N);
cudaThreadSynchronize ();
stopwatch_start (timer);
kernel4 <<<gb, tb>>> (d_idata, d_odata, N);
int s = blocks;
while(s > 1) {
threads = 0;
blocks = 0;
getNumBlocksAndThreads (whichKernel, s, MAX_BLOCKS, MAX_THREADS, blocks, threads);
dim3 gb(blocks, 1, 1);
dim3 tb(threads, 1, 1);
kernel4 <<<gb, tb>>> (d_odata, d_odata, s);
s = (s + threads * 2 - 1) / (threads * 2);
}
cudaThreadSynchronize ();
t_kernel_4 = stopwatch_stop (timer);
fprintf (stdout, "Time to execute unrolled GPU reduction kernel: %Lg secs\n", t_kernel_4);
double bw = (N * sizeof(T)) / (t_kernel_4 * 1e9); // total bits / time
fprintf (stdout, "Effective bandwidth: %.2lf GB/s\n", bw);
CUDA_CHECK_ERROR (cudaMemcpy (&h_odata, d_odata, sizeof (T), cudaMemcpyDeviceToHost));
stopwatch_start (timer);
h_cpu = reduce_cpu (h_idata, N);
t_cpu = stopwatch_stop (timer);
fprintf (stdout, "Time to execute naive CPU reduction: %Lg secs\n", t_cpu);
if(abs (h_odata - h_cpu) > 1e-5)
fprintf(stderr, "FAILURE: GPU: %f CPU: %f\n", h_odata, h_cpu);
else
printf("SUCCESS: GPU: %f CPU: %f\n", h_odata, h_cpu);
return 0;
}
My first question is: when declaring
__shared__ volatile T d[MAX_THREADS];
I would like to verify my understanding of volatile. Volatile prevents compilers from incorrectly optimizing my code and promises that load/stores are completed through the cache and not just registers (please correct me if wrong). For reduction, if partial reduction sums are still stored in registers, why is this a problem?
My second question is: when doing the actual warp reduction
if (tid < warp_size) { // Final log2(32) = 5 strides
if (n > 64) d[tid] += d[tid + 32];
if (n > 32) d[tid] += d[tid + 16];
d[tid] += d[tid + 8];
d[tid] += d[tid + 4];
d[tid] += d[tid + 2];
d[tid] += d[tid + 1];
}
The reduction sum will yield incorrect results without (n > 64) and (n > 32) conditions. The results I get are:
FAILURE: GPU: 41.966557 CPU: 41.946209
With 5 trials, the GPU reduction consistently yields an error of 0.0204. I am wary to think this is a floating point operation error.
To be honest as well, my teacher's assistant suggested this change to add the (n > 64) and (n > 32) conditions but did not explain why it would fix the code.
Since n in my trials are over 64, why does this conditional change the results. I am having difficulty tracing back the problem because I cannot use print functions like I would in a CPU.
Let's start with a few preface comments before we tackle your two questions:
I encourage you to read NVIDIA's canonical reduction tutorial
Reductions written like this make several assumptions, one of which is that the block size is a power-of-2 (for "correctness").
Your code is using warp-synchronous programming at the final reduction stage. You appear to know what you are doing, so I won't provide a detailed description of that, but it is certainly relevant for understanding here. You can google it and get descriptions if needed. It is relevant to the discussion below, but I'm not going to call out its relevance in each situation.
OK, now your questions:
I would like to verify my understanding of volatile. Volatile prevents compilers from incorrectly optimizing my code and promises that load/stores are completed through the cache and not just registers (please correct me if wrong). For reduction, if partial reduction sums are still stored in registers, why is this a problem?
Regarding a definition of volatile, I would refer you to the CUDA programming guide. I have seen summary descriptions referring to this as preventing a register optimization or preventing reordering of loads and stores. I prefer the former and will use that as a working definition.
The basic idea is that volatile forces any reference (read or write) to that variable to actually go to the memory subsystem. By this I mean it will perform a read or write, and will not attempt to use a value previously loaded into a register. Without this qualifier, the compiler is free to load a value once (for example) from the actual memory location, and then maintain that value (and any updates to it) in a register, for as long as it deems appropriate. Compilers do this with an eye toward performance. (As an aside, note that you used the word "cache" here. I would avoid that usage here. Shared memory has no cache interposed between it and the processor load/store mechanism.)
Without volatile in this type of warp-synchronous coding, we will run into a problem if we allow the compiler to "optimize" (i.e. maintain) intermediate values into registers. This primarily comes about due to inter-thread communication. To see clearly why, let's look at the last 2 steps in your final reduction:
d[tid] += d[tid + 2];
d[tid] += d[tid + 1];
Let's consider just threads whose tid values are 0-1. In the second-last step, thread 0 will pick up the d[2] value and add it to the d[0] value, while thread 1 will pick up the d[3] value and add it to the d[1] value. At this point, if we don't use volatile, the compiler is not obligated to write the d[1] value accumulated by thread 1 back out to shared memory. It is allowed to maintain that in a register. So the d[1] value as seen in shared memory is not "up-to-date".
Now lets go to the last step. In this step, thread 0 reads the d[1] value from shared memory and adds it to the d[0] value. But without volatile, we saw in the previous step that the shared memory contents of d[1] are no longer accurate. OTOH, if we use volatile, then the write to shared memory in the previous step will actually take place, and in the final step, thread 0 will pick up the correct value when it reads d[1]. A CUDA thread is a standalone model. By that, I mean that one thread cannot directly access values contained in registers belonging to another thread. So inter-thread communication at the warp level will normally be accomplished either through shared memory, or via warp-shuffle operations.
__syncthreads() has a similar behavior: it forces all register-optimized values like this to be written out to memory, so that they are "visible" to other threads in the block. Therefore, a more sophisticated optimization would be to only switch to a volatile qualified pointer when the reduction switches from the loop-driven __syncthreads() based reduction to the final warp-synchronous reduction. You can see an example in the tutorial slides I linked at the beginning of this answer.
As another aside, warp-synchronous programming of this kind is (more officially) deprecated in CUDA 9. Instead, you should use cooperative groups.
The reduction sum will yield incorrect results without (n > 64) and (n > 32) conditions.
These conditionals are primarily used because the code is designed to be "correct" for any block configuration that has a power-of-2 size. If we assume that the block size (number of threads per block) is a power of 2, and greater than 64, it must be 128 or larger for example. Your n variable starts out as the block size, but then gets multiplied by 2:
n >>= 1;
Therefore, if we want to ensure the correctness of this line of code:
d[tid] += d[tid + 32];
then we should only apply that operation when the thread block size is 64 (at least) which is like saying that n is greater than 64:
if (n > 64) d[tid] += d[tid + 32];
regarding this question, the claim is made that the posted code behaves differently if the if (n > 64) is included or not. The reason for this is that the posted code includes a loop which recalculates thread count and block count as the reduction proceeds:
int s = blocks;
while(s > 1) {
threads = 0;
blocks = 0;
getNumBlocksAndThreads (whichKernel, s, MAX_BLOCKS, MAX_THREADS, blocks, threads);
This loop eventually results in a block size that is smaller than 128, meaning the omission of the if conditions leads to breakage. (simply print out the threads variable, during this loop).
regarding this:
I am having difficulty tracing back the problem because I cannot use print functions like I would in a CPU.
I'm not sure what the problem is there. printf should work from within kernel code.
shared variables cannot have an initialization as part of their declaration according to this answer.
So if n < 64 we add some random shared memory array data to the sum, which case error.

What is the general way to launch appropriate amount of reduction kernels?

As I have read from NVIDIA's instruction in this link http://www.cuvilib.com/Reduction.pdf, for arrays bigger than blockSize, I should launch multiple reduction kernels to achieve global synchronization. What is the general way to determine how many times I should launch the reduction kernel? I tried as below but I need to Malloc 2 additional pointers, which takes a lot of processing times.
My job is to Reduce the array d_logLuminance into one minimum value min_logLum
void your_histogram_and_prefixsum(const float* const d_logLuminance,
float &min_logLum,
const size_t numRows,
const size_t numCols)
{
const dim3 blockSize(512);
unsigned int pixel = numRows*numCols;
const dim3 gridSize(pixel/blockSize.x+1);
//Reduction kernels to find max and min value
float *d_tempMin, *d_min;
checkCudaErrors(cudaMalloc((void**) &d_tempMin, sizeof(float)*pixel));
checkCudaErrors(cudaMalloc((void**) &d_min, sizeof(float)*pixel));
checkCudaErrors(cudaMemcpy(d_min, d_logLuminance, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
dim3 subGrid = gridSize;
for(int reduceLevel = pixel; reduceLevel > 0; reduceLevel /= blockSize.x) {
checkCudaErrors(cudaMemcpy(d_tempMin, d_min, sizeof(float)*pixel, cudaMemcpyDeviceToDevice));
reduceMin<<<subGrid,blockSize,blockSize.x*sizeof(float)>>>(d_tempMin, d_min);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
subGrid.x = subGrid.x / blockSize.x + 1;
}
checkCudaErrors(cudaMemcpy(&min_logLum, d_min, sizeof(float), cudaMemcpyDeviceToHost));
std::cout<< "Min value = " << min_logLum << std::endl;
checkCudaErrors(cudaFree(d_tempMin));
checkCudaErrors(cudaFree(d_min));
}
And if you are curious, here is my reduction kernel:
__global__
void reduceMin(const float* const g_inputRange,
float* g_outputRange)
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
sdata[tid] = g_inputRange[i];
__syncthreads();
for(unsigned int s = blockDim.x/2; s > 0; s >>= 1){
if (tid < s){
sdata[tid] = min(sdata[tid],sdata[tid+s]);
}
__syncthreads();
}
if(tid == 0){
g_outputRange[blockIdx.x] = sdata[0];
}
}
There are many ways to skin the cat, but if you want to minimize kernel launches, it can always be done with at most two kernel launches.
The first kernel launch is composed of up to however many blocks correspond to the number of threads per block that your device supports. Newer devices will support 1024, older devices, 512.
Each of these (at most 512 or 1024) blocks in the first kernel will participate in a grid-looping sum of all the data elements in global memory.
Each of these blocks will then do a partial reduction and write a partial result to global memory. There will be 512 or 1024 of these partial results.
The second kernel launch will be composed of 512 or 1024 threads in a single block. Each thread will pick up one of the partial results from global memory, and then the threads in that single block will cooperatively reduce the partial results to a single final result, and write it back to global memory.
The "grid-looping sum" is described in reduction #7 here as "multiple add/thread". All of the reductions described in this document are available in the NVIDIA reduction sample code

Cuda gridDim and blockDim

I get what blockDim is, but I have a problem with gridDim. Blockdim gives the size of the block, but what is gridDim? On the Internet it says gridDim.x gives the number of blocks in the x coordinate.
How can I know what blockDim.x * gridDim.x gives?
How can I know that how many gridDim.x values are there in the x line?
For example, consider the code below:
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double temp = a[tid];
tid += blockDim.x * gridDim.x;
while (tid < count)
{
if (a[tid] > temp)
{
temp = a[tid];
}
tid += blockDim.x * gridDim.x;
}
I know that tid starts with 0. The code then has tid+=blockDim.x * gridDim.x. What is tid now after this operation?
blockDim.x,y,z gives the number of threads in a block, in the
particular direction
gridDim.x,y,z gives the number of blocks in a grid, in the
particular direction
blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)
block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.
In the CUDA documentation, these variables are defined here
In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.
You might want to consider taking an introductory CUDA webinars For example, the first 4 units. It would be 4 hours well spent, if you want to understand these concepts better.
Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2); would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x line would in effect be the unique index of each thread within your grid. This is because your blockDim.x would be the size of each block, and your gridDim.x would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x, you would have values within the range defined by: [0, 128) + 128 * [1, 10), which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10), and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)
In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?
#define N 10 //(33*1024)
__global__ void add(int *c){
int tid = threadIdx.x + blockIdx.x * gridDim.x;
if(tid < N)
c[tid] = 1;
while( tid < N)
{
c[tid] = 1;
tid += blockDim.x * gridDim.x;
}
}
int main(void)
{
int c[N];
int *dev_c;
cudaMalloc( (void**)&dev_c, N*sizeof(int) );
for(int i=0; i<N; ++i)
{
c[i] = -1;
}
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
add<<< 2, 2>>>(dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );
for(int i=0; i< N; ++i)
{
printf("c[%d] = %d \n" ,i, c[i] );
}
cudaFree( dev_c );
}
Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.
This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.
see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
In this source code,
gridDim.x : 2 this means number of block of x
gridDim.y : 1 this means number of block of y
blockDim.x : 2 this means number of thread of x in a block
blockDim.y : 1 this means number of thread of y in a block
Our number of thread are 4, because 2*2(blocks * thread).
In add kernel function, we can access 0, 1, 2, 3 index of thread
->tid = threadIdx.x + blockIdx.x * blockDim.x
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3
How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
tid += blockDim.x + gridDim.x in while
** first call of kernel **
-1 loop: 0+2*2=4
-2 loop: 4+2*2=8
-3 loop: 8+2*2=12 ( but this value is false, while out!)
** second call of kernel **
-1 loop: 1+2*2=5
-2 loop: 5+2*2=9
-3 loop: 9+2*2=13 ( but this value is false, while out!)
** third call of kernel **
-1 loop: 2+2*2=6
-2 loop: 6+2*2=10 ( but this value is false, while out!)
** fourth call of kernel **
-1 loop: 3+2*2=7
-2 loop: 7+2*2=11 ( but this value is false, while out!)
So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.
refer to this page.
http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html
I cannot upload image, because low reputation.
First, see this figure Grid of thread blocks from the CUDA official document
Usually, we use kernel as this:
__global__ void kernelname(...){
const id_x = blockDim.x * blockIdx.x + threadIdx.x;
const id_y = blockDim.y * blockIdx.y + threadIdx.y;
...
}
// invoke kernel
// assume we have assigned the proper gridsize and blocksize
kernelname<<<gridsize, blocksize>>>(...)
The meaning of some variables:
gridsize the number of blocks per grid, corresponding to the gridDim
blocksize the number of threads per block, corresponding to the blockDim
threadIdx.x varies in [0, blockDim.x)
blockIdx.x varies in [0, gridDim.x)
So, let's try to calculate the index at the x direction when we have threadIdx.x and blockIdx.x. According to the figure, blockIdx.x determines which block you are, and threadIdx.x determines which thread you are when the location of block is given. Hence, we have:
which_blk = blockDim.x * blockIdx.x; // which block you are
final_index_x = which_blk + threadIdx.x; // based on the given block, we can have the final location by adding the threadIdx.x
that is:
final_index_x = blockDim.x * blockIdx.x + threadIdx.x;
which is same as the sample code above.
Similarly, we can get the index at y or z direction respectively.
As we can see, we usually don't use gridDim in our code, because this information is performed as the range of blockIdx. On the contrary, we have to use blockDim although this information is performed as the range of threadIdx. The reason I have shown above step by step.
I hope this answer may help resolve your confusion.

OpenCl equivalent of finding Consecutive indices in CUDA

In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:
Host side Code:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
Device side code
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
What is the equivalent in OpenCL for acheiving the above case ?
On the host, when you enqueue your kernel using clEnqueueNDRangeKernel, you have to specify the global and local work size. For instance:
size_t global_work_size[1] = { 144 }; // 16 * 9 == 144
size_t local_work_size[1] = { 16 };
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
global_work_size, local_work_size,
0, NULL, NULL);
In your kernel, use:
size_t get_global_size(uint dim);
size_t get_global_id(uint dim);
size_t get_local_size(uint dim);
size_t get_local_id(uint dim);
to retrieve the global and local work sizes and indices respectively, where dim is 0 for x, 1 for y and 2 for z.
The equivalent of your idx will thus be simply size_t idx = get_global_id(0);
See the OpenCL Reference Pages.
Equivalences between CUDA and OpenCL are:
blockIdx.x*blockDim.x+threadIdx.x = get_global_id(0)
LocalSize = blockDim.x
GlobalSize = blockDim.x * gridDim.x