Cuda, executional thread order in a 3d-block

Cuda, executional thread order in a 3d-block - cuda

As title, I would like to know the right execution order in case we have a 3d block
I think to remember that I read already something regarding it, but it was some time ago, I dont remember where but it was coming by someone who didnt look so reliable..
Anyway I would like to have some confirmations about it.
Is it as the following (divided in warps)?
[0, 0, 0]...[blockDim.x, 0, 0] - [0, 1, 0]...[blockDim.x, 1, 0] - (...) - [0, blockDim.y, 0]...[blockDim.x, blockDim.y, 0] - [0, 0, 1]...[blockDim.x, 0, 1] - (...) - [0, blockDim.y, 1]...[blockDim.x, blockDim.y, 1] - (...) - [blockDim.x, blockDim.y, blockDim.z]

Yes, that is the correct ordering; threads are ordered with the x dimension varying first, then y, then z (equivalent to column-major order) within a block. The calculation can be expressed as
int threadID = threadIdx.x +
blockDim.x * threadIdx.y +
(blockDim.x * blockDim.y) * threadIdx.z;
int warpID = threadID / warpSize;
int laneID = threadID % warpsize;
Here threadID is the thread number within the block, warpID is the warp within the block and laneID is the thread number within the warp.
Note that threads are not necessarily executed in any sort of predicable order related to this ordering within a block. The execution model guarantees that threads in the same warp are executed "lock-step", but you can't infer any more than that from the thread numbering within a block.

Related

Cuda: Access Block and Thread Count within Kernel? [duplicate]

I get what blockDim is, but I have a problem with gridDim. Blockdim gives the size of the block, but what is gridDim? On the Internet it says gridDim.x gives the number of blocks in the x coordinate.
How can I know what blockDim.x * gridDim.x gives?
How can I know that how many gridDim.x values are there in the x line?
For example, consider the code below:
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double temp = a[tid];
tid += blockDim.x * gridDim.x;
while (tid < count)
{
if (a[tid] > temp)
{
temp = a[tid];
}
tid += blockDim.x * gridDim.x;
}
I know that tid starts with 0. The code then has tid+=blockDim.x * gridDim.x. What is tid now after this operation?

blockDim.x,y,z gives the number of threads in a block, in the
particular direction
gridDim.x,y,z gives the number of blocks in a grid, in the
particular direction
blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)
block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.
In the CUDA documentation, these variables are defined here
In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.
You might want to consider taking an introductory CUDA webinars For example, the first 4 units. It would be 4 hours well spent, if you want to understand these concepts better.

Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2); would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x line would in effect be the unique index of each thread within your grid. This is because your blockDim.x would be the size of each block, and your gridDim.x would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x, you would have values within the range defined by: [0, 128) + 128 * [1, 10), which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10), and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)

In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?
#define N 10 //(33*1024)
__global__ void add(int *c){
int tid = threadIdx.x + blockIdx.x * gridDim.x;
if(tid < N)
c[tid] = 1;
while( tid < N)
{
c[tid] = 1;
tid += blockDim.x * gridDim.x;
}
}
int main(void)
{
int c[N];
int *dev_c;
cudaMalloc( (void**)&dev_c, N*sizeof(int) );
for(int i=0; i<N; ++i)
{
c[i] = -1;
}
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
add<<< 2, 2>>>(dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );
for(int i=0; i< N; ++i)
{
printf("c[%d] = %d \n" ,i, c[i] );
}
cudaFree( dev_c );
}
Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.
This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.
see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
In this source code,
gridDim.x : 2 this means number of block of x
gridDim.y : 1 this means number of block of y
blockDim.x : 2 this means number of thread of x in a block
blockDim.y : 1 this means number of thread of y in a block
Our number of thread are 4, because 2*2(blocks * thread).
In add kernel function, we can access 0, 1, 2, 3 index of thread
->tid = threadIdx.x + blockIdx.x * blockDim.x
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3
How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
tid += blockDim.x + gridDim.x in while
** first call of kernel **
-1 loop: 0+2*2=4
-2 loop: 4+2*2=8
-3 loop: 8+2*2=12 ( but this value is false, while out!)
** second call of kernel **
-1 loop: 1+2*2=5
-2 loop: 5+2*2=9
-3 loop: 9+2*2=13 ( but this value is false, while out!)
** third call of kernel **
-1 loop: 2+2*2=6
-2 loop: 6+2*2=10 ( but this value is false, while out!)
** fourth call of kernel **
-1 loop: 3+2*2=7
-2 loop: 7+2*2=11 ( but this value is false, while out!)
So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.
refer to this page.
http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html
I cannot upload image, because low reputation.

First, see this figure Grid of thread blocks from the CUDA official document
Usually, we use kernel as this:
__global__ void kernelname(...){
const id_x = blockDim.x * blockIdx.x + threadIdx.x;
const id_y = blockDim.y * blockIdx.y + threadIdx.y;
...
}
// invoke kernel
// assume we have assigned the proper gridsize and blocksize
kernelname<<<gridsize, blocksize>>>(...)
The meaning of some variables:
gridsize the number of blocks per grid, corresponding to the gridDim
blocksize the number of threads per block, corresponding to the blockDim
threadIdx.x varies in [0, blockDim.x)
blockIdx.x varies in [0, gridDim.x)
So, let's try to calculate the index at the x direction when we have threadIdx.x and blockIdx.x. According to the figure, blockIdx.x determines which block you are, and threadIdx.x determines which thread you are when the location of block is given. Hence, we have:
which_blk = blockDim.x * blockIdx.x; // which block you are
final_index_x = which_blk + threadIdx.x; // based on the given block, we can have the final location by adding the threadIdx.x
that is:
final_index_x = blockDim.x * blockIdx.x + threadIdx.x;
which is same as the sample code above.
Similarly, we can get the index at y or z direction respectively.
As we can see, we usually don't use gridDim in our code, because this information is performed as the range of blockIdx. On the contrary, we have to use blockDim although this information is performed as the range of threadIdx. The reason I have shown above step by step.
I hope this answer may help resolve your confusion.

Summing up elements in array using managedCuda

Problem Description
I try to get a kernel summing up all elements of an array to work. The kernel is intended to be launched with 256 threads per block and an arbitary number of blocks. The length of the array passsed in as a is always a multiple of 512, in fact it is #blocks * 512. One block of the kernel should sum up 'its' 512 elements (256 threads can sum up 512 elements using this algorithm), storing the result in out[blockIdx.x]. The final summation over the values in out ,and therefore the results of the blocks, will be done on the host.
This kernel works fine for up to 6 blocks, meaning up to 3072 elements. But launching it with more than 6 blocks result in the first block calculating a strictly greater, wrong result than the other blocks (i. e. out = {572, 512, 512, 512, 512, 512, 512}), this wrong result is reproducable, the wrong value is the same for multiple executions.
I guess this means there is a structural error somewhere in my code, which has something to do with blockIdx.x, but the only use this is to calculate blockStart, and this seams to be a correct calculation, also for the first block.
I verified if my host code computes the correct number of blocks for the kernel and passes in an array of correct size. That's not the problem.
Of course I read a lot of similar questions here on stackoverflow, but none seems to describe my problem (See i. e. here or here)
The kernel is called via managedCuda (C#), I don't know if this might be a problem.
Hardware
I use a MX150 with the follwing specifications:
Revision Number: 6.1
Total global memory: 2147483648
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Max Threads per block: 1024
Max Blocks: 2147483648
Number of multiprocessors: 3
Code
Kernel
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
int leftSumElementIdx = blockStart + tid * 2;
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
__syncthreads();
if (tid < 128)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if(tid < 64)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 32)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 16)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 8)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 4)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid < 2)
{
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
}
__syncthreads();
if (tid == 0)
{
out[blockIdx.x] = a[blockStart] + a[blockStart + 1];
}
}
Kernel Invocation
//Get the cuda kernel
//PathToPtx and MangledKernelName must be replaced
CudaContext cntxt = new CudaContext();
CUmodule module = cntxt.LoadModule("pathToPtx");
CudaKernel vectorReduceAsSumKernel = new CudaKernel("MangledKernelName", module, cntxt);
//Get an array to reduce
float[] array = new float[4096];
for(int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//Calculate execution info for the kernel
int threadsPerBlock = 256;
int numOfBlocks = array.Length / (threadsPerBlock * 2);
//Memory on the device
CudaDeviceVariable<float> m_d = array;
CudaDeviceVariable<float> out_d = new CudaDeviceVariable<float>(numOfBlocks);
//Give the kernel necessary execution info
vectorReduceAsSumKernel.BlockDimensions = threadsPerBlock;
vectorReduceAsSumKernel.GridDimensions = numOfBlocks;
//Run the kernel on the device
vectorReduceAsSumKernel.Run(out_d.DevicePointer, m_d.DevicePointer);
//Fetch the result
float[] out_h = out_d;
//Sum up the partial sums on the cpu
float sum = 0;
for(int i = 0; i < out_h.Length; i++)
{
sum += out_h[i];
}
//Verify the correctness
if(sum != 4096)
{
throw new Exception("Thats the wrong result!");
}
Update:
The very helpfull and only answer did address all my problems. Thank you! The problem was an unforeseen race condition.
Important Hint:
In the comments the author of managedCuda pointed out all NPPs methods are indeed already implmented in managedCuda (using ManagedCuda.NPP.NPPsExtensions;). I wasn't aware of that, and i guess so are many people reading ths question.

You are not correctly incorporating into your code the idea that each block will process 512 elements out of your total array. According to my testing, you need to make at least 2 changes to fix this:
In the kernel, you have incorrectly calculated the starting point for each block:
int blockStart = blockDim.x * blockIdx.x;
since blockDim.x is 256, but each block processes 512 elements, you must multiply this by 2. (the multiplication by 2 in your calculation of leftSumElementIdx doesn't take care of this -- since it is only multiplying tid).
In your host code, your number of blocks calculation is incorrect:
vectorReduceAsSumKernel.GridDimensions = array.Length / threadsPerBlock;
for a value of 2048 for array.Length and a value of 256 for threadsPerBlock, this creates 8 blocks. But as you already indicate, your intention is to launch for blocks (2048/512). So you need to multiply the denominator by 2:
vectorReduceAsSumKernel.GridDimensions = array.Length / (2*threadsPerBlock);
In addition, your reduction sweep pattern is broken. It is warp-execution-order dependent, to give the proper result, and CUDA does not specify a warp execution order.
To see why, let's take a simple example. Let's consider just a single threadblock, with a starting point of the array being all 1, just as you have initialized it.
Now, warp 0 consists of threads 0-31. Your reduction sweep operation is like this:
a[i] = a[leftSumElementIdx] + a[leftSumElementIdx + 1];
So each thread in warp 0 will collect two other values and add them, and store them. Thread 31 will take the values a[62] and a[63] and add them together. If the values of a[62] and a[63] are still 1, as initialized, then this will work as expected. But the values of a[62] and a[63] are written to by warp 1, consisting of threads 32-63. So if warp 1 executes before warp 0 (perfectly legal), then you will get a different result. This is a global memory race condition. It is arising due to the fact that your input array is both the source and destination of your intermediate results, and __syncthreads() will not sort this out for you. It doesn't force warps to execute in any particular order.
One possible solution is to fix your sweep pattern. On any given reduction cycle, let's have a sweep pattern where each thread writes and reads values that are not touched by any other thread during that cycle. The following adaptation of your kernel code accomplishes that:
__global__ void Vector_Reduce_As_Sum_Kernel(float* out, float* a)
{
int tid = threadIdx.x;
int blockStart = blockDim.x * blockIdx.x * 2;
int i = tid + blockStart;
for (int j = blockDim.x; j > 0; j>>=1){
if (tid < j)
a[i] += a[i+j];
__syncthreads();}
if (tid == 0)
{
out[blockIdx.x] = a[i];
}
}
For general purpose reductions, this is still a very slow method. This tutorial covers how to write faster reductions. And, as already pointed out, managedCuda may have methods to avoid writing a kernel at all.

CUDA Illegal memory access with possibly 'insufficient' shared memory

I have a simple CUDA kernel that can do vector accumulation by basic reduction. I am scaling it up to be able to handle larger data by splitting it across multiple blocks. However, my assumption about allocating an appropriate amount of shared memory to be used by the kernel is failing with illegal memory access. It goes away when I increase this limit, but I want to know why.
Here is the code that I am talking about:
CORE KERNEL:
__global__ static
void vec_add(int *buffer,
int numElem, // The actual number of elements
int numIntermediates) // The next power of two of numElem
{
extern __shared__ unsigned int interim[];
int index = blockDim.x * blockIdx.x + threadIdx.x;
// Copy global intermediate values into shared memory.
interim[threadIdx.x] =
(index < numElem) ? buffer[index] : 0;
__syncthreads();
// numIntermediates2 *must* be a power of two!
for (unsigned int s = numIntermediates / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
interim[threadIdx.x] += interim[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
buffer[blockIdx.x] = interim[0];
}
}
And this is the caller:
void accumulate (int* buffer, int numElem)
{
unsigned int numReductionThreads =
nextPowerOfTwo(numElem); // A routine to return the next higher power of 2.
const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock
unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize;
while (numReductionThreads > 1) {
numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?
numReductionThreads : maxThreadsPerBlock;
numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1) / numThreadsPerBlock;
reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int);
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numReductionThreads);
numReductionThreads = nextPowerOfTwo(numReductionBlocks);
}
}
I tried this code with a sample set of 1152 elements on my GPU with the following configuration:
Type: Quadro 600
MaxThreadsPerBlock: 1024
MaxSharedMemory: 48KB
OUTPUT:
Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8
CUDA Error 77: an illegal memory access was encountered
Suspecting that my 'interim' shared memory was causing illegal memory access, I arbitrarily increased the shared memory by two times in the following line:
reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);
And my kernel started working fine!
What I do not understand is - why I had to provide this extra shared memory to make my problem go away (temporarily).
As a further experiment to check this magic number I ran my code with a much larger data-set with 6912 points. This time, even 2X or 4X didn't help me.
Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384
Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128
CUDA Error 77: an illegal memory access was encountered
But the problem again went away when I increased the shared memory size by 8X.
Of course, I cannot be arbitrarily picking this scaling factor for larger and larger data-sets because I will soon run out of the 48KB shared memory limit. So I want to know a legitimate way of fixing my issue.

Thanks to #havogt for pointing out the out-of-index access.
The issue was that I was using the wrong argument as numIntermediates to the vec_add method. The intention was for the kernel to operate on exactly the same number of data points as the number of threads, which should have been 1024 all the time.
I fixed it by using numThreadsPerBlock as the argument:
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numThreadsPerBlock);

Heisenbug in CUDA kernel, global memory access

About two years ago, I wrote a kernel for work on several numerical grids simultaneously. Some very strange behaviour emerged, which resulted in wrong results. When hunting down the bug utilizing printf()-statements inside the kernel, the bug vanished.
Due to deadline constraints, I kept it that way, though recently I figured that this was no appropriate coding style. So I revisited my kernel and boiled it down to what you see below.
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
This kernel computes the derivative of some value at all grid points for a number of numerical 1D-grids.
Things to consider: (see full code base at the bottom)
a grid consists of 1300 grid points
each grid has to be worked upon by two blocks (due to memory/register limitations)
each block successively works on 37 grids (or better: grid halves, the while-loop takes care of that)
each thread is responsible for the same grid point in each grid
for the derivative to be computed, the threads need access to data from the four next grid points
in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
due to the boiling down process, the two threads on each side of the blocks do no computations, in the original they are responsible for writing the corresponing grid values to shared memory
The values of the grids are stored in u_arr, du_arr and r_arr (and their corresponding device arrays d_u, d_du and d_r).
Each grid occupies 1300 consecutive values in each of these arrays.
The while-loop in the kernel iterates over 37 grids for each block.
To evaluate the workings of the kernel, each grid is initialized with the exact same values, so a deterministic program will produce the same result for each grid.
This does not happen with my code.
The weirdness of the Heisenbug:
I compared the computed values of grid 0 with each of the other grids, and there are differences at the overlap (grid points 666-669), though not consistently. Some grids have the right values, some do not. Two consecutive runs will mark different grids as erroneous.
The first thing that came to mind was that two threads at this overlap try to concurrently write to memory, though that does not seem to be the case (I checked.... and re-checked).
Commenting or un-commenting lines or using printf() for debugging purposes will alter
the outcome of the program as well: When "asking" the threads responsible for the grid points in question, they tell me that everything is allright, and they are actually correct. As soon as I force a thread to print out its variables, they will be computed (and more importantly: stored) correctly.
The same goes for debugging with Nsight Eclipse.
Memcheck / Racecheck:
cuda-memcheck (memcheck and racecheck) report no memory/racecondition problems, though even the usage of one of these tools have the ability to impact the correctness of the results.
Valgrind gives some warnings, though I think they have something to do with the CUDA API which I can not influence and which seem unrelated to my problem.
(Update)
As pointed out, cuda-memcheck --tool racecheck only works for shared memory race conditions, whereas the problem at hand has a race condition on d_u, i.e., global memory.
Testing environment:
The original kernel has been tested on different CUDA devices and with different compute capabilities (2.0, 3.0 and 3.5) with the bug showing up in every configuration (in some form or another).
My (main) testsystem is the following:
2 x GTX 460, tested on both the GPU that ran the X-server as well as
the other one
Driver Version: 340.46
Cuda Toolkit 6.5
Linux Kernel 3.11.0-12-generic (Linux Mint 16 - Xfce)
State of solution:
By now I am pretty sure that some memory access is the culprit, maybe some optimization from the compiler or use of uninitialized values, and that I obviously do not understand some fundamental CUDA paradigm.
The fact that printf() statements inside the kernel (which through some dark magic have to utilize device and host memory as well) and memcheck algorithms (cuda-memcheck and valgrind) influence
the bevavior point in the same direction.
I am sorry for this somewhat complicated kernel, but I boiled the original kernel and invocation down as much as I could, and this is as far as I got. By now I have learned to admire this problem, and I am looking forward to learning what is going on here.
Two "solutions", which force the kernel do work as intended, are marked in the code.
(Update) As mentioned in the correct answer below, the problem with my code is a race condition at the border of the thread-blocks. As there are two blocks working on each grid and there is no guarantee as to which block works first, resulting in the behavior outlined below. It also explains the correct results when employing "Solution 1" as mentioned in the code, because the input/output value d_u is not altered when uchange = 1.0.
The simple solution is to split this kernel into two kernels, one computing d_u, the other computing the derivative d_du. It would be more desirable to have just one kernel invocation instead of two, though I do not know how to accomplish this with -arch=sm_20. With -arch=sm_35 one could probably use dynamic parallelism to achieve that, though the overhead for the second kernel invocation is negligible.
heisenbug.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
const float r_sol = 6.955E8f;
__constant__ float d_fourpoint_constant = 0.2f;
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
}
bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){
bool retval = true;
for(uint i=0; i<numNodesPerGrid; ++i)
if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i])
{
printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n",
label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i],
matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]);
retval = false;
}
return retval;
}
int main(int argc, const char* argv[])
{
float *d_u;
float *d_du;
float *d_r;
float *u_arr;
float *du_arr;
float *r_arr;
int numNodesPerGrid = 1300;
int numBlocksPerSM = 2;
int numGridsPerSM = 37;
int numSM = 7;
int TPB = 672;
int radius = 2;
int numGrids = 259;
int memsize_grid = sizeof(float) * numNodesPerGrid;
int numBlocksPerGrid = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1);
printf("---------------------------------------------------------------------------\n");
printf("--- Heisenbug Extermination Tracker ---------------------------------------\n");
printf("---------------------------------------------------------------------------\n\n");
cudaSetDevice(0);
cudaDeviceReset();
cudaMalloc((void **) &d_u, memsize_grid * numGrids);
cudaMalloc((void **) &d_du, memsize_grid * numGrids);
cudaMalloc((void **) &d_r, memsize_grid * numGrids);
u_arr = new float[numGrids * numNodesPerGrid];
du_arr = new float[numGrids * numNodesPerGrid];
r_arr = new float[numGrids * numNodesPerGrid];
for(uint k=0; k<numGrids; ++k)
for(uint i=0; i<numNodesPerGrid; ++i)
{
uint index = k * numNodesPerGrid + i;
if (i < 585)
r_arr[index] = i * (6000.0f);
else
{
if (i == 585)
r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol;
else
r_arr[index] = r_arr[index - 1] + 1.02102f * ( r_arr[index - 1] - r_arr[index - 2] );
}
u_arr[index] = 1E-10f * (i+1);
du_arr[index] = 0.0f;
}
/*
printf("\n\nbefore kernel start\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
bool equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Input values are not identical for different grids!\n\n");
else
printf("All grids contain the same values at same grid points.!\n\n");
cudaMemcpy(d_u, u_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_du, du_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_r, r_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
printf("Configuration:\n\n");
printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM);
printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid);
printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid);
printf("Kernel launch parameters:\n\n");
printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0);
printf("Launching Kernel...\n\n");
heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids);
cudaDeviceSynchronize();
cudaMemcpy(u_arr, d_u, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(du_arr, d_du, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(r_arr, d_r, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
/*
printf("\n\nafter kernel finished\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Results are wrong!!\n");
else
printf("All went well!\n");
cudaFree(d_u);
cudaFree(d_du);
cudaFree(d_r);
delete [] u_arr;
delete [] du_arr;
delete [] r_arr;
return 0;
}
Makefile:
CUDA = 1
DEFINES =
ifeq ($(CUDA), 1)
DEFINES += -DCUDA
CUDAPATH = /usr/local/cuda-6.5
CUDAINCPATH = -I$(CUDAPATH)/include
CUDAARCH = -arch=sm_20
endif
CXX = g++
CXXFLAGS = -pipe -g -std=c++0x -fPIE -O0 $(DEFINES)
VALGRIND = valgrind
VALGRIND_FLAGS = -v --leak-check=yes --log-file=out.memcheck
CUDAMEMCHECK = cuda-memcheck
CUDAMC_FLAGS = --tool memcheck
RACECHECK = $(CUDAMEMCHECK)
RACECHECK_FLAGS = --tool racecheck
INCPATH = -I. $(CUDAINCPATH)
LINK = g++
LFLAGS = -O0
LIBS =
ifeq ($(CUDA), 1)
NVCC = $(CUDAPATH)/bin/nvcc
LIBS += -L$(CUDAPATH)/lib64/
LIBS += -lcuda -lcudart -lcudadevrt
NVCCFLAGS = -g -G -O0 --ptxas-options=-v
NVCCFLAGS += -lcuda -lcudart -lcudadevrt -lineinfo --machine 64 -x cu $(CUDAARCH) $(DEFINES)
endif
all:
$(NVCC) $(NVCCFLAGS) $(INCPATH) -c -o $(DST_DIR)heisenbug.o $(SRC_DIR)heisenbug.cu
$(LINK) $(LFLAGS) -o heisenbug heisenbug.o $(LIBS)
clean:
rm heisenbug.o
rm heisenbug
memrace: all
./heisenbug > out
$(VALGRIND) $(VALGRIND_FLAGS) ./heisenbug > out.memcheck.log
$(CUDAMEMCHECK) $(CUDAMC_FLAGS) ./heisenbug > out.cudamemcheck
$(RACECHECK) $(RACECHECK_FLAGS) ./heisenbug > out.racecheck

Note that in the entirety of your writeup, I do not see a question being explicitly asked, therefore I am responding to:
I am looking forward to learning what is going on here.
You have a race condition on d_u.
by your own statement:
•in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
Furthermore, if you comment out the write to d_u, according to your statement in the code, the problem disappears.
CUDA threadblocks can execute in any order. You have at least 2 different blocks that are reading from grid points 666, 667, 668, 669. The results will be different depending on which case actually occurs:
both blocks read the value before any writes occur.
one block reads the value, then a write occurs, then the other block reads the value.
The blocks are not independent of each other (contrary to your statement) if one block is reading a value that can be written to by another block. The order of block execution will determine the result in this case, and CUDA does not specify the order of block execution.
Note that cuda-memcheck with the -tool racecheck option only captures race conditions related to __shared__ memory usage. Your kernel as posted uses no __shared__ memory, therefore I would not expect cuda-memcheck to report anything.
cuda-memcheck, in order to gather its data, does influence the order of block execution, so it's not surprising that it affects the behavior.
in-kernel printf represents a costly function call, writing to a global memory buffer. So it also affects execution behavior/patterns. And if you are printing out a large amount of data, exceeding the buffer lines of output, the effect is extremely costly (in terms of execution time) in the event of buffer overflow.
As an aside, Linux Mint is not a supported distro for CUDA, as far as I can see. However I don't think this is relevant to your issue; I can reproduce the behavior on a supported config.

Cuda gridDim and blockDim

I get what blockDim is, but I have a problem with gridDim. Blockdim gives the size of the block, but what is gridDim? On the Internet it says gridDim.x gives the number of blocks in the x coordinate.
How can I know what blockDim.x * gridDim.x gives?
How can I know that how many gridDim.x values are there in the x line?
For example, consider the code below:
int tid = threadIdx.x + blockIdx.x * blockDim.x;
double temp = a[tid];
tid += blockDim.x * gridDim.x;
while (tid < count)
{
if (a[tid] > temp)
{
temp = a[tid];
}
tid += blockDim.x * gridDim.x;
}
I know that tid starts with 0. The code then has tid+=blockDim.x * gridDim.x. What is tid now after this operation?

blockDim.x,y,z gives the number of threads in a block, in the
particular direction
gridDim.x,y,z gives the number of blocks in a grid, in the
particular direction
blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)
block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.
In the CUDA documentation, these variables are defined here
In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.
You might want to consider taking an introductory CUDA webinars For example, the first 4 units. It would be 4 hours well spent, if you want to understand these concepts better.

Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2); would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x line would in effect be the unique index of each thread within your grid. This is because your blockDim.x would be the size of each block, and your gridDim.x would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x, you would have values within the range defined by: [0, 128) + 128 * [1, 10), which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10), and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)

In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?
#define N 10 //(33*1024)
__global__ void add(int *c){
int tid = threadIdx.x + blockIdx.x * gridDim.x;
if(tid < N)
c[tid] = 1;
while( tid < N)
{
c[tid] = 1;
tid += blockDim.x * gridDim.x;
}
}
int main(void)
{
int c[N];
int *dev_c;
cudaMalloc( (void**)&dev_c, N*sizeof(int) );
for(int i=0; i<N; ++i)
{
c[i] = -1;
}
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
add<<< 2, 2>>>(dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );
for(int i=0; i< N; ++i)
{
printf("c[%d] = %d \n" ,i, c[i] );
}
cudaFree( dev_c );
}
Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.
This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.
see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
In this source code,
gridDim.x : 2 this means number of block of x
gridDim.y : 1 this means number of block of y
blockDim.x : 2 this means number of thread of x in a block
blockDim.y : 1 this means number of thread of y in a block
Our number of thread are 4, because 2*2(blocks * thread).
In add kernel function, we can access 0, 1, 2, 3 index of thread
->tid = threadIdx.x + blockIdx.x * blockDim.x
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3
How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
tid += blockDim.x + gridDim.x in while
** first call of kernel **
-1 loop: 0+2*2=4
-2 loop: 4+2*2=8
-3 loop: 8+2*2=12 ( but this value is false, while out!)
** second call of kernel **
-1 loop: 1+2*2=5
-2 loop: 5+2*2=9
-3 loop: 9+2*2=13 ( but this value is false, while out!)
** third call of kernel **
-1 loop: 2+2*2=6
-2 loop: 6+2*2=10 ( but this value is false, while out!)
** fourth call of kernel **
-1 loop: 3+2*2=7
-2 loop: 7+2*2=11 ( but this value is false, while out!)
So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.
refer to this page.
http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html
I cannot upload image, because low reputation.

First, see this figure Grid of thread blocks from the CUDA official document
Usually, we use kernel as this:
__global__ void kernelname(...){
const id_x = blockDim.x * blockIdx.x + threadIdx.x;
const id_y = blockDim.y * blockIdx.y + threadIdx.y;
...
}
// invoke kernel
// assume we have assigned the proper gridsize and blocksize
kernelname<<<gridsize, blocksize>>>(...)
The meaning of some variables:
gridsize the number of blocks per grid, corresponding to the gridDim
blocksize the number of threads per block, corresponding to the blockDim
threadIdx.x varies in [0, blockDim.x)
blockIdx.x varies in [0, gridDim.x)
So, let's try to calculate the index at the x direction when we have threadIdx.x and blockIdx.x. According to the figure, blockIdx.x determines which block you are, and threadIdx.x determines which thread you are when the location of block is given. Hence, we have:
which_blk = blockDim.x * blockIdx.x; // which block you are
final_index_x = which_blk + threadIdx.x; // based on the given block, we can have the final location by adding the threadIdx.x
that is:
final_index_x = blockDim.x * blockIdx.x + threadIdx.x;
which is same as the sample code above.
Similarly, we can get the index at y or z direction respectively.
As we can see, we usually don't use gridDim in our code, because this information is performed as the range of blockIdx. On the contrary, we have to use blockDim although this information is performed as the range of threadIdx. The reason I have shown above step by step.
I hope this answer may help resolve your confusion.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008