numba: how to understand the stride [duplicate] - cuda

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?

My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Related

Why do we need stride in CUDA kernel?

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Parallel reduction example

I found this parallel reduction code from Stanford which uses shared memory.
The code is an example of 1<<18 number of elements which is equal to 262144 and gets correct results.
Why for certain number of elements I get the correct results and for other number of elements, like 200000 or 25000 I get different results from what is to be expected?
It looks to me it's always appointing the needed thread blocks
// launch a single block to compute the sum of the partial sums
block_sum<<<1,num_blocks,num_blocks * sizeof(float)>>>
this code causes the bug.
suppose numblocks is 13,
Then in the kernal blockDim.x / 2 will be 6,
and
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
will only add the first 12 elements causing the bug.
when the element count is 200000 or 250000, num_blocks will be odd numbers and causes the bug, for even num_blocks it will work fine
This kernel is sensitive to the blocking parameters (grid and threadblock size) of the kernel. Are you invoking it with enough threads to cover the input size?
It is more robust to formulate kernels like this with for loops - instead of:
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
something like:
for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The source code in the CUDA Handbook has lots of examples of "blocking agnostic" code. The reduction code is here:
https://github.com/ArchaeaSoftware/cudahandbook/tree/master/reduction

creation 2D grid in CUDA for GPGPU using C++

I am trying to extend my grid from a 1d to a 2d grid. Is there any way to do this?
Here is my current code:
int idx = threadIdx.x + blockDim.x * blockIdx.x;
In the #include list I have these definitions:
#define BLOCKS_PER_GRID 102
#define THREADS_PER_BLOCK 1024
Given that you want 1024 threads per block, the block can be easily reshaped to 2D.
32 x 32 = 1024;
So your block will look like this:
dim3 Block(32,32); //1024 threads per block. Will only work for devices of at least 2.0 Compute Capability.
I don't know what is your exact requirement, but usually number of blocks is not fixed (as you have defined in the macro). The number of blocks depend on the input data size, so that the grid scales dynamically.
Going with you case, you have many options, but the nearest optimal size for your grid comes out to be 17 x 6 or 6 x 17.
dim3 Grid(17,6);
Now you can call the kernel with these parameters:
kernel<<<Grid,Block>>>();
Inside the kernel, the 2-Dimensional index of the thread is calculated as follows:
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
Or if you follow the Row/Column convention instead of x/y, then:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int column = blockIdx.x * blockDim.x + threadIdx.x;
You can also have a 2D grid of 1-dimensional threadblocks, in order to get around the limitation of 65535 blocks per grid dimension (for pre-cc3.0 devices). This may be an easier way of extending a fundamentally 1-D problem past the limit without introducing a 2-D array representation for the data.
Let's assume we have a DATA_ELEMENTS parameter defined to be the number of elements (one element per thread) that your kernel will work on. If DATA_ELEMENTS is larger than 65535*1024, then you cannot handle them all using a 1-D grid, if each thread handles only 1 element.
you can leave your THREADS_PER_BLOCK parameter the same. Your thread index calculation inside the kernel will change to something like:
int idx = threadIdx.x + (blockDim.x * ((gridDim.x * blockIdx.y) + blockIdx.x));
you will want to be sure to condition your kernel calculations with something like:
if (idx < DATA_ELEMENTS){
(kernel code)
}
Your grid dimensions will be as follows:
dim3 grid;
if (DATA_ELEMENTS > (65535*THREADS_PER_BLOCK)){ // create a 2-D grid
int gridx = 65535; // could choose another number here
int gridy = ((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)/gridx;
if ((((DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK)%gridx) != 0) gridy++;
grid.x=gridx;
grid.y=gridy;
grid.z=1;
}
else{ // create a 1-D grid
int gridx = (DATA_ELEMENTS+(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;
grid.x=gridx;
grid.y=1;
grid.z=1;
}
and you would launch your kernel as:
kernel<<<grid, THREADS_PER_BLOCK>>>(...);
Another method to tackle this kind of problem is to create a 1-D grid of some dimension (let's say the total number of threads in the grid is NUM_THREADS_PER_GRID), and have each thread work on more than one element in the array of data elements, using something like a for-loop or while-loop:
while (idx < DATA_ELEMENTS) {
(code to process an element)
idx += NUM_THREADS_PER_GRID
}
I like Robert's solutions above. The only comment I have about his first solution is that it seems that one should make gridx as small as one can when DATA_ELEMENTS > (65535*THREADS_PER_BLOCK). The reason is that if the number of data elements is 65535*THREADS_PER_BLOCK + 1, and gridx is 65535, then 65535*2*THREADS_PER_BLOCK are launched, so almost half of the threads will do nothing. If gridx is smaller, then there will be less threads that do nothing.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

CUDA - specifiying <<<x,y>>> for a for loop

Hey,
I have two arrays of size 2000. I want to write a kernel to copy one array to the other. The array represents 1000 particles. index 0-999 will contain an x value and 1000-1999 the y value for their position.
I need a for loop to copy up to N particles from 1 array to the other. eg
int halfway = 1000;
for(int i = 0; i < N; i++){
array1[i] = array2[i];
array1[halfway + i] = array[halfway + i];
}
Due to the number of N always being less than 2000, can I just create 2000 threads? or do I have to create several blocks.
I was thinking about doing this inside a kernel:
int tid = threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
and calling it as follows:
kernel<<<1,2000>>>(...);
Would this work? will it be fast? or will I be better off splitting the problem into blocks. I'm not sure how to do this, perhaps: (is this correct?)
int tid = blockDim.x*blockIdx.x + threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
kernel<<<4,256>>>(...);
Would this work?
Have you actually tried it?
It will fail to launch, because you are allowed to have 512 threads maximum (value may vary on different architectures, mine is one of GTX 200-series). You will either need more blocks or have fewer threads and a for-loop inside with blockDim.x increment.
Your multi-block solution should work as well.
Other approach
If this is the only purpose of the kernel, you might as well try using cudaMemcpy with cudaMemcpyDeviceToDevice as the last parameter.
The only way to answer questions about configurations is to test them. To do this, write your kernels so that they work regardless of the configuration. Often, I will assume that I will launch enough threads, which makes the kernel easier to write. Then, I will do something like this:
threads_per_block = 512;
num_blocks = SIZE_ARRAY/threads_per_block;
if(num_blocks*threads_per_block<SIZE_ARRAY)
num_blocks++;
my_kernel <<< num_blocks, threads_per_block >>> ( ... );
(except, of course, threads_per_block might be a define, or a command line argument, or iterated to test many configurations)
Is better to use more than one block for any kernel.
It Seems to me that you are simply copying from one array to another as a sequence of values with an offset.
If this is the case you can simply use the cudaMemcpy API call and specify
cudaMemcpyDeviceToDevice
cudaMemcpy(array1+halfway,array1,1000,cudaMemcpyDeviceToDevice);
The API will figure out the best partition of block / threads.