Grid and Block dimension in (py)CUDA [duplicate]

Grid and Block dimension in (py)CUDA [duplicate] - cuda

This question already has answers here:
How do I choose grid and block dimensions for CUDA kernels?
(3 answers)
Closed 9 years ago.
I got a question regarding the dimensions of the blocks and grids in (py)CUDA. I know that there are limits in the total size of the blocks, but not of the grids
And that the actual blocksize influences the runtime. But what I'm wondering about is: Does it make a difference if I have a block of 256 threads, to start it like (256,1) or to start it like (128,2), like (64,4) etc.
If it makes a difference: which is the fastest?

Yes, it makes a difference.
(256,1) creates a (1D) block of 256 threads in the X-dimension, all of which have a y-index of 0.
(128,2) creates a (2D) block of 128x2 threads, ie. 128 in the x-dimension and 2 in the y-dimension. These threads will have an x-index ranging from 0 to 127 and a y-index ranging from 0 to 1
The structure of your kernel code must comprehend the thread indexing/numbering.
For example if your kernel code starts with something like:
int idx=threadIdx.x+blockDim.x*blockIdx.x;
and doesn't create any other index variables, it's probably assuming a 1D threadblock and 1D grid.
If, on the other hand, your kernel code starts with something like:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
It's probably expecting a 2D grid and 2D threadblocks.
Generally speaking, the two approaches are not interchangeable, meaning you cannot launch a kernel that expects a 1D grid with a 2D grid and expect everything to work normally, and vice-versa.

Related

Why is it said max blocks per grid dimension are 65535?

For a constant block size of 128 (cores per MP):
I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.
According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.
Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim
{ x = 125000, y = 1, z = 1 }
I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).
I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?
Could someone please enlighten me here?
Thank you.

As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.
The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.
The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.
The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.
I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.

Understanding the tiling of tensor cores using CUDA on V100

I have a toy code heavily borrowing from NVidia's simpleTensorCoreGEMM.cu. I swapped out their random generation of matrices for a function that reads in matrices from files.
Using this toy code and multiplying two matrices of size [2000 x 10000] * [10000 x 3008] works beautifully. The output is as expected.
When I try a much larger multiplication [20000 x 10000] * [10000 x 30000], the output goes horribly wrong and 2/3's of the rows are 0's.
I'm convinced that this is a result of me not understanding the lines of code:
// blockDim.x must be a multple of warpSize
// 128x4 means we have 16 warps and a block computes a 64x64 output tile
blockDim.x = 128;
blockDim.y = 4;
gridDim.x = (MATRIX_M + (WMMA_M * blockDim.x / 32 - 1)) / (WMMA_M * blockDim.x / 32);
gridDim.y = (MATRIX_N + WMMA_N * blockDim.y - 1) / (WMMA_N * blockDim.y);
Even if it is not the source of my error, I should still understand what it is doing. I understand setting blockDim.* There are 32 threads per warp, 128*4/32 = 16 warps.
QUESTION : Could someone explain to me the logic behind the values of and the computation of gridDim.x and gridDim.y? The correct usage of the tensor cores seems to be very sensitive to using the correct values for gridDim.*.

A couple introductory points:
For understanding, this code is intended to accompany this blog article. The last part of that blog, the section "Programmatic Access to Tensor Cores in CUDA 9.0" is definitely useful for understanding this code.
As mentioned in the readme for that code an easier method to access the performance of tensor cores (especially for the basic matrix multiply operations you seem to be playing with) is simply to use a CUBLAS function, such as cublasGemmEx which will make intelligent use of tensor cores under the right circumstances.
Now to your question:
Could someone explain to me the logic behind the values of and the computation of gridDim.x and gridDim.y?
These values are sizing the CUDA grid to be sufficient for the size of the matrix multiply problem requested. We need to approach this hierarchically.
First of all, the tensor core capability is accessed at the warp level. The blog article indicates that "The strategy we’ll employ is to have a single warp responsible for a single 16×16 section of the output matrix" Therefore the output matrix dimensions will drive the dimensions of the CUDA grid used to compute the result. (Typical naive realizations of matrix multiply also determine grid size based on output matrix size. More specifically they assign one thread per output point. Here we are assigning one 32-thread-warp to be responsible for one 16x16 tile of the output matrix.) The code uses WMMA_M (i.e. how many rows) and WMMA_N (i.e. how many columns) to define what a single warp-level tensor core operation will handle. These values are 16, and this drives the choice of using a 16x16 tile in the output, per warp.
As is often the case in CUDA, block dimensions can be somewhat arbitrary, but they do frequently affect the grid size (variables). Warps exist at the block level and the number of warps in a block effectively determine how many 16x16 tiles in the output matrix will be handled per block. In this particular case, the code is choosing block dimensions of 128 (blockDim.x) by 4 (blockDim.y). This happens to be 4 warps "wide" by 4 warps "high", so each block is handling a 4x4 set of tiles in the output, which means each block is responsible for 64x64 output points. Note that these blockDim and gridDim variables in host code are logically separate from (although end up being the same numerically as) the blockDim and gridDim built-in variables in CUDA device code.
Given the above, the m,n, and k parameters of a typical BLAS GEMM operation have the same meaning here. m is the number of rows of the left hand side input matrix. n is the number of columns of the right hand side input matrix. k is the number of columns of the left matrix, which must match the number of rows of the right matrix. Therefore m,n define the dimensions of the output matrix. These are indicated in the code as MATRIX_M and MATRIX_N respectively.
With the above groundwork laid, we can then state the arithmetic needed to compute gridDim.x and gridDim.y in host code.
We must choose enough threads in the x dimension, so that when divided by 32 (the width of a warp in the x dimension) and then multiplied by WMMA_M (the output tile width responsibility of that warp), we have enough threads to cover the width of the output matrix.
We must choose enough threads in the y dimension, so that when divided by 1 (the "height" of a warp in the y dimension) and then multiplied by WMMA_N (the output tile height responsibility of that warp), we have enough threads to cover the height of the output matrix. Note that the "height" of the warp in the y dimension is definitely 1 in this case, because the code requires that the the block width dimension be a whole number multiple of the warp size. Therefore any warp has a constant threadIdx.y component, across the warp.
To go from threads determined in 1 and 2 above, to blocks in each dimension, we must scale (divide) each by the corresponding threadblock dimension. Therefore the grid thread dimension in x must be divided by blockDim.x (in host code), scaled as in 1 above, to get the total grid dimension (number of blocks) in x. This division operation is the usual CUDA "round up" integer divide operation, to scale the number of blocks to be equal to or larger than the threads needed, to account for matrix sizes that are not evenly divisibly by the block size.
Putting all that together, we have:
gridDim.x = (MATRIX_M + (WMMA_M * blockDim.x / 32 - 1)) / (WMMA_M * blockDim.x / 32);
^ ^ ^ ^
| | | divided by the block size scaled for the
| | | portion of the output matrix it covers.
| | rounded up
| the matrix size
The grid in blocks is
And similarly for the y grid dimension. The only real difference is that 32 threads in x (a warp-width) is responsible for a 16x16 output tile whereas on a single thread in y (a warp "height") is responsible for that 16x16 output tile.

Should I check the number of threads in kernel code?

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?

CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

kernel failure: invalid configuration argument

I have a question about my code and whether I can run it on my current device or not.
Basically, I want to do a 3D interpolation.
When I launch my interpolation kernel, I get the following error: kernel failure: invalid configuration argument
I saw in this discussion that it can happen if you call too many threads or blocks, but I am not sure it is the case in my code. Could someone have a look at it and tell me what's wrong?
Here is how I call my kernel:
dim3 blockSize(6,6,6);
dim3 threadSize(dimX/blockSize.x,dimY/blockSize.y,dimZ/blockSize.z);
d_interpolate_kernel<<<blockSize,threadSize>>>(output,dimX,dimY,dimZ);
My dimensions are dimX = 54 or 108, dimY=dimX=42 or 84.
So I have blockSize(6,6,6) and threadSize(9,7,7) or (18,14,14).
My card has the following capabilities:
MAX_BLOCK_DIM_X = 512
MAX_BLOCK_DIM_Y = 512
MAX_BLOCK_DIM_Z = 64
MAX_GRID_DIM_X = 65535
MAX_GRID_DIM_Y = 65535
MAX_GRID_DIM_Z = 1
Do I get the error because MAX_GRID_DIM_Z is 1?
If yes, is there a way around this?
Thank you!

One problem is you have your blockSize and threadSize variables reversed in your kernel call.
You want something like this:
d_interpolate_kernel<<<threadSize,blockSize>>>(output,dimX,dimY,dimZ);
The first configuration argument is the size of the grid in blocks.
The second configuration argument is the size of the block in threads.
Since you have them reversed, your (18,14,14) values are not acceptable block sizes (too many threads), since the max number of threads per block is 512 (for cc1.x) or 1024 (otherwise), whereas 18x14x14 = 3528.
For me, threadSize is a confusing name. I would have called it gridSize or something like that.
The second problem as you've pointed out is that for a cc1.x card (which seems to be what you have) your Z grid dimension must be 1. At least for your 42 case, you can fix this by re-structuring the thread blocks to have a dimension of, say, (2,2,42) and your grid a dimension of, say, (27, 21, 1).
Otherwise, these indices are just arbitrary numbering schemes. You can come up with a 2D grid that covers all of your 3D volume, using a (6, 6, 6) block size if that is what you want. You just need to get creative about how you map the blockIdx.x and blockIdx.y built-in variables in your interpolation kernel, to simulate a 3D grid.

3 questions about alignment

The discussion is restricted to compute capability 2.x
Question 1
The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?
Question 2
The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?
Question 3
struct
{
double x, y, z;
}Position
Suppose each thread is accessing the structure above:
int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];
To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?
Thanks for your time!

(1) They are placed contiguously in memory.
(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.
(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008