Why is it said max blocks per grid dimension are 65535? - cuda

For a constant block size of 128 (cores per MP):
I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.
According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.
Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim
{ x = 125000, y = 1, z = 1 }
I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).
I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?
Could someone please enlighten me here?
Thank you.

As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.
The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.
The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.
The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.
I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.

Related

Maximum number of CUDA blocks?

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link (http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.
How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)",
because those are obviously very different numbers.
Read the relevant documentation, or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
There is no difference.

Understanding the tiling of tensor cores using CUDA on V100

I have a toy code heavily borrowing from NVidia's simpleTensorCoreGEMM.cu. I swapped out their random generation of matrices for a function that reads in matrices from files.
Using this toy code and multiplying two matrices of size [2000 x 10000] * [10000 x 3008] works beautifully. The output is as expected.
When I try a much larger multiplication [20000 x 10000] * [10000 x 30000], the output goes horribly wrong and 2/3's of the rows are 0's.
I'm convinced that this is a result of me not understanding the lines of code:
// blockDim.x must be a multple of warpSize
// 128x4 means we have 16 warps and a block computes a 64x64 output tile
blockDim.x = 128;
blockDim.y = 4;
gridDim.x = (MATRIX_M + (WMMA_M * blockDim.x / 32 - 1)) / (WMMA_M * blockDim.x / 32);
gridDim.y = (MATRIX_N + WMMA_N * blockDim.y - 1) / (WMMA_N * blockDim.y);
Even if it is not the source of my error, I should still understand what it is doing. I understand setting blockDim.* There are 32 threads per warp, 128*4/32 = 16 warps.
QUESTION : Could someone explain to me the logic behind the values of and the computation of gridDim.x and gridDim.y? The correct usage of the tensor cores seems to be very sensitive to using the correct values for gridDim.*.
A couple introductory points:
For understanding, this code is intended to accompany this blog article. The last part of that blog, the section "Programmatic Access to Tensor Cores in CUDA 9.0" is definitely useful for understanding this code.
As mentioned in the readme for that code an easier method to access the performance of tensor cores (especially for the basic matrix multiply operations you seem to be playing with) is simply to use a CUBLAS function, such as cublasGemmEx which will make intelligent use of tensor cores under the right circumstances.
Now to your question:
Could someone explain to me the logic behind the values of and the computation of gridDim.x and gridDim.y?
These values are sizing the CUDA grid to be sufficient for the size of the matrix multiply problem requested. We need to approach this hierarchically.
First of all, the tensor core capability is accessed at the warp level. The blog article indicates that "The strategy we’ll employ is to have a single warp responsible for a single 16×16 section of the output matrix" Therefore the output matrix dimensions will drive the dimensions of the CUDA grid used to compute the result. (Typical naive realizations of matrix multiply also determine grid size based on output matrix size. More specifically they assign one thread per output point. Here we are assigning one 32-thread-warp to be responsible for one 16x16 tile of the output matrix.) The code uses WMMA_M (i.e. how many rows) and WMMA_N (i.e. how many columns) to define what a single warp-level tensor core operation will handle. These values are 16, and this drives the choice of using a 16x16 tile in the output, per warp.
As is often the case in CUDA, block dimensions can be somewhat arbitrary, but they do frequently affect the grid size (variables). Warps exist at the block level and the number of warps in a block effectively determine how many 16x16 tiles in the output matrix will be handled per block. In this particular case, the code is choosing block dimensions of 128 (blockDim.x) by 4 (blockDim.y). This happens to be 4 warps "wide" by 4 warps "high", so each block is handling a 4x4 set of tiles in the output, which means each block is responsible for 64x64 output points. Note that these blockDim and gridDim variables in host code are logically separate from (although end up being the same numerically as) the blockDim and gridDim built-in variables in CUDA device code.
Given the above, the m,n, and k parameters of a typical BLAS GEMM operation have the same meaning here. m is the number of rows of the left hand side input matrix. n is the number of columns of the right hand side input matrix. k is the number of columns of the left matrix, which must match the number of rows of the right matrix. Therefore m,n define the dimensions of the output matrix. These are indicated in the code as MATRIX_M and MATRIX_N respectively.
With the above groundwork laid, we can then state the arithmetic needed to compute gridDim.x and gridDim.y in host code.
We must choose enough threads in the x dimension, so that when divided by 32 (the width of a warp in the x dimension) and then multiplied by WMMA_M (the output tile width responsibility of that warp), we have enough threads to cover the width of the output matrix.
We must choose enough threads in the y dimension, so that when divided by 1 (the "height" of a warp in the y dimension) and then multiplied by WMMA_N (the output tile height responsibility of that warp), we have enough threads to cover the height of the output matrix. Note that the "height" of the warp in the y dimension is definitely 1 in this case, because the code requires that the the block width dimension be a whole number multiple of the warp size. Therefore any warp has a constant threadIdx.y component, across the warp.
To go from threads determined in 1 and 2 above, to blocks in each dimension, we must scale (divide) each by the corresponding threadblock dimension. Therefore the grid thread dimension in x must be divided by blockDim.x (in host code), scaled as in 1 above, to get the total grid dimension (number of blocks) in x. This division operation is the usual CUDA "round up" integer divide operation, to scale the number of blocks to be equal to or larger than the threads needed, to account for matrix sizes that are not evenly divisibly by the block size.
Putting all that together, we have:
gridDim.x = (MATRIX_M + (WMMA_M * blockDim.x / 32 - 1)) / (WMMA_M * blockDim.x / 32);
^ ^ ^ ^
| | | divided by the block size scaled for the
| | | portion of the output matrix it covers.
| | rounded up
| the matrix size
The grid in blocks is
And similarly for the y grid dimension. The only real difference is that 32 threads in x (a warp-width) is responsible for a 16x16 output tile whereas on a single thread in y (a warp "height") is responsible for that 16x16 output tile.

Should I check the number of threads in kernel code?

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?
CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

kernel failure: invalid configuration argument

I have a question about my code and whether I can run it on my current device or not.
Basically, I want to do a 3D interpolation.
When I launch my interpolation kernel, I get the following error: kernel failure: invalid configuration argument
I saw in this discussion that it can happen if you call too many threads or blocks, but I am not sure it is the case in my code. Could someone have a look at it and tell me what's wrong?
Here is how I call my kernel:
dim3 blockSize(6,6,6);
dim3 threadSize(dimX/blockSize.x,dimY/blockSize.y,dimZ/blockSize.z);
d_interpolate_kernel<<<blockSize,threadSize>>>(output,dimX,dimY,dimZ);
My dimensions are dimX = 54 or 108, dimY=dimX=42 or 84.
So I have blockSize(6,6,6) and threadSize(9,7,7) or (18,14,14).
My card has the following capabilities:
MAX_BLOCK_DIM_X = 512
MAX_BLOCK_DIM_Y = 512
MAX_BLOCK_DIM_Z = 64
MAX_GRID_DIM_X = 65535
MAX_GRID_DIM_Y = 65535
MAX_GRID_DIM_Z = 1
Do I get the error because MAX_GRID_DIM_Z is 1?
If yes, is there a way around this?
Thank you!
One problem is you have your blockSize and threadSize variables reversed in your kernel call.
You want something like this:
d_interpolate_kernel<<<threadSize,blockSize>>>(output,dimX,dimY,dimZ);
The first configuration argument is the size of the grid in blocks.
The second configuration argument is the size of the block in threads.
Since you have them reversed, your (18,14,14) values are not acceptable block sizes (too many threads), since the max number of threads per block is 512 (for cc1.x) or 1024 (otherwise), whereas 18x14x14 = 3528.
For me, threadSize is a confusing name. I would have called it gridSize or something like that.
The second problem as you've pointed out is that for a cc1.x card (which seems to be what you have) your Z grid dimension must be 1. At least for your 42 case, you can fix this by re-structuring the thread blocks to have a dimension of, say, (2,2,42) and your grid a dimension of, say, (27, 21, 1).
Otherwise, these indices are just arbitrary numbering schemes. You can come up with a 2D grid that covers all of your 3D volume, using a (6, 6, 6) block size if that is what you want. You just need to get creative about how you map the blockIdx.x and blockIdx.y built-in variables in your interpolation kernel, to simulate a 3D grid.

Grid and Block dimension in (py)CUDA [duplicate]

This question already has answers here:
How do I choose grid and block dimensions for CUDA kernels?
(3 answers)
Closed 9 years ago.
I got a question regarding the dimensions of the blocks and grids in (py)CUDA. I know that there are limits in the total size of the blocks, but not of the grids
And that the actual blocksize influences the runtime. But what I'm wondering about is: Does it make a difference if I have a block of 256 threads, to start it like (256,1) or to start it like (128,2), like (64,4) etc.
If it makes a difference: which is the fastest?
Yes, it makes a difference.
(256,1) creates a (1D) block of 256 threads in the X-dimension, all of which have a y-index of 0.
(128,2) creates a (2D) block of 128x2 threads, ie. 128 in the x-dimension and 2 in the y-dimension. These threads will have an x-index ranging from 0 to 127 and a y-index ranging from 0 to 1
The structure of your kernel code must comprehend the thread indexing/numbering.
For example if your kernel code starts with something like:
int idx=threadIdx.x+blockDim.x*blockIdx.x;
and doesn't create any other index variables, it's probably assuming a 1D threadblock and 1D grid.
If, on the other hand, your kernel code starts with something like:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
It's probably expecting a 2D grid and 2D threadblocks.
Generally speaking, the two approaches are not interchangeable, meaning you cannot launch a kernel that expects a 1D grid with a 2D grid and expect everything to work normally, and vice-versa.