Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
If you call a kernel like this:
kernel<<< BLOCKS,THREADS >>>()
(without dim3 objects), what is the maximum number available for BLOCKS?
In an application of mine, I've set it up to 192000 and seemed to work fine... The problem is that the kernel I used changes the contents of a huge array, so although I checked some parts of the array and seemed fine, I can't be sure whether the kernel behaved strangely at other parts.
For the record I have a 2.1 GPU, GTX 500 ti.
With compute capability 3.0 or higher, you can have up to 2^31 - 1 blocks in the x-dimension, and at most 65535 blocks in the y and z dimensions. See Table H.1. Feature Support per Compute Capability of the CUDA C Programming Guide Version 9.1.
As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here.
In case anybody lands here based on a Google search (as I just did):
Nvidia changed the specification since this question was asked. With compute capability 3.0 and newer, the x-Dimension of a grid of thread blocks is allowed to be up to 2'147'483'647 or 2^31 - 1.
See the current: Technical Specification
65535 in a single dimension. Here's the complete table
I manually checked on my laptop (MX130), program crashes when #blocks > 678*1024+651. Each block with 1 thread, Adding even a single more block gives SegFault. Kernal code had no grid, linear structure only.
Related
For a constant block size of 128 (cores per MP):
I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.
According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.
Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim
{ x = 125000, y = 1, z = 1 }
I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).
I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?
Could someone please enlighten me here?
Thank you.
As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.
The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.
The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.
The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.
I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.
I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link (http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.
How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)",
because those are obviously very different numbers.
Read the relevant documentation, or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
There is no difference.
I'm writing a cuda c code to process pictures for example i created a swap function (swap blocs of the matrix) but it dos not work every time i thing i have a problem with the number of blocs and number of threads whene i lunch my kernel.
For example if i tak an image of size 2048*2048 with
threadsPerBlock.x=threadsPerBlock.y=64 and numBlocks.x=numBlocks.y=2048/threadsPerBlock.x
then swap<<<threadsPerBlock,numBlocks>>>(...) works fine.
But if I take an image of size 2560*2160, threadsPerBlock.x=threadsPerBlock.y=64 and numBlocks.x=2560/64 and numBlocks.y=2160/64+1, I have an error 9 wish is error invalid configuration argument.
I'm using CUDA 7.5 and a GPU with compute capability 5.0
The maximum number of threads per block for your compute 5.0 device is 1024. The source of your problem is that you have the arguments in the kernel launch reversed. When the maximum dimension of the image is less than 2048, that gives you a launch with less than 1024 threads per block. Larger than 2048 and the block size becomes illegal
If you do something like this:
threadsPerBlock.x=threadsPerBlock.y=32
numBlocks.x=numBlocks.y=2048/threadsPerBlock.x
swap<<<numBlocks,threadsPerBlock>>>(...)
You should find the kernel launch works unconditionally.
I want to start a kernel in such a fashion:
kernel_code<<<NUMBER_BLOCKS, NUMBER_THREADS_PER_BLOCK>>> (param1, param2, param3, param4);
Thus, using only the x-dimension of the grid. I want to call the kernel with the maximum number of blocks possible. I thought the max. number of blocks in a grid for one dimension would be 65535.
However, I explored the constant CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X which sounds like exact the same number I want to find out. However, this constant returns 1899336 on my GeForce 210 (CUDA 1.2). What am I getting wrong?
Referring to the driver API documentation for cuDeviceGetAttribute, the parameter which gives the maximum number of blocks in the x-direction of a grid is:
•CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X: Maximum x-dimension of a grid;
As you surmised, the parameter you indicated gives the maximum number of threads in a block (x-dimension):
•CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X: Maximum x-dimension of a block;
On a GeForce 210, the MAX_GRID_DIM_X parameter should be 65535. (True for all cc 1.x devices.)
If you are getting some other number, there is either something wrong with your code that you are using to retrieve this data (which you haven't shown), or else something wrong with your machine setup.
Try running and also inspecting the code for the CUDA driver API deviceQuery sample.
say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 ... has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
According to Section G.3.2.1 of the CUDA Programming Guide short will not coalesce on Compute Capability 1.0 and 1.1 devices under any circumstances. Specifically, it states:
The size of the words accessed by the
threads must be 4, 8, or 16 bytes
You can however use vector types such as short2, short4, or even short8 to get coalesced access. The coalescing rules for these types is spelled out in Section G.3.2.1 as well. However, as far as coalescing is concerned a short2 is just like a 32-bit int.
FWIW, devices with Compute Capability 1.3 or greater handle types like char and short much better. Reading chars on a 1.3 device might give you as much as ~60% of peak memory bandwidth vs. ~10% of peak memory bandwidth on a 1.0 or 1.1 device.