Maximum number of CUDA blocks? - cuda

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link (http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.

How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)",
because those are obviously very different numbers.
Read the relevant documentation, or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
There is no difference.

Related

Why is it said max blocks per grid dimension are 65535?

For a constant block size of 128 (cores per MP):
I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.
According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.
Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim
{ x = 125000, y = 1, z = 1 }
I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).
I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?
Could someone please enlighten me here?
Thank you.
As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.
The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.
The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.
The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.
I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.

Should I check the number of threads in kernel code?

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?
CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

What is the maximum block count possible in CUDA?

Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
If you call a kernel like this:
kernel<<< BLOCKS,THREADS >>>()
(without dim3 objects), what is the maximum number available for BLOCKS?
In an application of mine, I've set it up to 192000 and seemed to work fine... The problem is that the kernel I used changes the contents of a huge array, so although I checked some parts of the array and seemed fine, I can't be sure whether the kernel behaved strangely at other parts.
For the record I have a 2.1 GPU, GTX 500 ti.
With compute capability 3.0 or higher, you can have up to 2^31 - 1 blocks in the x-dimension, and at most 65535 blocks in the y and z dimensions. See Table H.1. Feature Support per Compute Capability of the CUDA C Programming Guide Version 9.1.
As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here.
In case anybody lands here based on a Google search (as I just did):
Nvidia changed the specification since this question was asked. With compute capability 3.0 and newer, the x-Dimension of a grid of thread blocks is allowed to be up to 2'147'483'647 or 2^31 - 1.
See the current: Technical Specification
65535 in a single dimension. Here's the complete table
I manually checked on my laptop (MX130), program crashes when #blocks > 678*1024+651. Each block with 1 thread, Adding even a single more block gives SegFault. Kernal code had no grid, linear structure only.

How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?
You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.

maximum number of threads per block

i have the following information:
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
does this mean that the maximum number of threads in a 2d thread block is 512x512 which gives me a 262144 threads in every block?
if yes, then is it a good practice to have this number of threads in a a kernel of minimum 256 blocks?
No, it means that the maximum threads per block is 512,
You can decide how to lay that out over [1 ... 512] x [1 ... 512] x [1 ... 64].
For instance 16x16 would be ok in 2D.
As for the deciding on the size of the block, lots of things come into consideration, like the amount of memory a block needs and how big a half-warp is on the hardware (I don't remember if its always 16 on Nvidia hardware).
No, that means that your block can have 512 maximum X/Y or 64 Z, but not all at the same time. In fact, your info already said the maximum block size is 512 threads.
Now, there is no optimal block, as it depends on the hardware your code is running on, and also depends on your specific algorithm.