CUDA kernel returns 0's for other grid and block sizes? [closed] - cuda

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed yesterday.
Improve this question
I've been trying use different grid and block sizes for my CUDA application but when I print the results it seems anything different than <<< (size+255)/256, 256 >>> returns all zeros.
You can see my device properties:
Device 0: NVIDIA GeForce RTX 2060
Maximum number of threads per block: 1024
Maximum number of threads per multiprocessor: 1024
Maximum number of warps per multiprocessor: 32
Maximum number of blocks per multiprocessor: 16
Maximum number of threads per block dimension (x, y, z): (1024, 1024, 64)
Maximum grid size (x, y, z): (2147483647, 65535, 65535)
Device 0 has 30 multiprocessors.
Even if I change the kernel parameters to <<< (size-1024-1)/1024, 1024>>> it returns all zeros.
I am working with a very large size of arrays, sizes of between 1 million and 10 million. I want to use all available sources in my GPU. What is the best configuration for my kernel launch?

Related

Maximum number of CUDA blocks?

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link (http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.
How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)",
because those are obviously very different numbers.
Read the relevant documentation, or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
There is no difference.

What are the CUDA variables mean?

What are the CUDA variables mean?
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
For example, Maximum sizes of each dimension of a grid, does it mean there're 2147483647 grids. And each grid contains 65535 blocks?
No, those are the maximal HW limits that you can use. You have maximum dim of block 1024x1024x64, but the limit for thread per block is 1024, so you can use block dimension 1024x1x1 or 32x32x1 etc. You can't have more, but of course, you can use less.
Generally, it is up to you, how you set your grid a block dimensions (within the limits), it depends on what you need. The very basic hierarchy is, that you have a grid of blocks. Each block contains threads. So if you have grid dimensions 2x2x2 and block dimensions 16x1x1, there are 8 blocks and each block has 16 threads, so there are 128 threads running.
CUDA has a great documentation, so I suggest you start there.

How to avoid using number of threads exceeding the maximum allowed on GPU?

As described in a previous post:
how to find the number of maximum available threads in CUDA?
I found the maximum number of threads on my GPU card is 21504. However, when I assigned more than that number to the kernel, everything runs smoothly.
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void dummy()
{
}
int main()
{
//int N=21504;
int N=21504*40;
dummy<<<1,N>>>();
return 0;
}
I don't know what happened, but I believe we should avoid this, and not sure how to do it.
Your example did not run correctly. It only appeared to run correctly because you did not check the CUDA error status after the kernel launch.
The comment I made on your other question also applies here:
The maximum number of threads per multiprocessor is the upper limit to how many threads can be "in flight" at the same time. Other limiting factors will normally limit the number further. This value does not affect how many threads can be launched at the same time and it is not very useful for finding out the number of threads needed for optimal performance.
Your card is a compute capability 2.0 device. See the Features and Technical Specifications section in the CUDA Programming Guide for details on the limitations of your device. In particular, your device is limited to a grid size of 65535 in each of the X, Y and Z dimensions. You attempted to launch with a grid size of X = 21504*40, Y = 1, Z = 1.
Your device is limited to 1024 threads per block. So, in theory, you can launch up to 65535 * 65535 * 65535 blocks, each with 1024 threads at the same time.
There is no performance penalty to launching kernels with many more threads than the maximum number of resident threads your device supports.

maximum number of threads per block

i have the following information:
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
does this mean that the maximum number of threads in a 2d thread block is 512x512 which gives me a 262144 threads in every block?
if yes, then is it a good practice to have this number of threads in a a kernel of minimum 256 blocks?
No, it means that the maximum threads per block is 512,
You can decide how to lay that out over [1 ... 512] x [1 ... 512] x [1 ... 64].
For instance 16x16 would be ok in 2D.
As for the deciding on the size of the block, lots of things come into consideration, like the amount of memory a block needs and how big a half-warp is on the hardware (I don't remember if its always 16 on Nvidia hardware).
No, that means that your block can have 512 maximum X/Y or 64 Z, but not all at the same time. In fact, your info already said the maximum block size is 512 threads.
Now, there is no optimal block, as it depends on the hardware your code is running on, and also depends on your specific algorithm.

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How are threads organized to be executed by a GPU?
Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512x512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
Suppose a 9800GT GPU:
it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads
Note:
The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.
I would greatly appreciate if someone test this again for a more accurate piece of information.