maximum number of threads per block - cuda

i have the following information:
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
does this mean that the maximum number of threads in a 2d thread block is 512x512 which gives me a 262144 threads in every block?
if yes, then is it a good practice to have this number of threads in a a kernel of minimum 256 blocks?

No, it means that the maximum threads per block is 512,
You can decide how to lay that out over [1 ... 512] x [1 ... 512] x [1 ... 64].
For instance 16x16 would be ok in 2D.
As for the deciding on the size of the block, lots of things come into consideration, like the amount of memory a block needs and how big a half-warp is on the hardware (I don't remember if its always 16 on Nvidia hardware).

No, that means that your block can have 512 maximum X/Y or 64 Z, but not all at the same time. In fact, your info already said the maximum block size is 512 threads.
Now, there is no optimal block, as it depends on the hardware your code is running on, and also depends on your specific algorithm.

Related

Maximum number of CUDA blocks?

I want to implement an algorithm in CUDA that takes an input of size N and uses N^2 threads to execute it (this is the way the particular algorithm words). I've been asked to make a program that can handle up to N = 2^10. I think for my system a given thread block can have up to 512 threads, but for N = 2^10, having N^2 threads would mean having N^2 / 512 = 2^20 / 512 blocks. I read at this link (http://www.ce.jhu.edu/dalrymple/classes/602/Class10.pdf) that you the number of blocks "can be as large as 65,535 (or larger 2^31 - 1)".
My questions are:
1) How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)", because those are obviously very different numbers.
2) Is it possible to run an algorithm that requires 2^20 / 512 threads?
3) If the number of threads that I need (2^20 / 512) is greater than what CUDA can provide, what happens? Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
4) If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
If you can provide any insight into any of these ^^ questions, I'd appreciate it.
How do I find the actual maximum number of blocks? I'm not sure what the quote ^^ meant when it said "65,535 (or larger 2^31 - 1)",
because those are obviously very different numbers.
Read the relevant documentation, or build and run the devicequery utility. But in either case, the limit is much larger than 2048 (which is what 2^20 / 512 equals). Note also that the block size limit on all currently supported hardware is 1024 threads per block, not 512, so you might need as few as 1024 blocks.
Is it possible to run an algorithm that requires 2^20 / 512 threads[sic]?
Yes
If the number of threads[sic] that I need .... is greater than what CUDA can provide, what happens?
Nothing. A runtime error is emitted.
Does it just fill all the available threads, and then re-assign those threads to the additional waiting tasks once they're done computing?
No. You would have to explicitly implement such a scheme yourself.
If I want to use the maximum number of threads in each block, should I just set the number of threads to 512 like <<<number, 512>>>, or is there an advantage to using a dim3 value?
There is no difference.

Why am I allowed to run a CUDA kernel with more blocks than my GPU's CUDA core count?

Comments / Notes
Can I have more thread blocks than the maximum number of CUDA cores?
How does warp size relate to what I am doing?
Begin
I am running a cuda program using the following code to launch cuda kernels:
cuda_kernel_func<<<960, 1>>> (... arguments ...)
I thought this would be the limit of what I would be allowed to do, as I have a GTX670MX graphics processor on a laptop, which according to Nvidia's website has 960 CUDA cores.
So I tried changing 960 to 961 assuming that the program would crash. It did not...
What's going on here?
This is the output of deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 670MX"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221028864 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 670MX
Result = PASS
I am not sure how to interpret this information. It says here "960 CUDA cores", but then "2048 threads per multiprocessor" and "1024 threads per block".
I am slightly confused about what these things mean, and therefore what the limitations of the cuda_kernel_func<<<..., ...>>> arguments are. (And how to get the maximum performance out of my device.)
I guess you could also interpret this question as "What do all the statistics about my device mean." For example, what actually is a CUDA core / thread / multiprocessor / texture dimension size?
It didn't crash because the number of 'CUDA cores' has nothing to do with the number of blocks. Not all blocks necessarily execute in parallel. CUDA just schedules some of your blocks after others, returning after all block executions have taken place.
You see, NVIDIA is misstating the number of cores in its GPUs, so as to make for a simpler comparison with single-threaded non-vectorized CPU execution. Your GPU actually has 6 cores in the proper sense of the word; but each of these can execute a lot of instructions in parallel on a lot of data. Note that the bona-fide cores on a Kepler GPUs are called "SMx"es (and are described here briefly).
So:
[Number of actual cores] x [max number of instructions a single core can execute in parallel] = [Number of "CUDA cores"]
e.g. 6 x 160 = 960 for your card.
Even this is a rough description of things, and what happens within an SMx doesn't always allow us to execute 160 instructions in parallel per cycle. For example, when each block has only 1 thread, that number goes down by a factor of 32 (!)
Thus even if you use 960 rather than 961 blocks, your execution isn't as well-parallelized as you would hope. And - you should really use more threads per block to utilize a GPU's capabilities for parallel execution. More importantly, you should find a good book on CUDA/GPU programming.
A simpler answer:
Blocks do not all execute at the same time. Some blocks may finish before others have even started. The GPU takes does X amount of blocks at a time, finishes those, grabs more blocks, and continues until all blocks are finished.
Aside: This is why thread_sync only synchronizes threads within a block, and not within the whole kernel.

Understanding GPU architecture (NVIDIA)

Specifications of graphics card:
Device 0: "GeForce GTX 650
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1058 MHz (1.06 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
I don't understand it completely. I tried looking up in the internet and ended up getting more confused.
What I know is I can launch kernels with blocks having maximum threads 1024 which can be of the form (1024,1,1) or (32,32,1) and so on.
What is the significance of ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores? How can I use this information for the benefit of the program? If the GPU takes cares of it, I guess I shouldn't bother.
What do we mean by Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? particularly, in contrast to the fact that max no. of threads in a block is 1024.
What is the significance of ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores? How can I use this information for the benefit of the program? If the GPU takes cares of it, I guess I shouldn't bother.
It is useful as it tells you how powerful your GPU is, and lets you compare with others. A GPU on the Kepler K10 board, for example, has the same compute capability, but 8 multiprocessors (SM) of 192 cores each. This is obviously going to have rather significantly more performance.
There's little you can do with this information yourself in most cases. When you have a number of blocks which is of the same order of magnitude as the number of SMs you have, there can be some rather sharp performance spikes if the blocks are scheduled such that in the "tail" some SMs are idle, but the way blocks map to SMs is unspecified, so optimisations based on this knowledge may be invalidated at any time by, for example, a driver update.
2 . What do we mean by Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? particularly, in contrast to the fact that max no. of threads in a block is 1024.
As JackOLantern said in his comment:
... it means that you can have (1024,1,1) or (1,1024,1) blocks, but not (1,1,1024) blocks. Along z, the maximum number of threads is limited to 64.

What are the CUDA variables mean?

What are the CUDA variables mean?
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
For example, Maximum sizes of each dimension of a grid, does it mean there're 2147483647 grids. And each grid contains 65535 blocks?
No, those are the maximal HW limits that you can use. You have maximum dim of block 1024x1024x64, but the limit for thread per block is 1024, so you can use block dimension 1024x1x1 or 32x32x1 etc. You can't have more, but of course, you can use less.
Generally, it is up to you, how you set your grid a block dimensions (within the limits), it depends on what you need. The very basic hierarchy is, that you have a grid of blocks. Each block contains threads. So if you have grid dimensions 2x2x2 and block dimensions 16x1x1, there are 8 blocks and each block has 16 threads, so there are 128 threads running.
CUDA has a great documentation, so I suggest you start there.

CUDA Block parallelism

I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.