Discovering my GPU capabilities - cuda

I am trying to understand how the memory organization of my GPU is working.
According to the technical specification which are tabulated below my GPU can have 8 active blocks/SM and 768 threads/SM. Based on that I was thinking that in order to take advantage of the above each block should have 96 (=768/8) threads. The closest block that has this number of threads I think it is a 9x9 block, 81 threads. Using the fact that 8 blocks can run simultaneously in one SM we will have 648 threads. What about the rest 120 (= 768-648)?
I know that something wrong is happening with these thoughts. A simple example describing the connection between the maximum number of SM threads the maximum number of threads per block and the warp size based on my GPU specifications it would be very helpful.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0

You could find the technical specification of your device in the cuda programming guide as follows, rather than the output of a sample program of cuda.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
From the hardware point of view, we generally try to maximize the warp occupancy per Multiprocessor (SM) to get max performance. The max occupancy is limited by 3 types of hardware resources: #warp/SM, #register/SM and #shared memory/SM.
You could try the following tool in your cuda installation dir to understand how to do the calculation. It will give you a clearer understanding of the connections between #threads/SM, #threads/block, #warp/SM, etc.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls

Related

How to maximise the use of the GPU without having blocks waiting to be scheduled?

A device query on my Titan-XP shows that I have 30 multiprocessors with a maximum number of 2048 threads per multiprocessor. Is it correct to think that the maximum number of threads that can simultaneously be executed physically on the hardware is 30 * 2048? I.e: will a kernel configuration like the following exploit this?
kernel<<<60, 1024>>>(...);
I'd really like to physically have the maximum number of blocks executing while avoiding having blocks waiting to be scheduled. Here's the full output of device query:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "TITAN Xp"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12190 MBytes (12781682688 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5705 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = TITAN Xp
Result = PASS
Yes, your conclusion is correct. The maximum number of threads that can be "in-flight" is 2048 * # of SMs for all GPUs supported by CUDA 9 or CUDA 9.1. (Fermi GPUs, supported by CUDA 8, are a bit lower at 1536 * # of SMs)
This is an upper bound, and the specifics of your kernel (resource utilization) may mean that fewer than this number can actually be "resident" or "in flight". This is in the general topic of GPU occupancy. CUDA includes an occupancy calculator spreadsheet and also a programmatic occupancy API to help determine this, for your specific kernel.
The usual kernel strategy to have a limited number of threads (e.g. 60 * 1024 in your case) handle an arbitrary data set size is to use some form of a construct called a grid striding loop.

Why am I allowed to run a CUDA kernel with more blocks than my GPU's CUDA core count?

Comments / Notes
Can I have more thread blocks than the maximum number of CUDA cores?
How does warp size relate to what I am doing?
Begin
I am running a cuda program using the following code to launch cuda kernels:
cuda_kernel_func<<<960, 1>>> (... arguments ...)
I thought this would be the limit of what I would be allowed to do, as I have a GTX670MX graphics processor on a laptop, which according to Nvidia's website has 960 CUDA cores.
So I tried changing 960 to 961 assuming that the program would crash. It did not...
What's going on here?
This is the output of deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 670MX"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221028864 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 670MX
Result = PASS
I am not sure how to interpret this information. It says here "960 CUDA cores", but then "2048 threads per multiprocessor" and "1024 threads per block".
I am slightly confused about what these things mean, and therefore what the limitations of the cuda_kernel_func<<<..., ...>>> arguments are. (And how to get the maximum performance out of my device.)
I guess you could also interpret this question as "What do all the statistics about my device mean." For example, what actually is a CUDA core / thread / multiprocessor / texture dimension size?
It didn't crash because the number of 'CUDA cores' has nothing to do with the number of blocks. Not all blocks necessarily execute in parallel. CUDA just schedules some of your blocks after others, returning after all block executions have taken place.
You see, NVIDIA is misstating the number of cores in its GPUs, so as to make for a simpler comparison with single-threaded non-vectorized CPU execution. Your GPU actually has 6 cores in the proper sense of the word; but each of these can execute a lot of instructions in parallel on a lot of data. Note that the bona-fide cores on a Kepler GPUs are called "SMx"es (and are described here briefly).
So:
[Number of actual cores] x [max number of instructions a single core can execute in parallel] = [Number of "CUDA cores"]
e.g. 6 x 160 = 960 for your card.
Even this is a rough description of things, and what happens within an SMx doesn't always allow us to execute 160 instructions in parallel per cycle. For example, when each block has only 1 thread, that number goes down by a factor of 32 (!)
Thus even if you use 960 rather than 961 blocks, your execution isn't as well-parallelized as you would hope. And - you should really use more threads per block to utilize a GPU's capabilities for parallel execution. More importantly, you should find a good book on CUDA/GPU programming.
A simpler answer:
Blocks do not all execute at the same time. Some blocks may finish before others have even started. The GPU takes does X amount of blocks at a time, finishes those, grabs more blocks, and continues until all blocks are finished.
Aside: This is why thread_sync only synchronizes threads within a block, and not within the whole kernel.

Understanding GPU architecture (NVIDIA)

Specifications of graphics card:
Device 0: "GeForce GTX 650
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1058 MHz (1.06 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
>Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
I don't understand it completely. I tried looking up in the internet and ended up getting more confused.
What I know is I can launch kernels with blocks having maximum threads 1024 which can be of the form (1024,1,1) or (32,32,1) and so on.
What is the significance of ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores? How can I use this information for the benefit of the program? If the GPU takes cares of it, I guess I shouldn't bother.
What do we mean by Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? particularly, in contrast to the fact that max no. of threads in a block is 1024.
What is the significance of ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores? How can I use this information for the benefit of the program? If the GPU takes cares of it, I guess I shouldn't bother.
It is useful as it tells you how powerful your GPU is, and lets you compare with others. A GPU on the Kepler K10 board, for example, has the same compute capability, but 8 multiprocessors (SM) of 192 cores each. This is obviously going to have rather significantly more performance.
There's little you can do with this information yourself in most cases. When you have a number of blocks which is of the same order of magnitude as the number of SMs you have, there can be some rather sharp performance spikes if the blocks are scheduled such that in the "tail" some SMs are idle, but the way blocks map to SMs is unspecified, so optimisations based on this knowledge may be invalidated at any time by, for example, a driver update.
2 . What do we mean by Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? particularly, in contrast to the fact that max no. of threads in a block is 1024.
As JackOLantern said in his comment:
... it means that you can have (1024,1,1) or (1,1024,1) blocks, but not (1,1,1024) blocks. Along z, the maximum number of threads is limited to 64.

Number of threads in GeForce GTX 560Ti

I ran the deviceQuery and got the following result
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 560 Ti"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
(8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1701 MHz (1.70 GHz)
Memory Clock rate: 2052 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D= (2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 560 Ti
My understanding is that I can create maximum of 65535 x 65535 x 65535 blocks with 1024 threads per block. Does that I can have 65535 x 65535 x 65535 x 1024 threads of maximum ? If not what is the maximum number of threads I can have ?
Can anyone clarify this doubt ?
Your understanding is correct. You can launch 65535 x 65535 x 65535 x 1024 threads theoretically but due to resource constraints you may be not able to hit the maximum.
You can't just multiply all the maximum grid dimensions and assume that you can have that many threads, unfortunately. You have 8 MPs and a maximum number of threads per MP = 1536, so that makes 8 * 1536 = 12288 threads max.

CUDA Block parallelism

I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.