cuda occupancy calculator - cuda

i used --ptax-options=-v while compiling my .cu code, it gave the following:
ptxas info: Used 74 registers, 124 bytes smem, 16 bytes cmem[1]
devQuery for my card returns the following:
rev: 2.0
name: tesla c2050
total shared memory per block: 49152
total reg. per block: 32768
now, i input these data into cuda occupancy calculator as follows:
1.) 2.0
1.b) 49152
2.) threads per block: x
registers per thread: 74
shared memory per block (bytes): 124
i was varying the x (threads per block) so that x*74<=32768. for example, i enter 128 (or 256) in place of x. Am I entering all the required values by occupancy calculator correctly? thanks.

ptxas-options=--verbose (or -v) produces output of the format
ptxas : info : Compiling entry function '_Z13matrixMulCUDAILi16EEvPfS0_S0_ii' for 'sm_10'
ptxas : info : Used 15 registers, 2084 bytes smem, 12 bytes cmem[1]
The critical information is
1st line has the target architecture
2nd line has <Registers Per Thread>, <Static Shared Memory Per Block>, <Constant Memory Per Kernel>
When you fill in the occupancy calculator
Set field 1.) Select Compute Capability to 'sm_10' in the above example
Set field 2.) Register Per Thread to
Set field 2.) Share Memory Per Block to + DynamicSharedMemoryPerBlock passed as 3rd parameter to <<<GridDim, BlockDim, DynamicSharedMemoryPerBlock, Stream>>>
The Occupancy Calculator Help tab contains additional information.
In your example I believe you are not correctly setting field 1 as Fermi architecture is limited to 63 Registers Per Thread. sm_1* supports a limit of 124 Registers Per Thread.

Related

NSight Compute not showing achieved occupancy in the metrics

I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute.
ncu says: Theoretical Occupancy [%] 100, and Achieved Occupancy [%] 93,04. What parameters do i need to calculate this value?
I can see the theoretical occupancy using the occupancy api, which comes out as 1.0 or 100%.
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__actice_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum. I can see the formaula to calculate the achieved occupancy from this SO answer.
Few details if that might help:
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 7
Minor revision number: 5
Name: NVIDIA GeForce GTX 1650
Total global memory: 4093181952
Total constant memory: 65536
Total shared memory per block: 49152
Total registers per block: 65536
Total registers per multiprocessor: 65536
Warp size: 32
Maximum threads per block: 1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor: 16
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1515000
Maximum memory pitch: 2147483647
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: Yes
ptxas info : Used 18 registers, 360 bytes cmem[0]
Shorter:
In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
Longer:
The metrics you have indicated:
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__active_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum.
are not applicable to nsight compute. NVIDIA has made a variety of different profilers, and these metric names apply to other profilers. The article you reference refers to a different profiler (the original profiler on windows used the nsight name but was not nsight compute.)
This blog article discusses different ways to get valid nsight compute metric names with references to documentation links that present the metrics in different ways.
I would also point out for others that nsight compute has a whole report section dedicated to occupancy, and so for typical interest, that is probably the easiest way to go. Additional instructions for how to run nsight compute are available in this blog.
To come up with metrics that represent occupancy the way the nsight compute designers intended, my suggestion would be to look at their definitions. Each report section in nsight compute has "human-readable" files that indicate how the section is assembled. Since there is a report section for occupancy that includes reporting both theoretical and achieved occupancy, we can discover how those are computed by inspecting those files.
The methodology for how the occupancy section is computed is contained in 2 files which are part of a CUDA install. On a standard linux CUDA install, these will be in /usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py and .../sections/Occupancy.section. The python file gives the exact names of the metrics that are used as well as the calculation method(s) for other displayed topics related to occupancy (e.g. notes, warnings, etc.) In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
You could retrieve both the Occupancy section report (which is part of the "default" "set") as well as these specific metrics with a command line like this:
ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./my-app
Here is an example output from such a run:
$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./t2140
Testing with mask size = 3
==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.
________________________________________________________________________
Testing with mask size = 5
==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.
________________________________________________________________________
Testing with mask size = 7
==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.
________________________________________________________________________
Testing with mask size = 9
==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.
________________________________________________________________________
==PROF== Disconnected from process 31551
[31551] t2140#127.0.0.1
convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 50
sm__warps_active.avg.pct_of_peak_sustained_active % 40.42
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 815.21
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 47,929
Memory [%] % 23.96
DRAM Throughput % 15.23
Duration usecond 42.08
L1/TEX Cache Throughput % 26.90
L2 Cache Throughput % 10.54
SM Active Cycles cycle 42,619.88
Compute (SM) [%] % 37.09
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 38
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1,048,576
Waves Per SM 12.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 1
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 40.42
Achieved Active Warps Per SM warp 25.87
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 100
sm__warps_active.avg.pct_of_peak_sustained_active % 84.01
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 771.98
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 31,103
Memory [%] % 40.61
DRAM Throughput % 24.83
Duration usecond 29.12
L1/TEX Cache Throughput % 46.39
L2 Cache Throughput % 18.43
SM Active Cycles cycle 27,168.03
Compute (SM) [%] % 60.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,156
Registers Per Thread register/thread 31
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 4.10
Threads thread 1,183,744
Waves Per SM 7.22
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 24
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 84.01
Achieved Active Warps Per SM warp 53.77
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
as well as across blocks of the same kernel.
<sections repeat for each kernel launch>
$

Interpreting the verbose output of ptxas, part II

This question is a continuation of Interpreting the verbose output of ptxas, part I .
When we compile a kernel .ptx file with ptxas -v, or compile it from a .cu file with -ptxas-options=-v, we get a few lines of output such as:
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
(same example as in the linked-to question; but with name demangling)
This question regards the last line. A few more examples from other kernels:
ptxas info : Used 19 registers, 336 bytes cmem[0], 4 bytes cmem[2]
...
ptxas info : Used 19 registers, 336 bytes cmem[0]
...
ptxas info : Used 6 registers, 16 bytes smem, 328 bytes cmem[0]
How do we interpret the information on this line, other than the number of registers used? Specifically:
Is cmem short for constant memory?
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
smem probably stands for shared memory; is it only static shared memory?
Under which conditions does each kind of entry appear on this line?
Is cmem short for constant memory?
Yes
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
They represent different constant memory banks. cmem[0] is the reserved bank for kernel arguments and statically sized constant values.
smem probably stands for shared memory; is it only static shared memory?
It is, and how could it be otherwise.
Under which conditions does each kind of entry appear on this line?
Mostly answered here.
Collected and reformatted...
Resources on the last ptxas info line:
registers - in the register file on every SM (multiprocessor)
gmem - Global memory
smem - Static Shared memory
cmem[N] - Constant memory bank with index N.
cmem[0] - Bank reserved for kernel argument and statically-sized constant values
cmem[2] - ???
cmem[4] - ???
cmem[14] - ???
Each of these categories will be shown if the kernel uses any such memory (Registers - probably always shown); thus it is no surprise all the examples show some cmem[0] usage.
You can read a bit more on the CUDA memory hierarchy in Section 2.3 of the Programming Guide and the links there. Also, there's this blog post about static vs dynamic shared memory.

NVCC ptas=-v output

a I compiled my program with "nvcc -ccbin=icpc source/* -Iinclude -arch=sm_35 --ptxas-options=-v
". Output is below:
ptxas info : 0 bytes gmem
ptxas info : 0 bytes gmem
ptxas info : 450 bytes gmem
ptxas info : Compiling entry function '_Z21process_full_instancePiPViS1_S_' for 'sm_35'
ptxas info : Function properties for _Z21process_full_instancePiPViS1_S_
408 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 174 registers, 9748 bytes smem, 352 bytes cmem[0]
I think gmem refers to global memory, but why the first line and third line have different values (0 vs 450) for geme?
smem is shared memory, how about cmem?
Is the memory usage for a block or a SM (stream processor)? Blocks are dynamically assigned to SM. Can we infer how many blocks will concurrently run on a SM?
My GPU is K20.
smem is shared memory, how about cmem?
cmem stands for constant memory
gmem stands for global memory
smem stands for shared memory
lmem stands for local memory
stack frame is part of local memory
spill loads and store use part of the stack frame
Is the memory usage for a block or a SM (stream processor)?
No, the number of registers is per thread while the shared memory is per block.
Can we infer how many blocks will concurrently run on a SM?
No. Since you can't determine the number of threads per block you cannot calculate the resources each block requires.

CUDA Block parallelism

I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.

Cuda: Chasing an insufficient resource issue

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.