Understanding counters in CUDA profiler - cuda

I'm facing difficulty in understanding sm_cta counter in CUDA profiler. I'm launching 128 blocks and my launch bound configs are __launch_bounds(192,8) but profiler is showing 133 for particular run. I profiled app for several times but it is around 133 everytime. What this counter indicates? Using Tesla C2075, Linux 32bit.

The NVIDIA GPUs have Performance Monitor Units in multiple locations of the chip. On Fermi devices the sm_cta_launched signal is collected by the GPC monitor not the SM monitor. The Fermi GPC Performance Monitor is limited to observing 1 SM per GPC. The C2075 has 4 GPCs and 14 SMs. A C2075 could have a configuration of 2 GPCs with 4 SMs and 2 GPCs with 3 SMs. The CUDA profiler will collect the counter for 1 SM per GPC and multiply the result by the number of SMs in the GPC. The final value can be higher or lower than the expected value. For example:
GPC SMs Counter Value
0 4 8 32
1 4 8 32
2 3 11 33
3 3 12 36
---------------------------
133
In the document Compute Command Line Profiler this information is specified under the countermodeaggregate option.
countermodeaggregate
If this option is selected then aggregate counter values will be
output. For a SM counter the counter value is the sum of the counter
values from all SMs. For l1*, tex*, sm_cta_launched,
uncached_global_load_transaction and global_store_transaction counters
the counter value is collected for 1 SM from each GPC and it is
extrapolated for all SMs. This option is supported only for CUDA
devices with compute capability 2.0 or higher.
A more accurate value can be achieved by using warps_launched which is collected per SM using the formula:
thread_blocks_launched = warps_launched
/ ((threadblocksizeX * threadblocksizeY * threadblocksizeZ) + WARP_SIZE - 1)
/ WARP_SIZE
where WARP_SIZE is 32 on all current devices.
NOTE: This approach will not be correct for Dynamic Parallelism.

Some CUDA library functions are also implemented using kernels internally, so it is not surprising the total number of blocks executed is slightly higher than what you explicitly launched yourself.

Related

NSight Compute not showing achieved occupancy in the metrics

I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute.
ncu says: Theoretical Occupancy [%] 100, and Achieved Occupancy [%] 93,04. What parameters do i need to calculate this value?
I can see the theoretical occupancy using the occupancy api, which comes out as 1.0 or 100%.
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__actice_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum. I can see the formaula to calculate the achieved occupancy from this SO answer.
Few details if that might help:
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 7
Minor revision number: 5
Name: NVIDIA GeForce GTX 1650
Total global memory: 4093181952
Total constant memory: 65536
Total shared memory per block: 49152
Total registers per block: 65536
Total registers per multiprocessor: 65536
Warp size: 32
Maximum threads per block: 1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor: 16
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1515000
Maximum memory pitch: 2147483647
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: Yes
ptxas info : Used 18 registers, 360 bytes cmem[0]
Shorter:
In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
Longer:
The metrics you have indicated:
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__active_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum.
are not applicable to nsight compute. NVIDIA has made a variety of different profilers, and these metric names apply to other profilers. The article you reference refers to a different profiler (the original profiler on windows used the nsight name but was not nsight compute.)
This blog article discusses different ways to get valid nsight compute metric names with references to documentation links that present the metrics in different ways.
I would also point out for others that nsight compute has a whole report section dedicated to occupancy, and so for typical interest, that is probably the easiest way to go. Additional instructions for how to run nsight compute are available in this blog.
To come up with metrics that represent occupancy the way the nsight compute designers intended, my suggestion would be to look at their definitions. Each report section in nsight compute has "human-readable" files that indicate how the section is assembled. Since there is a report section for occupancy that includes reporting both theoretical and achieved occupancy, we can discover how those are computed by inspecting those files.
The methodology for how the occupancy section is computed is contained in 2 files which are part of a CUDA install. On a standard linux CUDA install, these will be in /usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py and .../sections/Occupancy.section. The python file gives the exact names of the metrics that are used as well as the calculation method(s) for other displayed topics related to occupancy (e.g. notes, warnings, etc.) In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
You could retrieve both the Occupancy section report (which is part of the "default" "set") as well as these specific metrics with a command line like this:
ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./my-app
Here is an example output from such a run:
$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./t2140
Testing with mask size = 3
==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.
________________________________________________________________________
Testing with mask size = 5
==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.
________________________________________________________________________
Testing with mask size = 7
==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.
________________________________________________________________________
Testing with mask size = 9
==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.
________________________________________________________________________
==PROF== Disconnected from process 31551
[31551] t2140#127.0.0.1
convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 50
sm__warps_active.avg.pct_of_peak_sustained_active % 40.42
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 815.21
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 47,929
Memory [%] % 23.96
DRAM Throughput % 15.23
Duration usecond 42.08
L1/TEX Cache Throughput % 26.90
L2 Cache Throughput % 10.54
SM Active Cycles cycle 42,619.88
Compute (SM) [%] % 37.09
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 38
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1,048,576
Waves Per SM 12.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 1
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 40.42
Achieved Active Warps Per SM warp 25.87
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 100
sm__warps_active.avg.pct_of_peak_sustained_active % 84.01
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 771.98
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 31,103
Memory [%] % 40.61
DRAM Throughput % 24.83
Duration usecond 29.12
L1/TEX Cache Throughput % 46.39
L2 Cache Throughput % 18.43
SM Active Cycles cycle 27,168.03
Compute (SM) [%] % 60.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,156
Registers Per Thread register/thread 31
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 4.10
Threads thread 1,183,744
Waves Per SM 7.22
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 24
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 84.01
Achieved Active Warps Per SM warp 53.77
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
as well as across blocks of the same kernel.
<sections repeat for each kernel launch>
$

Is blockIdx correlated to the order of block execution?

Is there any relationship between blockIdx and the order in which thread blocks are executed on the GPU device?
My motivation is that I have a kernel in which multiple blocks will read from the same location in global memory, and it would be nice if these blocks would run concurrently (because L2 cache hits are nice). In deciding how to organize these blocks into a grid, would it be safe to say that blockIdx.x=0 is more likely to run concurrently with blockIdx.x=1 than with blockIdx.x=200? And that I should try to assign consecutive indices to blocks that read from the same location in global memory?
To be clear, I'm not asking about inter-block dependencies (as in this question) and the thread blocks are completely independent from the point of view of program correctness. I'm already using shared memory to broadcast data within a block, and I can't make the blocks any larger.
EDIT: Again, I am well aware that
Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series.
and the blocks are fully independent---they can run in any order and produce the same output. I am just asking if the order in which I arrange the blocks into a grid will influence which blocks end up running concurrently, because that does affect performance via L2 cache hit rate.
I found a writeup in which a CS researcher used micro-benchmarking to reverse engineer the block scheduler on a Fermi device:
http://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html
I adapted his code to run on my GPU device (GTX 1080, with the Pascal GP104 GPU) and to randomize the runtimes.
Methods
Each block contains only 1 thread, and is launched with enough shared memory that only 2 blocks can be resident per SM. The kernel records its start time (obtained via clock64()) and then runs for a random amount of time (the task, appropriately enough, is generating random numbers using the multiply-with-carry algorithm).
The GTX 1080 is comprised of 4 Graphics Processing Clusters (GPCs) with 5 streaming multiprocessors (SM) each. Each GPC has its own clock, so I used the same method described in the link to determine which SMs belonged to which GPCs and then subtract a fixed offset to convert all of the clock values to the same time zone.
Results
For a 1-D block grid, I found that the blocks were indeed launched in consecutive order:
We have 40 blocks starting immediately (2 blocks per SM * 20 SMs) and the subsequent blocks start when the previous blocks end.
For 2-D grids, I found the same linear-sequential order, with blockIdx.x being the fast dimension and blockIdx.y the slow dimension:
NB: I made a terrible typo when labeling these plots. All instances of "threadIdx" should be replaced with "blockIdx".
And for a 3-d block grid:
Conclusions
For a 1-D grid, these results match what Dr. Pai reported in the linked writeup. For 2-D grids, however, I did not find any evidence for a space-filling curve in block execution order, so this may have changed somewhere between Fermi and Pascal.
And of course, the usual caveats with benchmarking apply, and there's no guarantee that this isn't specific to a particular processor model.
Appendix
For reference, here's a plot showing the results for random vs. fixed runtimes:
The fact that we see this trend with randomized runtimes gives me more confidence that this is a real result and not just a quirk of the benchmarking task.
Yes, there definitely is a correlation (although of course it is not guaranteed).
You are probably best off just trying it out on your device. You can use the %globaltimer and %smid special PTX registers with a bit of inline assembly:
#include <stdio.h>
__managed__ unsigned long long starttime;
__device__ unsigned long long globaltime(void)
{
unsigned long long time;
asm("mov.u64 %0, %%globaltimer;" : "=l"(time));
return time;
}
__device__ unsigned int smid(void)
{
unsigned int sm;
asm("mov.u32 %0, %%smid;" : "=r"(sm));
return sm;
}
__global__ void logkernel(void)
{
unsigned long long t = globaltime();
unsigned long long t0 = atomicCAS(&starttime, 0ull, t);
if (t0==0) t0 = t;
printf("Started block %2u on SM %2u at %llu.\n", blockIdx.x, smid(), t - t0);
}
int main(void)
{
starttime = 0;
logkernel<<<30, 1, 49152>>>();
cudaDeviceSynchronize();
return 0;
}
I've used 48K of shared memory to make the results a bit more interesting - you should substitute your kernel of interest with it's actual launch configuration instead.
If I run this code on my laptop with a GTX 1050, I get output like the following:
Started block 1 on SM 1 at 0.
Started block 6 on SM 1 at 0.
Started block 8 on SM 3 at 0.
Started block 0 on SM 0 at 0.
Started block 3 on SM 3 at 0.
Started block 5 on SM 0 at 0.
Started block 2 on SM 2 at 0.
Started block 7 on SM 2 at 0.
Started block 4 on SM 4 at 0.
Started block 9 on SM 4 at 0.
Started block 10 on SM 3 at 152576.
Started block 11 on SM 3 at 152576.
Started block 18 on SM 1 at 153600.
Started block 16 on SM 1 at 153600.
Started block 17 on SM 0 at 153600.
Started block 14 on SM 0 at 153600.
Started block 13 on SM 2 at 153600.
Started block 12 on SM 2 at 153600.
Started block 19 on SM 4 at 153600.
Started block 15 on SM 4 at 153600.
Started block 20 on SM 0 at 210944.
Started block 21 on SM 3 at 210944.
Started block 22 on SM 0 at 211968.
Started block 23 on SM 3 at 211968.
Started block 24 on SM 1 at 214016.
Started block 26 on SM 1 at 215040.
Started block 25 on SM 2 at 215040.
Started block 27 on SM 2 at 215040.
Started block 28 on SM 4 at 216064.
Started block 29 on SM 4 at 217088.
So you see there is indeed a strong correlation.

why the difference in cuda cores between nvidia control panel and device query?

Q1: why there is different information i got from Nvidia control panel->system information and information from device query example in cuda sdk.
system information:
cuda cores 384 cores
memory data rate 1800MHz
device query output:
cuda cores= 2 MP x 192 SP/MP = 576 cuda cores
memory clock rate 900MHz
Q2: how can i calculate the GFLOPs of my GPU using device query data?
the most common used formula i found was the one mentioned here which suggest using Number of mul-add units, number of mul units which i don't know?
Max GFLOPS (Cores x SIMDs x ([mul-add]x2+[mul]*1)*clock speed)
Q1: It tells you right there just above the line...
MapSMtoCores for SM 5.0 is unefined. Default to use 192 Cores/SM
Maxwell, the architecture behind the GeForce 840M, uses 128 "cores" per "SMM"
3 * 128 = 384
Q2: "Cores" * frequency * 2 (because each core can do a multiply+add)

CUDA Block parallelism

I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.

Why are overlapping data transfers in CUDA slower than expected?

When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy? (I've also posted this question at http://forums.developer.nvidia.com/devforum/discussion/comment/8976.)
The first kernel launch cannot start until the first memcpy is completed, and the last memcpy cannot start until the last kernel launch is completed. So, there is "overhang" that introduces some of the overhead you are observing. You can decrease the size of the "overhang" by increasing the number of streams, but the streams' inter-engine synchronization incurs its own overhead.
It's important to note that overlapping compute+transfer doesn't always benefit a given workload - in addition to the overhead issues described above, the workload itself has to spend equal amounts of time doing compute and data transfer. Due to Amdahl's Law, the potential speedup of 2x or 3x falls off as the workload becomes either transfer-found or compute-bound.