Other reasons for instruction replays in CUDA

Other reasons for instruction replays in CUDA - cuda

This is the output I get from nvprof (CUDA 5.5):
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: MyKernel(double const *, double const *, double*, int, int, int)
60 inst_replay_overhead Instruction Replay Overhead 0.736643 0.925197 0.817188
60 shared_replay_overhead Shared Memory Replay Overhead 0.000000 0.000000 0.000000
60 global_replay_overhead Global Memory Replay Overhead 0.108972 0.108972 0.108972
60 global_cache_replay_overhead Global Memory Cache Replay Ove 0.000000 0.000000 0.000000
60 local_replay_overhead Local Memory Cache Replay Over 0.000000 0.000000 0.000000
60 gld_transactions Global Load Transactions 25000 25000 25000
60 gst_transactions Global Store Transactions 75000 75000 75000
60 warp_nonpred_execution_efficie Warp Non-Predicated Execution 99.63% 99.63% 99.63%
60 cf_issued Issued Control-Flow Instructio 44911 45265 45101
60 cf_executed Executed Control-Flow Instruct 39533 39533 39533
60 ldst_issued Issued Load/Store Instructions 273117 353930 313341
60 ldst_executed Executed Load/Store Instructio 50016 50016 50016
60 stall_data_request Issue Stall Reasons (Data Requ 65.21% 68.93% 67.86%
60 inst_executed Instructions Executed 458686 458686 458686
60 inst_issued Instructions Issued 789220 879145 837129
60 issue_slots Issue Slots 716816 803393 759614
The kernel uses 356 bytes cmem[0] and no shared memory. Also, no register spills.
My question is, what is the reason for instruction replays in this case? We see an overhead of 81% but the numbers do not add up.
Thanks!

some possible reasons:
shared memory bank conflicts (which you don't have)
constant memory conflicts (i.e. different threads in a warp requesting different locations in constant memory from the same instruction)
warp-divergent code (if..then..else taking differnt paths for different threads in a warp)
This presentation may be of interest, especially slides 8-11.

Related

NSight Compute not showing achieved occupancy in the metrics

I want to calculate the achieved occupancy and compare it with the value that is being displayed in Nsight Compute.
ncu says: Theoretical Occupancy [%] 100, and Achieved Occupancy [%] 93,04. What parameters do i need to calculate this value?
I can see the theoretical occupancy using the occupancy api, which comes out as 1.0 or 100%.
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__actice_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum. I can see the formaula to calculate the achieved occupancy from this SO answer.
Few details if that might help:
There are 1 CUDA devices.
CUDA Device #0
Major revision number: 7
Minor revision number: 5
Name: NVIDIA GeForce GTX 1650
Total global memory: 4093181952
Total constant memory: 65536
Total shared memory per block: 49152
Total registers per block: 65536
Total registers per multiprocessor: 65536
Warp size: 32
Maximum threads per block: 1024
Maximum threads per multiprocessor: 1024
Maximum blocks per multiprocessor: 16
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1515000
Maximum memory pitch: 2147483647
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 14
Kernel execution timeout: Yes
ptxas info : Used 18 registers, 360 bytes cmem[0]

Shorter:
In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
Longer:
The metrics you have indicated:
I tried looking for the metric achieved_occupancy, sm__active_warps_sum, sm__active_cycles_sum but all of them say: Failed to find metric sm__active_warps_sum.
are not applicable to nsight compute. NVIDIA has made a variety of different profilers, and these metric names apply to other profilers. The article you reference refers to a different profiler (the original profiler on windows used the nsight name but was not nsight compute.)
This blog article discusses different ways to get valid nsight compute metric names with references to documentation links that present the metrics in different ways.
I would also point out for others that nsight compute has a whole report section dedicated to occupancy, and so for typical interest, that is probably the easiest way to go. Additional instructions for how to run nsight compute are available in this blog.
To come up with metrics that represent occupancy the way the nsight compute designers intended, my suggestion would be to look at their definitions. Each report section in nsight compute has "human-readable" files that indicate how the section is assembled. Since there is a report section for occupancy that includes reporting both theoretical and achieved occupancy, we can discover how those are computed by inspecting those files.
The methodology for how the occupancy section is computed is contained in 2 files which are part of a CUDA install. On a standard linux CUDA install, these will be in /usr/local/cuda-XX.X/nsight-compute-zzzzzz/sections/Occupancy.py and .../sections/Occupancy.section. The python file gives the exact names of the metrics that are used as well as the calculation method(s) for other displayed topics related to occupancy (e.g. notes, warnings, etc.) In a nutshell, the theoretical occupancy is given by metric name sm__maximum_warps_per_active_cycle_pct and the achieved occupancy is given by metric name sm__warps_active.avg.pct_of_peak_sustained_active.
You could retrieve both the Occupancy section report (which is part of the "default" "set") as well as these specific metrics with a command line like this:
ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./my-app
Here is an example output from such a run:
$ ncu --set default --metrics sm__maximum_warps_per_active_cycle_pct,sm__warps_active.avg.pct_of_peak_sustained_active ./t2140
Testing with mask size = 3
==PROF== Connected to process 31551 (/home/user2/misc/t2140)
==PROF== Profiling "convolution_2D" - 1: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 2: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 460.922913 ms.
________________________________________________________________________
Testing with mask size = 5
==PROF== Profiling "convolution_2D" - 3: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 4: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 429.748230 ms.
________________________________________________________________________
Testing with mask size = 7
==PROF== Profiling "convolution_2D" - 5: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 6: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 500.704254 ms.
________________________________________________________________________
Testing with mask size = 9
==PROF== Profiling "convolution_2D" - 7: 0%....50%....100% - 20 passes
==PROF== Profiling "convolution_2D_tiled" - 8: 0%....50%....100% - 20 passes
Time elapsed on naive GPU convolution 2d tiled ( 32 ) block 449.445892 ms.
________________________________________________________________________
==PROF== Disconnected from process 31551
[31551] t2140#127.0.0.1
convolution_2D(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:44, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 50
sm__warps_active.avg.pct_of_peak_sustained_active % 40.42
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 815.21
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 47,929
Memory [%] % 23.96
DRAM Throughput % 15.23
Duration usecond 42.08
L1/TEX Cache Throughput % 26.90
L2 Cache Throughput % 10.54
SM Active Cycles cycle 42,619.88
Compute (SM) [%] % 37.09
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 38
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1,048,576
Waves Per SM 12.80
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 1
Block Limit Shared Mem block 32
Block Limit Warps block 2
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 40.42
Achieved Active Warps Per SM warp 25.87
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the number of required registers
convolution_2D_tiled(float *, const float *, float *, unsigned long, unsigned long, unsigned long), 2022-Oct-29 13:02:45, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__maximum_warps_per_active_cycle_pct % 100
sm__warps_active.avg.pct_of_peak_sustained_active % 84.01
---------------------------------------------------------------------- --------------- ------------------------------
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 771.98
SM Frequency cycle/nsecond 1.07
Elapsed Cycles cycle 31,103
Memory [%] % 40.61
DRAM Throughput % 24.83
Duration usecond 29.12
L1/TEX Cache Throughput % 46.39
L2 Cache Throughput % 18.43
SM Active Cycles cycle 27,168.03
Compute (SM) [%] % 60.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see
what the compute pipelines are spending their time doing. Also, consider whether any computation is
redundant and could be reduced or moved to look-up tables.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1,024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1,156
Registers Per Thread register/thread 31
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 4.10
Threads thread 1,183,744
Waves Per SM 7.22
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 24
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 84.01
Achieved Active Warps Per SM warp 53.77
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
theoretical (100.0%) and measured achieved occupancy (84.0%) can be the result of warp scheduling overheads
or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
as well as across blocks of the same kernel.
<sections repeat for each kernel launch>
$

Memory coalescing and nvprof results on NVIDIA Pascal

I am running a memory coalescing experiment on Pascal and getting unexpected nvprof results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof reports confusing numbers for gld_transactions_per_request and gst_transactions_per_request.
I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results.
#include <stdio.h>
#include <cstdint>
#include <assert.h>
#define N 1ULL*1024*1024*1024
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void copy_kernel(
const float* __restrict__ data, float* __restrict__ data2) {
for (unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < N; tid += blockDim.x * gridDim.x) {
data2[tid] = data[tid];
}
}
int main() {
float* d_data;
gpuErrchk(cudaMalloc(&d_data, sizeof(float) * N));
assert(d_data != nullptr);
uintptr_t d = reinterpret_cast<uintptr_t>(d_data);
assert(d%128 == 0); // check alignment, just to be sure
float* d_data2;
gpuErrchk(cudaMalloc(&d_data2, sizeof(float)*N));
assert(d_data2 != nullptr);
copy_kernel<<<1024,1024>>>(d_data, d_data2);
gpuErrchk(cudaDeviceSynchronize());
}
Compiled with CUDA version 10.1:
nvcc coalescing.cu -std=c++11 -Xptxas -dlcm=ca -gencode arch=compute_61,code=sm_61 -O3
Profiled with:
nvprof -m all ./a.out
There are a few confusing parts in the profiling results:
gld_transactions = 536870914, which means that every global load transaction should on average be 4GB/536870914 = 8 bytes. This is consistent with gld_transactions_per_request = 16.000000: Each warp reads 128 bytes (1 request) and if every transaction is 8 bytes, then we need 128 / 8 = 16 transactions per request. Why is this value so low? I would expect perfect coalescing, so something along the lines of 4 (or even 1) transactions/request.
gst_transactions = 134217728 and gst_transactions_per_request = 4.000000, so storing memory is more efficient?
Requested and achieved global load/store throughput (gld_requested_throughput, gst_requested_throughput, gld_throughput, gst_throughput) is 150.32GB/s each. I would expect a lower throughput for loads than for stores since we have more transactions per request.
gld_transactions = 536870914 but l2_read_transactions = 134218800. Global memory is always accessed through the L1/L2 caches. Why is the number of L2 read transactions so much lower? It can't all be cached in the L1. (global_hit_rate = 0%)
I think I am reading the nvprof results wrong. Any suggestions would be appreciated.
Here is the full profiling result:
Device "GeForce GTX 1080 Ti (0)"
Kernel: copy_kernel(float const *, float*)
1 inst_per_warp Instructions per warp 1.4346e+04 1.4346e+04 1.4346e+04
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 99.99% 99.99% 99.99%
1 inst_replay_overhead Instruction Replay Overhead 0.000178 0.000178 0.000178
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 16.000000 16.000000 16.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 536870914 536870914 536870914
1 gst_transactions Global Store Transactions 134217728 134217728 134217728
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 134218800 134218800 134218800
1 l2_write_transactions L2 Write Transactions 134217741 134217741 134217741
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 50.00% 50.00% 50.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
1 tex_cache_throughput Unified Cache Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_read_throughput L2 Throughput (Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_write_throughput L2 Throughput (Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 5.8711KB/s 5.8711KB/s 5.8701KB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified Cache Transactions 134217728 134217728 134217728
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 470089728 470089728 470089728
1 inst_issued Instructions Issued 470173430 470173430 470173430
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.79% 0.79% 0.79%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.46% 1.46% 1.46%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 96.16% 96.16% 96.16%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 1.13% 1.13% 1.13%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.00% 0.00% 0.00%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.07% 0.07% 0.07%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 1.0742e+10 1.0742e+10 1.0742e+10
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 1073741824 1073741824 1073741824
1 inst_compute_ld_st Load/Store Instructions 2147483648 2147483648 2147483648
1 inst_misc Misc Instructions 1077936128 1077936128 1077936128
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 470173430 470173430 470173430
1 cf_issued Issued Control-Flow Instructions 33619968 33619968 33619968
1 cf_executed Executed Control-Flow Instructions 33619968 33619968 33619968
1 ldst_issued Issued Load/Store Instructions 268500992 268500992 268500992
1 ldst_executed Executed Load/Store Instructions 67174400 67174400 67174400
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 134217728 134217728 134217728
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.39% 0.39% 0.39%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 134217728 134217728 134217728
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 pcie_total_data_transmitted PCIe Total Data Transmitted 1024 1024 1024
1 pcie_total_data_received PCIe Total Data Received 0 0 0
1 inst_executed_global_loads Warp level instructions for global loads 33554432 33554432 33554432
1 inst_executed_local_loads Warp level instructions for local loads 0 0 0
1 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0
1 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0
1 inst_executed_global_stores Warp level instructions for global stores 33554432 33554432 33554432
1 inst_executed_local_stores Warp level instructions for local stores 0 0 0
1 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0
1 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0
1 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0
1 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0
1 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0
1 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0
1 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0
1 inst_executed_tex_ops Warp level instructions for texture 0 0 0
1 l2_global_load_bytes Bytes read from L2 for misses in Unified Cache for global loads 4294967296 4294967296 4294967296
1 l2_local_load_bytes Bytes read from L2 for misses in Unified Cache for local loads 0 0 0
1 l2_surface_load_bytes Bytes read from L2 for misses in Unified Cache for surface loads 0 0 0
1 l2_local_global_store_bytes Bytes written to L2 from Unified Cache for local and global stores. 4294967296 4294967296 4294967296
1 l2_global_reduction_bytes Bytes written to L2 from Unified cache for global reductions 0 0 0
1 l2_global_atomic_store_bytes Bytes written to L2 from Unified cache for global atomics 0 0 0
1 l2_surface_store_bytes Bytes written to L2 from Unified Cache for surface stores. 0 0 0
1 l2_surface_reduction_bytes Bytes written to L2 from Unified Cache for surface reductions 0 0 0
1 l2_surface_atomic_store_bytes Bytes transferred between Unified Cache and L2 for surface atomics 0 0 0
1 global_load_requests Total number of global load requests from Multiprocessor 134217728 134217728 134217728
1 local_load_requests Total number of local load requests from Multiprocessor 0 0 0
1 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0
1 global_store_requests Total number of global store requests from Multiprocessor 134217728 134217728 134217728
1 local_store_requests Total number of local store requests from Multiprocessor 0 0 0
1 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0
1 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0
1 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0
1 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0
1 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0
1 sysmem_read_bytes System Memory Read Bytes 0 0 0
1 sysmem_write_bytes System Memory Write Bytes 160 160 160
1 l2_tex_hit_rate L2 Cache Hit Rate 0.00% 0.00% 0.00%
1 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0
1 unique_warps_launched Number of warps launched 32768 32768 32768
1 sm_efficiency Multiprocessor Activity 99.63% 99.63% 99.63%
1 achieved_occupancy Achieved Occupancy 0.986477 0.986477 0.986477
1 ipc Executed IPC 0.344513 0.344513 0.344513
1 issued_ipc Issued IPC 0.344574 0.344574 0.344574
1 issue_slot_utilization Issue Slot Utilization 8.61% 8.61% 8.61%
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.592326 0.592326 0.592326
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 tex_fu_utilization Texture Function Unit Utilization Low (1) Low (1) Low (1)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 dram_read_transactions Device Memory Read Transactions 134218560 134218560 134218560
1 dram_write_transactions Device Memory Write Transactions 134176900 134176900 134176900
1 dram_read_throughput Device Memory Read Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 dram_write_throughput Device Memory Write Throughput 150.27GB/s 150.27GB/s 150.27GB/s
1 dram_utilization Device Memory Utilization High (7) High (7) High (7)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 ecc_transactions ECC Transactions 0 0 0
1 ecc_throughput ECC Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 dram_read_bytes Total bytes read from DRAM to L2 cache 4294993920 4294993920 4294993920
1 dram_write_bytes Total bytes written from L2 cache to DRAM 4293660800 4293660800 4293660800

With Fermi and Kepler GPUs, when a global transaction was issued, it was always for 128 bytes, and the L1 cacheline size (if enabled) was 128 bytes. With Maxwell and Pascal, these characteristics changed. In particular, a read of a portion of an L1 cacheline does not necessarily trigger a full 128-byte width transaction. This is fairly easily discoverable/provable with microbenchmarking.
Effectively, the size of a global load transaction changed, subject to a certain quantum of granularity. Based on this change of transaction size, it's possible that multiple transactions could be required, where previously only 1 was required. As far as I know, none of this is clearly published or detailed, and I won't be able to do that here. However I think we can address a number of your questions without giving a precise description of how global load transactions are calculated.
gld_transactions = 536870914, which means that every global load transaction should on average be 4GB/536870914 = 8 bytes. This is consistent with gld_transactions_per_request = 16.000000: Each warp reads 128 bytes (1 request) and if every transaction is 8 bytes, then we need 128 / 8 = 16 transactions per request. Why is this value so low? I would expect perfect coalescing, so something along the lines of 4 (or even 1) transactions/request.
This mindset (1 transaction per request for fully coalesced loads of a 32-bit quantity per thread) would have been correct in the Fermi/Kepler timeframe. It is no longer correct for Maxwell and Pascal GPUs. As you've already calculated, the transaction size appears to be smaller than 128 bytes, and therefore the number of transactions per request is higher than 1. But this doesn't indicate an efficiency problem per se (as it would have in Fermi/Kepler timeframe). So let's just acknowledge that the transaction size can be smaller and therefore transactions per request can be higher, even though the underlying traffic is essentially 100% efficient.
gst_transactions = 134217728 and gst_transactions_per_request = 4.000000, so storing memory is more efficient?
No, that's not what this means. It simply means that the subdivision quanta can be different for loads (load transactions) and stores (store transactions). These happen to be 32-byte transactions. In either case, loads or stores, the transactions are and should be fully efficient in this case. The requested traffic is consistent with the actual traffic, and other profiler metrics confirm this. If the actual traffic were much higher than the requested traffic, that would be a good indication of inefficient loads or stores:
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
Requested and achieved global load/store throughput (gld_requested_throughput, gst_requested_throughput, gld_throughput, gst_throughput) is 150.32GB/s each. I would expect a lower throughput for loads than for stores since we have more transactions per request.
Again, you'll have to adjust your way of thinking to account for variable transaction sizes. Throughput is driven by the needs and efficiency associated with fulfilling those needs. Both loads and stores are fully efficient for your code design, so there is no reason to think there is or should be an imbalance in efficiency.
gld_transactions = 536870914 but l2_read_transactions = 134218800. Global memory is always accessed through the L1/L2 caches. Why is the number of L2 read transactions so much lower? It can't all be cached in the L1. (global_hit_rate = 0%)
This is simply due to the different size of the transactions. You've already calculated that the apparent global load transaction size is 8 bytes, and I've already indicated that the L2 transaction size is 32 bytes, so it makes sense that there would be a 4:1 ratio between the total number of transactions, since they reflect the same movement of the same data, viewed through 2 different lenses. Note that there has always been a disparity in the size of global transactions vs. the size of L2 transactions, or transactions to DRAM. Its simply that the ratios of these may vary by GPU architecture, and possibly other factors, such as load patterns.
Some notes:
I won't be able to answer questions such as "why is it this way?", or "why did Pascal change from Fermi/Kepler?" or "given this particular code, what would you predict as the needed global load transactions on this particular GPU?", or "generally, for this particular GPU, how would I calculate or predict transaction size?"
As an aside, there are new profiling tools (Nsight Compute and Nsight Systems) being advanced by NVIDIA for GPU work. Many of the efficiency and transactions per request metrics which are available in nvprof are gone under the new toolchain. So these mindsets will have to be broken anyway, because these methods of ascertaining efficiency won't be available moving forward, based on the current metric set.
Note that the use of compile switches such as -Xptxas -dlcm=ca may affect (L1) caching behavior. I don't expect caches to have much performance or efficiency impact on this particular copy code, however.
This possible reduction in transaction size is generally a good thing. It results in no loss of efficiency for traffic patterns such as presented in this code, and for certain other codes it allows (less-than-128byte) requests to be satisfied with less wasted bandwidth.
Although not specifically Pascal, here is a better defined example of the possible variability in these measurements for Maxwell. Pascal will have similar variability. Also, some small hint of this change (especially for Pascal) was given in the Pascal Tuning Guide. It by no means offers a complete description or explains all of your observations, but it does hint at the general idea that the global transactions are no longer fixed to a 128-byte size.

Discovering my GPU capabilities

I am trying to understand how the memory organization of my GPU is working.
According to the technical specification which are tabulated below my GPU can have 8 active blocks/SM and 768 threads/SM. Based on that I was thinking that in order to take advantage of the above each block should have 96 (=768/8) threads. The closest block that has this number of threads I think it is a 9x9 block, 81 threads. Using the fact that 8 blocks can run simultaneously in one SM we will have 648 threads. What about the rest 120 (= 768-648)?
I know that something wrong is happening with these thoughts. A simple example describing the connection between the maximum number of SM threads the maximum number of threads per block and the warp size based on my GPU specifications it would be very helpful.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0

You could find the technical specification of your device in the cuda programming guide as follows, rather than the output of a sample program of cuda.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
From the hardware point of view, we generally try to maximize the warp occupancy per Multiprocessor (SM) to get max performance. The max occupancy is limited by 3 types of hardware resources: #warp/SM, #register/SM and #shared memory/SM.
You could try the following tool in your cuda installation dir to understand how to do the calculation. It will give you a clearer understanding of the connections between #threads/SM, #threads/block, #warp/SM, etc.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls

CUDA Block parallelism

I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0

Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.

Cuda: Chasing an insufficient resource issue

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?

When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008