Trivia
In NVIDIA Nsight Systems you can use the --stats=true flag to get the details for data transfer between GPU and CPU. The output includes a section similar to what follows:
CUDA Memory Operation Statistics (KiB)
Total Operations Average Minimum Maximum Name
------------------- -------------- ------------------- ----------------- ------------------- -------------------
8192.000 2 4096.000 4096.000 4096.000 [CUDA memcpy HtoD]
528384.000 2 264192.000 4096.000 524288.000 [CUDA memcpy DtoD]
Question
Is it possible to get this statistics per API call? That is, can we get the amount of data transferred between Host/Device in each of the cudaMemCpyxxx calls?
If you want to do this purely from the CLI, I suggest following the guidance given in this blog starting at "Extending the Summary Statistics". The basic steps are to export the profile data as a sqlite database, then formulate a database query to extract the data that you want. I acknowledge this is not a compelete recipe.
If using the GUI is acceptable, I think it is pretty straightforward. Suppose we had a very simple CUDA program:
int main(){
int *d1, *d2;
int *h1, *h2;
h1 = new int[8192];
h2 = new int[262144];
cudaMalloc(&d1, 32768);
cudaMalloc(&d2, 1048576);
cudaMemcpy(d1, h1, 32768, cudaMemcpyHostToDevice);
cudaMemcpy(d2, h2, 1048576, cudaMemcpyHostToDevice);
}
These are the steps:
You could either do interactive profiling directly from the GUI as covered here or you could start with the CLI. To start with the CLI, run a command like this:
nsys profile --trace=cuda ./my_app
among other activities, this will create a report file of the name reportX.qdrep where X is really a number like 1, or 2, or 3, etc.
Open up the GUI, and File...Open the above reportX.qdrep file. In this case, the GUI need not be on the same machine, but it should be of a version greater than or equal to the CLI version used to create the report file.
Fully expand all the rows in the timeline pertaining to the CUDA activities.
Hover your mouse over the desired operation of interest:
Related
I have noticed that when I use nsys in my machine
nsys profile --stats=true -o output-report ./input
It outputs the data like this:
NVIDIA Nsight Systems version 2022.4.2.50-32196742v0
[5/8] Executing 'cudaapisum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------ ----------------------
46.7 100,404,793 3 33,468,264.3 22,463.0 12,434 100,369,896 57,938,512.8 cudaMallocManaged
39.5 84,938,847 1 84,938,847.0 84,938,847.0 84,938,847 84,938,847 0.0 cudaDeviceSynchronize
13.8 29,677,781 3 9,892,593.7 9,610,457.0 9,514,092 10,553,232 574,154.9 cudaFree
0.0 82,478 1 82,478.0 82,478.0 82,478 82,478 0.0 cuLibraryLoadData
0.0 40,588 1 40,588.0 40,588.0 40,588 40,588 0.0 cudaLaunchKernel
0.0 892 1 892.0 892.0 892 892 0.0 cuModuleGetLoadingMode
The section is described by "Executing 'cudaapisum' stats report" instead of the normal title like "CUDA API Statistics". So I'm wondering if there's a flag that I can use to output the stats like the one below:
The output below isn't from my machine, it's from AWS's machine.
NVIDIA Nsight Systems version 2021.1.1.66-6c5c5cb
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ---------------------
61.5 250696605 3 83565535.0 36197 250541972 cudaMallocManaged
32.8 133916228 1 133916228.0 133916228 133916228 cudaDeviceSynchronize
5.7 23226526 3 7742175.3 6373371 9064987 cudaFree
0.0 56395 1 56395.0 56395 56395 cudaLaunchKernel
And the other thing I have to mention is that on my machine it automatically outputs the profile file to a .nsys-rep extension not the .qdrep extension. Are both of them the same or different?
I've been trying to find information in the nsys documentation, but I couldn't find any. I've tried searching in stackoverflow & nvidia's forum on Nsight but none came up so far. Maybe I've missed something. Any help will be appreciated.
Note: both of them is using the same command but just a slightly different file.
And the other thing I have to mention is that on my machine it automatically outputs the profile file to a .nsys-rep extension not the .qdrep extension. Are both of them the same or different?
.nsys-rep is the new extension name for .qdrep files, it is the same format though. The change happened with version 2021.4.
Specifically, from the release notes of the aforementioned version:
Result file rename
In order to make the Nsight tools family more consistent, all versions of Nsight Systems starting with 2021.4 will use the “.nsys-rep” extension for generated report files by default.
Older versions of Nsight Systems used “.qdrep”.
Nsight Systems GUI 2021.4 and higher will continue to support opening older “.qprep” reports.
Versions of Nsight Systems GUI older than 2021.4 will not be able to open “.nsys-rep” reports.
Please note that the versions of the tool on your local machine and the AWS machine are different.
So I'm wondering if there's a flag that I can use to output the stats like the one below
There isn't a flag to control the output you are mentioning. You could modify your workflow slightly, profile your application without the --stats CLI switch, and collect the report file (nsys-rep/qdrep). Then you can use the nsys stats command and apply specific stats reports to your report file.
If you have feature requests for the Nsight Systems tool, please let us know through the NVIDIA Developer Forum.
Suppose I have an executable myapp which needs no command-line argument, and launches a CUDA kernel mykernel. I can invoke:
nv-nsight-cu-cli -k mykernel myapp
and get output looking like this:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 1234
[1234] myapp#127.0.0.1
mykernel(), 2020-Oct-25 01:23:45, Context 1, Stream 7
Section: GPU Speed Of Light
--------------------------------------------------------------------
Memory Frequency cycle/nsecond 1.62
SOL FB % 1.58
Elapsed Cycles cycle 4,421,067
SM Frequency cycle/nsecond 1.43
Memory [%] % 61.76
Duration msecond 3.07
SOL L2 % 0.79
SM Active Cycles cycle 4,390,420.69
(etc. etc.)
--------------------------------------------------------------------
(etc. etc. - other sections here)
so far - so good. But now, I just want the overall kernel duration of mykernel - and no other output. Looking at nv-nsight-cu-cli --query-metrics, I see, among others:
gpu__time_duration incremental duration in nanoseconds; isolated measurement is same as gpu__time_active
gpu__time_active total duration in nanoseconds
So, it must be one of these, right? But when I run
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_duration,gpu__time_active
I get:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 12345
[12345] myapp#127.0.0.1
mykernel(), 2020-Oct-25 12:34:56, Context 1, Stream 7
Section: GPU Speed Of Light
Section: Command line profiler metrics
---------------------------------------------------------------
gpu__time_active (!) n/a
gpu__time_duration (!) n/a
---------------------------------------------------------------
My questions:
Why am I getting "n/a" values?
How can I get the actual values I'm after, and nothing else?
Notes: :
I'm using CUDA 10.2 with NSight Compute version 2019.5.0 (Build 27346997).
I realize I can filter the standard output stream of the unqualified invocation, but that's not what I'm after.
I actually just want the raw number, but I'm willing to settle for using --csv and taking the last field.
Couldn't find anything relevant in the nvprof transition guide.
tl;dr: You need to specify the appropriate 'submetric':
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_active.avg
(Based on #RobertCrovella's comments)
CUDA's profiling mechanism collects 'base metrics', which are indeed listed with --list-metrics. For each of these, multiple samples are taken. In version 2019.5 of NSight Compute you can't just get the raw samples; you can only get 'submetric' values.
'Submetrics' are essentially some aggregation of the sequence of samples into a scalar value. Different metrics have different kinds of submetrics (see this listing); for gpu__time_active, these are: .min, .max, .sum, .avg. Yes, if you're wondering - they're missing second-moment metrics like the variance or the sample standard deviation.
So, you must either specify one or more submetrics (see example above), or alternatively, upgrade to a newer version of NSight Compute, with which you actually can just get all the samples apparently.
The command "ctx=mx.cpu()" is taking all available CPU. How to restrict to use a certain number only - say 6 out of 8 core
Unfortunately - no. Even though the cpu context has int as an input argument:
def cpu(device_id=0):
"""Returns a CPU context.
according to the official documentation:
Parameters
----------
device_id : int, optional
The device id of the device. `device_id` is not needed for CPU.
This is included to make interface compatible with GPU.
However, in theory, it might be changed in the future since the device_id argument is there. But for now MXNet takes all available cores.
I'm just looking at the following output and trying to wrap my mind around the numbers:
==2906== Profiling result:
Time(%) Time Calls Avg Min Max Name
23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16
22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg
18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16
16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
...
0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD]
0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH]
...
==2906== API calls:
Time(%) Time Calls Avg Min Max Name
83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy
9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch
1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument
1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree
...
When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like memcpy only takes 117.53+107.32ms (in both directions), but then there is this API call cudaMemcpy: 41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.
Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?
EDIT: second question is: is there a way to figure out who is calling e.g. axpy_kernel_val (and how many times)?
The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched well before the kernel(s) have finished executing. The total time of the API call is from the moment it is launched to the moment it completes, so will overlap with executing kernels. You can see this very clearly if you run the output through the NVIDIA Visual Profiler (nvprof -o xxx ./myApp, then import xxx into nvvp).
The difference is min time is due to launch overhead. While the API profiling takes all of the launch overhead into account, the kernel timing only contains a small part of it. Launch overhead can be ~10-20us, as you can see here.
In general, the API calls section lets you know what the CPU is doing, while, the profiling results tells you what the GPU is doing. In this case, I'd argue you're underusing the CPU, as arguably the cudaMemcpy is launched too early and CPU cycles are wasted. In practice, however, it's often hard or impossible to get anything useful out of these spare cycles.
How to declare a struct in device that a member of it, is an array and then dynamically allocated memory for this. for example in below code, compiler said: error : calling a __host__ function("malloc") from a __global__ function("kernel_ScoreMatrix") is not allowed. is there another way for perform this action?
Type ofdev_size_idx_threads is int* and value of it, sent to kernel and used for allocate memory.
struct struct_matrix
{
int *idx_threads_x;
int *idx_threads_y;
int thread_diag_length;
int idx_length;
};
struct struct_matrix matrix[BLOCK_SIZE_Y];
matrix->idx_threads_x= (int *) malloc ((*(dev_size_idx_threads) * sizeof(int) ));
From device code, dynamic memory allocations (malloc and new) are supported only with devices of cc2.0 and greater. If you have a cc2.0 device or greater, and you pass an appropriate flag to nvcc (such as -arch=sm_20) you should not see this error. Note that if you are passing multiple compilation targets (sm_10, sm_20, etc.), if even one of the targets does not meet the cc2.0+ requirement, you will see this error.
If you have a cc1.x device, you will need to perform these types of allocations from the host (e.g. using cudaMalloc) and pass appropriate pointers to your kernel.
If you choose that route (allocating from the host), you may also be interested in my answer to questions like this one.
EDIT: responding to questions below:
In visual studio (2008 express, should be similar for other versions), you can set the compilation target as follows: open project, select Project...Properties, select Configuration Properties...CUDA Runtime API...GPU Now, on the right hand pane, you will see entries like GPU Architecture (1) (and (2) etc.) These are drop-downs that you can click on and select the target(s) you want to compile for. If your GPU is sm_21 I would select that for (1) and leave the others blank, or select compatible versions like sm_20.
To see worked examples, please follow the link I gave above. A couple worked examples are linked from my answer here as well as a description of how it is done.