How to show the title for nsys profile? - cuda

I have noticed that when I use nsys in my machine
nsys profile --stats=true -o output-report ./input
It outputs the data like this:
NVIDIA Nsight Systems version 2022.4.2.50-32196742v0
[5/8] Executing 'cudaapisum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------ ----------------------
46.7 100,404,793 3 33,468,264.3 22,463.0 12,434 100,369,896 57,938,512.8 cudaMallocManaged
39.5 84,938,847 1 84,938,847.0 84,938,847.0 84,938,847 84,938,847 0.0 cudaDeviceSynchronize
13.8 29,677,781 3 9,892,593.7 9,610,457.0 9,514,092 10,553,232 574,154.9 cudaFree
0.0 82,478 1 82,478.0 82,478.0 82,478 82,478 0.0 cuLibraryLoadData
0.0 40,588 1 40,588.0 40,588.0 40,588 40,588 0.0 cudaLaunchKernel
0.0 892 1 892.0 892.0 892 892 0.0 cuModuleGetLoadingMode
The section is described by "Executing 'cudaapisum' stats report" instead of the normal title like "CUDA API Statistics". So I'm wondering if there's a flag that I can use to output the stats like the one below:
The output below isn't from my machine, it's from AWS's machine.
NVIDIA Nsight Systems version 2021.1.1.66-6c5c5cb
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ---------------------
61.5 250696605 3 83565535.0 36197 250541972 cudaMallocManaged
32.8 133916228 1 133916228.0 133916228 133916228 cudaDeviceSynchronize
5.7 23226526 3 7742175.3 6373371 9064987 cudaFree
0.0 56395 1 56395.0 56395 56395 cudaLaunchKernel
And the other thing I have to mention is that on my machine it automatically outputs the profile file to a .nsys-rep extension not the .qdrep extension. Are both of them the same or different?
I've been trying to find information in the nsys documentation, but I couldn't find any. I've tried searching in stackoverflow & nvidia's forum on Nsight but none came up so far. Maybe I've missed something. Any help will be appreciated.
Note: both of them is using the same command but just a slightly different file.

And the other thing I have to mention is that on my machine it automatically outputs the profile file to a .nsys-rep extension not the .qdrep extension. Are both of them the same or different?
.nsys-rep is the new extension name for .qdrep files, it is the same format though. The change happened with version 2021.4.
Specifically, from the release notes of the aforementioned version:
Result file rename
In order to make the Nsight tools family more consistent, all versions of Nsight Systems starting with 2021.4 will use the “.nsys-rep” extension for generated report files by default.
Older versions of Nsight Systems used “.qdrep”.
Nsight Systems GUI 2021.4 and higher will continue to support opening older “.qprep” reports.
Versions of Nsight Systems GUI older than 2021.4 will not be able to open “.nsys-rep” reports.
Please note that the versions of the tool on your local machine and the AWS machine are different.
So I'm wondering if there's a flag that I can use to output the stats like the one below
There isn't a flag to control the output you are mentioning. You could modify your workflow slightly, profile your application without the --stats CLI switch, and collect the report file (nsys-rep/qdrep). Then you can use the nsys stats command and apply specific stats reports to your report file.
If you have feature requests for the Nsight Systems tool, please let us know through the NVIDIA Developer Forum.

Related

How to measure the amount of data copied in NVIDIA nsight systems?

Trivia
In NVIDIA Nsight Systems you can use the --stats=true flag to get the details for data transfer between GPU and CPU. The output includes a section similar to what follows:
CUDA Memory Operation Statistics (KiB)
Total Operations Average Minimum Maximum Name
------------------- -------------- ------------------- ----------------- ------------------- -------------------
8192.000 2 4096.000 4096.000 4096.000 [CUDA memcpy HtoD]
528384.000 2 264192.000 4096.000 524288.000 [CUDA memcpy DtoD]
Question
Is it possible to get this statistics per API call? That is, can we get the amount of data transferred between Host/Device in each of the cudaMemCpyxxx calls?
If you want to do this purely from the CLI, I suggest following the guidance given in this blog starting at "Extending the Summary Statistics". The basic steps are to export the profile data as a sqlite database, then formulate a database query to extract the data that you want. I acknowledge this is not a compelete recipe.
If using the GUI is acceptable, I think it is pretty straightforward. Suppose we had a very simple CUDA program:
int main(){
int *d1, *d2;
int *h1, *h2;
h1 = new int[8192];
h2 = new int[262144];
cudaMalloc(&d1, 32768);
cudaMalloc(&d2, 1048576);
cudaMemcpy(d1, h1, 32768, cudaMemcpyHostToDevice);
cudaMemcpy(d2, h2, 1048576, cudaMemcpyHostToDevice);
}
These are the steps:
You could either do interactive profiling directly from the GUI as covered here or you could start with the CLI. To start with the CLI, run a command like this:
nsys profile --trace=cuda ./my_app
among other activities, this will create a report file of the name reportX.qdrep where X is really a number like 1, or 2, or 3, etc.
Open up the GUI, and File...Open the above reportX.qdrep file. In this case, the GUI need not be on the same machine, but it should be of a version greater than or equal to the CLI version used to create the report file.
Fully expand all the rows in the timeline pertaining to the CUDA activities.
Hover your mouse over the desired operation of interest:

How can I get a kernel's execution time with NSight Compute 2019 CLI?

Suppose I have an executable myapp which needs no command-line argument, and launches a CUDA kernel mykernel. I can invoke:
nv-nsight-cu-cli -k mykernel myapp
and get output looking like this:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 1234
[1234] myapp#127.0.0.1
mykernel(), 2020-Oct-25 01:23:45, Context 1, Stream 7
Section: GPU Speed Of Light
--------------------------------------------------------------------
Memory Frequency cycle/nsecond 1.62
SOL FB % 1.58
Elapsed Cycles cycle 4,421,067
SM Frequency cycle/nsecond 1.43
Memory [%] % 61.76
Duration msecond 3.07
SOL L2 % 0.79
SM Active Cycles cycle 4,390,420.69
(etc. etc.)
--------------------------------------------------------------------
(etc. etc. - other sections here)
so far - so good. But now, I just want the overall kernel duration of mykernel - and no other output. Looking at nv-nsight-cu-cli --query-metrics, I see, among others:
gpu__time_duration incremental duration in nanoseconds; isolated measurement is same as gpu__time_active
gpu__time_active total duration in nanoseconds
So, it must be one of these, right? But when I run
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_duration,gpu__time_active
I get:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 12345
[12345] myapp#127.0.0.1
mykernel(), 2020-Oct-25 12:34:56, Context 1, Stream 7
Section: GPU Speed Of Light
Section: Command line profiler metrics
---------------------------------------------------------------
gpu__time_active (!) n/a
gpu__time_duration (!) n/a
---------------------------------------------------------------
My questions:
Why am I getting "n/a" values?
How can I get the actual values I'm after, and nothing else?
Notes: :
I'm using CUDA 10.2 with NSight Compute version 2019.5.0 (Build 27346997).
I realize I can filter the standard output stream of the unqualified invocation, but that's not what I'm after.
I actually just want the raw number, but I'm willing to settle for using --csv and taking the last field.
Couldn't find anything relevant in the nvprof transition guide.
tl;dr: You need to specify the appropriate 'submetric':
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_active.avg
(Based on #RobertCrovella's comments)
CUDA's profiling mechanism collects 'base metrics', which are indeed listed with --list-metrics. For each of these, multiple samples are taken. In version 2019.5 of NSight Compute you can't just get the raw samples; you can only get 'submetric' values.
'Submetrics' are essentially some aggregation of the sequence of samples into a scalar value. Different metrics have different kinds of submetrics (see this listing); for gpu__time_active, these are: .min, .max, .sum, .avg. Yes, if you're wondering - they're missing second-moment metrics like the variance or the sample standard deviation.
So, you must either specify one or more submetrics (see example above), or alternatively, upgrade to a newer version of NSight Compute, with which you actually can just get all the samples apparently.

In Deep Learning "mxnet", restrict number of core (cpu)

The command "ctx=mx.cpu()" is taking all available CPU. How to restrict to use a certain number only - say 6 out of 8 core
Unfortunately - no. Even though the cpu context has int as an input argument:
def cpu(device_id=0):
"""Returns a CPU context.
according to the official documentation:
Parameters
----------
device_id : int, optional
The device id of the device. `device_id` is not needed for CPU.
This is included to make interface compatible with GPU.
However, in theory, it might be changed in the future since the device_id argument is there. But for now MXNet takes all available cores.

How can I modify xorg.conf file to force X server to run on a specific GPU? (I am using multiple GPUs) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I'm running 2 GPUs and I'm trying to force X server to run on one GPU.
According to this website : http://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x , here is how i should proceed :
The X display should be forced onto a single GPU using the BusID
parameter in the relevant "Display" section of the xorg.conf file. In
addition, any other "Display" sections should be deleted. For example:
BusID "PCI:34:0:0"
Here is my xorg.conf file :
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig: version 304.64 (buildmeister#swio-display-x86-rhel47-12) Tue Oct 30 12:04:46 PDT 2012
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
So I tried to modify the subsection display with the correct BusID but it still does not work, I also tried to put it in the section Device.
Anyone knows how i could do that ?
If you have 2 NVIDIA GPUs, get the BusID parameters for both. The doc you linked explains a couple ways to do that, but nvidia-smi -a is pretty easy.
You will need to figure out which GPU you want to keep for display, and which you want to keep for CUDA. Again, this should be pretty obvious from nvidia-smi -a
Let's suppose your nvidia-smi -a includes a section like this:
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x06D910DE
Bus Id : 0000:02:00.0
Then modify the device section like this:
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:2:0:0"
EndSection
Then reboot.
Make sure the one you are keeping for display is the one with the display cable attached!
You may also be interested in reading the nvidia driver readme and search on "BusID" for additional tips.
The document you linked references a "Display" section but that should be the "Device" section.
Since cannot add comments to the answer above, due to the reputation restriction, I just leave my solution here.
I followed the solution provided by #Robert Crovella. But it still did not work for me, until I changed the BusID to decimal format.
Let me write more details.
Two GPUs: GTX 1080Ti(device0 ) and GTX 960(device1). So I want to set GTX 1080Ti(device0) as computing card and GTX 960(device1) for xorg display.
Find their BusIDs: you can find the BusIDs' via the command 'lspci | grep VGA', which will give the following :
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
So we get the BusId 03:00.0 for device0 and 82:00.0 for device1, but they
are both hexadecimal numbers. So convert 0x03 and 0x82 to decimal numbers
as 3 and 130, respectively.
Add the BusID to the Device section in the xorg.conf file:
Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:130:0:0"
EndSection
Pay attention to the BusID format, like, "0:0", (not "0.0"). And also use the same device in the Section "Screen":
Section "Screen"
Identifier "Screen0"
Device "Device1"
...
EndSection
With the monitors connected to the display GPU, and reboot your computer.
I found this solution when I read the above comment by #Piotr Dobrogost, and double checked the decimal format BusID used in the xorg.conf file, different than the BusID provided vis command lspci when I found this article.

Only one node owns data in a Cassandra cluster

I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here