How to profile CUDA code on a headless node? - cuda

I'm working on a CUDA application I'd like to profile. Up to now all I've used is the command line profiler, nvprof, which just displayes the summarized statistics.
I thought about using the GUI profiler, NVVP. The problem is that the remote Linux node I'm running the application on doesn't have anything GUI (even X.org). Moreover, even if I managed to get some X11 stack on the remote node, keeping my own laptop alive for the whole time of the profiling would be, well, tedious.
I tried collecting all the needed information in the following way:
nvprof --analysis-metrics -o application.nvprof ./myapplication
Then I copy the output file onto my laptop and view it in NVVP. This has three problems, though.
First of all, I don't get any file transfer information when I load the output file into NVVP. It's not shown at all in the NVVP window.
Secondly, the call graph is completely distorted. The gaps between kernel launches are at least 100x bigger than the kernel durations, which makes any dependency and flow analysis impossible.
Lastly, my application uses a lot of the GPU memory. During the profiling the device gets out of memory, which is not the case during the standalone run.
How should I properly profile my CUDA application on a headless node?

NVVP supports headless nodes as a first-class citizen. Remote profiling is a major feature of NVVP.
The way this works is that NVVP runs on your local GUI-enabled host machine and invokes nvprof on the headless machine, generates the required files there, copies the files over, and opens them. All of this happens transparently and automatically. You can run further analyses from NVVP as usual and it will repeat these steps for you.
To use remote profiling, open NVVP, then File->New Session. Add a Connection instead of using Local, putting in details of the headless machine. Click on Manage... to point NVVP to the toolkit path on the remote machine. Once this one-time setup is done, enter the path to the executable and run as usual.
You can read about remote profiling in the relevant documentation.

Related

What is the difference between qemu-sparc64 and qemu-system-sparc64?

I am trying to run some simple SPARC tests on bare metal QEMU. I am using qemu-sparc64 -g 1234 simple_example and seems to be working fine (I can connect gdb to localhost:1234, step through, etc) but was wondering what does qemu-system-sparc64 do ? I tried running it with the same cmd line switches but got some errors. Any help is appreciated, thank you.
For any QEMU architecture target, the qemu-system-foo binary runs a complete system emulation of the CPU and all other devices that make up a machine using that CPU type. It typically is used to run a guest OS kernel, like Linux; it can run other bare-metal guest code too.
The qemu-foo binary (sometimes also named qemu-foo-static if it has been statically linked) is QEMU's "user-mode" or "linux-user" emulation. This expects to run a single Linux userspace binary, and it translates all the system calls that process makes into direct host system calls.
If you're running qemu-sparc64 then you are not running your program in a bare-metal environment -- it's a proper Linux userspace process, even if you're not necessarily using all of the facilities that allows. If you want bare-metal then you need qemu-system-sparc64, but your program needs to actually be compiled to run correctly on the specific machine type that you tell QEMU to emulate (eg the Sun4u hardware, which is the default). Also, by default qemu-system-sparc64 will run the OpenBIOS firmware, so your bare-metal guest code needs to either run under that OpenBIOS environment, or else you need to tell QEMU not to run the BIOS (and then you get to deal with all the hardware setup that the BIOS would do for you).

How to set breakpoint on symbols in qemu user mode emulated process?

As qemu user mode emulation doesn't support ptrace system call, I am trying to debug a qemu user mode emulated process via qemu's gdbstub, and use another gdb instance to connect to it via target remote :1234.
This works fine for some basic command like si to single step instructions, but I cannot set breakpoint on the symbols in the main emulated executable such as main. Simply run break main will say the breakpoint is set to some raw un-relocated address (like 0x63a, but if I hit c in gdb client, the symbols for the main executable is never resolved to the real virtual address, and then is never hit.
Is this a general issue for debugging qemu user mode emulated process, and is there any way to set up the breakpoint correctly?

How to setup and save qemu running option

I'm using qemu to replace bochs (since it doesn't update anymore)
In bochs, I can save the running settings into files and reload it. Furthermore, there will be a listed table of running options while boot up.
I'm wondering if I can do the same with qemu, save running settings such as cpu model, and other stuffs into some files and reload it next time I run emulation.
And if there exists a full listed running option table like thing for me to have a complete view on which options I can set.
Thanks a lot!
For this sort of UI and management of VMs you should look at a "management layer" program that sits on top of QEMU. libvirt's "virt-manager" is one common choice here. A management-layer will generally allow you to define options for a VM and save them so you can start and stop that VM without having to specify all the command line options every time. It will also configure QEMU in a more secure and performant way than you get by default, which often requires rather long QEMU command lines.
QEMU itself doesn't provide this kind of facility because its philosophy is to just be the low-level tool which runs a VM, and leave the UI and persistent-VM-management to other software which can do a better job of it.

Tcl: How to run an executable on another machine

While working on machine xyz (could be linux or windows) - I need to start an executable program on another windows machine (just trigger would do).
How could that be done.
As per the comments provided, did the following purely on trial basis:
- Ran a custom listening tcl script on target machine (this was monitoring a shared file for writes) - also trying to use the win task scheduler to make it run each time user logs in
- On another machine that was to send command to the remote machine for exe execution, wrote a tcl proc to write to the shared file mentioned above
Thus whenever the file was written to, the script running on the target machine read the instructions and executed accordingly.
This is working for now, there are a few issues i am still facing but I think will get better eventually.

How do I capture console output from a remote NSight session?

I have a set of CUDA apps that both write to the console via cout. I have a host machine with VS and NSight plug-in and a target machine with NSight service. However, when I execute the console app, it actually runs on the target machine (literally pops up a console).
So here's the question: how can I get the console to show up on the host and only the GPU stuff to execute on the target? Is this even possible?
Thanks!
The short answer is that it is currently not possible. The application on the target is executed by the Nsight Monitor process but Nsight Monitor currently does not forward the output back to host.
Currently your only option is to take care of it your self by capturing the output of your application on the target and somehow display it on the host.
If this feature is important to you i suggest you file a feature request via your Nvidia developer account.
The CUDA application completely runs on the target machine, so the console or UI for the application will be seen on the target machine only. You can set breakpoints in the GPU code in the VS side (your host machine), and it should break there.
If you feel the application quits too quickly and is not launching the kernels as expected (and you are not hitting the breakpoints), it may be that you have not deployed all the required DLLs on the target machine (e.g. CUDART).