Is it possible to programmatically determine if the CUDA profiler is running? - cuda

The problem I'm trying to solve. Most of our command line apps, when run from Visual Studio, we like to force the user to hit a key to exit, so we can see the output in Visual Studio while we're debugging.
This doesn't work at all with profiling. One way to fix that would be to determine if the profiler was running or not.
The API to the CUDA profiler is rather limited:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER.html
It appears to support:
Initialization cudaProfilerInitialize
Starting cudaProfilerStart
Stopping cudaProfilerStop
But no way to determine if it's actually running?

Well, an ugly and surely sub-optimal solution is just searching for nvprof among the running processes...
On Linux, you can do this with readproc():
#include <proc/readproc.h>
proc_t* readproc(PROCTAB *PT, proc_t *return_buf);
For more information on how to use the functions in readproc.h, have a look at:
How does the ps command work?
on SuperUser.com, and particularly at this answer.
Note: Don't forget nvprof might be running but not profiling your process.

Related

How to count the number of guest instructions QEMU executed from the beginning to the end of a run?

I want to benchmark guest instructions per second of QEMU to compare it with other simulators.
How to obtain the guest instruction count? I'm interested both in user and full system mode.
The only solutions I have now would be to log all instructions with either simple trace exec_tb or -d in_asm: How to use QEMU's simple trace backend? and then count the instructions from there. But this would likely considerably reduce simulation performance due to the output operations, so I would likely have to run the test program twice, one with and another without the trace, and hope that both executions are similar (should be, especially for single threaded user mode simulation).
I saw the -icount option, which sounds promising from the name, but when I passed it to QEMU 4.0.0, I didn't see anything happen. Should it print an instruction count somewhere? The following patch appears unmerged and suggests not: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01275.html
Basic Profiling
To follow up on Peter's answer, I have recently run into a situation where I wanted to get the instruction count of a program run under QEMU (I'm using v4.2.0, the first where plugins became available).
One of the example plugins, insn.c, does exactly what you want, and returns the count of executed instructions on plugin exit.
(I assume you already know how to run QEMU, so I'll strip this down to the important flags)
qemu-system-arm ... -plugin qemu-install-dir/build/tests/plugin/libinsn.so,arg=inline -d plugin
The first part loads the plugin and passes a single argument, "inline" to it. The next part enables printing of the plugin. You can redirect the plugin output to a different file by adding -D filename to the command line invocation.
More Advanced Profiling
When I was looking for possible ways to profile a program run under QEMU, this is one of the only results of my search that was promising. In the spirit of creating a good record for other searching in the future, here are some links to code that I have written to do just that.
Profiling Plugin code, docs.
Disclaimer: I wrote the above code.
Current released versions of QEMU don't provide any means for doing this. The upcoming "TCG plugin" support which should go out in the 4.2 release at the end of the year would allow you to write a simple "count the instructions executed" plugin, but this (as with the -d tracing) will add an overhead.
The -icount option is certainly confusing, but what it does is make the emulated CPU (try to) run at a specific number of executed instructions per second, as opposed to the default of "as fast as possible". This has higher overhead (and it will stop QEMU using multiple host threads for SMP guests), but is more deterministic.
Philosophically speaking, "instructions per second" is a rather misleading metric for emulators, because the time taken to execute an instruction can vary vastly compared to hardware. Loads and stores are rather slower than on real hardware. Floating point instructions are incredibly slow (perhaps a factor of 10 or worse of an integer arithmetic instruction, where real hardware could execute both in one cycle). JIT emulators like QEMU have a start-stop performance profile where execution stops entirely while we translate a block of code, whereas a real CPU or an interpreting emulator will not have these pauses. How much effect the JIT time has will depend on whether your code reruns previously translated hot code frequently or if it spends most of its time running "new" code, and whether it does things that result in the JIT having to discard the old code (eg self modifying code, or frequent between-process context switches). If you had an "IPS meter" on your emulator you'd see the value it reported fluctuate wildly as the guest code executed and did different things. You're probably better off just picking a benchmark which you think is representative of your actual use case, running it on various emulators, and comparing the wall-clock time it takes to complete.

Can not profile a cuda code with nvprof when using CUPTI functions inside

I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.

Is there any difference in the output of nvvp (visual) and nvprof (command line)?

To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <<metric_name>>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a difference in output when I choose a metric like achieved_occupancy. But this varies with every execution and that's probably why I get different results each time I run it, irrespective of whether I am using nvvp or nvprof.
The question:
I was under the impression that nvvp and nvprof are exactly the same, and that nvvp is simply a GUI built on top of nvprof for ease of use. However I have been given this advice:
Always use the visual profiler. Never use the command line.
Also, this question says:
I do not want to use the command line profiler as I need the global load/store efficiency, replay and DRAM utilization, which are much more visible in the visual profiler.
Apart from 'dynamic' metrics like achieved_occupancy, I never noticed any differences in results. So, is this advice valid? Is there some sort of deficiency in the way nvprof works? I would like to know the advantages of using the visual profiler over the command line form, if there are any.
More specifically, are there metrics for which nvprof gives wrong results?
Note:
My question is not the same as this or this because these are asking about the difference between nvvp and Nsight.
I'm not sure why someone would give you the advice:
Never use the command line.
assuming by "command line" you do in fact mean nvprof. That's not sensible. There are situations where it makes sense to use nvprof. (Note that if you actually meant the command line profiler, then that advice might be somewhat sensible, although still a matter of preference. It is separate from nvprof so has a separate learning curve. I personally would use nvprof instead of the command line profiler.)
nvvp uses nvprof under the hood, in order to do all of its measurement work. However nvvp may combined measured metrics in various interesting ways, e.g. to facilitate guided analysis.
nvprof should not give you "wrong results", and if it did for some reason, then nvvp should be equally susceptible to such errors.
Use of nvvp vs. nvprof may be simply a matter of taste or preference.
Many folks will like the convenenience of the GUI. And the nvvp GUI offers a "Guided Analysis" mode which nvprof does not. I'm sure there could be created an exhaustive list of other differences if you go through the documentation. But whatever nvvp does, it does it using nvprof. It doesn't have an alternate method to query the device for profiler data -- it uses nvprof.
I would use nvprof when it's inconvenient to use nvvp, perhaps when I am running on a compute cluster node where it's difficult or impossible to launch nvvp. You might also use it if you are doing targetted profiling (measuring a single metric, e.g. shared_replay_overhead - nvprofis certainly quicker than firing up the GUI and running a session), or if you are collecting metrics for tabular generation over a large series of runs.
In most other cases, I personally would use nvvp. The timeline feature itself is hugely more convenient than trying to assemble a sequence in your head from the output of nvprof --print-gpu-trace ... which is essentially the same info as the timeline.

Profile debug or release cuda code?

I have been profiling an application with nvprof and nvvp (5.5) in order to optimize it. However, I get totally different results for some metrics/events like inst_replay_overhead, ipc or branch_efficiency, etc. when I'm profiling the debug (-G) and release version of the code.
so my question is: which version should I profile? The release or debug version? Or the choice depends upon what I'm looking for?
I found CUDA - Visual Profiler and Control Flow Divergence where is stated that a debug (-G) version is needed to properly measure the divergent branches metric, but I am not sure about other metrics.
Profiling usually implies that you care about performance.
If you care about performance, you should profile the release version of a CUDA code.
The debug version (-G) will generate different code, which usually runs slower. There's little point in doing performance analysis (including execution time measurement, benchmarking, profiling, etc.) on a debug version of a CUDA code, in my opinion, for this reason.
The -G switch turns off most optimizations that the device code compiler might ordinarily make, which has a large effect on code generation and also often a large effect on performance. The reason for the disabling of optimizations is to facilitate debug of code, which is the primary reason for the -G switch and for a debug version of your code.

Computation between two different kernels in Cuda [duplicate]

I'm writing a CUDA program but I'm getting the obnoxious warning:
Warning: Cannot tell what pointer points to, assuming global memory space
this is coming from nvcc and I can't disable it.
Is there any way to filter out warning from third-party tools (like nvcc)?
I'm asking for a way to filter out of the output window log errors/warnings coming from custom build tools.
I had the same annoying warnings, I found help on this thread: link.
You can either remove the -G flag on the nvcc command line,
or
change the compute_10,sm_10 to compute_20,sm_20 in the Cuda C/C++ options of your project if you're using Visual Studio.