How do I know the presence of nvprof inside CUDA program? - cuda

I have a small CUDA program that I want to profile with nvprof. The problem is that I want to write the program in such a way that
When I run nvprof my_prog, it will invoke cudaProfilerStart and cudaProfilerStop.
When I run my_prog, it will not invoke any of the above APIs, and therefore can get rid of profiling overhead.
The problem hence becomes how to make my code aware of the presence of nvprof when it runs, without additional command line argument.

Have you measured and verified that cudaProfilerStart/Stop calls introduce measurable overheads when nvprof is not attached? I highly doubt that this is the case.
If this is a problem, you can use #ifdef directives to exclude these calls from your release builds.
There is no way of detecting whether nvprof is running, since that kind of defeats the purpose of profiling - if the profiled application "senses" the profiler and changes its behavior.

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Is there any difference in the output of nvvp (visual) and nvprof (command line)?

To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <<metric_name>>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a difference in output when I choose a metric like achieved_occupancy. But this varies with every execution and that's probably why I get different results each time I run it, irrespective of whether I am using nvvp or nvprof.
The question:
I was under the impression that nvvp and nvprof are exactly the same, and that nvvp is simply a GUI built on top of nvprof for ease of use. However I have been given this advice:
Always use the visual profiler. Never use the command line.
Also, this question says:
I do not want to use the command line profiler as I need the global load/store efficiency, replay and DRAM utilization, which are much more visible in the visual profiler.
Apart from 'dynamic' metrics like achieved_occupancy, I never noticed any differences in results. So, is this advice valid? Is there some sort of deficiency in the way nvprof works? I would like to know the advantages of using the visual profiler over the command line form, if there are any.
More specifically, are there metrics for which nvprof gives wrong results?
Note:
My question is not the same as this or this because these are asking about the difference between nvvp and Nsight.
I'm not sure why someone would give you the advice:
Never use the command line.
assuming by "command line" you do in fact mean nvprof. That's not sensible. There are situations where it makes sense to use nvprof. (Note that if you actually meant the command line profiler, then that advice might be somewhat sensible, although still a matter of preference. It is separate from nvprof so has a separate learning curve. I personally would use nvprof instead of the command line profiler.)
nvvp uses nvprof under the hood, in order to do all of its measurement work. However nvvp may combined measured metrics in various interesting ways, e.g. to facilitate guided analysis.
nvprof should not give you "wrong results", and if it did for some reason, then nvvp should be equally susceptible to such errors.
Use of nvvp vs. nvprof may be simply a matter of taste or preference.
Many folks will like the convenenience of the GUI. And the nvvp GUI offers a "Guided Analysis" mode which nvprof does not. I'm sure there could be created an exhaustive list of other differences if you go through the documentation. But whatever nvvp does, it does it using nvprof. It doesn't have an alternate method to query the device for profiler data -- it uses nvprof.
I would use nvprof when it's inconvenient to use nvvp, perhaps when I am running on a compute cluster node where it's difficult or impossible to launch nvvp. You might also use it if you are doing targetted profiling (measuring a single metric, e.g. shared_replay_overhead - nvprofis certainly quicker than firing up the GUI and running a session), or if you are collecting metrics for tabular generation over a large series of runs.
In most other cases, I personally would use nvvp. The timeline feature itself is hugely more convenient than trying to assemble a sequence in your head from the output of nvprof --print-gpu-trace ... which is essentially the same info as the timeline.

Profile debug or release cuda code?

I have been profiling an application with nvprof and nvvp (5.5) in order to optimize it. However, I get totally different results for some metrics/events like inst_replay_overhead, ipc or branch_efficiency, etc. when I'm profiling the debug (-G) and release version of the code.
so my question is: which version should I profile? The release or debug version? Or the choice depends upon what I'm looking for?
I found CUDA - Visual Profiler and Control Flow Divergence where is stated that a debug (-G) version is needed to properly measure the divergent branches metric, but I am not sure about other metrics.
Profiling usually implies that you care about performance.
If you care about performance, you should profile the release version of a CUDA code.
The debug version (-G) will generate different code, which usually runs slower. There's little point in doing performance analysis (including execution time measurement, benchmarking, profiling, etc.) on a debug version of a CUDA code, in my opinion, for this reason.
The -G switch turns off most optimizations that the device code compiler might ordinarily make, which has a large effect on code generation and also often a large effect on performance. The reason for the disabling of optimizations is to facilitate debug of code, which is the primary reason for the -G switch and for a debug version of your code.

Is it possible to kill a running CUDA kernel?

Let us say I have numerous CUDA kernels that I can ask the GPU to execute. I don't want to modify the kernel code in anyway (to include a trap for eg).
Is there a way to kill such a running kernel?
I intend to auto generate kernels (Genetic Programming). These kernels are likely to have behavior where they may take a very long time to complete. If I can kill a kernel while it is running I can maintain a timer and kill if required.
cudaDeviceReset() will kill any running kernel(s).
It will also wipe out any allocations done on the device, so you will need to re-allocate any data areas if you intend to use them again.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Can GPU counters be read transparently to the application code

I am trying to profile the CUDA rodinia benchmarks executing on a GTX 650.
I am using the code /usr/local/cuda-5.0/extras/CUPTI/samples/event_sampling to read
the instructions executed counter. It seems strange that I do not see any change in the
values reported by the event_sampling whether I am executing the CUDA benchmarks or not.
The event_sampling code also has some calculations of its own for which it measures the instructions executed. Unlike CPU, do I need to make changes to the source code of the application to be able to read the GPU counters such as instruction_executed?
CUPTI will only give you counter updates for kernels in the same process. You can get some of these values, though not to the same level of precision, with the NVIDIA visual profiler or related environment variables without modifying the code however.