Same as this question but for caffe. I want a command I can put in my python script to check if gpu is utilized.
I have checked nvidia-smi while my model is running and I see that python is recognized as a process but Usage is N/A.
I also tried running the caffe.set_mode_cpu() command thinking that the times would be very different but the times with the command and without where the same.
I would like to suggest the use of GPUSTAT. You can query the GPU and check if your process is in the list returned by the command.
It is simple, not too elegant but it works.
Related
I want to benchmark guest instructions per second of QEMU to compare it with other simulators.
How to obtain the guest instruction count? I'm interested both in user and full system mode.
The only solutions I have now would be to log all instructions with either simple trace exec_tb or -d in_asm: How to use QEMU's simple trace backend? and then count the instructions from there. But this would likely considerably reduce simulation performance due to the output operations, so I would likely have to run the test program twice, one with and another without the trace, and hope that both executions are similar (should be, especially for single threaded user mode simulation).
I saw the -icount option, which sounds promising from the name, but when I passed it to QEMU 4.0.0, I didn't see anything happen. Should it print an instruction count somewhere? The following patch appears unmerged and suggests not: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01275.html
Basic Profiling
To follow up on Peter's answer, I have recently run into a situation where I wanted to get the instruction count of a program run under QEMU (I'm using v4.2.0, the first where plugins became available).
One of the example plugins, insn.c, does exactly what you want, and returns the count of executed instructions on plugin exit.
(I assume you already know how to run QEMU, so I'll strip this down to the important flags)
qemu-system-arm ... -plugin qemu-install-dir/build/tests/plugin/libinsn.so,arg=inline -d plugin
The first part loads the plugin and passes a single argument, "inline" to it. The next part enables printing of the plugin. You can redirect the plugin output to a different file by adding -D filename to the command line invocation.
More Advanced Profiling
When I was looking for possible ways to profile a program run under QEMU, this is one of the only results of my search that was promising. In the spirit of creating a good record for other searching in the future, here are some links to code that I have written to do just that.
Profiling Plugin code, docs.
Disclaimer: I wrote the above code.
Current released versions of QEMU don't provide any means for doing this. The upcoming "TCG plugin" support which should go out in the 4.2 release at the end of the year would allow you to write a simple "count the instructions executed" plugin, but this (as with the -d tracing) will add an overhead.
The -icount option is certainly confusing, but what it does is make the emulated CPU (try to) run at a specific number of executed instructions per second, as opposed to the default of "as fast as possible". This has higher overhead (and it will stop QEMU using multiple host threads for SMP guests), but is more deterministic.
Philosophically speaking, "instructions per second" is a rather misleading metric for emulators, because the time taken to execute an instruction can vary vastly compared to hardware. Loads and stores are rather slower than on real hardware. Floating point instructions are incredibly slow (perhaps a factor of 10 or worse of an integer arithmetic instruction, where real hardware could execute both in one cycle). JIT emulators like QEMU have a start-stop performance profile where execution stops entirely while we translate a block of code, whereas a real CPU or an interpreting emulator will not have these pauses. How much effect the JIT time has will depend on whether your code reruns previously translated hot code frequently or if it spends most of its time running "new" code, and whether it does things that result in the JIT having to discard the old code (eg self modifying code, or frequent between-process context switches). If you had an "IPS meter" on your emulator you'd see the value it reported fluctuate wildly as the guest code executed and did different things. You're probably better off just picking a benchmark which you think is representative of your actual use case, running it on various emulators, and comparing the wall-clock time it takes to complete.
i've cloned a "PointPillars" repo for 3D detection using just point cloud as input. But when I came to run it, I noted it use cuda and numba. With any prior knowledge about these two, I'm asking if there is any way to remove or disable numba and cuda. I want to run it on local server with CPU only, so I want your advice to solve.
The actual code matters here.
If the usage is only of vectorize or guvectorize using the target=cuda parameter, then "removal" of CUDA should be trivial. Just remove the target parameter.
However if there is use of the #cuda.jit decorator, or explicit copying of data between host and device, then other code refactoring would be involved. There is no simple answer here in that case, the code would have to be converted to an alternate serial or parallel realization via refactoring or porting.
The problem I'm trying to solve. Most of our command line apps, when run from Visual Studio, we like to force the user to hit a key to exit, so we can see the output in Visual Studio while we're debugging.
This doesn't work at all with profiling. One way to fix that would be to determine if the profiler was running or not.
The API to the CUDA profiler is rather limited:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER.html
It appears to support:
Initialization cudaProfilerInitialize
Starting cudaProfilerStart
Stopping cudaProfilerStop
But no way to determine if it's actually running?
Well, an ugly and surely sub-optimal solution is just searching for nvprof among the running processes...
On Linux, you can do this with readproc():
#include <proc/readproc.h>
proc_t* readproc(PROCTAB *PT, proc_t *return_buf);
For more information on how to use the functions in readproc.h, have a look at:
How does the ps command work?
on SuperUser.com, and particularly at this answer.
Note: Don't forget nvprof might be running but not profiling your process.
To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <<metric_name>>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a difference in output when I choose a metric like achieved_occupancy. But this varies with every execution and that's probably why I get different results each time I run it, irrespective of whether I am using nvvp or nvprof.
The question:
I was under the impression that nvvp and nvprof are exactly the same, and that nvvp is simply a GUI built on top of nvprof for ease of use. However I have been given this advice:
Always use the visual profiler. Never use the command line.
Also, this question says:
I do not want to use the command line profiler as I need the global load/store efficiency, replay and DRAM utilization, which are much more visible in the visual profiler.
Apart from 'dynamic' metrics like achieved_occupancy, I never noticed any differences in results. So, is this advice valid? Is there some sort of deficiency in the way nvprof works? I would like to know the advantages of using the visual profiler over the command line form, if there are any.
More specifically, are there metrics for which nvprof gives wrong results?
Note:
My question is not the same as this or this because these are asking about the difference between nvvp and Nsight.
I'm not sure why someone would give you the advice:
Never use the command line.
assuming by "command line" you do in fact mean nvprof. That's not sensible. There are situations where it makes sense to use nvprof. (Note that if you actually meant the command line profiler, then that advice might be somewhat sensible, although still a matter of preference. It is separate from nvprof so has a separate learning curve. I personally would use nvprof instead of the command line profiler.)
nvvp uses nvprof under the hood, in order to do all of its measurement work. However nvvp may combined measured metrics in various interesting ways, e.g. to facilitate guided analysis.
nvprof should not give you "wrong results", and if it did for some reason, then nvvp should be equally susceptible to such errors.
Use of nvvp vs. nvprof may be simply a matter of taste or preference.
Many folks will like the convenenience of the GUI. And the nvvp GUI offers a "Guided Analysis" mode which nvprof does not. I'm sure there could be created an exhaustive list of other differences if you go through the documentation. But whatever nvvp does, it does it using nvprof. It doesn't have an alternate method to query the device for profiler data -- it uses nvprof.
I would use nvprof when it's inconvenient to use nvvp, perhaps when I am running on a compute cluster node where it's difficult or impossible to launch nvvp. You might also use it if you are doing targetted profiling (measuring a single metric, e.g. shared_replay_overhead - nvprofis certainly quicker than firing up the GUI and running a session), or if you are collecting metrics for tabular generation over a large series of runs.
In most other cases, I personally would use nvvp. The timeline feature itself is hugely more convenient than trying to assemble a sequence in your head from the output of nvprof --print-gpu-trace ... which is essentially the same info as the timeline.
Im trying to benchmark my CUDA application with Compute Visual Profiler. However the program is unable to fill any data in the .csv files. All the paths to CUDA are set properly in the profiler application.
After few runs on the exe file it returns the error:
Error in Profiler data file
'C:/..../temp_compute_profiler_0_0.csv'
at line number 1. No column found.
There are many possible reasons... some of them to check for
the execution time out. make sure that the profiler is not set to time out too soon
the program doesn't finish executing (even if the kernel does). make sure there isn't a getchar at the end of your code
try adding an explicit call to cudaThreadExit at the end of your code, and check for errors.
One of the most common reason for that kind of error is that your program never manages to launch a CUDA kernel or that it failed during its execution.