Profile debug or release cuda code? - cuda

I have been profiling an application with nvprof and nvvp (5.5) in order to optimize it. However, I get totally different results for some metrics/events like inst_replay_overhead, ipc or branch_efficiency, etc. when I'm profiling the debug (-G) and release version of the code.
so my question is: which version should I profile? The release or debug version? Or the choice depends upon what I'm looking for?
I found CUDA - Visual Profiler and Control Flow Divergence where is stated that a debug (-G) version is needed to properly measure the divergent branches metric, but I am not sure about other metrics.

Profiling usually implies that you care about performance.
If you care about performance, you should profile the release version of a CUDA code.
The debug version (-G) will generate different code, which usually runs slower. There's little point in doing performance analysis (including execution time measurement, benchmarking, profiling, etc.) on a debug version of a CUDA code, in my opinion, for this reason.
The -G switch turns off most optimizations that the device code compiler might ordinarily make, which has a large effect on code generation and also often a large effect on performance. The reason for the disabling of optimizations is to facilitate debug of code, which is the primary reason for the -G switch and for a debug version of your code.

Related

How do I know the presence of nvprof inside CUDA program?

I have a small CUDA program that I want to profile with nvprof. The problem is that I want to write the program in such a way that
When I run nvprof my_prog, it will invoke cudaProfilerStart and cudaProfilerStop.
When I run my_prog, it will not invoke any of the above APIs, and therefore can get rid of profiling overhead.
The problem hence becomes how to make my code aware of the presence of nvprof when it runs, without additional command line argument.
Have you measured and verified that cudaProfilerStart/Stop calls introduce measurable overheads when nvprof is not attached? I highly doubt that this is the case.
If this is a problem, you can use #ifdef directives to exclude these calls from your release builds.
There is no way of detecting whether nvprof is running, since that kind of defeats the purpose of profiling - if the profiled application "senses" the profiler and changes its behavior.

Is there any difference in the output of nvvp (visual) and nvprof (command line)?

To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <<metric_name>>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a difference in output when I choose a metric like achieved_occupancy. But this varies with every execution and that's probably why I get different results each time I run it, irrespective of whether I am using nvvp or nvprof.
The question:
I was under the impression that nvvp and nvprof are exactly the same, and that nvvp is simply a GUI built on top of nvprof for ease of use. However I have been given this advice:
Always use the visual profiler. Never use the command line.
Also, this question says:
I do not want to use the command line profiler as I need the global load/store efficiency, replay and DRAM utilization, which are much more visible in the visual profiler.
Apart from 'dynamic' metrics like achieved_occupancy, I never noticed any differences in results. So, is this advice valid? Is there some sort of deficiency in the way nvprof works? I would like to know the advantages of using the visual profiler over the command line form, if there are any.
More specifically, are there metrics for which nvprof gives wrong results?
Note:
My question is not the same as this or this because these are asking about the difference between nvvp and Nsight.
I'm not sure why someone would give you the advice:
Never use the command line.
assuming by "command line" you do in fact mean nvprof. That's not sensible. There are situations where it makes sense to use nvprof. (Note that if you actually meant the command line profiler, then that advice might be somewhat sensible, although still a matter of preference. It is separate from nvprof so has a separate learning curve. I personally would use nvprof instead of the command line profiler.)
nvvp uses nvprof under the hood, in order to do all of its measurement work. However nvvp may combined measured metrics in various interesting ways, e.g. to facilitate guided analysis.
nvprof should not give you "wrong results", and if it did for some reason, then nvvp should be equally susceptible to such errors.
Use of nvvp vs. nvprof may be simply a matter of taste or preference.
Many folks will like the convenenience of the GUI. And the nvvp GUI offers a "Guided Analysis" mode which nvprof does not. I'm sure there could be created an exhaustive list of other differences if you go through the documentation. But whatever nvvp does, it does it using nvprof. It doesn't have an alternate method to query the device for profiler data -- it uses nvprof.
I would use nvprof when it's inconvenient to use nvvp, perhaps when I am running on a compute cluster node where it's difficult or impossible to launch nvvp. You might also use it if you are doing targetted profiling (measuring a single metric, e.g. shared_replay_overhead - nvprofis certainly quicker than firing up the GUI and running a session), or if you are collecting metrics for tabular generation over a large series of runs.
In most other cases, I personally would use nvvp. The timeline feature itself is hugely more convenient than trying to assemble a sequence in your head from the output of nvprof --print-gpu-trace ... which is essentially the same info as the timeline.

qemu performance same with and without multi-threading and inconsistent behaviour

I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.

Why is the initialization of the GPU taking very long on Kepler achitecture and how to fix this?

When running my application the very first cuda_malloc takes 40 seconds which is due to the initialization of the GPU. When I build in debug mode this reduces to 5 seconds and when I run the same code on a Fermi device, it takes far less than a second (not even worth measuring in my case).
Now the funny thing is that if I compile for this specific architecture, using the flag sm35 instead of sm20, it becomes fast again. As I should not use any new sm35 features just yet, how can I compile for sm20 and not have this huge delay? Also I am curious what is causing this delay? Is the machine code recompiled on the fly into sm35 code?
Ps. I run on windows but a colleague of mine encountered the same problem, probably on windows. The device is a Kepler, driver version 320.
Yes, the machine code is recompiled on the fly. This is called the JIT-compile step, and it will occur any time the machine code does not match the device that is being used (and assuming valid PTX code exists in the executable.)
You can learn more about JIT-compile here. Note the discussion of the cache which should alleviate the issue after the first run.
If you specify compilation for both sm_20 and sm_35, you can build a binary/executable that will run quickly on both types of devices, and you will also get notification if you are using a sm_35 feature that is not supported on sm_20 (during the compile process).

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the accelerated CUDA code. What I want is to first check whether the user's system has a CUDA enabled NVIDIA card and if it does, initializing the CUDA runtime after. If the system does not support CUDA, then I want to execute the CPU path. This question is very similar to mine, but I don't want to use any other libraries other than the plain CUDA runtime. OpenCL is an alternative, but there isn't enough time to implement an OpenCL version of the algorithm for the first release. Without any CUDA existence check, the program will surely crash since it can't find the needed .dll's for the CUDA runtime and we surely don't want that. So, I need advices on how to handle this initialization step.
Use the calls cudaGetDeviceCount and cudaGetDeviceProperties to find CUDA devices on the running system. First find out how many, then loop through all the available devices, and inspect the properties to decide which ones qualify. What I mean by "qualify" depends on your application. Do you want to require a certain compute capability? Or need a certain amount of memory? If there's more than one device, you might want to sort on some criteria, then set the device cudaSetDevice. If there are no devices, or none that are sufficient, then fall back on the CPU code path.
I'd also suggest having some mechanism to force CUDA mode off, in case some user's environment just doesn't work due to driver issues, or an old board, or something else. You can use a command-line option, or an environment variable, or whatever...
EDITING:
Regarding DLLs, you should package cudart[whatever].dll with your application. That will ensure that the program starts, and at least the CUDA query functions will operate.