Cuda profiling speed concern - cuda

When I build an executable using nvcc, I can, by default, profile it using nvprof or Nvidia visual profiler GUI. My concern is that, even when I am not actually profiling it, my executable may not be running optimally, because it is 'able' to record or emit information about profiling. So, I am feeling like by default, executables are built with profiling enabled.
Strange that this question was not asked before, the answer is not obvious to me. Is there a compiler option to disable profiling, especially for release mode? Or is profiling completely free?

Is there a compiler option to disable profiling, especially for release mode? Or is profiling completely free?
There is no compiler option to disable profiling. It is completely free.

Related

does kvm or qemu have out of order execution?

Does kvm have out of order execution feature? if not can we implement to increase virtual machine performance or underlying processor will take care of it.
If the question is whether QEMU-KVM emulate OOO, then no they don't. QEMU can emulate an instruction set architecture (so you could run ARM code on x86, for example) but not at the level of instruction re-ordering. And it probably would only add extra overhead to do this in software.
On the other hand, if you run native code inside a VM (x86 binary on x86, but virtualized), then all unprivileged instructions are executed exactly as they would on a bare-metal. So if your CPU can execute out-of-order it will do so also for the code of your VM. The way the privileged processor instructions are executed depends on whether you are using KVM module alongside the QEMU. You can read more about this here or in more details here.
If you think your QEMU is too slow, check whether the KVM module is used: append the command line with by supplying the -enable-kvm argument. Also make sure your processor has virtualization support.
Also check this answer

Can't debug CUDA: CUDA dynamic parallelism debugging is not supported in preemption mode

I have CUDA 5.5, latest drivers, Nsight studio 3.1 for VC2010 on Windows7 64bit.
The target machine has a headless Titan card, and another simple NVidia card, to which the monitor is connected.
I'm trying to debug my CUDA code which includes some dynamic parallelism. Whenever I click "Start CUDA Debugging" in VC, I get this error from Nsight Monitor: CUDA dynamic parallelism debugging is not supported in preemption mode. From what little I found regarding this issue, this is because I'm trying to debug CUDA on the same device that drives my screen. This however is not true, as I mentioned, I have a separate card to drive the screen.
I went even further with this, disconnected the monitor from the second card as well, rebooted, and set up remote debugging from a different machine. Same result.
Does anyone have an idea how to tackle this?
Right click the monitor's tray icon, check "Options\CUDA\Debugger". Except TCC GPUs, the others are by default force "Software Preemption".
You can set "Desktop GPUS must use Software Preemption" and "Headless GPUs must use software preemption" to false. And make sure in you VisualStuido, the setting "Nsight\Options\CUDA\Preemption Preference" is "Prefer no Software Preemption".

Debugger in CUDA 5

Nvidia has released extended eclipse for CUDA 5. They have Nsight plugin for VS2010 also. In VS2010 we can stop program execution at breakpoint in kernel but how to achieve this functionality in eclipse on Linux? I don't see any nsight specific keys to stop execution. I tried changing perspective but it debugs as a normal C/C++ application. I'm using Tesla C2070, Intel Xeon 8 core machine with Linux.
I'm from Nsight Eclipse Edition team.
Our goal is specifically for the application to be debugged as a normal C/C++ application. This means that you can set breakpoints, use "run to line", etc. regardless of whether you debug host or device code.
Basically, the process is quite standard for Eclipse:
Create a project (you can also import existing executable)
Click debug button
Debugger will run and by default will break in the main function. Note that no device code posted on the device so you will only see the host thread.
Set a breakpoint in the device code and hit resume (note that Breakpoints view toolbar also allows you setting breakpoint on any CUDA kernel launch)
Debugger will break when device code reaches the breakpoint. You can inspect your application state using visual debugger UI.
Couple things, and not sure which solved the issue. Drivers updated to latest ones with RC5.0, but I chose to run VNC server instead of native X server. Then the CUDA card(s) are dedicated to my apps and debugging, and it works like a charm, and now accessible from everywhere.
Eugene,
I just installed Cuda 5, and I wasn't able to break in any kernel code. It was a clean install of centos 5.5, with a fresh download of cuda-5, and i am running on a asus g71x laptop which has a gtx260m installed.
I thought maybe you cant run display and dedbug on one device still, so i switched to non-nv x display, but still had same issue, cant stop in the kernel code.
Have you tried CUDA 5.0 RC1? It is available now. You can download and try it. And I have tried the Nsight in it, it works well for debugging.
Best regards!
The 304.43 NVIDIA Driver does not let users other than root debug their CUDA application.
That problem is not present in any past or future public releases. The CUDA documentation recommends using only drivers listed in the CUDA DevZone. The 304.43 driver is not one of them.
That may or may not be the issue you are hitting. But I thought it was worth mentioning.

CUDA Visual Profiler doesn't generate timeline

I'm trying to determine where a slowdown is occurring in my GPU code. I've verified that the code runs correctly on its own (it doesn't throw any errors, outputs are correct, finishes cleanly, etc). When I try to profile the code in Visual Profiler, it seems to run normally, dumping correct intermediate outputs to stdout. The GPU is being used (I've checked with cuda-gdb and dumping printf()s from inside my kernels). Once all the code has completed, Visual Profiler reports that viper has terminated the executable. However, no timeline is generated. Instead, the main window shows 0, 10, 20, 25 microseconds all "collapsed" on top of one another. When I tell the Visual Profiler to run all analysis options, it proceeds through the 24 runs without problems, but still no timeline is generated.
I'm using CUDA 4.2, driver version 295.41 on Ubuntu x86_64 with a GeForce 460.
When the visual profiler fails to generate a timeline it is typically because it cannot locate a component required for profiling. This component is a shared library found in /usr/local/cuda/lib64 called libcuinj.so. Is that path on your LD_LIBRARY_PATH? How are you launching the Visual Profiler? The script in /usr/local/cuda/bin/nvvp should set the path correctly for you.
The 4.2 version of the visual profiler does not do a good job of reporting errors when this shared library is not found. The upcoming 5.0 version of the visual profiler has much better error reporting in this regard.
I don't know if it's the same under Linux, but in Nsight under Windows, there are two basic types of profiling that you can run. "Application trace" and "Profile". Only under Application trace do you get the timelines. Application trace records the timestamps when CUDA and kernel calls were made. The Profile setting offers options to analyze the kernels. It reads the hardware counters from the GPU and generates performance information related to one or multiple kernels (and no timelines).

GPGPU CUDA debug server

I have access to a server machine, with 3 CUDA enabled GPUs in it, and I would like to use NVidia Parallel Nsight, to remotly debug on the machine.
This works just find.
Now, is it possibble, to start another debug session (possibbly by another developer), on the same machine, but on another GPGPU?
Is it possibble, to do this, if I use gdb on linux?
Thanks,
krisy
Krisy, yes this is possible.
However this case/scenario that you mentioned has not been actively tested internally by the Nsight team yet. I tried this our real quick on a system with a similar setup as the one you mentioned and I was able to debug 2 different instances of CUDA app simulataneously (provided each app runs on a different unique device that is not connected to any output display).
The stability of this is not guaranteed. From what I've tried so far, this worked for me and it should work in theory as well but there were instances where I experienced sluggish behavior on my system.
For other developers who are interested to know more about this, please take a look at: http://forums.nvidia.com/index.php?showtopic=201211