Is there a way in NVIDIA Nsight Systems to limit threads displayed? - cuda

I have a project which has thousands of threads, but I want to use the Nsight System to profile the CUDA code. However, loading the report takes a while which I believe is due to the high number of thread information, in addition to all the visual clutter of those threads which I don't currently care about information on.
Is there a way to toggle collecting thread information or limit it before loading a report in the Nsight System GUI?

Is there a way to toggle collecting thread information?
If profiling through the CLI
Check the -s/--sample and --cpuctxsw options, for the profile or start commands, link to documentation. You can set both to none, to minimize the amount of information collected from the CPU side.
If profiling a Linux target: check also the -t/--trace option for the profile or launch commands. Essentially you would like to exclude osrt from the trace options, it is enabled by default.
If you want to collect only CUDA events, then you can use nsys profile -t cuda -s none --cpuctxsw=none <app>.
If profiling through the GUI
You can deselect the "Collect CPU IP/backtrace samples" and "Collect CPU context switch trace" boxes.
If profiling a Linux target: you can additionally deselect the "Collect OS runtime libraries trace" box.
Is there a way to limit it before loading a report in the Nsight System GUI?
If the data is collected, it is not possible to exclude it from rendering on the GUI. You can minimize threads, or hide them by right clicking on "Threads" -> "Show less".

Related

Dependency Analysis options in CUDA Profiler

I have implemented a program that uses a single GPU using the cudaStreamWaitEvent() function to set dependency within two streams using events.
In order to verify this dependency, is it possible to use the "Dependency Analysis" view on the Nvidia Visual Profiler ?
If not, what does each of the following options in the dependency analysis view provide?
Focus Critical Path
Highlight Execution Dependencies
detailed information on those options doesn't seem to be available in the nvidia official website and here
Yes, you should be able to use the dependency analysis feature to verify your usage of most CUDA synchronization APIs, including cudaStreamWaitEvent.
To use either of the two mentioned options, you must have computed the dependencies in your application trace. In order to do that, in NVIDIA Visual Profiler, select "Unguided Analysis" and there "Dependency Analysis".
Now you can enable "Highlight Execution Dependencies", which will highlight the incoming and outgoing dependencies for each analyzed activity on the timeline in red, once you hover over it or select it.
If you use cudaStreamWaitEvent to block one kernel until another kernel in another independent stream has finished, those will be highlighted in red if they are direct dependencies.

How to choose a non busy CUDA device?

I'm working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can't launch "nvidia-smi" to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.
My question is: There is a way to check at runtime if the device is busy or not?
Thanks
Your cluster managers should install and use cluster management (job-scheduling) software that allows them to assign and track GPUs just like CPUs and memory. There are a number of job schedulers that can do this. Even without explicit GPU support in the job-scheduler, it's possible to build job entry/exit scripts that will assign GPUs properly.
You can effectively include the same functionality that nvidia-smi uses by embedding NVML in your applications. Any query or data item reported on by nvidia-smi can be accessed programmatically through NVML.
It's also not clear to me why you could not launch a script for your job which checks which devices are busy using nvidia-smi, then picks an un-busy device.
But keep in mind that any runtime check you might do would be subject to the behavior of other applications. If those applications (whether launched by you or other users) have unusual behavior, your runtime check can easily be defeated.

'Flush records'-Warning in Parallel Nsight profiling results

I'm trying to profile my CUDA-Kernels running on a Windows 7 32 bit machine with a NVIDIA GTX 480 board. I'm using the CUDA 4.1 32 bit toolkit and the Parallel Nsight 2.1 edition for VS 2010.
The profiling results of my program always show the same warning on an irregular basis:
Message: Flush records, Event Type: Range, Level: 50
After this event there is always a processing break of several milliseconds. Then the GPU proceeds the computing at the speed it had before.
I havn't found any information about this warning in CUDA documentation and on the web and I don't even know if it is a problem that only occours during profiling.
Has anyone an idea what this warning is about and how to avoid it?
The warning "Flush Record" is used to show when the Nsight CUDA Trace Activity is adding additional overhead to your application. This is to allow you to interpret periods of high CPU activity. There is no way to remove this warning. Your application is not doing anything wrong.
The Nsight CUDA Trace Activity collects timestamps for the start and end of GPU work including kernels launches, memory copies, and memory sets. When an application launches a task on the GPU the tool allocates a trace record for the task and programs the GPU to write a time stamp into the record. The collection of timestamps is done in a way that should not break concurrency and should not stall the CPU. When the work is completed the tools collects the information and streams it to memory. The Flush range includes the time to collect the results and write out the information. This can include time to perform additional kernel launches and copy memory from device to host. The tool will collect results when the application synchronizes a context (cuCtxSynchronize or cuda{Thread, Device}Synchronize) or when it runs out of trace records.
I will enter a bug to improve the user documentation and tool tips.

Is there any power mode selection feature for nvidia gpu?

Is there any power mode selection feature for nvidia gpu? for example, normal, high performance, or power save mode? If so, is it possible to select a power mode when I compile my cuda program by nvcc? And is it possible to check current power mode status of my gpu?
Actually I could not find any clue of this, though have searched a little long time from web.
Depending on your GPU you may be able to use nvidia-smi or NVML (check out the documentation for more information) to read the current power state, the GPU will dynamically change the power state to conserve power when idle and to provide performance when under load.
As a user it is not possible to set the power state of the GPU - the Tesla product line does have power-capping for the server products but that's not under user control obviously.

CUDA apps time out & fail after several seconds - how to work around this?

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.