Is there a command on Linux to flush or reset Nvidia GPUs? - cuda

I have a small program that generates an image using cuda and I'm attempting to run the program several times in a row. It works for around 30 invocations but always hangs somewhere between invocation 30 and 40. It doesn't error out - the program just hangs. I'm invoking the program using exponential backoff - sometimes the second invocation succeeds, but eventually I get to a point where the max number of invocations is reached and my script terminates.
I'm wondering if there's a way to flush or reset the GPU between program invocations? The only thing I've done with success so far is rebooting the computer after the hang and starting from where I left off, which works, but I'm hoping there's a quicker solution. Any tips? Thanks!

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Why my cpu seems to lose the ability to decode

I meet this problem when finishing the lab of my OS course. We are trying to implement a kernel with the function of system call (platform: QEMU/i386).
When testing the kernel, problem occurred that after kernel load user program to memory and change the CPU state from kernel mode to user mode using 'iret' instruction, CPU works in a strange way as following.
%EIP register increased by 2 each time no matter how long the current instrution is.
no instruction seems to be execute, for no other registers change meantime.
Your guest has probably ended up executing a block of zeroed out memory. In i386, zeroed memory disassembles to a succession of "add BYTE PTR [rax],al" instructions, each of which is two bytes long (0x00 0x00), and if rax happens to point to memory which reads as zeroes, this will effectively be a 2-byte-insn no-op, which corresponds to what you are seeing. This might happen because you set up the iret incorrectly and it isn't returning to the address you expected, or because you've got the MMU setup wrong and the userspace program isn't in the memory where you expect it to be, for instance.
You could confirm this theory using QEMU's debug options (eg -d in_asm,cpu,exec,int,unimp,guest_errors -D qemu.log will log a lot of execution information to a file), which should (among a lot of other data) show you what instructions it is actually executing.

How to count the number of guest instructions QEMU executed from the beginning to the end of a run?

I want to benchmark guest instructions per second of QEMU to compare it with other simulators.
How to obtain the guest instruction count? I'm interested both in user and full system mode.
The only solutions I have now would be to log all instructions with either simple trace exec_tb or -d in_asm: How to use QEMU's simple trace backend? and then count the instructions from there. But this would likely considerably reduce simulation performance due to the output operations, so I would likely have to run the test program twice, one with and another without the trace, and hope that both executions are similar (should be, especially for single threaded user mode simulation).
I saw the -icount option, which sounds promising from the name, but when I passed it to QEMU 4.0.0, I didn't see anything happen. Should it print an instruction count somewhere? The following patch appears unmerged and suggests not: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01275.html
Basic Profiling
To follow up on Peter's answer, I have recently run into a situation where I wanted to get the instruction count of a program run under QEMU (I'm using v4.2.0, the first where plugins became available).
One of the example plugins, insn.c, does exactly what you want, and returns the count of executed instructions on plugin exit.
(I assume you already know how to run QEMU, so I'll strip this down to the important flags)
qemu-system-arm ... -plugin qemu-install-dir/build/tests/plugin/libinsn.so,arg=inline -d plugin
The first part loads the plugin and passes a single argument, "inline" to it. The next part enables printing of the plugin. You can redirect the plugin output to a different file by adding -D filename to the command line invocation.
More Advanced Profiling
When I was looking for possible ways to profile a program run under QEMU, this is one of the only results of my search that was promising. In the spirit of creating a good record for other searching in the future, here are some links to code that I have written to do just that.
Profiling Plugin code, docs.
Disclaimer: I wrote the above code.
Current released versions of QEMU don't provide any means for doing this. The upcoming "TCG plugin" support which should go out in the 4.2 release at the end of the year would allow you to write a simple "count the instructions executed" plugin, but this (as with the -d tracing) will add an overhead.
The -icount option is certainly confusing, but what it does is make the emulated CPU (try to) run at a specific number of executed instructions per second, as opposed to the default of "as fast as possible". This has higher overhead (and it will stop QEMU using multiple host threads for SMP guests), but is more deterministic.
Philosophically speaking, "instructions per second" is a rather misleading metric for emulators, because the time taken to execute an instruction can vary vastly compared to hardware. Loads and stores are rather slower than on real hardware. Floating point instructions are incredibly slow (perhaps a factor of 10 or worse of an integer arithmetic instruction, where real hardware could execute both in one cycle). JIT emulators like QEMU have a start-stop performance profile where execution stops entirely while we translate a block of code, whereas a real CPU or an interpreting emulator will not have these pauses. How much effect the JIT time has will depend on whether your code reruns previously translated hot code frequently or if it spends most of its time running "new" code, and whether it does things that result in the JIT having to discard the old code (eg self modifying code, or frequent between-process context switches). If you had an "IPS meter" on your emulator you'd see the value it reported fluctuate wildly as the guest code executed and did different things. You're probably better off just picking a benchmark which you think is representative of your actual use case, running it on various emulators, and comparing the wall-clock time it takes to complete.

Cuda iterative program stopping unusually. Runs only when PC is restarted Each time

I have a iterative cuda program which iterates new values as required.
It is a confidential code so I cant share, but I want to discuss the problem.
The iterative program runs properly on my PC when I work with less data.
I have proper allocation and deallocation codes.
No matter how many times I run the program it runs properly with less data.
But in case of huge data, It runs properly one time but not multiple times providing an error "****.exe has stopped working.....".
Same error persists until I restart the PC...each time.
It is not feasible to restart the PC each time for me to start the program. So What might be the reason behind it?
Most likely a memory error.
You should try running cuda-memcheck, this will make obvious any memory errors.
Other options include using error handling within your code, this would catch the problems as they arise.

CUDA apps time out & fail after several seconds - how to work around this?

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.