Does anyone know what the limit is on the maximum number of breakpoints that can be set with Puppeteer? I will collect all stacktraces when debugger is hit these breakpoints with Debugger.paused event.
It can be said that it is only related with hardware resources as answered in https://stackoverflow.com/a/8224612/11527186.
Related
I've noticed that, on a host with two working CUDA SM_2.x devices, the first of which is running the display, calling cudaSetDevice(1) in the debugger throws CUDA error 10 (invalid device). It works fine when executed outside of the debugger, however. I also note that the device which normally has ID 1 has device ID 0 inside the debugger.
Are my suspicions confirmed that device ID 0 is assigned only to the first available device, rather than the device installed in the first PCIe slot?
If so, is there a way of ensuring that e.g. cudaSetDevice(1) always selects the same device, irrespective of how CUDA assigns device IDs?
The really short answer is, no, there is no way to do this. Having said that, hardcoding a fixed device id is never the correct thing to do. You want to either:
Select an id from the list of available devices which the API returns for you (and there are a number of very helpful APIs to let you get the device you want), or
You don't use any explicit device selection at all in your code and rely on appropriate driver compute mode settings and/or the CUDA_VISIBLE_DEVICES environment setting to have the driver automatically select a suitable valid device ID for you.
Which you choose will probably dictated by the environment in which your code ends up being deployed.
From looking online, it appears as if the chrome profiler samples 1000 times a second. This seems to be a reasonable default that balances information collection without high overhead. However, i'm finding the default to be not aggressive enough for my current task.
I was wondering if there was a way to configure this default so i could try a few more values. I'm absolutely willing to take the increased overhead while smapling this task.
Thanks!
There's high resolution profiling option in DevTools settings. If enabled it will sample with 10kHz rate.
I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:
Why it is not possible to overlap memHtoD with GPU kernel with GTX 590,
CUDA kernels not launching before CudaDeviceSynchronize
Time between Kernel Launch and Kernel Execution
and suggest to:
flush the WDDM queue with cudaEventQuery() or cudaEventQuery(). (Does not work).
submit streams in breadth first manner. (Does not work).
This thread argues it is a bug in fermi:
How can I overlap memory transfers and kernel execution in a CUDA application?
This thread:
http://blog.icare3d.org/2010/04/tesla-compute-drivers.html
proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.
However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this problem.
TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the
TDR delay value to a very high number, and the "enabled" option to
true, the issue goes away.
Read below for other (older) steps followed until i came to the solution above, and some other possible causes.
I just recently were able to mostly solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.
Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!
Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.
This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!
Alternatively, you can:
Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.
This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.
Please, when you try these solutions, come back here and post your findings!
For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...
UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!
I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.
I'm looking for an example of code that samples signal from microphone and
plays it on speakers. I need a solution that has a resonably constant delay on
different platforms (PC, android, iphone). Delay around 1-2 s is ok for me, and I don't
mind if it varies everytime the application starts.
I tried using SampleDataEvent.SAMPLE_DATA event on Sound and Microphpne classess.
One event would put data into buffer the other would read data.
But it seems impossible to provide constant delay, either the delay grows constantly or
it gets lower to the point where I have less than 2048 samples to put out and Sound class stops
generating SampleDataEvent.SAMPLE_DATA events.
I wan't to process each incoming frame so using setLoopBack(true) is not an option.
ps This is more a problem on Android devices than on PC. Althought when I start to resize application
window on PC delay starts to grow also.
Please help.
Unfortunately, this won't be possible... at least not directly.
Some sound devices will use a different clock between recording and playback. This will be especially true for cell phones where what is running the microphone may very well be different hardware than the headphone audio output.
Basically, if you record at 44.1kHz and play back at 44.1kHz, but those clocks are not in sync, you may be recording at 44.099kHz and play back at 44.101kHz. Over time, this drift will mean that you won't have enough data in the buffer to send to the output.
Another complication (and more than likely your problem) is that your record and playback sample rates may be different. If you record from the microphone at 11kHz and playback at 48kHz, you will note that 11 doesn't evenly fit into 48. Software is often used to up-sample the recording. Sometimes this is done with a nice algorithm which is guaranteed to give you the necessary output. Other times, that 11kHz will get pushed to 44kHz and is deemed "close enough".
In short, you cannot rely on recording and playback devices being in sync, and will need to synchronize yourself. There are many algorithms out there for handling this. The easiest method is to add a sample here and there that averages the sample before and after it. If you do this with just a few samples, it will be inaudible. Depending on the kind of drift problem you are having, this may be sufficient.
I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.