cublas failed to synchronize stop event? - cuda

I'm playing with the matrixMulCUBLAS sample code and tried changing the default matrix sizes to something slightly more fun rows=5k x cols=2.5k and then the example fails with the error Failed to synchronize on the stop event (error code unknown error)! at line #377 when all the computation is done and it is apparently cleaning up cublas. What does this mean? and how can be fixed?
I've got cuda 5.0 installed with an EVGA FTW nVidia GeForce GTX 670 with 2GB memory. The driver version is 314.22 latest one as of today.

In general, when using CUDA on windows, it's necessary to make sure the execution time of a single kernel is not longer than about 2 seconds. If the execution time becomes longer, you may hit a windows TDR event. This is a windows watchdog timer that will reset the GPU driver if it does not respond within a certain period of time. Such a reset halts the execution of your kernel and generates bogus results, as well as usually a briefly "black" display and a brief message in the system tray. If your kernel execution is triggering the windows watchdog timer, you have a few options:
If you have the possibility to use more than one GPU in your system (i.e. usually not talking about a laptop here) and one of your GPUs is a Quadro or Tesla device, the Quadro or Tesla device can usually be placed in TCC mode. This will mean that GPU can no longer driver a physical display (if it was driving one) and that it is removed from the WDDM subsystem, so is no longer subject to the watchdog timer. You can use the nvidia-smi.exe tool that ships with the NVIDIA GPU driver to modify the setting from WDDM to TCC for a given GPU. Use your windows file search function to find nvidia-smi.exe and then use nvidia-smi --help to get command line help for how to switch from WDDM to TCC mode.
If the above method is not available to you (don't have 2 GPUs, don't have a Quadro or Tesla GPU...) then you may want to investigate changing the watchdog timer setting. Unfortunately this requires modifying the system registry, and the process and specific keys vary by OS. There are a number of resources on the web, such as here from Microsoft, as well as other questions on Stack Overflow, such as here, which may help with this.
A third option is simply to limit the execution time of your kernel(s). Successive operations might be broken into multiple kernel calls. The "gap" between kernel calls will allow the display driver to respond to the OS, and prevent the watchdog timeout.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.

Related

Cuda Compute Mode and 'CUBLAS_STATUS_ALLOC_FAILED'

I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.

Limitations of work-item load in GPU? CUDA/OpenCL

I have a compute-intensive image algorithm that, for each pixel, needs to read many distant pixels. The distance is dependent on a constant defined at compile-time. My OpenCL algorithm performs well, but at a certain maximum distance - resulting in more heavy for loops - the driver seems to bail out. The screen goes black for a couple of seconds and then the command queue never finishes. A balloon message reveals that the driver is unhappy:
"Display driver AMD driver stopped responding and has successfully recovered."
(Running this on OpenCL 1.1 with an AMD FirePro V4900 (FireGL V) Graphics Adapter.)
Why does this occur?
Is it possible to, beforehand, tell the driver that everything is ok?
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming (best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.

Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.

Any tips to avoid laggy display during long kernels?

Dear CUDA users I am reposting a question from nvidia boards:
I am currently doing image processing on GPU and I have one kernel that takes something like 500 to 700 milliseconds when running on big images. It used to work perfectly on smaller images but now the problem is that the whole display and even the mouse cursor are getting laggy (OS=win7)
My idea was to split my kernel in 4 or 8 kernel launches, hoping that the driver could refresh more often (between each kernel launch).
Unfortunately it does not help at all, so what else could I try to avoid this freezing display effect? I was suggested to add a cudaStreamQuery(0) call between each kernel to avoid packing by the driver.
Note: I am prepared to trade performances for smoothness!
The GPU is not (yet) designed to context switch between kernel launches, which is why your long-running kernel is causing a laggy display. Breaking the kernel into multiple launches probably would help on platforms other than Windows Vista/Windows 7. On those platforms, the Windows Display Driver Model requires an expensive user->kernel transition ("kernel thunk") every time the CUDA driver wants to submit work to the GPU.
To amortize the cost of the kernel thunk, the CUDA driver queues up GPU commands and submits them in batches. The driver uses a heuristic to trade off the performance hit from the kernel thunk against the increased latency of not immediately submitting work. What's happening with your multiple-kernels solution is that the driver's submitting your kernel or series of kernels to the GPU all at once.
Have you tried the cudaStreamQuery(0) suggestion? The reason that might help is because it forces the CUDA driver to submit work to the GPU, even if very little work is pending.