Disabled ECC support for Tesla C2070 and Ubuntu 12.04 - cuda

I have a headless workstation running Ubuntu 12.04 server and recently installed new Tesla C2070 card, but when running the examples from the CUDA SDK, I get the following error:
NVIDIA_GPU_Computing_SDK/C/bin/linux/release% ./reduction
[reduction] starting...
Using Device 0: Tesla C2070
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
reduction.cpp(473) : cudaSafeCallNoSync() Runtime API error 39 : uncorrectable ECC error encountered.
Actually, this error occurs with all other examples except "deviceQuery".
I'm using kernel 3.2.0, nvidia driver 295.41 and Cuda 4.2.9.
After a lot of searching found a suggestion to disable the ecc support by:
nvidia-smi -g 0 --ecc-config=0
which worked. But the question is how reliable will be the GPU computing
with disabled ecc support?
Any advice, suggestion or solution will be highly appreciated.
-Konstantin

I'm wondering if this may be some sort of compatibility issue, rather than a bad card. I'm suffering from the same problem with a Tesla C2075, same Ubuntu version. We contacted nVidia and they told us that double-bit ECC errors (as seen using nvidia-smi -q in linux) meant that the card was probably broken. We obtained a replacement, but it has exactly the same issues.
It seems unlikely that both the boards I have had are broken in the same way, so we're going to try it in another machine if we can find a suitable one.
I'll post anything interesting that we learn.

I'll echo what aland said and add my own experience.
I worked with a number of Fermi equipped compute clusters and tested them variably with ECC on and off. We did this to increase the amount of memory available and the speed of the computations, which was noticeable. nvidia-smi never reported any ECC errors for those cards with ECC on, nor did we ever encounter any runtime errors that were indicative of ECC related problems.
If your card is detecting uncorrectable ECC problems, that indicates a flaw in the hardware, and turning ECC off is only masking the problem. The runtime is rightly warning you that something bad has gone wrong, and you can't depend on the results.
You can try running your calculations anyway and see what happens, but be prepared for anything going absolutely crazy for no real reason. A single bit flipped here or there can have enormous consequences for floating point math for example, and may flat out crash your kernel if an instruction gets corrupted.
If you can, I would try to get the card replaced rather than masking the symptoms.

It turned out my case is the same as carthurs's. I also got my card replaced, but the
error didn't go away. Only after setting the motherboard's onboard VGA as primary in
the BIOS it disappeared. There should be a warning about this in the Tesla installation manual!
Thanks everybody for the help.

Once a GPU uncorrectable ECC error occurs the GPU might be in unstable state (e.g. data corruption could have occurred not only in user allocated memory but also in memory region necessary for GPU operation). To recover the GPU you need to either power cycle/reboot your system or try to use GPU Reset from nvidia-smi
nvidia-smi -h
...
-r --gpu-reset Trigger secondary bus reset of the GPU.
Can be used to reset GPU HW state in situations
that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has
occurred.
--id= switch is mandatory for this switch
Type man nvidia-smi for more help on that topic

Related

Nvidia CUDA Profiler's timeline contains many large gaps

I'm trying to profile my code using Nivida Profiler, but I'm getting strange gaps in the timeline as shown below:
Note: both kernels on the edges of the gaps are CudaMemCpyAsync (Host-to-Device)
I'm running on Ubuntu 14.04 with latest version of CUDA, 8.0.61 and latest Nvidia display driver.
Intel integrated graphics card is used in display not Nvidia. So, Nvidia Graphics card is only running the code, not anything else.
I've enabled CPU Profiling as well to check these gaps but nothing is shown!
Also, no Debugging options are enabled (-G nor -g)
and this is a "release build"
My laptop's specs:
Intel Core i7 4720HQ
Nvidia GTX 960m
16GB DDR3 Ram
1 TB Hard Drive
Is there anyway to trace what's happening in these empty time slots?
Thanks,
I'm afraid there are no automatic methods, but you can add custom traces in your code to find what's happening :
To do that you can use NVTX.
follow the links for some tutorials or documentation.
These profiling holes are probably due to data loading, memory allocations/initialisations done by the host between your kernels executions.

"no CUDA-capable device is detected" with CUDA-capable GPU installed Win7

I have installed cuda.7.0.28 into my laptop. I tried to run one of the sample file. I ran deviceQuery project and got this message:
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Then, I ran nvidia-smi.exe file and got this message:
As you see, it is written that "Not Supported". What should I do?
nvidia-smi returning 'not supported' does not necessarily mean that your GPU does not have the ability to run CUDA code. It means that you don't have the ability to see the active CUDA process name using nvidia-smi.
Cuda-z might be of help here. Take a look at what it is here: http://cuda-z.sourceforge.net/
Also, I have to say I had quite a few problems getting CUDA running on Windows. If you really need to run it on Windows, make sure you go through this first: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/#axzz3cNkYKZDP
Have you tried to run it on linux on the same machine? It was much easier to get it workinge.
NVIDIA now provide a toolkit to install CUDA on windows (Linux or Mac also). It does a handy check of your system, to see if it meets necessary requirements for CUDA if you are unsure about your GPU
https://developer.nvidia.com/cuda-80-ga2-download-archive
I've noticed that when my nvidia driver is updated during the system package update process (on Ubuntu) that I'll get this message. It is resolved by a reboot, or likely an X restart although I haven't tried that.
This was disconcerting the first time it happened since it was one of those "Hey! My code just ran fine. WTF happened?" moments.

Getting Theano to use the GPU and all CPU cores (at the same time)

I managed to get Theano working with either GPU or multicore CPU on Ubuntu 14.04 by following this tutorial.
First I got multicore working (I could verify that in System Monitor).
Then, after adding the config below to .theanorc, I got GPU working:
[global]
device = gpu
floatX = float32
I verified it by running the test from the tutorial and checking the execution times, and also by the log message when running my program:
"Using gpu device 0: GeForce GT 525M"
But as soon as GPU started working I wouldn't see multicore in System Monitor anymore. It uses just one core at 100% like before.
How can I use both? Is it even possible?
You can't fully utilize both multicore and GPU at the same time.
Maybe this can be impoved in the future.

CUDA samples cause machine to crash

I was planning on starting to use CUDA on a machine with Kubuntu 12.04 LTS and a Quadro card. I installed CUDA 5.5 using the .deb from here, and the installation seems to have gone fine. Then I built the CUDA samples, again everything went fine.
When I run the samples in sequence, however, some of them botch my display, and others simply crash my computer.
What causes the crash? How can I fix it?
I'll mention that my NVidia card is the only display adapter the machine has, but that shouldn't make CUDA crash and burn.
The problem was due to the X server using the FOSS nouveau drivers. These are known to conflict with NVidia's way of accessing the card. When I restarted X (actually, I restarted the machine), the samples did run and work properly.
Not all the samples are runnable if you just installed CUDA on a clean ubuntu system. Some of them require additional libraries, and some of them require particular CC versions.
You could read the CUDA sample document of those crashed samples for more information.
http://docs.nvidia.com/cuda/cuda-samples/index.html

CUDA 5.0 cuda-gdb on Linux Needs dedicated CPU?

With a fresh CUDA 5.0 Linux install on CentOS 5.5, I am not able to gdb. So I am wondering if you still need a dedicated GPU for the Linux cuda-gdb? I tried it with the Vesa device driver for X11, but get the same result. Profiling works, running the app works, but trying to run cuda-gdb gives :
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Any suggestions?
cuda-gdb still needs a GPU that is not used by graphical environment (e.g. if you are running Gnome/KDE/etc. you need to have system with several GPUs - not necessary all of them must be NVIDIA GPUs)
This particular message is not about this problem - you can ignore it. cuda-gdb will tell if it fails because no GPU can be used for debugging.