Caffe installation

Caffe installation - cuda

I'm installing Caffe.
I'm using Ubuntu 14.04.
I tried to install cuda. On Caffe site is written that I need to install the library and the latest standalone driver separately.
I downloaded driver from there. I tried every product type, but I get the same error:
You do not appear to have an NVIDIA GPU supported by the 346.46
NVIDIA Linux graphics driver installed in this system. For further
details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in
the README available on the Linux driver download page at www.nvidia.com.
And then
You appear to be running an X server; please exit X before
installing. For further details, please see the section INSTALLING
THE NVIDIA DRIVER in the README available on the Linux driver
download page at www.nvidia.com.
And
Installation has failed. Please see the file
'/var/log/nvidia-installer.log' for details. You may find
suggestions on fixing installation problems in the README available
on the Linux driver download page at www.nvidia.com.
I successfuly installed cuda and cuDNN.
Then I downloaded Caffe from here.
Then I tried to compile and after I did make all and make test,
I did make runtest and get this error:
Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
Also I found that I need to verify that I have a CUDA-Capable GPU.
This command: lspci | grep -i nvidia doesn't return anything. update-pciids doesn't help neither, though it returns Downloaded daily snapshot dated.
Can anyone help me install Caffe and everything correctly?

Your system apparently does not have a CUDA compatible GPU. Depending on what type of system you are using (most likely a desktop or server with appropriate free PCI-e slot(s), case space, and sufficient power supply capacity), it might be possible to purchase and install such a GPU.
Still you can get started with Caffe, by not using GPU by uncommenting CPU_ONLY flag in Makefile.config

Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device
is detected
Assuming you have a GPU card, the above error can come if NVDIA Driver is not installed / used by the system.
Please check this link - https://askubuntu.com/questions/670485/how-to-inspect-the-currently-used-nvidia-driver-version-and-switch-it-to-another
Check the latest driver version from Nvidia site for your card. Then add the relevant repository and install via that. Better to restart
sudo apt-add-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-3xx
sudo modeporbe nvidia (also ran this before restart)
Check via nvidia-smi command
alex#alex-Lenovo-G400s-Touch:~$ nvidia-smi
Tue Feb 28 15:10:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 720M Off | 0000:01:00.0 N/A | N/A |
| N/A 51C P0 N/A / N/A | 271MiB / 1985MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Install samples and test via deviceQuery after making the samples --> http://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 720M"
...
After that Reconfigure Caffe and do a clean make
Below are the CMake settings for reference and the CMake file http://pastebin.com/qAd40uvh

Probably you don't have CUDA compatible card. Also, you may have it, but you are not using it. i.e. If you have a NVidia card and Integrated graphics, you should make sure your monitor has plugged in your NVidia Card output interface.
You should make sure that your graphics card indeed support CUDA at http://www.geforce.com/hardware/technology/cuda/supported-gpus?field_gpu_type_value=All. Find your graphics card in this list, until you find your card.
p.s. To find your graphics card info, you can run lspci | grep VGA in the shell.

Related

CUDA is installed, but PyTorch v1.13 on Windows 10 not working. PyTorch not using GPU; how to fix PyTorch out of sync with CUDA 11.x drivers?

How can I find where CUDA 11.x for PyTorch-GPU 1.13 get installed on Windows 10 on my computer?
What I tried:
I installed the NVIDIA CUDA drivers and toolkit for Windows from the NVIDIA website. I can verify this by typing: !nvidia-smi in Jupyter Lab, which gives me the following output. This indicates that the CUDA tools are installed, but not being used by my PyTorch package. I need to find out what version of CUDA drivers are installed so I can install the correct PyTorch-GPU package.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 513.63 Driver Version: 513.63 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2000 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 N/A / N/A | 0MiB / 4096MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I find many Ubuntu questions and answers for locating CUDA to add it to my PATH, but nothing specific for Windows 10.
For example:
Pytorch CUDA installation fails,
Pytorch CUDA installation using conda,
pytorch-says-that-cuda-is-not-available
What are the equivalent Python commands on Windows 10 to locate the CUDA 11.x toolkits and driver version that my PyTorch-GPU package must use? And then how to fix the problem if PyTorch is out of sync?

I am answering my own question here...
PyTorch-GPU must be compiled against specific CUDA binary drivers.
I finally found this hint Why torch.cuda.is_available() returns False even after installing pytorch with cuda? which identifies the issue.
import torch
torch.zeros(1).cuda()
The return value clearly identifies the problem.
AssertionError Traceback (most recent call last)
Cell In [222], line 2
1 import torch
----> 2 torch.zeros(1).cuda()
File C:\ProgramData\Anaconda3\envs\tf210_gpu\lib\site-packages\torch\cuda\__init__.py:221, in _lazy_init()
217 raise RuntimeError(
218 "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
219 "multiprocessing, you must use the 'spawn' start method")
220 if not hasattr(torch._C, '_cuda_getDeviceCount'):
--> 221 raise AssertionError("Torch not compiled with CUDA enabled")
222 if _cudart is None:
223 raise AssertionError(
224 "libcudart functions unavailable. It looks like you have a broken build?")
AssertionError: Torch not compiled with CUDA enabled
The problem is: "Torch not compiled with CUDA enabled"
Now I have to see if I can just re-install PyTorch-GPU to replace the current PyTorch-CPU version with one that is compiled against my CUDA CUDA-GPU v11.6 driver, without rebuilding the entire conda environment. I would rather not rebuild the conda environment from scratch unless it is really necessary.

NVIDIA Visual Profiler: Insufficient kernel bounds data

I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.
My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the kernel in question. The program launches the kernel several times in order to measure it's execution time as a mean over multiple launches. After the timing loop a memory copy from device to host is issued to make sure all kernel calls have finished. The program is written in CUDA C++.
This is how I built the program:
main.o: main.cu
nvcc -res-usage -arch=sm_61 -c $<
main: main.o stopwatch.o
g++ -o $# $^ -lcudart -L/usr/local/cuda-11.0/lib64
This test was done on a PC with Intel CPU and an NVIDIA GeForce GTX 1070. The OS is Ubuntu 20.04 with a freshly installed CUDA 11 from the NVIDIA website along with driver 450.51.06:
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 28% 38C P8 8W / 151W | 317MiB / 8111MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The following command was used to generate the profiling file:
sudo /usr/local/cuda-11.0/bin/nvprof -o main.nvvp --profile-from-start
off ./main
I also tried with profiling from start but it leads to the same issue below.
The following command was used to launch the visual profiler:
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java main.nvvp
The Visual profiler walks me through several steps and when it comes to "Perform Kernel Analysis" the program tells me:
Insufficient kernel bounds data. The data needed to calculate compute,
memory, and latency bounds for the kernel could not be collected
Is this sort of detailed profiling not available on my GPU? (maybe because it's a gamer card)

nvprof by default will capture only a small amount of information in the output file it generates. This is enough to generate an application timeline, when the output file is imported into nvvp, but not enough information to enable all of the different capabilities of nvvp.
According to the documentation, the --analysis-metrics switch for nvprof is recommended for this type of use.
--analysis-metrics is referred to about 6 different times in the profiler documentation, so you may simply want to search on it to see all of the references or recommendations for its use.
Note that --analysis-metrics can capture a large amount of information. For a large, complex application, it may substantially increase the times the profilers spend processing data. Therefore if you know specifically which data you are looking for, you may wish to specify specific metrics instead. Without --analysis-metrics, however, various nvvp analysis tools may not work correctly when you import the file.

How to Install the CUDA Driver for TensorFlow (installing from source)

I'm trying to build TensorFlow from source and run it with GPU support. To install the toolkit I use the runfile, to install the driver I used the Additional Drivers Tool, since I did not get Ubuntu to boot into Text mode as specified in the CUDA documentation and stop lightdm and start lightdm does not work either, it gives me (also with sudo):
Name com.ubuntu.Upstart does not exist
So far I could build a release from the TensorFlow repository. However, when I'm trying to run the example as specified in the how-to
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
the GPU apparently cannot be found:
jonas#jonas-Aspire-V5-591G:~/Documents/repos/tensoflow_fork$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version
4.9.2 (Ubuntu 4.9.2-10ubuntu13) """
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])
Aborted
I'm using a clean Ubuntu 15.04 installation on an Acer Notebook with the GTX950M.
Can anybody tell me how to properly install the driver?

Can you run deviceQuery (comes with cuda installation)? Can you see nvidia present in lspci/lsmod/nvidia-smi?
lsmod |grep nvidia
dmesg | grep -i nvidia
lspci | grep -i nvidia
nvidia-smi
You can reload nvidia module and look for error messages
modprobe -r nvidia
dmesg | tail
sudo dmesg | grep NVRM
Related issue https://github.com/tensorflow/tensorflow/issues/601

CUDA Installation for NVidia Quadro FX 3800

I'm having trouble installing CUDA 7.0 (to use with TensorFlow) on a workstation with the Nvidia Quadro FX 3800. I'm wondering if this is because the GPU is no longer supported.
Installation of the driver (340.96) seems to work fine:
$ sh ./NVIDIA-Linux-x86_64-340.96.run
Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64
(version: 340.96) is now complete. Please update your XF86Config or
xorg.conf file as appropriate; see the file
/usr/share/doc/NVIDIA_GLX-1.0/README.txt for details.
However, I think I may be having trouble with the following:
$ ./cuda_7.0.28_linux.run --kernel-source-path=/usr/src/linux-headers-3.13.0-76-generic
The driver installation is unable to locate the kernel source. Please make sure
that the kernel source packages are installed and set up correctly. If you know
that the kernel source packages are installed and set up correctly, you may pass
the location of the kernel source with the '--kernel-source-path' flag.
...
Logfile is /tmp/cuda_install_1357.log
$ vi /tmp/cuda_install_1357.log
WARNING: The NVIDIA Quadro FX 3800 GPU installed in this system is
supported through the NVIDIA 340.xx legacy Linux graphics drivers.
Please visit http://www.nvidia.com/object/unix.html for more
information. The 346.46 NVIDIA Linux graphics driver will ignore
this GPU.
WARNING: You do not appear to have an NVIDIA GPU supported by the 346.46
NVIDIA Linux graphics driver installed in this system. For
further details, please see the appendix SUPPORTED NVIDIA GRAPHICS
CHIPS in the README available on the Linux driver download page at
www.nvidia.com.
...
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most
frequently when this kernel module was built against the wrong or
improperly configured kernel sources, with a version of gcc that
differs from the one used to build the target kernel, or if a driver
such as rivafb, nvidiafb, or nouveau is present and prevents the
NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
device(s), or no NVIDIA GPU installed in this system is supported by
this NVIDIA Linux graphics driver release.
...
Please see the log entries 'Kernel module load error' and 'Kernel
messages' at the end of the file '/var/log/nvidia-installer.log' for
more information.
Is the installation failure due to CUDA dropping support for this graphics card?
I followed the link trail: https://developer.nvidia.com/cuda-gpus > https://developer.nvidia.com/cuda-legacy-gpus > http://www.nvidia.com/object/product_quadro_fx_3800_us.html and I would have thought the Quadro FX 3800 supported CUDA (at least at the beginning).

Yes, the Quadro FX 3800 GPU is no longer supported by CUDA 7.0 and beyond.
The last CUDA version that supported that GPU was CUDA 6.5.
This answer and this answer may be of interest. Your QFX 3800 is a compute capability 1.3 device.
If you review the release notes that come with CUDA 7, you will find a notice of the elimination of support for these earlier GPUs. Likewise, the newer CUDA driver versions also don't support those GPUs.

CUDA Runtime API error 38: no CUDA-capable device is detected

The Situation
I have a 2 gpu server (Ubuntu 12.04) where I switched a Tesla C1060 with a GTX 670. Than I installed CUDA 5.0 over the 4.2. Afterwards I compiled all examples execpt for simpleMPI without error. But when I run ./devicequery I get following error message:
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
What I have tried
To solve this I tried all of the thinks recommended by CUDA-capable device, but to no avail:
/dev/nvidia* is there and the permissions are 666 (crw-rw-rw-) and owner root:root
foo#bar-serv2:/dev$ ls -l nvidia*
crw-rw-rw- 1 root root 195, 0 Oct 24 18:51 nvidia0
crw-rw-rw- 1 root root 195, 1 Oct 24 18:51 nvidia1
crw-rw-rw- 1 root root 195, 255 Oct 24 18:50 nvidiactl
I tried executing the code with sudo
CUDA 5.0 installs driver and libraries at the same time
PS here is lspci | grep -i nvidia:
foo#bar-serv2:/dev$ lspci | grep -i nvidia
03:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 670] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation G94 [Quadro FX 1800] (rev a1)
[update]
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ nvidia-smi -a
NVIDIA: API mismatch: the NVIDIA kernel module has version 295.59,
but this NVIDIA driver component has version 304.54. Please make
sure that the kernel module and all NVIDIA driver components
have the same version.
Failed to initialize NVML: Unknown Error
How could that be, if I use the CUDA 5.0 installer to install driver and libs at the same time. Could the old 4.2 version, that is still lying around mess things up?

I came across this issue, and running
nvidia-smi
informed me of an API mismatch. The problem was that my Linux distro had installed updates that required a system restart, so restarting resolved the issue.

See this stack overflow question Installing cuda 5 samples in Ubuntu 12.10.
Ubuntu 12 is not a supported Linux distro (yet). For reference see CUDA 5.0 Toolkit Release Notes And Errata
** Distributions Currently Supported
Distribution 32 64 Kernel GCC GLIBC
----------------- -- -- --------------------- ---------- -------------
Fedora 16 X X 3.1.0-7.fc16 4.6.2 2.14.90
ICC Compiler 12.1 X
OpenSUSE 12.1 X 3.1.0-1.2-desktop 4.6.2 2.14.1
Red Hat RHEL 6.x X 2.6.32-131.0.15.el6 4.4.5 2.12
Red Hat RHEL 5.5+ X 2.6.18-238.el5 4.1.2 2.5
SUSE SLES 11 SP2 X 3.0.13-0.27-pae 4.3.4 2.11.3
SUSE SLES 11.1 X X 2.6.32.12-0.7-pae 4.3.4 2.11.1
Ubuntu 11.10 X X 3.0.0-19-generic-pae 4.6.1 2.13
Ubuntu 10.04 X X 2.6.35-23-generic 4.4.5 2.12.1
If you want to do it run on Ubuntu 12 anyway then see answer of rpardo. It looks like this distro instead of installing 64 bit libraries to /usr/lib64 installs them to /usr/lib/x86_64-linux-gnu/
I'd suggest searching for all instances of libcuda.so and libnvidia-ml.so on the system. Since the driver doesn't support this distro it might have installed libraries to a path that is not pointed by LD_LIBRARY_PATH. Then move the libraries around and/or change the LD_LIBRARY_PATH to point to this location (it should be the first path on the left). Then retry nvidia-smi or deviceQuery
Good luck

I got error 38 for cudaGetDeviceCount on a windows machine with GTX980 GPU.
After I downloaded the latest driver for GTX 980 fro the NVIDIA site, installed it and restarted, everything is fine. Looks like the CUDA installer is not installing the latest driver.

Try running the sample using sudo (or, you might do a 'sudo su', set LD_LIBRARY_PATH to the path of cuda libraries and run the sample while being root). Apparently, since you've probably installed CUDA 5.0 using sudo, the samples doesn't run with normal user. However, if you run a sample with root, then you'll be able to run samples with the regular user too! I've not yet restarted the system to see if samples work with normal user even after reboot, or each time you should run at least one CUDA application with root.
The problem might completely disappear if you install CUDA TookKit without using sudo.

I had very similar problem on Debian and it turns out that loaded nvidia module had different version than libcuda1.
To check for installed nvidia module you should do:
$ sudo modinfo nvidia-current | grep version
version: 319.82
If it doesn't match version of libcuda1 this the root of your problems.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008