Unable to get cuda to work in tensorflow - cuda

I'm trying to use cuda to accelerate tensorflow. I'm running tensorflow using the docker image.
Firstly, when I launch the gpu image, it has a mismatch in the LT_LIBRARY_PATH environment variable:
~# echo $LD_LIBRARY_PATH
/usr/local/nvidia/lib:/usr/local/nvidia/lib64:
root#d578acbbc2cd:~# ls /usr/local/
bin cuda cuda-7.0 etc games include lib man sbin share src
There's no nvidia directory there. When I try to run the convolutional.py demo, it can't initialise the cuda support:
# python models/image/mnist/convolutional.py
Succesfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Succesfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Succesfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Succesfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.2.0-23-generic/modules.dep.bin'
E tensorflow/stream_executor/cuda/cuda_driver.cc:466] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:98] retrieving CUDA diagnostic information for host: d578acbbc2cd
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:106] hostname: d578acbbc2cd
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:131] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:242] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.68 Tue Dec 1 17:24:11 PST 2015
GCC version: gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:135] kernel reported version is: 352.68
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA:
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
It then goes on to train using cpu only.
# find /usr -name libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so
So in the docker image, there's only the gnu cpu cuda implementation. No NVIDIA stuff. In the host ubuntu 15.10 session, I have libcuda.so installed:
$ find /usr -name libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/i386-linux-gnu/libcuda.so
/usr/local/cuda-7.5/targets/x86_64-linux/lib
/stubs/libcuda.so
So these seem to be stubs ... not sure why.
Is there some trick to getting this to work?

Try rebuilding the Docker image directly from the Tensorflow repository (i.e. don't rely on the image on the container registry) and use https://github.com/NVIDIA/nvidia-docker to run the container (the Docker command described in the Tensorflow documentation is not portable).

I had a similar problem, though not in docker. The libcuda.so in /usr/local/cuda/lib64/stubs was a broken sym link. When I searched for libcuda.so it only turned up a file in a lib32 folder.
It seems that the problem was how I originally installed the NVIDIA device driver. At some point in the driver install process you're given the option to install the lib32 drivers. I had thought this meant in addition to lib64 drivers so I selected it. Turns out it only installs lib32 and not lib64 drivers.
I reinstalled the NIVDIA device driver, this time not selecting the lib32 'option'. Now tensorflow finds libcuda.so.

I had the same problem with running tensorflow on a Ubuntu machine after I upgraded my driver to 352.63 and 352.93. (I remember it works with 346.* but when I try to install 346., it installs 352. automatically for some reason).
I finally figured out that it's caused by permission issue. (I can run it with root) So, I changed the permission of the libcuda.so.352-63 file to executable by anyone and it works well now.
Hope this will be helpful to those still struggling with this issue.
I didn't try the docker one, but I guess it's also caused by permission setting.

Try this command
sudo apt-get install nvidia-modprobe
As mentioned here:
https://github.com/tensorflow/tensorflow/issues/394
and
http://kkjkok.blogspot.in/2016_08_01_archive.html

After I updated NVIDIA driver to 378.09 on Ubuntu 14.10 I had the same error,
although all the right for lib files were set correctly.
Thanks to #PhoenixQ, I tried to run with sudo and it worked.
After that I tried to run without sudo one more time and error disappeared. I'm not sure what ecxactly happened, but maybe something was configured during call with sudo, which was not possible withous sudo.
So the solution:
Try to run the same thing with sudo.
After this. Tryu running without sudo. Worked for me.

Related

Where does cuda-repo-cross-<identifier>-all.deb come from?

I am trying to set up a cross-compile environment on an AWS EC2 Ubuntu box targeting Nvida Xavier devices on Cuda 10.2. I tried following the "instructions" at https://docs.nvidia.com/cuda/archive/10.2/cuda-installation-guide-linux/index.html#cross-platform which say to install
sudo dpkg -i cuda-repo-cross-<identifier>_all.deb
but no clue as to where I might get hold of that .deb file, or what <identifier> should be replaced with. I have installed the native package cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb and there are a load of .deb files in /var/cuda-repo-10-2-local-10.2.89-440.33.01, but none of them are that one.
So it turns out that the instructions that can be found by googling for, for instance, "cuda install cross compile" are wrong, or at least so incomplete as makes no difference.
Instead, use the SDK manager https://developer.nvidia.com/nvidia-sdk-manager to install just the host tools. It does run without a GUI.

nvcc not found but cuda runs fine?

I was trying to run nvcc -V to check cuda version but I got the following error message.
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
But gpu acceleration is working fine for training models on cuda. Is there another way to find out cuda compiler tools version. I know nvidia-smi doesn't give the right version.
Is there a way to install or configure nvcc. So I don't have to install a whole new toolkit.
Most of the time, nvcc and other CUDA SDK binaries are not in the environment variable PATH. Check the installation path of CUDA; if it is installed under /usr/local/cuda, add its bin folder to the PATH variable in your ~/.bashrc:
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
You can apply the changes with source ~/.bashrc, or the next time you log in, everything is set automatically.
As #pQB and #talonmies above mentioned you only need to install the GPU drivers (Versioned 430-470 these days) to use PyTorch. If you are using your GPU display port you should be fine.
For Cuda compilation tools you need to install the whole toolkit, which includes the driver as well. If installing manually from CLI the downloaded file, CLI will give you the option to choose the components to install or skip.
Generally, it is recommended to install the compilation tools (which are system wide) and GPU drivers together because it avoids compatibility issues.
Append:
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
to
~/.bashrc
Note: your path to cuda may include a version so navigate to /usr/local/ and check for cudaXX.XX and modify the command to point to that in ~/.bashrc

Qt Library 'mysql' is not defined

I have a problem with Qt connecting with MySql, when i run this code
QSqlDatabase DBObject = QSqlDatabase::addDatabase("QMYSQL");
DBObject.setHostName("localhost");
DBObject.setDatabaseName("SingleDB");
DBObject.setUserName("root");
DBObject.setPassword("abc123");
bool ok = DBObject.open();
and I got this... QSqlDatabase: QMYSQL driver not loaded
I Have already done this also:
sudo apt-get install libmysqlclient
and
/home/wrm/Qt/5.12.3/gcc_64/bin/qmake "INCLUDEPATH+=/usr/local/include" "LIBS+=-L/usr/local/lib -lmysqlclient_r" mysql.pro
and here i have this error: Project ERROR: Library 'mysql' is not defined
Any idea?
Perhaps you need to install mysql-devel.
According to the Qt Docs QMYSQL for MySQL 4 and higher:
How to Build the QMYSQL Plugin on Unix and macOS
You need the MySQL header files, as well as the shared library libmysqlclient.so. Depending on your Linux distribution, you may need to install a package which is usually called "mysql-devel".
Google doesn't have a readily available answer, so answering this old question:
Aside from needing development files as pointed above (like apt install libmysqlclient-dev), you need to generate a config:
# Just for making my snippet work. Feel free to hardcode paths.
export QTDIR=/home/you/Qt/
export QTVERSION=5.9.5
cd $QTDIR/$QTVERSION/Src/qtbase/src/plugins/sqldrivers
$QTDIR/$QTVERSION/gcc_64/bin/qmake sqldrivers.pro
cd mysql
make
make install # if you want; it installs it in the bin dir of $QTVERSION
In the past, this was not necessary for Qt 5.5 (where I did this last time).
On a side note, there is no longer a special thread-safe version of libmysqlclient (libmysqlclient_r). It's just one one. Last time I ran into that link error, I just edited the generate Makefile to use the non-_r.

Tensorflow cannot open libcuda.so.1

I have a laptop with a GeForce 940 MX. I want to get Tensorflow up and running on the gpu. I installed everything from their tutorial page, now when I import Tensorflow, I get
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:119] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: workLaptop
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] LD_LIBRARY_PATH:
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1093] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libnvidia-fatbinaryloader.so.367.57: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
>>>
after which I think it just switches to running on the cpu.
EDIT: After I nuked everything , started from scratch. Now I get this:
>>> import tensorflow
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:119] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: workLaptop
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1092] LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1093] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libnvidia-fatbinaryloader.so.367.57: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
libcuda.so.1 is a symlink to a file that is specific to the version of your NVIDIA drivers. It may be pointing to the wrong version or it may not exist.
# See where the link is pointing.
ls /usr/lib/x86_64-linux-gnu/libcuda.so.1 -la
# My result:
# lrwxrwxrwx 1 root root 19 Feb 22 20:40 \
# /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> ./libcuda.so.375.39
# Make sure it is pointing to the right version.
# Compare it with the installed NVIDIA driver.
nvidia-smi
# Replace libcuda.so.1 with a link to the correct version
cd /usr/lib/x86_64-linux-gnu
sudo ln -f -s libcuda.so.<yournvidia.version> libcuda.so.1
Now in the same way, make another symlink from libcuda.so.1 to a link of the same name in your LD_LIBRARY_PATH directory.
You may also find that you need to create a link to libcuda.so.1 in /usr/lib/x86_64-linux-gnu named libcuda.so
In case anyone still encounters this. First make sure to add the --runtime=nvidia parameter in order to run your container.
docker run --runtime=nvidia -t tensorflow/serving:latest-gpu
where tensorflow/serving:latest-gpu is the name of the docker image.
In the case I just solved, it was updating the GPU driver to the latest and installing the cuda toolkit. First, the ppa was added and GPU driver installed:
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-390
After adding the ppa, it showed options for driver versions, and 390 was the latest 'stable' version that was shown.
Then install the cuda toolkit:
sudo apt install nvidia-cuda-toolkit
Then reboot:
sudo reboot
It updated the drivers to a newer version than the 390 originally installed in the first step (it was 410; this was a p2.xlarge instance on AWS).

MacPorts is unusable

I've recently installed MacPorts as explained on MacPorts website. All the process went well. The .profile file in my home directory has been updated (in this file the paths "/opt/local/bin" and "/opt/local/sbin" are added to the environment variable PATH) and all the macports files are in the directory "/opt". When I type "which port" in the shell, it returns "/opt/local/bin/port".
But something weird happens when I ask to install the port "octave-devel" (I've installed MacPorts to use Octave on my Mac in the first place). So when I enter the command "sudo port install octave-devel +atlas+docs" (as explained in GNU Octave wiki) in the terminal and type my password, the shell replies "Error: Port octave-devel not found". However the port "octave-devel" seems to exist because I've found its description on this page of the macports website.
Because I had to use Octave quickly I first wanted to uninstall MacPorts and install Fink instead and I tried the method described on the MacPorts website but after I typed "sudo port -fp uninstall installed" it returned "Error: No ports matched the given expression". I couldn't even uninstall this software! I really think that it is a problem of MacPorts itself and not the octave port but I can't find what exactly.
Eventually I used Octave on a Windows computer but it annoys me not to know what is wrong with MacPorts on my computer. And mainly, I want to be capable to use GNU Octave on my Mac because I need it for school.
Thank you in advance and happy holidays.
I'm not sure which version of OSX you are running, however, I have octave (not octave-devel) version 3.6.4 installed via macports on a machine running OSX 10.9.1. This was built using:
sudo port install octave
which yields a known bug building the atlas dependency that results from a missing fortran compiler. At this point you have two options. Before attempting to install octave first try to install atlas separately, either overriding the standard clang compiler with the gcc4x flag, or install atlas using:
sudo port install atlas +nofortran
which runs fine using clang. With atlas installed, octave should build to completion although there is a possibility that you will find an error regarding the use of arpack by apple as a vector library. Using +arpack is preferred, so it may be useful to load this by hand as well before starting your octave install.
Trying to install Octave using MacPorts I ran into a similar problem.
Summary
My solution was to first clean & build atlas separately using gcc47 instead of the default mpclang34. Then to build the default octave.
Details
This is on a MacBook running an older OS (10.7.5), the standard Octave (3.8.2) package failed to build - it hung on building the atlas dependency.
Solution:
sudo port clean atlas
sudo port -v install atlas +gcc48
sudo port -v install octave +atlas+docs
I'm currently going through the process of installing Octave via MacPorts. I used the following command which I found on Shifteleven.com:
sudo port install octave-devel +gcc45
It seems to be working so far. You also need to make sure you've installed the Xcode command line tools, which is something that I forgot to do the first time I tried.
I also ran into problems installing Octave using Macports on OSX 10.10.1 and solved them, similar to #Tom_N_PDX and #isak.
Short version
I got it working using one of the options described by #isak.
More detailed version
Running sudo port install octave failed because of the missing Fortran compiler problem.
I next installed Fortran using Macports sudo port install gcc48 and then tried re-installing Octave
sudo port clean octave
sudo port install octave
This "hung" on Atlas, as others have mentioned, although I now realize it just takes a long time and I killed it before it finished. Likely it would have worked, as the output said it had found Fortran
Selected C compiler: /usr/bin/clang
Selected F77 compiler: gfortran48
I then installed atlas separately, using the +gcc48 flag, as suggested by #isak
sudo port install atlas +gcc48
but it displayed the same compiler information as above (consistent with my conjecture that the above would have worked). This process took about 4 hours. You can monitor the progress of the task in the logfile (found with the command sudo port logfile atlas), which reassures you it's doing something and not "hung". (Oddly the output does halt mid-message, but it always eventually resumed. Also there were a lot of warning messages.)
Last, running the following worked:
sudo port clean octave
sudo port install +arpack
I actually first tried without the +arpack option and it worked but I got the following message, consistent with #isak's answer
WARNING: Dependency 'arpack' is installed with the +accelerate variant, using Apple's Vector Libraries which have some known bugs that can cause Octave to crash if using certain functions in arpack. The +atlas variant does not have these issues with Octave, but does take many hours to compile even on modern hardware.
When I reinstalled Octave with the +arpack flag it took less than a minute (because I had already installed Atlas).
I had a similar problem with MacPorts. I would recommend using HomeBrew instead. Here are the commands to install Octave on HomeBrew:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew update
brew upgrade
brew install octave