'which' doesn't find cuda after 'sudo dnf install cuda' - cuda

I ran sudo dnf install cuda on Fedora 27. The output is:
Last metadata expiration check: 0:01:05 ago on Thu 05 Jul 2018 10:32:51 AM CEST.
Package cuda-1:9.1.85.3-7.fc27.x86_64 is already installed, skipping.
Dependencies resolved.
Nothing to do.
Complete!
But when I do which cuda on the terminal, I get:
/usr/bin/which: no cuda in (/home/username/anaconda3/bin:/home/username/anaconda2/bin:/home/username/anaconda3/bin:/home/username/anaconda2/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/username/.local/bin:/home/username/bin)
Do I have cuda installed ?
Linux distribution :
x86_64
Fedora release 27 (Twenty Seven)
GPU available (output of lspci | grep -i nvidia):
01:00.0 3D controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1)

There is no executable called 'cuda'. The libraries get installed someplace like /usr/lib/x86_64-linux-gnu/libcuda.so.
You can try to run 'nvcc --version' but you can't be sure until you're able to run a cuda based application.
More details here:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#verify-installation

Related

No devices were found

No devices were found
error in installing
ZOTAC GAMING GeForce RTX 3080 Ti Trinity OC in Ubuntu 20.04.
cat /proc/driver/nvidia/version command gives
NVRM version: NVIDIA UNIX x86_64 Kernel Module 525.85.05 Sat Jan 14 00:49:50 UTC 2023
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
lspci | grep -i nvidia command gives
06:00.0 VGA compatible controller: NVIDIA Corporation Device 2216 (rev a1)
06:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
The system has Nvidia driver version 525.85.05 and
Cuda 11.6.
All installed successfully using run files.
But nvidia-smi command gives
No devices were found
What is wrong with installation?
EDIT:
dpkg -l | grep -i nvidia command has outputs as below
ii libnvidia-compute-510:i386 510.108.03-0ubuntu0.20.04.1 i386 NVIDIA libcompute package
ii libnvidia-decode-510:i386 510.108.03-0ubuntu0.20.04.1 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-510:i386 510.108.03-0ubuntu0.20.04.1 i386 NVENC Video Encoding runtime library
ii libnvidia-fbc1-510:i386 510.108.03-0ubuntu0.20.04.1 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel

How to Install the CUDA Driver for TensorFlow (installing from source)

I'm trying to build TensorFlow from source and run it with GPU support. To install the toolkit I use the runfile, to install the driver I used the Additional Drivers Tool, since I did not get Ubuntu to boot into Text mode as specified in the CUDA documentation and stop lightdm and start lightdm does not work either, it gives me (also with sudo):
Name com.ubuntu.Upstart does not exist
So far I could build a release from the TensorFlow repository. However, when I'm trying to run the example as specified in the how-to
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
the GPU apparently cannot be found:
jonas#jonas-Aspire-V5-591G:~/Documents/repos/tensoflow_fork$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version
4.9.2 (Ubuntu 4.9.2-10ubuntu13) """
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])
Aborted
I'm using a clean Ubuntu 15.04 installation on an Acer Notebook with the GTX950M.
Can anybody tell me how to properly install the driver?
Can you run deviceQuery (comes with cuda installation)? Can you see nvidia present in lspci/lsmod/nvidia-smi?
lsmod |grep nvidia
dmesg | grep -i nvidia
lspci | grep -i nvidia
nvidia-smi
You can reload nvidia module and look for error messages
modprobe -r nvidia
dmesg | tail
sudo dmesg | grep NVRM
Related issue https://github.com/tensorflow/tensorflow/issues/601

Installing general package in octave has error

I have error in installing general package using the instruction.
pkg install -forge general
and get the message
octave:3> pkg install -forge general
In file included from /usr/local/octave/3.8.0/lib/gcc47/gcc/x86_64-apple-darwin13/4.7.3/include/stdint.h:3:0,
from /usr/local/octave/3.8.0/include/octave-3.8.0/octave/oct-conf-post.h:167,
from /usr/local/octave/3.8.0/include/octave-3.8.0/octave/config.h:3351,
from /usr/local/octave/3.8.0/include/octave-3.8.0/octave/../octave/oct.h:31,
from SHA1.cc:19:
/usr/local/octave/3.8.0/lib/gcc47/gcc/x86_64-apple-darwin13/4.7.3/include-fixed/stdint.h:27:32: fatal error: sys/_types/_int8_t.h: No such file or directory
compilation terminated.
make: *** [SHA1.oct] Error 1
/usr/local/octave/3.8.0/bin/mkoctfile-3.8.0 SHA1.cc
pkg: error running `make' for the general package.
error: called from 'configure_make' in file /usr/local/octave/3.8.0/share/octave/3.8.0/m/pkg/private/configure_make.m near line 82, column 9
error: called from:
error: /usr/local/octave/3.8.0/share/octave/3.8.0/m/pkg/private/install.m at line 199, column 5
error: /usr/local/octave/3.8.0/share/octave/3.8.0/m/pkg/pkg.m at line 394, column 9
octave:3>
I have no idea to solve this problem. My computer OS is Mac 10.9.3 Mavericks. Octave version is 3.8.0
octave:1> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.0
GNU Octave License: GNU General Public License
Operating System: Darwin 13.2.0 Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.
Does anyone have idea?
I find the solution! Using this comment
xcode-select --install
and it's success!
octave:1> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.0
GNU Octave License: GNU General Public License
Operating System: Darwin 13.2.0 Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
no packages installed.
octave:2> pkg install -forge general
For information about changes from previous versions of the general package, run 'news general'.
octave:3> ver
----------------------------------------------------------------------
GNU Octave Version 3.8.0
GNU Octave License: GNU General Public License
Operating System: Darwin 13.2.0 Darwin Kernel Version 13.2.0: Thu Apr 17 23:03:13 PDT 2014; root:xnu-2422.100.13~1/RELEASE_X86_64 x86_64
----------------------------------------------------------------------
Package Name | Version | Installation directory
--------------+---------+-----------------------
general | 1.3.4 | /Users/apple/octave/general-1.3.4
I was having the same issue when trying to install the Octave Signal Package without success. The following finally appears to be working.
code-select --install from the Terminal window to install the command line tools
Install MacPorts for Mac. This is a standard installer that you can download from Macports.
sudo port install gcc48 --> This is a Fortran compiler, which is necessary for installing octave-general
sudo port install octave-general [NOTE: THIS TOOK A VERY LONG TIME, and I had to disable Spotlight indexing...Hours on a Macbook Pro]
sudo port install octave-control
sudo port install octave-signal
While looking at how to install the control package, I found this in the Arch Wiki:
Note: Some Octave's packages, like control, need the gcc-fortran ArchLinux's package in order to compile and install.
(https://wiki.archlinux.org/index.php/Octave)
So you might have to install gcc-fortran first.

Cannot run CUDA code that queries NVML - error regarding libnvidia-ml.so

Recently a colleague needed to use NVML to query device information, so I downloaded the Tesla development kit 3.304.5 and copied the file nvml.h to /usr/include. To test, I compiled the example code in tdk_3.304.5/nvml/example and it worked fine.
Over a weekend, something changed in the system (I cannot determine what was changed and I am not the only one with access to the machine) and now any code that uses nvml.h, such as the example code, fails with the following error:
Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
However, I can still run nvidia-smi and read information about my K20m's state, and as far as I am aware nvidia-smi is just a set of calls to nvml.h. The error message I receive is somewhat cryptic, but I believe it is telling me that the nvidia-ml.so file needs to match the Tesla driver that I have installed on my system. Just to ensure everything is correct, I re-downloaded CUDA 5.0 and installed the driver, CUDA runtime, and the test files. I am certain that the nvidia-ml.so file matches the driver (both are 304.54) so I am quite confused as to what could be going wrong. I can compile and run the test code with nvcc as well as run my own CUDA code, as long as it doesn't include nvml.h.
Has anyone encountered this error or have any thoughts on rectifying the issue?
$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54
$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
$ whereis nvml.h
nvml: /usr/include/nvml.h
$ ldd example
linux-vdso.so.1 => (0x00007fff2da66000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
/lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
EDIT: The solution was to remove all extra instances of libnvidia-ml.so. For some reason there were a LOT of them.
$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1
You are getting this error because the application that is trying to use nvml is loading the stub library that is located in:
...tdk_install_path/lib64/libnvidia-ml.so
instead of the one in:
/usr/lib64/libnvidia-ml.so
I was able to reproduce your error when I added the stub library path to my LD_LIBRARY_PATH environment variable. So that is one possible source of error, if someone added the path of the stub library that comes with the tdk distribution to your LD_LIBRARY_PATH environment variable, but probably not the only way this could happen. If someone in an unusual fashion copied the stub library to some system path, that might also be an issue.
You'll need to try and figure out why your system is loading that stub library in place of the correct one in /usr/lib64. Alternatively, for discovery purposes, you could try deleting all instances of the stub library anywhere on your system (leave the correct libraries in /usr/lib and /usr/lib64 alone), and you should be able to observe correct behavior.
I solved the problem this way on a GTX 1070 using windows 10 : go to device manager, select the GPU that is having a problem, disable the GPU and enable back.
I was having this same or similar issue with EWBF Cuda Miner for zCash.
Here is a way to automatically implement Pro7ech's answer (which worked for me) for WIN10:
Install WDK for Windows 10 if you don't already have it: This will give you the ability to use devcon.exe which allows manipulation of devices via batch scripts:
https://learn.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
You might also need the Windows SDK if you don't have visual studio with Desktop development with C++ workload:
https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk
To make things easier, you might want to add the installation path to your PATH environment variable:
https://www.howtogeek.com/118594/how-to-edit-your-system-path-for-easy-command-line-access/
Devcon.exe was installed here for me:
C:\Program Files (x86)\Windows Kits\10\Tools\x64
So now run this or similar in a cmd.exe prompt to get the device id:
devcon findall * | find /i "nvidia"
Here is what mine looks like:
C:\Users\Soenhay>devcon findall * | find /i "nvidia"
HDAUDIO\FUNC_01&VEN_10DE&DEV_0083&SUBSYS_38426674&REV_1001\5&1C277AD4&0&0001: NVIDIA High Definition Audio
SWD\MMDEVAPI\{0.0.0.00000000}.{574980C3-9747-42EF-A78C-4C304E070B81}: SAMSUNG (NVIDIA High Definition Audio)
ROOT\UNNAMED_DEVICE\0000 : NVIDIA Virtual Audio Device (Wave Extensible) (WDM)
PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000: NVIDIA GeForce GTX 1070
From that I see that my graphics device id is:
PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000
So I create a batch file with the following to disable and re-enable the driver:
devcon disable "#PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"
devcon enable "#PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"
Now, when I get the NVML error when starting the miner I just run this batch file and it fixes it. You could also just add those 2 lines to the beginning of your start.bat file to do this every time but I found that the error does not always happen every time I restart the miner time now.
References:
superuser post
devcon commands
devcon examples
No matching devices found.
NOTE:
The command should have the # symbol at the beginning of the device id.
The batch script should be run as administrator.
I have faced the same error.
Found a solutions is to run command:
nvidia-uninstall

CUDA Runtime API error 38: no CUDA-capable device is detected

The Situation
I have a 2 gpu server (Ubuntu 12.04) where I switched a Tesla C1060 with a GTX 670. Than I installed CUDA 5.0 over the 4.2. Afterwards I compiled all examples execpt for simpleMPI without error. But when I run ./devicequery I get following error message:
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
What I have tried
To solve this I tried all of the thinks recommended by CUDA-capable device, but to no avail:
/dev/nvidia* is there and the permissions are 666 (crw-rw-rw-) and owner root:root
foo#bar-serv2:/dev$ ls -l nvidia*
crw-rw-rw- 1 root root 195, 0 Oct 24 18:51 nvidia0
crw-rw-rw- 1 root root 195, 1 Oct 24 18:51 nvidia1
crw-rw-rw- 1 root root 195, 255 Oct 24 18:50 nvidiactl
I tried executing the code with sudo
CUDA 5.0 installs driver and libraries at the same time
PS here is lspci | grep -i nvidia:
foo#bar-serv2:/dev$ lspci | grep -i nvidia
03:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 670] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation G94 [Quadro FX 1800] (rev a1)
[update]
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ nvidia-smi -a
NVIDIA: API mismatch: the NVIDIA kernel module has version 295.59,
but this NVIDIA driver component has version 304.54. Please make
sure that the kernel module and all NVIDIA driver components
have the same version.
Failed to initialize NVML: Unknown Error
How could that be, if I use the CUDA 5.0 installer to install driver and libs at the same time. Could the old 4.2 version, that is still lying around mess things up?
I came across this issue, and running
nvidia-smi
informed me of an API mismatch. The problem was that my Linux distro had installed updates that required a system restart, so restarting resolved the issue.
See this stack overflow question Installing cuda 5 samples in Ubuntu 12.10.
Ubuntu 12 is not a supported Linux distro (yet). For reference see CUDA 5.0 Toolkit Release Notes And Errata
** Distributions Currently Supported
Distribution 32 64 Kernel GCC GLIBC
----------------- -- -- --------------------- ---------- -------------
Fedora 16 X X 3.1.0-7.fc16 4.6.2 2.14.90
ICC Compiler 12.1 X
OpenSUSE 12.1 X 3.1.0-1.2-desktop 4.6.2 2.14.1
Red Hat RHEL 6.x X 2.6.32-131.0.15.el6 4.4.5 2.12
Red Hat RHEL 5.5+ X 2.6.18-238.el5 4.1.2 2.5
SUSE SLES 11 SP2 X 3.0.13-0.27-pae 4.3.4 2.11.3
SUSE SLES 11.1 X X 2.6.32.12-0.7-pae 4.3.4 2.11.1
Ubuntu 11.10 X X 3.0.0-19-generic-pae 4.6.1 2.13
Ubuntu 10.04 X X 2.6.35-23-generic 4.4.5 2.12.1
If you want to do it run on Ubuntu 12 anyway then see answer of rpardo. It looks like this distro instead of installing 64 bit libraries to /usr/lib64 installs them to /usr/lib/x86_64-linux-gnu/
I'd suggest searching for all instances of libcuda.so and libnvidia-ml.so on the system. Since the driver doesn't support this distro it might have installed libraries to a path that is not pointed by LD_LIBRARY_PATH. Then move the libraries around and/or change the LD_LIBRARY_PATH to point to this location (it should be the first path on the left). Then retry nvidia-smi or deviceQuery
Good luck
I got error 38 for cudaGetDeviceCount on a windows machine with GTX980 GPU.
After I downloaded the latest driver for GTX 980 fro the NVIDIA site, installed it and restarted, everything is fine. Looks like the CUDA installer is not installing the latest driver.
Try running the sample using sudo (or, you might do a 'sudo su', set LD_LIBRARY_PATH to the path of cuda libraries and run the sample while being root). Apparently, since you've probably installed CUDA 5.0 using sudo, the samples doesn't run with normal user. However, if you run a sample with root, then you'll be able to run samples with the regular user too! I've not yet restarted the system to see if samples work with normal user even after reboot, or each time you should run at least one CUDA application with root.
The problem might completely disappear if you install CUDA TookKit without using sudo.
I had very similar problem on Debian and it turns out that loaded nvidia module had different version than libcuda1.
To check for installed nvidia module you should do:
$ sudo modinfo nvidia-current | grep version
version: 319.82
If it doesn't match version of libcuda1 this the root of your problems.