How to prevent syslogging "Error inserting nvidia" on cudaGetDeviceCount()? - cuda

I have a tool that can be run on both, GPU and CPU. In some init-step I check cudaGetDeviceCount() for the available GPUs. If the tool is being executed on a node without video cards, this results in the following syslog message:
Sep 13 00:21:10 [...] NVRM: No NVIDIA graphics adapter found!
How can I prevent the nvidia driver from flooding my syslog server with this message? It's OK if the node doesn't have a video card, it's not that critical, so I just want to get rid of the message.

That message gets inserted into the syslog by the NVIDIA driver. So the most direct solution would be to not install the NVIDIA driver on a node that does not have a GPU.
If you need some NVIDIA driver components on that node, for example to build CUDA driver API codes on a GPU-less login node, then you will need to use some special switches during driver installation.
You can find out more about driver install switches by using the --help switch on the driver installer package.
A sequence of switches like this may do the trick:
sudo sh NVIDIA-Linux-x86_64-319.72.run --no-nvidia-modprobe --no-kernel-module --no-kernel-module-source -z

Related

ERR_NVGPUCTRPERM error when launching nvprof with all metrics to profile CUDA application

GPU Tesla M60
Driver: 510.47.03
OSL Ubuntu 20.04.5 LTS
CUDA Version: 11.6
Trying the code below to get back full metrics on profiling a CUDA application results in teh error below.
Code
nvprof --metrics all ./myapp
Error
==8169== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
I tried using sudo as suggested but was unable to find the nvcc program.
The easiest solution is to run the profiler as root as below, noting that it may be necessary to use the fully qualified path to find nvcc if it is not in your sudo path.
sudo /usr/local/cuda/bin/nvprof --metrics all ./myapp
There are more permanent solutions available as per https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters such as changing permission settings with modprobe. However, I was not able to get these to work.

Why won't DraftSight run on Fedora 26 with Intel graphics?

DraftSight 2017SP1 Linux (beta) worked on Fedora 24. It fails after upgrading to Fedora 26. Running it from the command line so you can see the low-level errors,
/opt/dassault-systemes/DraftSight/Linux/DraftSight
Qt: Session management error: None of the authentication protocols specified are supported
Could not parse stylesheet of object 0x238a050
Could not parse stylesheet of object 0x238a050
In the graphics environment you see the usual start screens, then error pop-ups which offer to report the error and then close the application when clicked. One says that error-reporting is not available.
Similarly with 2017SP3 and 2018SP0. Fedora updates are current as of today.
This system is an Intel core i3. lspci reports "Intel Corp Xeon E3-1200 v3/4th Gen core processor Integrated Graphics Controller (rev 06)"
2018SP0 does work once an Nvidia GT710 card and the nvidia driver module are installed. It does not work with the nouveau driver module and the same card.
Does anybody have any insight as to the cause? A regression in Fedora, or a latent bug in DraftSight, or anything else?
Knowing whether it works with Fedora 26 and AMD graphics might be very helpful.
Edit March 2018
Doesn't work but differently on a system with AMD R5 230. No "Could not parse" errors, not anything else on the terminal window, but Draftsight starts up with the display all wrong and then locks up. Clicking the "X" gets to "the program is not responding".
Also worth noting that this isn't a Wayland issue. Systems are running Cinnamon and lightdm, so it's good old X.
Also a work-around, if performance is unimportant. (And it probably isn't, with Gen 4 Intel Graphics). Run it as a "remote" application on localhost, on a system with Intel graphics.
$ ssh -X 127.0.0.1
password:
Last login: Wed Mar ...
-bash-4.4$ /opt/dassault-systemes/DraftSight/Linux/DraftSight
(success)
Further update Fedora 29, DraftSight 2018SP3
New wrinkles for Nvidia, Cinnamon as above
Needs invocation
LD_PRELOAD=/usr/lib64/libfreetype.so.6 /opt/dassault-systemes/DraftSight/Linux/DraftSight
otherwise fails with /lib64/libfontconfig.so.1 lookup error FT_DOne_MM_Var
Also kernel 4.20 plus NVidia 390.87 fails to build. There's a patched NVidia installer that does work at if_not_false_then_true site.
Also does not install a .desktop file into /usr/share/applications
I had similar problems when I updated Fedora 24 to 25. The parse stylesheet messages still show up but I can run draftsight from an Xorg session, (not Wayland), using the nouveau drivers but only under root privileges using sudo .
You might try the following script:
sudo DISPLAY=$DISPLAY vblank_mode=1 /opt/dassault-systemes/DraftSight/Linux/DraftSight
I can only get DraftSight to run as root under Fedora 27, 4.18.16-100.fc27.x86_64. I have installed a VM with Ubuntu, and it runs fine, without elevated privileges.

multi process multi GPU with tensorflow, windows

I'm little bit new with tensor-flow.. so please be gentle with me..
I have problem with creating second process that load tensorflow on already working GPU.
the error I get is:
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
\cuda\cuda_dnn.cc:392] error retrieving driver version: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
\cuda\cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
\kernels\conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
Hardware details :
super micro - 4028GR-TRT
8 GPU's 1080
CUDA: 8
cudnn: 5.1
windows: 10
tensorflow: 0.12.1 / 1.0.1
My PC shouldn't be a problem
windows 7
gpu 1070
cuda 8
cudnn 5.1
tensorflow 0.12.1
Can someone tell me why on my PC everything is ok but not on the big one(supermicro)?
is this windows / driver issues maybe?
I try to update NVIDIA driver.. no help on that ..
TensorFlow is not always good at sharing GPUs with other processes (including other instances of itself!). The typical workaround is to use the %CUDA_VISIBLE_DEVICES% environment variable to prevent the two processes from clashing over the same GPU. For example:
C:\>set CUDA_VISIBLE_DEVICES=0
C:\>python tensorflow_program_1.py
While in another command prompt you could tell TensorFlow to use a different GPU as follows:
C:\>set CUDA_VISIBLE_DEVICES=1
C:\>python tensorflow_program_2.py

How to Install the CUDA Driver for TensorFlow (installing from source)

I'm trying to build TensorFlow from source and run it with GPU support. To install the toolkit I use the runfile, to install the driver I used the Additional Drivers Tool, since I did not get Ubuntu to boot into Text mode as specified in the CUDA documentation and stop lightdm and start lightdm does not work either, it gives me (also with sudo):
Name com.ubuntu.Upstart does not exist
So far I could build a release from the TensorFlow repository. However, when I'm trying to run the example as specified in the how-to
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
the GPU apparently cannot be found:
jonas#jonas-Aspire-V5-591G:~/Documents/repos/tensoflow_fork$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version
4.9.2 (Ubuntu 4.9.2-10ubuntu13) """
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])
Aborted
I'm using a clean Ubuntu 15.04 installation on an Acer Notebook with the GTX950M.
Can anybody tell me how to properly install the driver?
Can you run deviceQuery (comes with cuda installation)? Can you see nvidia present in lspci/lsmod/nvidia-smi?
lsmod |grep nvidia
dmesg | grep -i nvidia
lspci | grep -i nvidia
nvidia-smi
You can reload nvidia module and look for error messages
modprobe -r nvidia
dmesg | tail
sudo dmesg | grep NVRM
Related issue https://github.com/tensorflow/tensorflow/issues/601

Cannot run CUDA code that queries NVML - error regarding libnvidia-ml.so

Recently a colleague needed to use NVML to query device information, so I downloaded the Tesla development kit 3.304.5 and copied the file nvml.h to /usr/include. To test, I compiled the example code in tdk_3.304.5/nvml/example and it worked fine.
Over a weekend, something changed in the system (I cannot determine what was changed and I am not the only one with access to the machine) and now any code that uses nvml.h, such as the example code, fails with the following error:
Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
However, I can still run nvidia-smi and read information about my K20m's state, and as far as I am aware nvidia-smi is just a set of calls to nvml.h. The error message I receive is somewhat cryptic, but I believe it is telling me that the nvidia-ml.so file needs to match the Tesla driver that I have installed on my system. Just to ensure everything is correct, I re-downloaded CUDA 5.0 and installed the driver, CUDA runtime, and the test files. I am certain that the nvidia-ml.so file matches the driver (both are 304.54) so I am quite confused as to what could be going wrong. I can compile and run the test code with nvcc as well as run my own CUDA code, as long as it doesn't include nvml.h.
Has anyone encountered this error or have any thoughts on rectifying the issue?
$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54
$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
$ whereis nvml.h
nvml: /usr/include/nvml.h
$ ldd example
linux-vdso.so.1 => (0x00007fff2da66000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
/lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
EDIT: The solution was to remove all extra instances of libnvidia-ml.so. For some reason there were a LOT of them.
$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1
You are getting this error because the application that is trying to use nvml is loading the stub library that is located in:
...tdk_install_path/lib64/libnvidia-ml.so
instead of the one in:
/usr/lib64/libnvidia-ml.so
I was able to reproduce your error when I added the stub library path to my LD_LIBRARY_PATH environment variable. So that is one possible source of error, if someone added the path of the stub library that comes with the tdk distribution to your LD_LIBRARY_PATH environment variable, but probably not the only way this could happen. If someone in an unusual fashion copied the stub library to some system path, that might also be an issue.
You'll need to try and figure out why your system is loading that stub library in place of the correct one in /usr/lib64. Alternatively, for discovery purposes, you could try deleting all instances of the stub library anywhere on your system (leave the correct libraries in /usr/lib and /usr/lib64 alone), and you should be able to observe correct behavior.
I solved the problem this way on a GTX 1070 using windows 10 : go to device manager, select the GPU that is having a problem, disable the GPU and enable back.
I was having this same or similar issue with EWBF Cuda Miner for zCash.
Here is a way to automatically implement Pro7ech's answer (which worked for me) for WIN10:
Install WDK for Windows 10 if you don't already have it: This will give you the ability to use devcon.exe which allows manipulation of devices via batch scripts:
https://learn.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk
You might also need the Windows SDK if you don't have visual studio with Desktop development with C++ workload:
https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk
To make things easier, you might want to add the installation path to your PATH environment variable:
https://www.howtogeek.com/118594/how-to-edit-your-system-path-for-easy-command-line-access/
Devcon.exe was installed here for me:
C:\Program Files (x86)\Windows Kits\10\Tools\x64
So now run this or similar in a cmd.exe prompt to get the device id:
devcon findall * | find /i "nvidia"
Here is what mine looks like:
C:\Users\Soenhay>devcon findall * | find /i "nvidia"
HDAUDIO\FUNC_01&VEN_10DE&DEV_0083&SUBSYS_38426674&REV_1001\5&1C277AD4&0&0001: NVIDIA High Definition Audio
SWD\MMDEVAPI\{0.0.0.00000000}.{574980C3-9747-42EF-A78C-4C304E070B81}: SAMSUNG (NVIDIA High Definition Audio)
ROOT\UNNAMED_DEVICE\0000 : NVIDIA Virtual Audio Device (Wave Extensible) (WDM)
PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000: NVIDIA GeForce GTX 1070
From that I see that my graphics device id is:
PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000
So I create a batch file with the following to disable and re-enable the driver:
devcon disable "#PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"
devcon enable "#PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"
Now, when I get the NVML error when starting the miner I just run this batch file and it fixes it. You could also just add those 2 lines to the beginning of your start.bat file to do this every time but I found that the error does not always happen every time I restart the miner time now.
References:
superuser post
devcon commands
devcon examples
No matching devices found.
NOTE:
The command should have the # symbol at the beginning of the device id.
The batch script should be run as administrator.
I have faced the same error.
Found a solutions is to run command:
nvidia-uninstall