Make error in caffe (I guess relate on Openmp,gcc) - caffe

I build to https://github.com/liulei01/DRBox
make -j8 's Output
OS : ubuntu 14.04
cuda : 7.5
cudnn : 5.1v
gcc : 4.6.4
what can i do?

Related

fail to link cuda example with clang++-9 under Ubuntu 18.04

I am trying to follow the example in
https://llvm.org/docs/CompileCudaWithLLVM.html#invoking-clang
I use Ubuntu 18.04.3 LTS, clang version 9.0.0-2
The device I have is (snippet from the output of deviceQuery):
Detected 1 CUDA Capable device(s)
Device 0: "Quadro P520"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 6.1
I ran the command:
clang++-9 --verbose --cuda-path=/usr/local/cuda-10.2 axpy.cu -o axpy --cuda-gpu-arch=sm_61 -L/usr/local/cuda-10.2 -lcudart_static -ldl -lrt -pthread
And the output is:
clang version 9.0.0-2~ubuntu18.04.1 (tags/RELEASE_900/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/8
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/7.4.0
Candidate multilib: .;#m64
Selected multilib: .;#m64
Found CUDA installation: /usr/local/cuda-10.2, version unknown
clang: error: cannot find libdevice for sm_61. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.
As far as I can tell, libdevice is right where it should be:
~>ls /usr/local/cuda-10.2/nvvm/libdevice/
libdevice.10.bc
What am I doing wrong ?
Added Nov 2020:
Following #ArtemB comment, I tried running it with clang++-10, which throws a warning, but compiles and runs just fine.
Short answer: The version of cuda my driver supports (10.2) is too current for my clang (9.0.0).
Here is the top of the output of nvidia-smi on my machine:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
So my driver indeed supports cuda-10.2. However, it seems this version is not supported by clang 9.0.0. Indeed when running the above command with the extra flag -nocudalib , one gets the following response (only showing the last lines):
In file included from <built-in>:1:
/usr/lib/llvm-9/lib/clang/9.0.0/include/__clang_cuda_runtime_wrapper.h:52:2: error: "Unsupported CUDA version!"
#error "Unsupported CUDA version!"
^
axpy.cu:23:7: error: use of undeclared identifier cudaConfigureCall
axpy<<<1, kDataLen>>>(a, device_x, device_y);
^
2 errors generated when compiling for sm_61.
When inspecting the offending file (the clang cuda runtime wrapper), one sees the following in lines 48-53:
#include "cuda.h"
#if !defined(CUDA_VERSION)
#error "cuda.h did not define CUDA_VERSION"
#elif CUDA_VERSION < 7000 || CUDA_VERSION > 10010
#error "Unsupported CUDA version!"
#endif
Until recently clang was rather particular about CUDA versions. I've relaxed it a bit lately, so clang-10 is more lenient and will attempt to use a newer CUDA version at a feature parity with the latest supported CUDA version (currently 10.1). It will also issue a warning. It does work with CUDA-11.0 well enough to compile Tensorflow.
CUDA-11.1 (and I believe 11.0 update1 on windows) have dropped the version.txt file from the distribution and that will break CUDA compilation with the currently released clang versions, again. This should be fixed in clang-11.0.1 when it's released (version match with CUDA is purely coincidental).

multi process multi GPU with tensorflow, windows

I'm little bit new with tensor-flow.. so please be gentle with me..
I have problem with creating second process that load tensorflow on already working GPU.
the error I get is:
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
\cuda\cuda_dnn.cc:392] error retrieving driver version: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
\cuda\cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
\kernels\conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
Hardware details :
super micro - 4028GR-TRT
8 GPU's 1080
CUDA: 8
cudnn: 5.1
windows: 10
tensorflow: 0.12.1 / 1.0.1
My PC shouldn't be a problem
windows 7
gpu 1070
cuda 8
cudnn 5.1
tensorflow 0.12.1
Can someone tell me why on my PC everything is ok but not on the big one(supermicro)?
is this windows / driver issues maybe?
I try to update NVIDIA driver.. no help on that ..
TensorFlow is not always good at sharing GPUs with other processes (including other instances of itself!). The typical workaround is to use the %CUDA_VISIBLE_DEVICES% environment variable to prevent the two processes from clashing over the same GPU. For example:
C:\>set CUDA_VISIBLE_DEVICES=0
C:\>python tensorflow_program_1.py
While in another command prompt you could tell TensorFlow to use a different GPU as follows:
C:\>set CUDA_VISIBLE_DEVICES=1
C:\>python tensorflow_program_2.py

How to Install the CUDA Driver for TensorFlow (installing from source)

I'm trying to build TensorFlow from source and run it with GPU support. To install the toolkit I use the runfile, to install the driver I used the Additional Drivers Tool, since I did not get Ubuntu to boot into Text mode as specified in the CUDA documentation and stop lightdm and start lightdm does not work either, it gives me (also with sudo):
Name com.ubuntu.Upstart does not exist
So far I could build a release from the TensorFlow repository. However, when I'm trying to run the example as specified in the how-to
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
the GPU apparently cannot be found:
jonas#jonas-Aspire-V5-591G:~/Documents/repos/tensoflow_fork$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: jonas-Aspire-V5-591G
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 GCC version: gcc version
4.9.2 (Ubuntu 4.9.2-10ubuntu13) """
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])
Aborted
I'm using a clean Ubuntu 15.04 installation on an Acer Notebook with the GTX950M.
Can anybody tell me how to properly install the driver?
Can you run deviceQuery (comes with cuda installation)? Can you see nvidia present in lspci/lsmod/nvidia-smi?
lsmod |grep nvidia
dmesg | grep -i nvidia
lspci | grep -i nvidia
nvidia-smi
You can reload nvidia module and look for error messages
modprobe -r nvidia
dmesg | tail
sudo dmesg | grep NVRM
Related issue https://github.com/tensorflow/tensorflow/issues/601

nvcc -arch sm_52 gives error "Value 'sm_52' is not defined for option 'gpu-architecture'"

I updated my cuda toolkit from 5.5 to 6.5. Then following command
nvcc -arch=sm_52
starts to give me an error
nvcc fatal : Value 'sm_52' is not defined for option 'gpu-architecture'
Is this a bug ? or nvcc 6.5 does not support Maxwell virtual architecture.
CUDA Toolkit 6.5 was released before sm_52 architecture came into production.
After the arrival of sm_52 architecture, an update to CUDA 6.5 was released which enabled nvcc to generate code for sm_52.
Make sure you download the newer version of CUDA Toolkit 6.5.
P.S: I would rather use the latest version of toolkit (currently 7.0).

CUDA Runtime API error 38: no CUDA-capable device is detected

The Situation
I have a 2 gpu server (Ubuntu 12.04) where I switched a Tesla C1060 with a GTX 670. Than I installed CUDA 5.0 over the 4.2. Afterwards I compiled all examples execpt for simpleMPI without error. But when I run ./devicequery I get following error message:
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
What I have tried
To solve this I tried all of the thinks recommended by CUDA-capable device, but to no avail:
/dev/nvidia* is there and the permissions are 666 (crw-rw-rw-) and owner root:root
foo#bar-serv2:/dev$ ls -l nvidia*
crw-rw-rw- 1 root root 195, 0 Oct 24 18:51 nvidia0
crw-rw-rw- 1 root root 195, 1 Oct 24 18:51 nvidia1
crw-rw-rw- 1 root root 195, 255 Oct 24 18:50 nvidiactl
I tried executing the code with sudo
CUDA 5.0 installs driver and libraries at the same time
PS here is lspci | grep -i nvidia:
foo#bar-serv2:/dev$ lspci | grep -i nvidia
03:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 670] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GK104 HDMI Audio Controller (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation G94 [Quadro FX 1800] (rev a1)
[update]
foo#bar-serv2:~/NVIDIA_CUDA-5.0_Samples/bin/linux/release$ nvidia-smi -a
NVIDIA: API mismatch: the NVIDIA kernel module has version 295.59,
but this NVIDIA driver component has version 304.54. Please make
sure that the kernel module and all NVIDIA driver components
have the same version.
Failed to initialize NVML: Unknown Error
How could that be, if I use the CUDA 5.0 installer to install driver and libs at the same time. Could the old 4.2 version, that is still lying around mess things up?
I came across this issue, and running
nvidia-smi
informed me of an API mismatch. The problem was that my Linux distro had installed updates that required a system restart, so restarting resolved the issue.
See this stack overflow question Installing cuda 5 samples in Ubuntu 12.10.
Ubuntu 12 is not a supported Linux distro (yet). For reference see CUDA 5.0 Toolkit Release Notes And Errata
** Distributions Currently Supported
Distribution 32 64 Kernel GCC GLIBC
----------------- -- -- --------------------- ---------- -------------
Fedora 16 X X 3.1.0-7.fc16 4.6.2 2.14.90
ICC Compiler 12.1 X
OpenSUSE 12.1 X 3.1.0-1.2-desktop 4.6.2 2.14.1
Red Hat RHEL 6.x X 2.6.32-131.0.15.el6 4.4.5 2.12
Red Hat RHEL 5.5+ X 2.6.18-238.el5 4.1.2 2.5
SUSE SLES 11 SP2 X 3.0.13-0.27-pae 4.3.4 2.11.3
SUSE SLES 11.1 X X 2.6.32.12-0.7-pae 4.3.4 2.11.1
Ubuntu 11.10 X X 3.0.0-19-generic-pae 4.6.1 2.13
Ubuntu 10.04 X X 2.6.35-23-generic 4.4.5 2.12.1
If you want to do it run on Ubuntu 12 anyway then see answer of rpardo. It looks like this distro instead of installing 64 bit libraries to /usr/lib64 installs them to /usr/lib/x86_64-linux-gnu/
I'd suggest searching for all instances of libcuda.so and libnvidia-ml.so on the system. Since the driver doesn't support this distro it might have installed libraries to a path that is not pointed by LD_LIBRARY_PATH. Then move the libraries around and/or change the LD_LIBRARY_PATH to point to this location (it should be the first path on the left). Then retry nvidia-smi or deviceQuery
Good luck
I got error 38 for cudaGetDeviceCount on a windows machine with GTX980 GPU.
After I downloaded the latest driver for GTX 980 fro the NVIDIA site, installed it and restarted, everything is fine. Looks like the CUDA installer is not installing the latest driver.
Try running the sample using sudo (or, you might do a 'sudo su', set LD_LIBRARY_PATH to the path of cuda libraries and run the sample while being root). Apparently, since you've probably installed CUDA 5.0 using sudo, the samples doesn't run with normal user. However, if you run a sample with root, then you'll be able to run samples with the regular user too! I've not yet restarted the system to see if samples work with normal user even after reboot, or each time you should run at least one CUDA application with root.
The problem might completely disappear if you install CUDA TookKit without using sudo.
I had very similar problem on Debian and it turns out that loaded nvidia module had different version than libcuda1.
To check for installed nvidia module you should do:
$ sudo modinfo nvidia-current | grep version
version: 319.82
If it doesn't match version of libcuda1 this the root of your problems.