Can I update cuda version through conda? - cuda

My current:
nvidia-smi
Wed Aug 4 01:40:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:0C.0 Off | 0 |
| N/A 34C P0 37W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:0D.0 Off | 0 |
| N/A 34C P0 36W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:0E.0 Off | 0 |
| N/A 33C P0 39W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:0F.0 Off | 0 |
| N/A 37C P0 41W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I want to install Tensorflow 2.3/2.4, so I need to upgrade cuda to 10.1 at least in Conda. I know how to install cudakit in conda:
conda install cudatoolkit=10.1
But this seems not enough:
Status: CUDA driver version is insufficient for CUDA runtime version
If I want to keep the old version cuda 10.0, can I update cuda to 10.1 through Conda? This won't work:
conda install cuda=10.1
I am using Python 3.8. If I can't keep cuda 10.0, how to directly upgrade cuda to 10.1 with or without conda? It's best if I can upgrade in Conda.
ADDITION:
I installed cudatoolkit=10.1, but the cuda driver still not good. My conda env list shows:
cudatoolkit 10.1.243 h6bb024c_0
tensorflow-gpu 2.3.0 pypi_0 pypi
The following test is good:
import tensorflow as tf
2021-08-04 04:21:31.110443: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
In [3]: print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
2021-08-04 04:21:34.499432: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-08-04 04:21:34.665738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.666369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:0c.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 04:21:34.666459: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.667017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:00:0d.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 04:21:34.667064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.667613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 2 with properties:
pciBusID: 0000:00:0e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-08-04 04:21:34.667644: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-08-04 04:21:34.670275: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-08-04 04:21:34.672971: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-08-04 04:21:34.673378: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-08-04 04:21:34.676043: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-08-04 04:21:34.677370: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-08-04 04:21:34.681850: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-08-04 04:21:34.681989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.682604: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.683196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.683782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.684353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.684961: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-04 04:21:34.685513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2
Num GPUs Available: 3
But the following test failed:
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
The error message:
2021-08-04 04:27:30.934969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1, 2
2021-08-04 04:27:30.935028: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
......
InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
If this statement is true, why my installation is still bad, because I already installed cudatoolkit=10.1 in Conda:
If you want to install a GPU driver, you could install a newer CUDA toolkit, which will have a newer GPU driver (installer) bundled with it.
cudatoolkit and cuda driver still not match?

No, you can't update the GPU driver via conda, and that is what is needed in your case to support CUDA 10.1 or something newer. See here:
Anaconda requires that the user has installed a recent NVIDIA driver that meets the version requirements in the table below.
(the up-to-date table is here)
If you want to install a GPU driver, you could install a newer CUDA toolkit, which will have a newer GPU driver (installer) bundled with it. Or you can retrieve a driver here and install it. By newer CUDA toolkit, I mean the CUDA toolkit installers provided by NVIDIA, which are available here, not via conda. You cannot do the driver update via conda.
I suggest you study the CUDA linux install guide, because the methodology used to install the previous driver (runfile or package manager) is probably the one you want to use for your next driver.
As an alternative (for example if you don't have or can't get admin access to the system), you can investigate CUDA forward compatibility. (This may also be of interest regarding compatibility.)

Related

CUDA is installed, but PyTorch v1.13 on Windows 10 not working. PyTorch not using GPU; how to fix PyTorch out of sync with CUDA 11.x drivers?

How can I find where CUDA 11.x for PyTorch-GPU 1.13 get installed on Windows 10 on my computer?
What I tried:
I installed the NVIDIA CUDA drivers and toolkit for Windows from the NVIDIA website. I can verify this by typing: !nvidia-smi in Jupyter Lab, which gives me the following output. This indicates that the CUDA tools are installed, but not being used by my PyTorch package. I need to find out what version of CUDA drivers are installed so I can install the correct PyTorch-GPU package.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 513.63 Driver Version: 513.63 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2000 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 N/A / N/A | 0MiB / 4096MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I find many Ubuntu questions and answers for locating CUDA to add it to my PATH, but nothing specific for Windows 10.
For example:
Pytorch CUDA installation fails,
Pytorch CUDA installation using conda,
pytorch-says-that-cuda-is-not-available
What are the equivalent Python commands on Windows 10 to locate the CUDA 11.x toolkits and driver version that my PyTorch-GPU package must use? And then how to fix the problem if PyTorch is out of sync?
I am answering my own question here...
PyTorch-GPU must be compiled against specific CUDA binary drivers.
I finally found this hint Why torch.cuda.is_available() returns False even after installing pytorch with cuda? which identifies the issue.
import torch
torch.zeros(1).cuda()
The return value clearly identifies the problem.
AssertionError Traceback (most recent call last)
Cell In [222], line 2
1 import torch
----> 2 torch.zeros(1).cuda()
File C:\ProgramData\Anaconda3\envs\tf210_gpu\lib\site-packages\torch\cuda\__init__.py:221, in _lazy_init()
217 raise RuntimeError(
218 "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
219 "multiprocessing, you must use the 'spawn' start method")
220 if not hasattr(torch._C, '_cuda_getDeviceCount'):
--> 221 raise AssertionError("Torch not compiled with CUDA enabled")
222 if _cudart is None:
223 raise AssertionError(
224 "libcudart functions unavailable. It looks like you have a broken build?")
AssertionError: Torch not compiled with CUDA enabled
The problem is: "Torch not compiled with CUDA enabled"
Now I have to see if I can just re-install PyTorch-GPU to replace the current PyTorch-CPU version with one that is compiled against my CUDA CUDA-GPU v11.6 driver, without rebuilding the entire conda environment. I would rather not rebuild the conda environment from scratch unless it is really necessary.

How to convert a U-Net segmentation model to TensorRT on NVIDIA Jetson Nano ? (process killed error)

I trained a U-Net segmentation model with Keras (using TF backend).
I am trying to convert its frozen graph (.pb) to TensorRT format on the Jetson Nano but the process is killed (as seen below). I’ve seen on other posts that it could be related to an « out of memory » problem.
To be known, I already have an SSD MobileNet V2 model running on the Jetson Nano. 
If I stop the systemctl, I can make inference with the U-Net model without converting it to TensorRT (just using the frozen graph model loaded with Tensorflow). As this way doesn't work when I start the systemctl (so when the other neural network is running), I try to convert my U-Net segmentation model to TensorRT to get an optimized version of it (which failed because of a killed process), but it may not be the right way to do this.
Is it possible to run two neural networks on a Jetson Nano ? Is there any other way to do this ? 
For information, here is the way I try to convert the frozen graph to TensorRT : 
trt_graph = trt.create_inference_graph(
input_graph_def=frozen_graph_gd, # Pass the parsed graph def here
outputs=['conv2d_24/Sigmoid'],
max_batch_size=1,
max_workspace_size_bytes=1 << 32, # I have tried 25 and 32 here
precision_mode='FP16'
)
And here is when the process is killed (conversion of the U-Net frozen graph to TensorRT) :
2020-10-05 16:00:58.200269: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2020-10-05 16:01:11.976893: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer.so.7
2020-10-05 16:01:11.994472: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer_plugin.so.7
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From convert_pb_to_tensorrt.py:14: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
2020-10-05 16:01:13.678101: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libnvinfer.so.7
2020-10-05 16:01:15.506432: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-05 16:01:15.512224: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:15.512359: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-10-05 16:01:15.512638: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-10-05 16:01:15.532712: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-10-05 16:01:15.533264: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x328fd900 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-05 16:01:15.533318: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-05 16:01:15.632451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:15.632757: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x30d0edb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-05 16:01:15.632808: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2020-10-05 16:01:15.633163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:15.633276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1634] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
2020-10-05 16:01:15.633348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-10-05 16:01:15.633500: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-05 16:01:15.716786: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-05 16:01:15.903326: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-05 16:01:16.060655: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-05 16:01:16.141950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-05 16:01:16.142219: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-10-05 16:01:16.142553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:16.142878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:16.142991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1762] Adding visible gpu devices: 0
2020-10-05 16:01:16.143133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
2020-10-05 16:01:27.700226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1175] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-05 16:01:27.700377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] 0 
2020-10-05 16:01:27.700417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1194] 0: N 
2020-10-05 16:01:27.713559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:27.713897: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] ARM64 does not support NUMA - returning NUMA node zero
2020-10-05 16:01:27.714101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 200 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
Killed
If the model has unsupported layers, converting to tensor RT won't be achieved. If it's the case, using tensorflow's version or TRT can yield results as this version handles well unsupported layers (they will be handled by tensorflow alongside your tens rt converted layers).
Hope the answer is close to your problem. Tensor rt is a messy ecosystem

Problems with installing nvidia grid driver

I want to use gpu acceleration for my android emulator in a compute engine instance.
I added tesla t4 gpu and now trying to install the gpu grid driver according to here.
I use ubuntu 20. please advise
https://cloud.google.com/compute/docs/gpus/install-grid-drivers
I get an error:
in file included from /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nv-rsync.c:24:
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/common/inc/nv-linux.h:1775:6: error: "NV_BUILD_MODULE_INSTA
NCES" is not defined, evaluates to 0 [-Werror=undef]
1775 | #if (NV_BUILD_MODULE_INSTANCES != 0)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
c1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nv_uvm_int
erface.o] Error 1
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_linux.c: In function ‘nvlink_sleep’:
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_linux.c:570:5: error: implicit declaration of
function ‘do_gettimeofday’; did you mean ‘efi_gettimeofday’? [-Werror=implicit-function-declaration]
570 | do_gettimeofday(&tm_aux);
| ^~~~~~~~~~~~~~~
| efi_gettimeofday
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_lin
ux.o] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1731: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel] Error 2
make[1]: Target 'modules' not remade because of errors.
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-1021-gcp'
make: *** [Makefile:79: modules] Error 2
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find sug
gestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.co
m.
(END)
The document you are using to install NVIDIA GRID® drivers for virtual workstations, only contains examples of the commands needed to install the GRID drivers.
The example contained in that guide, is for installing the NVIDIA 410.92 driver, this driver is for GRID7.1, but I recommend to use the latest version of GRID, you can consult the following table to see the drivers available.
I’ve reproduced this scenario on my own project and I was able to install GRID11.0, using the NVIDIA 450.51.05 driver.
I’m using an instance with the following characteristics:
Machine type: n1-standard-1 (1 vCPU, 3.75 GB memory)
GPUs: 1 x NVIDIA Tesla T4
OS ubuntu-minimal-2004-focal-v20200702
Keep in mind that you need to have the option Enable Virtual Workstation (NVIDIA GRID) enabled at the creation moment to avoid issues.
I used the following commands for this installation:
user#instance-1:~$ curl -O https://storage.googleapis.com/nvidia-drivers-us-public/GRID/GRID11.0/NVIDIA-Lin
ux-x86_64-450.51.05-grid.run
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 139M 100 139M 0 0 72.2M 0 0:00:01 0:00:01 --:--:-- 72.1M
user#instance-1:~$ sudo bash NVIDIA-Linux-x86_64-450.51.05-grid.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 450.51.05.....................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
........................................................................
user#instance-1:~$ nvidia-smi
Mon Jul 27 21:11:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 73C P8 21W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
In my case I needed to install some dependencies like the gcc compiler, and I only used the command
$ sudo apt install build-essential
I hope this information is useful for you.

Caffe-SSD installed with GPU support thorows error "Cannot use GPU in CPU-only Caffe"

I have installed caffe-ssd with OpenCV version 3.2.0, CUDA version 9.2.148 and CuDNN version 7.2.1.38.
These are my settings in Makefile.config
# cuDNN acceleration switch (uncomment to build with cuDNN).
USE_CUDNN := 1
# CPU-only switch (uncomment to build without GPU support).
# CPU_ONLY := 1
# Uncomment if you're using OpenCV 3
OPENCV_VERSION := 3
# We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \
/usr/local/lib/python2.7/dist-packages/numpy/core/include
# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial/
All tests were passed.
[----------] Global test environment tear-down
[==========] 1266 tests from 168 test cases ran. (45001 ms total)
[ PASSED ] 1266 tests.
Thereafter I follow this link for SSD. The LMDB creation works without a problem but when I run
python examples/ssd/ssd_pascal.py
I get the following error
I0820 14:16:29.089138 22429 caffe.cpp:217] Using GPUs 0
F0820 14:16:29.089301 22429 common.cpp:66] Cannot use GPU in CPU-only Caffe: check mode.
*** Check failure stack trace: ***
# 0x7f97322a00cd google::LogMessage::Fail()
# 0x7f97322a1f33 google::LogMessage::SendToLog()
# 0x7f973229fc28 google::LogMessage::Flush()
# 0x7f97322a2999 google::LogMessageFatal::~LogMessageFatal()
# 0x7f973284f8a0 caffe::Caffe::SetDevice()
# 0x55b05fe50dcb (unknown)
# 0x55b05fe4c543 (unknown)
# 0x7f9730ae3b97 __libc_start_main
# 0x55b05fe4cffa (unknown)
Aborted (core dumped)
I have an NVIDIA GeForce GTX 1080 Ti graphics card.
Mon Aug 20 14:26:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.51 Driver Version: 396.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 44% 37C P8 19W / 250W | 18MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1356 G /usr/lib/xorg/Xorg 9MiB |
| 0 1391 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
I've tried compiling a simple Cuda code with nvcc and run it without any problem. I'm able to import caffe without any issue.
I have checked this question and that's not my problem.
for the error error == cudaSuccess (7 vs. 0)
change from gpus = "0,1,2,3" to gpus = "0" in ssd_pascal.py and also check the path of cuda in CUDA_DIR in Makefile.config and update it with the proper path and version that is installed in your system.
and for error “Cannot use GPU in CPU-only Caffe” build the ssd again using make test command

Caffe installation

I'm installing Caffe.
I'm using Ubuntu 14.04.
I tried to install cuda. On Caffe site is written that I need to install the library and the latest standalone driver separately.
I downloaded driver from there. I tried every product type, but I get the same error:
You do not appear to have an NVIDIA GPU supported by the 346.46
NVIDIA Linux graphics driver installed in this system. For further
details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in
the README available on the Linux driver download page at www.nvidia.com.
And then
You appear to be running an X server; please exit X before
installing. For further details, please see the section INSTALLING
THE NVIDIA DRIVER in the README available on the Linux driver
download page at www.nvidia.com.
And
Installation has failed. Please see the file
'/var/log/nvidia-installer.log' for details. You may find
suggestions on fixing installation problems in the README available
on the Linux driver download page at www.nvidia.com.
I successfuly installed cuda and cuDNN.
Then I downloaded Caffe from here.
Then I tried to compile and after I did make all and make test,
I did make runtest and get this error:
Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
Also I found that I need to verify that I have a CUDA-Capable GPU.
This command: lspci | grep -i nvidia doesn't return anything. update-pciids doesn't help neither, though it returns Downloaded daily snapshot dated.
Can anyone help me install Caffe and everything correctly?
Your system apparently does not have a CUDA compatible GPU. Depending on what type of system you are using (most likely a desktop or server with appropriate free PCI-e slot(s), case space, and sufficient power supply capacity), it might be possible to purchase and install such a GPU.
Still you can get started with Caffe, by not using GPU by uncommenting CPU_ONLY flag in Makefile.config
Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device
is detected
Assuming you have a GPU card, the above error can come if NVDIA Driver is not installed / used by the system.
Please check this link - https://askubuntu.com/questions/670485/how-to-inspect-the-currently-used-nvidia-driver-version-and-switch-it-to-another
Check the latest driver version from Nvidia site for your card. Then add the relevant repository and install via that. Better to restart
sudo apt-add-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-3xx
sudo modeporbe nvidia (also ran this before restart)
Check via nvidia-smi command
alex#alex-Lenovo-G400s-Touch:~$ nvidia-smi
Tue Feb 28 15:10:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 720M Off | 0000:01:00.0 N/A | N/A |
| N/A 51C P0 N/A / N/A | 271MiB / 1985MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Install samples and test via deviceQuery after making the samples --> http://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 720M"
...
After that Reconfigure Caffe and do a clean make
Below are the CMake settings for reference and the CMake file http://pastebin.com/qAd40uvh
Probably you don't have CUDA compatible card. Also, you may have it, but you are not using it. i.e. If you have a NVidia card and Integrated graphics, you should make sure your monitor has plugged in your NVidia Card output interface.
You should make sure that your graphics card indeed support CUDA at http://www.geforce.com/hardware/technology/cuda/supported-gpus?field_gpu_type_value=All. Find your graphics card in this list, until you find your card.
p.s. To find your graphics card info, you can run lspci | grep VGA in the shell.