Caffe-SSD installed with GPU support thorows error "Cannot use GPU in CPU-only Caffe" - caffe

I have installed caffe-ssd with OpenCV version 3.2.0, CUDA version 9.2.148 and CuDNN version 7.2.1.38.
These are my settings in Makefile.config
# cuDNN acceleration switch (uncomment to build with cuDNN).
USE_CUDNN := 1
# CPU-only switch (uncomment to build without GPU support).
# CPU_ONLY := 1
# Uncomment if you're using OpenCV 3
OPENCV_VERSION := 3
# We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \
/usr/local/lib/python2.7/dist-packages/numpy/core/include
# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial/
All tests were passed.
[----------] Global test environment tear-down
[==========] 1266 tests from 168 test cases ran. (45001 ms total)
[ PASSED ] 1266 tests.
Thereafter I follow this link for SSD. The LMDB creation works without a problem but when I run
python examples/ssd/ssd_pascal.py
I get the following error
I0820 14:16:29.089138 22429 caffe.cpp:217] Using GPUs 0
F0820 14:16:29.089301 22429 common.cpp:66] Cannot use GPU in CPU-only Caffe: check mode.
*** Check failure stack trace: ***
# 0x7f97322a00cd google::LogMessage::Fail()
# 0x7f97322a1f33 google::LogMessage::SendToLog()
# 0x7f973229fc28 google::LogMessage::Flush()
# 0x7f97322a2999 google::LogMessageFatal::~LogMessageFatal()
# 0x7f973284f8a0 caffe::Caffe::SetDevice()
# 0x55b05fe50dcb (unknown)
# 0x55b05fe4c543 (unknown)
# 0x7f9730ae3b97 __libc_start_main
# 0x55b05fe4cffa (unknown)
Aborted (core dumped)
I have an NVIDIA GeForce GTX 1080 Ti graphics card.
Mon Aug 20 14:26:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.51 Driver Version: 396.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 44% 37C P8 19W / 250W | 18MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1356 G /usr/lib/xorg/Xorg 9MiB |
| 0 1391 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
I've tried compiling a simple Cuda code with nvcc and run it without any problem. I'm able to import caffe without any issue.
I have checked this question and that's not my problem.

for the error error == cudaSuccess (7 vs. 0)
change from gpus = "0,1,2,3" to gpus = "0" in ssd_pascal.py and also check the path of cuda in CUDA_DIR in Makefile.config and update it with the proper path and version that is installed in your system.
and for error “Cannot use GPU in CPU-only Caffe” build the ssd again using make test command

Related

CUDA is installed, but PyTorch v1.13 on Windows 10 not working. PyTorch not using GPU; how to fix PyTorch out of sync with CUDA 11.x drivers?

How can I find where CUDA 11.x for PyTorch-GPU 1.13 get installed on Windows 10 on my computer?
What I tried:
I installed the NVIDIA CUDA drivers and toolkit for Windows from the NVIDIA website. I can verify this by typing: !nvidia-smi in Jupyter Lab, which gives me the following output. This indicates that the CUDA tools are installed, but not being used by my PyTorch package. I need to find out what version of CUDA drivers are installed so I can install the correct PyTorch-GPU package.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 513.63 Driver Version: 513.63 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P2000 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 N/A / N/A | 0MiB / 4096MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I find many Ubuntu questions and answers for locating CUDA to add it to my PATH, but nothing specific for Windows 10.
For example:
Pytorch CUDA installation fails,
Pytorch CUDA installation using conda,
pytorch-says-that-cuda-is-not-available
What are the equivalent Python commands on Windows 10 to locate the CUDA 11.x toolkits and driver version that my PyTorch-GPU package must use? And then how to fix the problem if PyTorch is out of sync?
I am answering my own question here...
PyTorch-GPU must be compiled against specific CUDA binary drivers.
I finally found this hint Why torch.cuda.is_available() returns False even after installing pytorch with cuda? which identifies the issue.
import torch
torch.zeros(1).cuda()
The return value clearly identifies the problem.
AssertionError Traceback (most recent call last)
Cell In [222], line 2
1 import torch
----> 2 torch.zeros(1).cuda()
File C:\ProgramData\Anaconda3\envs\tf210_gpu\lib\site-packages\torch\cuda\__init__.py:221, in _lazy_init()
217 raise RuntimeError(
218 "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
219 "multiprocessing, you must use the 'spawn' start method")
220 if not hasattr(torch._C, '_cuda_getDeviceCount'):
--> 221 raise AssertionError("Torch not compiled with CUDA enabled")
222 if _cudart is None:
223 raise AssertionError(
224 "libcudart functions unavailable. It looks like you have a broken build?")
AssertionError: Torch not compiled with CUDA enabled
The problem is: "Torch not compiled with CUDA enabled"
Now I have to see if I can just re-install PyTorch-GPU to replace the current PyTorch-CPU version with one that is compiled against my CUDA CUDA-GPU v11.6 driver, without rebuilding the entire conda environment. I would rather not rebuild the conda environment from scratch unless it is really necessary.

How do I know my GPU supports multiprocessing by default?

I came across from this post:
How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?
But when I run ./mps_run before I launch the MPS, I got
kernel duration: 4.999370s
kernel duration: 5.012310s
And when I check nvidia-smi in 5 secs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 28C P0 38W / 250W | 508MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Looks like the GPU I am using supports multi-processing somehow,
When I run nvidia-smi -i 2 -c EXCLUSIVE_PROCESS, turned out No devices were found
This is weird.
How do I know my GPU supports multiprocessing or not?
The GPU I am using: Tesla P100 (GP100GL)
In that post you linked, in the UPDATE section of my answer, I indicated that the GPU scheduler has changed in Pascal and beyond (your Tesla P100 is a Pascal GPU).
MPS is supported on all current NVIDIA GPUs.
The results you got are expected (in the non-MPS case) because the GPU scheduler allows both kernels to run, in a time-sliced fashion. All currently supported CUDA GPUs support multiprocessing (in Default compute mode). However the older GPUs (e.g. Kepler) would run the kernel from one process, then the kernel from the other process. Pascal and newer GPUs will run the kernel from one process for a period of time, then the other process for a period of time, then the first process, etc in a round-robin time-sliced fashion.

NVIDIA Visual Profiler: Insufficient kernel bounds data

I am trying to get some insight of why my CUDA kernel has a relatively low performance and I am hoping to get some answers with the NVIDIA profiler.
My CUDA program is a 'boiled down' version of a larger application, isolating and exercising the kernel in question. The program launches the kernel several times in order to measure it's execution time as a mean over multiple launches. After the timing loop a memory copy from device to host is issued to make sure all kernel calls have finished. The program is written in CUDA C++.
This is how I built the program:
main.o: main.cu
nvcc -res-usage -arch=sm_61 -c $<
main: main.o stopwatch.o
g++ -o $# $^ -lcudart -L/usr/local/cuda-11.0/lib64
This test was done on a PC with Intel CPU and an NVIDIA GeForce GTX 1070. The OS is Ubuntu 20.04 with a freshly installed CUDA 11 from the NVIDIA website along with driver 450.51.06:
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 On | N/A |
| 28% 38C P8 8W / 151W | 317MiB / 8111MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The following command was used to generate the profiling file:
sudo /usr/local/cuda-11.0/bin/nvprof -o main.nvvp --profile-from-start
off ./main
I also tried with profiling from start but it leads to the same issue below.
The following command was used to launch the visual profiler:
nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java main.nvvp
The Visual profiler walks me through several steps and when it comes to "Perform Kernel Analysis" the program tells me:
Insufficient kernel bounds data. The data needed to calculate compute,
memory, and latency bounds for the kernel could not be collected
Is this sort of detailed profiling not available on my GPU? (maybe because it's a gamer card)
nvprof by default will capture only a small amount of information in the output file it generates. This is enough to generate an application timeline, when the output file is imported into nvvp, but not enough information to enable all of the different capabilities of nvvp.
According to the documentation, the --analysis-metrics switch for nvprof is recommended for this type of use.
--analysis-metrics is referred to about 6 different times in the profiler documentation, so you may simply want to search on it to see all of the references or recommendations for its use.
Note that --analysis-metrics can capture a large amount of information. For a large, complex application, it may substantially increase the times the profilers spend processing data. Therefore if you know specifically which data you are looking for, you may wish to specify specific metrics instead. Without --analysis-metrics, however, various nvvp analysis tools may not work correctly when you import the file.

Problems with installing nvidia grid driver

I want to use gpu acceleration for my android emulator in a compute engine instance.
I added tesla t4 gpu and now trying to install the gpu grid driver according to here.
I use ubuntu 20. please advise
https://cloud.google.com/compute/docs/gpus/install-grid-drivers
I get an error:
in file included from /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nv-rsync.c:24:
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/common/inc/nv-linux.h:1775:6: error: "NV_BUILD_MODULE_INSTA
NCES" is not defined, evaluates to 0 [-Werror=undef]
1775 | #if (NV_BUILD_MODULE_INSTANCES != 0)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
c1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nv_uvm_int
erface.o] Error 1
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_linux.c: In function ‘nvlink_sleep’:
/tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_linux.c:570:5: error: implicit declaration of
function ‘do_gettimeofday’; did you mean ‘efi_gettimeofday’? [-Werror=implicit-function-declaration]
570 | do_gettimeofday(&tm_aux);
| ^~~~~~~~~~~~~~~
| efi_gettimeofday
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:275: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel/nvidia/nvlink_lin
ux.o] Error 1
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1731: /tmp/selfgz11598/NVIDIA-Linux-x86_64-410.92-grid/kernel] Error 2
make[1]: Target 'modules' not remade because of errors.
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-1021-gcp'
make: *** [Makefile:79: modules] Error 2
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find sug
gestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.co
m.
(END)
The document you are using to install NVIDIA GRID® drivers for virtual workstations, only contains examples of the commands needed to install the GRID drivers.
The example contained in that guide, is for installing the NVIDIA 410.92 driver, this driver is for GRID7.1, but I recommend to use the latest version of GRID, you can consult the following table to see the drivers available.
I’ve reproduced this scenario on my own project and I was able to install GRID11.0, using the NVIDIA 450.51.05 driver.
I’m using an instance with the following characteristics:
Machine type: n1-standard-1 (1 vCPU, 3.75 GB memory)
GPUs: 1 x NVIDIA Tesla T4
OS ubuntu-minimal-2004-focal-v20200702
Keep in mind that you need to have the option Enable Virtual Workstation (NVIDIA GRID) enabled at the creation moment to avoid issues.
I used the following commands for this installation:
user#instance-1:~$ curl -O https://storage.googleapis.com/nvidia-drivers-us-public/GRID/GRID11.0/NVIDIA-Lin
ux-x86_64-450.51.05-grid.run
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 139M 100 139M 0 0 72.2M 0 0:00:01 0:00:01 --:--:-- 72.1M
user#instance-1:~$ sudo bash NVIDIA-Linux-x86_64-450.51.05-grid.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 450.51.05.....................................
................................................................................................................
................................................................................................................
................................................................................................................
................................................................................................................
........................................................................
user#instance-1:~$ nvidia-smi
Mon Jul 27 21:11:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 73C P8 21W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
In my case I needed to install some dependencies like the gcc compiler, and I only used the command
$ sudo apt install build-essential
I hope this information is useful for you.

Sample deviceQuery cuda program

I have a Intel Xeon machine with NVIDIA GeForce1080 GTX configured and CentOS 7 as operating system. I have installed NVIDIA-driver 410.93 and cuda-toolkit 10.0. After compiling the cuda-samples, i tried to run ./deviceQuery.
But it throws like this
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
some command outputs
lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
nvidia-smi
Wed Feb 13 16:08:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 0% 54C P0 46W / 240W | 175MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6275 G /usr/bin/X 94MiB |
| 0 7268 G /usr/bin/gnome-shell 77MiB |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.13
PATH & LD_LIBRARY_PATH
PATH =/usr/local/cuda-10.0/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/local/sbin:
LD_LIBRARY_PATH = /usr/local/cuda-10.0/lib64:/usr/local/cuda/lib64:
lsmod | grep nvidia
nvidia_drm 39819 3
nvidia_modeset 1036573 6 nvidia_drm
nvidia 16628708 273 nvidia_modeset
drm_kms_helper 179394 1 nvidia_drm
drm 429744 6 drm_kms_helper,nvidia_drm
ipmi_msghandler 56032 2 ipmi_devintf,nvidia
lsmod | grep nvidia-uvm
no output
dmesg | grep NVRM
[ 8.237489] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.93 Thu Dec 20 17:01:16 CST 2018 (using threaded interrupts)
Is this problem anything related to modprobe or nvidia-uvm?
I asked this in NVIDIA-devtalk forum, but no-reply yet.
Please give some suggestions.
Thanking in advance.
I debugged it. The problem is version mismatch between nvidia-driver(410.93) and cuda(with driver 410.48 came with cuda run file). Gave autoremove all the drivers and reinstalled from the beginning. Deleted all the link files in /var/lib/dkms/nvidia/*.
Now it works fine. And nvidia-uvm also loaded.
lsmod | grep nvidia
nvidia_uvm 786031 0
nvidia_drm 39819 3
nvidia_modeset 1048491 6 nvidia_drm
nvidia 16805034 274 nvidia_modeset,nvidia_uvm
drm_kms_helper 179394 1 nvidia_drm
drm 429744 6 drm_kms_helper,nvidia_drm
ipmi_msghandler 56032 2 ipmi_devintf,nvidia
nvidia-smi
Fri Feb 15 11:46:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 0% 45C P8 10W / 240W | 242MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6063 G /usr/bin/X 120MiB |
| 0 7502 G /usr/bin/gnome-shell 119MiB |
+-----------------------------------------------------------------------------+
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8119 MBytes (8513585152 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1797 MHz (1.80 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS