Caffe: GPU CUDA error after training: Check failed: error == cudaSuccess (30 vs. 0) unknown error - cuda

Sometimes after the training or when I stop the training manually by pressing CTRL + C I get this cuda error:
Check failed: error == cudaSuccess (30 vs. 0) unknown error
This only started to happen recently, though. Does anyone have experienced that before or do you know how to fix this or what the problem is?
Complete log:
I1027 09:29:37.779079 11959 caffe.cpp:217] Using GPUs 0
I1027 09:29:37.780676 11959 caffe.cpp:222] GPU 0: �|���
F1027 09:29:37.780697 11959 common.cpp:151] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
# 0x7f6cc4f465cd google::LogMessage::Fail()
# 0x7f6cc4f48433 google::LogMessage::SendToLog()
# 0x7f6cc4f4615b google::LogMessage::Flush()
# 0x7f6cc4f48e1e google::LogMessageFatal::~LogMessageFatal()
# 0x7f6cc5558032 caffe::Caffe::SetDevice()
# 0x40b3f8 train()
# 0x407590 main
# 0x7f6cc3eb7830 __libc_start_main
# 0x407db9 _start
# (nil) (unknown)

Use nvidia-smi command to see which programs are running on GPU & CPU. If you see any unwanted instance of caffe is running still after pressing ctrl+c is pressed you should kill those with process id. Like below:
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 0000:01:00.0 On | N/A |
| 58% 83C P2 188W / 260W | 1164MiB / 6142MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 980 Ti Off | 0000:02:00.0 Off | N/A |
| 53% 73C P2 127W / 260W | 585MiB / 6143MiB | 35% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1101 C ...-xx/build/tools/caffe 788MiB |
| 0 1570 G /usr/bin/X 235MiB |
| 0 1594 C /usr/bin/python 102MiB |
| 0 2387 G compiz 10MiB |
| 0 3984 G /usr/local/MATLAB/R2016a/bin/glnxa64/MATLAB 2MiB |
| 1 25056 C /usr/bin/caffe 563MiB |
+-----------------------------------------------------------------------------+
you should kill with this command sudo kill -9 1101

try to do make all --> make test --> make runtest. it should work

After running Make all, noticed some errors regarding libcudnn libs, I had them duplicathed in /usr/lib/x86_64-linux-gnu and /usr/local/cuda-8.0/lib64. After leaving only the ones in /usr/lib/x86_64-linux-gnu and restarting the laptop everything worked.

CUDA runtime error (30) might show if your program is unable to create or open the /dev/nvidia-uvm device file. This is usually fixed by installing package nvidia-modprobe:
sudo apt-get install nvidia-modprobe

Try to reinstall/build the nvidia driver for current kernel
sudo apt-get install --reinstall nvidia-375

sudo apt-get install nvidia-modprobe
CUDA runtime error (30) might show if your program is unable to create or open the /dev/nvidia-uvm device file. This is usually fixed by installing package nvidia-modprobe:
(Source)

Related

Orion Context Broker already running error

I have a problem when i start orion context broker as a docker container, it is always exiting, it says that Orion is already running, how can i fix this issue?
srdjan-orion-1 | time=2022-07-27T09:53:34.508Z | lvl=ERROR | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=contextBroker.cpp[432]:pidFile | msg=PID-file '/tmp/contextBroker.pid' found. A broker seems to be running already
srdjan-orion-1 exited with code 1
When i run sudo ps aux | grep contextBroker this is the result: srdjan 31012 0.0 0.0 11568 652 pts/0 S+ 11:56 0:00 grep --color=auto contextBroker
The problem is also when i execute this command, the first number that appears after my username is always changing, and when i try to execute the kill order this is the result: sudo kill 31012 kill: (31012): No such process
and also this: sudo kill 11568 kill: (11568): No such process
Thanks for the help!

CUDA kernel failed : no kernel image is available for execution on the device, Error when running PyTorch model inside Google Compute VM

I have a docker image of a PyTorch model that returns this error when run inside a google compute engine VM running on debian/Tesla P4 GPU/google deep learning image:
CUDA kernel failed : no kernel image is available for execution on the device
This occurs on the line where my model is called. The PyTorch model includes custom c++ extensions, I'm using this model https://github.com/daveredrum/Pointnet2.ScanNet
My image installs these at runtime
The image runs fine on my local system. Both VM and my system have these versions:
Cuda compilation tools 10.1, V10.1.243
torch 1.4.0
torchvision 0.5.0
The main difference is the GPU as far as I'm aware
Local:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 36C P8 N/A / N/A | 361MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
VM:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 23W / 75W | 0MiB / 7611MiB | 3% Default |
If I ssh into the VM torch.cuda.is_available() returns true
Therefore I suspect it must be something to do with the compilation of the extensions
This is the relevant part of my docker file:
ENV CUDA_HOME "/usr/local/cuda-10.1"
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda-10.1/bin:${PATH}
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
ENV FORCE_CUDA=1
# CUDA 10.1-specific steps
RUN conda install -c open3d-admin open3d
RUN conda install -y -c pytorch \
cudatoolkit=10.1 \
"pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0" \
"torchvision=0.5.0=py36_cu101" \
&& conda clean -ya
RUN pip install -r requirements.txt
RUN pip install flask
RUN pip install plyfile
RUN pip install scipy
# Install OpenCV3 Python bindings
RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends \
libgtk2.0-0 \
libcanberra-gtk-module \
libgl1-mesa-glx \
&& sudo rm -rf /var/lib/apt/lists/*
RUN dir
RUN cd pointnet2 && python setup.py install
RUN cd ..
I have already re-running this line from ssh in the VM:
TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install
Which I think targets the installation to the Tesla P4 compute capability?
Is there some other setting or troubleshooting step I can try?
I didn't know anything about docker/VMs/pytorch extensions until a couple of days ago, so somewhat shooting in the dark. Also this is my first stackoverflow post, apologies if I'm not following some etiquette, feel free to point out.
I resolved this in the end by manually deleting all the folders except for "src" in the folder containing setup.py
Then rebuilt the docker image
Then when building the image I ran TORCH_CUDA_ARCH_LIST="6.1" python setup.py install, to install the cuda extensions targeting the correct compute capability for the GPU on the VM
and it worked!
I guess just running setup.py without deleting the folders previously installed doesn't fully overwrite the extension

openstack compute service list --service nova-compute empty

After the installation of nova-compute on compute node, it failed to start and this command from the controller node return an empty result
openstack compute service list --service nova-compute
And the nova-compute.log file contain these two messages:
018-11-19 12:06:05.446 986 INFO os_vif [-] Loaded VIF plugins: ovs, linux_bridge
2018-11-19 12:30:13.784 1140 INFO os_vif [-] Loaded VIF plugins: ovs, linux_bridge
openstack compute service list :
return three service components for the controller with a down state
+----+------------------+------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At
+----+------------------+------------+----------+---------+-------+----------------------------+
| 2 | nova-conductor | Controller | internal | enabled | down | 2018-11-17T17:32:48.000000 |
| 4 | nova-scheduler | Controller | internal | enabled | down | 2018-11-17T17:32:49.000000 |
| 5 | nova-consoleauth | Controller | internal | enabled | down | None
+----+------------------+------------+----------+---------+-------+----------------------------+
service nova-compute status :
Active
How can i resolve these problems ?
This is because you might have missed to create the databases for nova_cell0.
# mysql -u root -p
MariaDB [(none)]> CREATE DATABASE nova_cell0;
MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'#'localhost' \ IDENTIFIED BY 'NOVA_DBPASS';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'#'%' \ IDENTIFIED BY 'NOVA_DBPASS';
#su -s /bin/sh -c "nova-manage cell_v2 map_cell0" nova
# su -s /bin/sh -c "nova-manage cell_v2 create_cell --name=cell1 --verbose" nova
109e1d4b-536a-40d0-83c6-5f121b82b650
# su -s /bin/sh -c "nova-manage db sync" nova
# nova-manage cell_v2 list_cells
#su -s /bin/sh -c "nova-manage api_db sync" nova
make sure in /etc/nova/nova.conf in compute node you have added following configuration:
[DEFAULT]
enabled_apis = osapi_compute,metadata
transport_url = rabbit://openstack:RABBIT_PASS#controller
Then restart compute services.
the try the command openstack compute service list.
this solution also holds good when openstack compute service list is empty or nova hypervisor list is empty.

Executing device Query Cuda Sample on Ubuntu [duplicate]

When I go to /usr/local/cuda/samples/1_Utilities/deviceQuery and execute
moose#pc09 /usr/local/cuda/samples/1_Utilities/deviceQuery $ sudo make clean
rm -f deviceQuery deviceQuery.o
rm -rf ../../bin/x86_64/linux/release/deviceQuery
moose#pc09 /usr/local/cuda/samples/1_Utilities/deviceQuery $ sudo make
"/usr/local/cuda-7.0"/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 -o deviceQuery.o -c deviceQuery.cpp
"/usr/local/cuda-7.0"/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 -o deviceQuery deviceQuery.o
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release
moose#pc09 /usr/local/cuda/samples/1_Utilities/deviceQuery $ ./deviceQuery
I keep getting
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version Result = FAIL
I have no idea how to fix it.
My System
moose#pc09 ~ $ cat /etc/issue
Linux Mint 17 Qiana \n \l
moose#pc09 ~ $ uname -a
Linux pc09 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
moose#pc09 ~ $ lspci -v | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GK110B [GeForce GTX Titan Black] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1066
Kernel driver in use: nvidia
01:00.1 Audio device: NVIDIA Corporation GK110 HDMI Audio (rev a1)
Subsystem: NVIDIA Corporation Device 1066
moose#pc09 ~ $ sudo lshw -c video
*-display
description: VGA compatible controller
product: GK110B [GeForce GTX Titan Black]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci#0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:96 memory:fa000000-faffffff memory:d0000000-d7ffffff memory:d8000000-d9ffffff ioport:e000(size=128) memory:fb000000-fb07ffff
moose#pc09 ~ $ nvidia-settings -q NvidiaDriverVersion
Attribute 'NvidiaDriverVersion' (pc09:0.0): 331.79
moose#pc09 ~ $ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 331.79 Sun May 18 03:55:59 PDT 2014
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
moose#pc09 ~ $ lsmod | grep -i nvidia
nvidia_uvm 34855 0
nvidia 10703828 40 nvidia_uvm
drm 303102 5 ttm,drm_kms_helper,nvidia,nouveau
moose#pc09 ~ $ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27
moose#pc09 ~ $ nvidia-smi
Thu Nov 12 11:23:24 2015
+------------------------------------------------------+
| NVIDIA-SMI 331.79 Driver Version: 331.79 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:01:00.0 N/A | N/A |
| 26% 35C N/A N/A / N/A | 132MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Update your NVIDIA driver. At the moment you have the driver which only supports CUDA 6 or lower, and you are trying to use the CUDA 7.0 toolkit with it.
I ran into this exact same error message with toolkit 8.0 on ubuntu 1604. I tried reinstalling toolkit, cudnn, etc etc and it didn't help. The solution turned out to be very simple: update to the latest NVIDIA driver. I installed NVIDIA-Linux-x86_64-367.57.run and the error went away.
My cent,
this error may be related to the selected GPU mode (Performance/Power Saving Mode), when you select (with nvidia-settings utiliy) the integrated Intel GPU and you execute the deviceQuery script... you get this error:
-> CUDA driver version is insufficient for CUDA runtime version
But this error is misleading, by selecting back the NVIDIA GPU(Performance mode) with nvidia-settings utility the problem disappears.
It is not a version problem (in my scenario).
Regards

Cannot set IndexMemory and DataMemory on MySQL Cluster

I am in the process of setting up a MySQL Cluster (version 7.2.4) on 64-bit Debian Linux. The cluster has two management/SQL nodes and two data nodes. Each server has the following in /var/lib/mysql-cluster/config.ini:
[NDBD DEFAULT]
NoOfReplicas=2
DataDir=/var/lib/mysql-cluster
DataMemory=256M
IndexMemory=64M
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]
# Management node 1
[NDB_MGMD]
NodeId=1
HostName=192.168.25.10
DataDir=/var/lib/mysql-cluster
# Management node 2
[NDB_MGMD]
NodeId=2
HostName=192.168.25.11
DataDir=/var/lib/mysql-cluster
# Storage node 1
[NDBD]
NodeId=3
HostName=192.168.25.12
# Storage node 2
[NDBD]
NodeId=4
HostName=192.168.25.13
[MYSQLD]
NodeId=5
HostName=192.168.25.10
[MYSQLD]
NodeId=6
HostName=192.168.25.11
[MYSQLD]
[MYSQLD]
The documentation and my own research on Google leads me to believe that this will set the data memory to 256 MB and the index memory to 64 MB. However, when the cluster is started using this configuration, these settings are not honored:
mysql> SELECT node_id, memory_type, total FROM ndbinfo.memoryusage;
+---------+--------------+----------+
| node_id | memory_type | total |
+---------+--------------+----------+
| 3 | Data memory | 83886080 |
| 3 | Index memory | 19136512 |
| 4 | Data memory | 83886080 |
| 4 | Index memory | 19136512 |
+---------+--------------+----------+
4 rows in set (0.03 sec)
For each node, the data memory is 80 MB and the index memory is 18 MB, which are the default values according to the MySQL Cluster documentation.
I've tried a few minor tweaks, such as changing the [NDBD DEFAULT] to [ndbd default], but nothing has worked. Does anyone know why I'm not able to change these two settings?
As always, any help will be greatly appreciated. Thanks!
when starting both management servers with the new configuration: did you restart the management nodes with --reload to actually load the new configuration from the config file? If --reload is not used the cached version of the pervious valid configuration will be used. In order for the configuration to be picked up by the data nodes also they have to be restarted.
Bernd