What to do with 'Bus error' in caffe while training? - caffe

I am using NVIDIA Jetson TX1 and caffe to train the AlexNet on my own data.
I have 104,000 train and 20,000 validation images fed to my model. with batch size of 16 for both test and train.
I run the solver for training and I get this Bus error after 1300 iterations:
.
.
.
I0923 12:08:37.121116 2341 sgd_solver.cpp:106] Iteration 1300, Ir = 0.01
*** Aborted at 1474628919 (unix time) try "date -d #1474628919" if you are using GNU date ***
PC: # 0x0 (unknown)
*** SIGBUS (#0x7ddea45000) received by PID 2341 (TID 0x7faa9fdf70) from PID 18446744073149894656; stack trace: ***
# 0x7fb4b014e0 (unknown)
# 0x7fb3ebe8b0 (unknown)
# 0x7fb4057248 (unknown)
# 0x7fb40572b4 (unknown)
# 0x7fb446e120 caffe::db::LMDBCursor::value()
# 0x7fb4587624 caffe::DataReader::Body::read_one()
# 0x7fb4587a90 caffe::DataReader::Body::InternalThreadEntry()
# 0x7fb458a870 caffe::InternalThread::entry()
# 0x7fb458b0d4 boost::detail::thread_data<>::run()
# 0x7fafdf7ef0 (unknown)
# 0x7fafcfde48 start_thread
Bus error
I use ubuntu 14, NVIDIA TegraX1, RAM 3.8 GB.
As i understood it is a memory issue. Could you please explain better about it and help me how I can solve this problem?
If any other information is needed please let me know.

Related

Memory limit on composer installation

I have a cloud in the digital ocean where it has 1 GB of ram.
I need to install a docker, laravel, mysql, nginx environment, I found the laradock and installed it normally but when running the composer in the container I am returning a memory limit error.
Error running: composer install
root#b9864446a1e1:/var/www/site# composer install
Loading composer repositories with package information
Updating dependencies (including require-dev)
mmap() failed: [12] Cannot allocate memory
mmap() failed: [12] Cannot allocate memory
PHP Fatal error: Out of memory (allocated 677388288) (tried to allocate 4096 bytes) in phar:///usr/local/bin/composer/src/Composer/DependencyResolver/RuleWatchGraph.php on line 52
Fatal error: Out of memory (allocated 677388288) (tried to allocate 4096 bytes) in phar:///usr/local/bin/composer/src/Composer/DependencyResolver/RuleWatchGraph.php on line 52
Error when trying to change memory.
WARNING: Your kernel does not support swap limit capabilities or the
cgroup is not mounted. Memory limited without swap.
This could be happening because the VPS runs out of memory and has no Swap space enabled.
free -m
total used free shared buffers cached
Mem: xxxx xxx xxxx x x xxx
-/+ buffers/cache: xxx xxxx
Swap: 0 0 0
To enable the swap you can use for example:
/bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024
/sbin/mkswap /var/swap.1
/sbin/swapon /var/swap.1
You can make a permanent swap file following this tutorial from DigitalOcean.

Segfault triggered by multiple GPUs

I am running a training script with caffe on a 8 GPU (1080Ti) server.
If I train on 6 or fewer gpus (using CUDA_VISIBLE_DEVICES), everything is fine.
(I set export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 and specify these GPUs in training script.)
But if I train on 7 or 8 GPUs, I see this error at the start of training consistently:
Error: (unix time) try if you are using GNU date
SIGSEGV (#0x70) received by PID 17206 (TID 0x7fc678ffd700) from PID 112; stack trace:
# 0x7fc86186b4b0 (unknown)
# 0x7fc861983f75 (unknown)
# 0x7fc863c4b4c7 std::__cxx11::basic_string<>::_M_construct<>()
# 0x7fc863c4c60b _ZN5caffe2db10LMDBCursor5valueB5cxx11Ev
# 0x7fc863ace3e7 caffe::AnnotatedDataLayer<>::DataLayerSetUp()
# 0x7fc863a6e4d5 caffe::BasePrefetchingDataLayer<>::LayerSetUp()
# 0x7fc863cbf2b4 caffe::Net<>::Init()
# 0x7fc863cc11ae caffe::Net<>::Net()
# 0x7fc863bb9c9a caffe::Solver<>::InitTestNets()
# 0x7fc863bbb84d caffe::Solver<>::Init()
# 0x7fc863bbbb3f caffe::Solver<>::Solver()
# 0x7fc863ba7d61 caffe::Creator_SGDSolver<>()
# 0x7fc863ccc1c2 caffe::Worker<>::InternalThreadEntry()
# 0x7fc863cf94c5 caffe::InternalThread::entry()
# 0x7fc863cfa38e boost::detail::thread_data<>::run()
# 0x7fc85350d5d5 (unknown)
# 0x7fc83fee56ba start_thread
# 0x7fc86193d41d clone
# 0x0 (unknown)```
The Error: (unix time) ... at the start of the trace is apparently thrown by glog.
It appears to be thrown when a general failure happens.
This thread show many different issues triggering Error: (unix time)... and similar trace.
In the thread, it is noted that multiple GPUs may trigger this error.
That is what appears to be the root cause in my case.
Are there things I can further look into to understand what is happening?

Caffe-SSD installed with GPU support thorows error "Cannot use GPU in CPU-only Caffe"

I have installed caffe-ssd with OpenCV version 3.2.0, CUDA version 9.2.148 and CuDNN version 7.2.1.38.
These are my settings in Makefile.config
# cuDNN acceleration switch (uncomment to build with cuDNN).
USE_CUDNN := 1
# CPU-only switch (uncomment to build without GPU support).
# CPU_ONLY := 1
# Uncomment if you're using OpenCV 3
OPENCV_VERSION := 3
# We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \
/usr/local/lib/python2.7/dist-packages/numpy/core/include
# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial/
All tests were passed.
[----------] Global test environment tear-down
[==========] 1266 tests from 168 test cases ran. (45001 ms total)
[ PASSED ] 1266 tests.
Thereafter I follow this link for SSD. The LMDB creation works without a problem but when I run
python examples/ssd/ssd_pascal.py
I get the following error
I0820 14:16:29.089138 22429 caffe.cpp:217] Using GPUs 0
F0820 14:16:29.089301 22429 common.cpp:66] Cannot use GPU in CPU-only Caffe: check mode.
*** Check failure stack trace: ***
# 0x7f97322a00cd google::LogMessage::Fail()
# 0x7f97322a1f33 google::LogMessage::SendToLog()
# 0x7f973229fc28 google::LogMessage::Flush()
# 0x7f97322a2999 google::LogMessageFatal::~LogMessageFatal()
# 0x7f973284f8a0 caffe::Caffe::SetDevice()
# 0x55b05fe50dcb (unknown)
# 0x55b05fe4c543 (unknown)
# 0x7f9730ae3b97 __libc_start_main
# 0x55b05fe4cffa (unknown)
Aborted (core dumped)
I have an NVIDIA GeForce GTX 1080 Ti graphics card.
Mon Aug 20 14:26:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.51 Driver Version: 396.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 44% 37C P8 19W / 250W | 18MiB / 11177MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1356 G /usr/lib/xorg/Xorg 9MiB |
| 0 1391 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
I've tried compiling a simple Cuda code with nvcc and run it without any problem. I'm able to import caffe without any issue.
I have checked this question and that's not my problem.
for the error error == cudaSuccess (7 vs. 0)
change from gpus = "0,1,2,3" to gpus = "0" in ssd_pascal.py and also check the path of cuda in CUDA_DIR in Makefile.config and update it with the proper path and version that is installed in your system.
and for error “Cannot use GPU in CPU-only Caffe” build the ssd again using make test command

Check failed: error == cudaSuccess during training SSD

I am training SSD and I have error as
I0116 13:10:31.206343 3447 net.cpp:761] Ignoring source layer drop6
I0116 13:10:31.207219 3447 net.cpp:761] Ignoring source layer drop7
I0116 13:10:31.207229 3447 net.cpp:761] Ignoring source layer fc8
I0116 13:10:31.207233 3447 net.cpp:761] Ignoring source layer prob
F0116 13:10:31.227303 3447 parallel.cpp:130] Check failed: error == cudaSuccess (10 vs. 0) invalid device ordinal
*** Check failure stack trace: ***
# 0x7f158382e5cd google::LogMessage::Fail()
# 0x7f1583830433 google::LogMessage::SendToLog()
# 0x7f158382e15b google::LogMessage::Flush()
# 0x7f1583830e1e google::LogMessageFatal::~LogMessageFatal()
# 0x7f158412f7bd caffe::DevicePair::compute()
# 0x7f15841354e0 caffe::P2PSync<>::Prepare()
# 0x7f1584135fee caffe::P2PSync<>::Run()
# 0x40af10 train()
# 0x407608 main
# 0x7f1581fbd830 __libc_start_main
# 0x407ed9 _start
# (nil) (unknown)
Aborted (core dumped)
My Graphic is Quadro4200.
./deviceQuery gives me
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro K4200"
CUDA Driver Version / Runtime Version 9.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4034 MBytes (4230479872 bytes)
( 7) Multiprocessors, (192) CUDA Cores/MP: 1344 CUDA Cores
GPU Max Clock rate: 784 MHz (0.78 GHz)
Memory Clock rate: 2700 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K4200
Result = PASS
I can successfully test SSD library, just that I have error in training.
Is that Graphic card not powerful enough to train the library?
I found the error.
If we run this command python examples/ssd/ssd_pascal.py in ssd, the next step of training command is as follow.
gdb --args ./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0,1,2,3 2>&1 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log
this --gpu 0,1,2,3 2>&1 is giving the issue. I changed to --gpu 0 and run from the training command directly as
./build/tools/caffe train --solver="models/VGGNet/VOC0712/SSD_300x300/solver.prototxt" --weights="models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel" --gpu 0 | tee jobs/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300.log,
then it solved.

Asking for the installation of Caffe

I was installing the Caffe library on Mac OS, but when I type 'make run test', I encountered the following problem. What should I do? Thanks in advance. My macbook doesn't contain Cudas, does this affect the installation?
.build_release/test/test_all.testbin 0 --gtest_shuffle
Cuda number of devices: 32767
Setting to use device 0
Current device id: 0
Current device name:
Note: Randomizing tests' orders with a seed of 14037 .
[==========] Running 1927 tests from 259 test cases.
[----------] Global test environment set-up.
[----------] 4 tests from BlobSimpleTest/0, where TypeParam = f
[ RUN ] BlobSimpleTest/0.TestPointersCPUGPU
E0306 11:45:15.035683 2126779136 common.cpp:104] Cannot create Cublas handle. Cublas won't be available.
E0306 11:45:15.114891 2126779136 common.cpp:111] Cannot create Curand generator. Curand won't be available.
F0306 11:45:15.115012 2126779136 syncedmem.cpp:55] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version
*** Check failure stack trace: ***
# 0x10d2c976a google::LogMessage::Fail()
# 0x10d2c8f14 google::LogMessage::SendToLog()
# 0x10d2c93c7 google::LogMessage::Flush()
# 0x10d2cc679 google::LogMessageFatal::~LogMessageFatal()
# 0x10d2c9a4f google::LogMessageFatal::~LogMessageFatal()
# 0x10e023406 caffe::SyncedMemory::to_gpu()
# 0x10e022c5e caffe::SyncedMemory::gpu_data()
# 0x108021d9c caffe::BlobSimpleTest_TestPointersCPUGPU_Test<>::TestBody()
# 0x10849ba5c testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x10848a1ba testing::Test::Run()
# 0x10848b0e2 testing::TestInfo::Run()
# 0x10848b7d0 testing::TestCase::Run()
# 0x108491f86 testing::internal::UnitTestImpl::RunAllTests()
# 0x10849c264 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x108491c99 testing::UnitTest::Run()
# 0x107f8c89a main
# 0x7fff903e15c9 start
# 0x3 (unknown)
make: *** [runtest] Abort trap: 6
I had the same issue. But i have a Graphic card specifically to run Caffe on it, so CPU_ONLY was not an option ;-)
To check if it's the same cause that mine, try to run CUDA Samples deviceQuery example
I fixed using CUDA Guide runfile verifications
sudo chmod 0666 /dev/nvidia*
Finally, I find a solution by setting CPU_ONLY := 1 in Makefile.config (uncomment the original one by removing the '#' in line "CPU_ONLY := 1 in Makefile.config") and rerun the command"make clean", "make all", then "make test", then "make runtest" referring to this link - https://github.com/BVLC/caffe/issues/736