Segfault triggered by multiple GPUs - caffe

I am running a training script with caffe on a 8 GPU (1080Ti) server.
If I train on 6 or fewer gpus (using CUDA_VISIBLE_DEVICES), everything is fine.
(I set export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 and specify these GPUs in training script.)
But if I train on 7 or 8 GPUs, I see this error at the start of training consistently:
Error: (unix time) try if you are using GNU date
SIGSEGV (#0x70) received by PID 17206 (TID 0x7fc678ffd700) from PID 112; stack trace:
# 0x7fc86186b4b0 (unknown)
# 0x7fc861983f75 (unknown)
# 0x7fc863c4b4c7 std::__cxx11::basic_string<>::_M_construct<>()
# 0x7fc863c4c60b _ZN5caffe2db10LMDBCursor5valueB5cxx11Ev
# 0x7fc863ace3e7 caffe::AnnotatedDataLayer<>::DataLayerSetUp()
# 0x7fc863a6e4d5 caffe::BasePrefetchingDataLayer<>::LayerSetUp()
# 0x7fc863cbf2b4 caffe::Net<>::Init()
# 0x7fc863cc11ae caffe::Net<>::Net()
# 0x7fc863bb9c9a caffe::Solver<>::InitTestNets()
# 0x7fc863bbb84d caffe::Solver<>::Init()
# 0x7fc863bbbb3f caffe::Solver<>::Solver()
# 0x7fc863ba7d61 caffe::Creator_SGDSolver<>()
# 0x7fc863ccc1c2 caffe::Worker<>::InternalThreadEntry()
# 0x7fc863cf94c5 caffe::InternalThread::entry()
# 0x7fc863cfa38e boost::detail::thread_data<>::run()
# 0x7fc85350d5d5 (unknown)
# 0x7fc83fee56ba start_thread
# 0x7fc86193d41d clone
# 0x0 (unknown)```
The Error: (unix time) ... at the start of the trace is apparently thrown by glog.
It appears to be thrown when a general failure happens.
This thread show many different issues triggering Error: (unix time)... and similar trace.
In the thread, it is noted that multiple GPUs may trigger this error.
That is what appears to be the root cause in my case.
Are there things I can further look into to understand what is happening?

Related

How to configure slurm on ubuntu 20.04 with minimum requirements?

I am trying to set-up configuration file on Ubuntu 20.04. I have tried several thing and searched for errors on other websites (link1, link2, link3) and slurm-website as well. Another similar question on SO as well.
Given the following information about my computer, what is the minimum required information must be provided in slurm.conf file.
The general information for my computer;
RAM: 125.5 GB
CPU: 1-20 (Intel® Xeon(R) CPU E5-2687W v3 # 3.10GHz × 20 )
Graphics: NVIDIA Corporation GP104 [GeForce GTX 1080] / NVIDIA Corporation
OS: Ubuntu 20.04.2 LTS 64 bit
and I want to have 2 nodes with 10 CPUs for each and 1 node for GPU.
I have tried the followings;
After configuration and running the followings;
>sudo systemctl restart slurmctld
with no error. But I got error witj slurmd.
> sudo systemctl restart slurmd
Error is as below;
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
if I run "systemctl status slurmd.service"
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2021-06-06 21:47:26 CEST; 1min 14s ago
Docs: man:slurmd(8)
Process: 52710 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Here is my configuration file slurm.conf generated by configurator_easy.html and saved in /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=myhostname
#
AuthType=auth/menge
Epilog=/usr/local/slurm/epilog
Prolog=/usr/local/slurm/prolog
FirstJobId=0
InactiveLimit=120
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp
KillWait=30
MinJobAge=300
MaxJobCount=10000
#PluginDir=/usr/local/lib
ReturnToService=0
SlurmdPort=6818
SlurmctldPort=6817
SlurmdSpoolDir=/var/spool/slurmd.spool
StateSaveLocation=/var/spool/slurm-llnl/slurm.state
SwitchType=switch/none
TmpFS=/tmp
WaitTime=30
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmUser=slurm
SlurmdUser=root
TaskPlugin=task/affinity
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=300
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile
#SlurmdDebug=info
#SlurmdLogFile=
#
# COMPUTE NODES
NodeName=Linux[1-32] State=UP
NodeName=DEFAULT State=UNKNOWN
PartitionName=Linux[1-32] Default=YES
I have Ubuntu 20.04 running on wsl and I was also struggling with setting up slurm as well. It looks like everything is running fine now. I am still a beginner..
I recommend you to really check the logs:
cat /var/log/slurmctld.log
cat /var/log/slurmd.log
In my case I had some permission issues and therefore had to make sure slurm related directories had to be owned by SlurmUser as defined in your config.
At first glance I see in your config the following lines which could cause the problem (if I compare the settings with mine):
I wonder that you defined NodeName twice.
In my case it has at first the value of SlurmctldHost
Hope something of the above mentioned can help.
Regards
Edit: I also would refer to the following Post, which could be similar to yours, if you run your command with sudo.

Problems running podman in Ubuntu 20.04

Different problems, mainly related to the amount of user namespaces.
Trying to run a pre-built image yields:
podman run --rm -it -p8080:8080 --env LOGSTASH_CONF_STRING=$LOGSTASH_CONF_STRING --name logstash bitnami/logstash:latest
Completed short name "bitnami/logstash" with unqualified-search registries (origin: /etc/containers/registries.conf)
Trying to pull docker.io/bitnami/logstash:latest...
Getting image source signatures
# layers
Writing manifest to image destination
Storing signatures
Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:4 for /var/log/apt/term.log): Check /etc/subuid and /etc/subgid: lchown /var/log/apt/term.log: invalid argument
Trying to pull quay.io/bitnami/logstash:latest...
Getting image source signatures
# layers...
Storing signatures
Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:4 for /var/log/apt/term.log): Check /etc/subuid and /etc/subgid: lchown /var/log/apt/term.log: invalid argument
Error: 2 errors occurred while pulling:
* Error committing the finished image: error adding layer with blob "sha256:8a5e287f7d41a454c717077151d24db164054831d7cd1399ee81ab2dfba4bcb2": Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:4 for /var/log/apt/term.log): Check /etc/subuid and /etc/subgid: lchown /var/log/apt/term.log: invalid argument
* Error committing the finished image: error adding layer with blob "sha256:8a5e287f7d41a454c717077151d24db164054831d7cd1399ee81ab2dfba4bcb2": Error processing tar file(exit status 1): potentially insufficient UIDs or GIDs available in user namespace (requested 0:4 for /var/log/apt/term.log): Check /etc/subuid and /etc/subgid: lchown /var/log/apt/term.log: invalid argument
So I can't really run this thing. Then trying to run a locally built container which creates a user I get:
Error: OCI runtime error: container_linux.go:370: starting container process caused: setup user: invalid argument
Configuration has been set up with subuid
jmerelo:100000:65536
Although
podman unshare cat /proc/self/uid_map
0 1000 1
So there must be something that I'm missing or that I should restart here. Even if I log in again, it's still the same result. So there must be something that I'm doing wrong here.
Same problem solved by following this:
rm -rf ~/.config/containers ~/.local/share/containers
podman system migrate

Compile error during Caffe installation on OS X 10.11

I've configured Caffe environment on my Mac for several times. But this time I encountered a problem I've never met before:
I use Intel's MKL for accelerating computation instead of ATLAS, and I use Anaconda 2.7 and OpenCV 2.4, with Xcode 7.3.1 on OS X 10.11.6.
when I
make all -j8
in terminal under Caffe's root directory, the error info is:
AR -o .build_release/lib/libcaffe.a
LD -o .build_release/lib/libcaffe.so.1.0.0-rc5
clang: warning: argument unused during compilation: '-pthread'
ld: can't map file, errno=22 file '/usr/local/cuda/lib' for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [.build_release/lib/libcaffe.so.1.0.0-rc5] Error 1
make: *** Waiting for unfinished jobs....
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: .build_release/lib/libcaffe.a(parallel.o) has no symbols
I've tried many times, does anyone can help me out?
This looks like you haven't changed Makefile.config from GPU to CPU mode. There shouldn't be anything trying to actively link that library. I think the only CUDA one you should need is libicudata.so
Look for the lines
# CPU-only switch (uncomment to build without GPU support).
# CPU_ONLY := 1
and remove the octothorpe from the front of the second line.

What to do with 'Bus error' in caffe while training?

I am using NVIDIA Jetson TX1 and caffe to train the AlexNet on my own data.
I have 104,000 train and 20,000 validation images fed to my model. with batch size of 16 for both test and train.
I run the solver for training and I get this Bus error after 1300 iterations:
.
.
.
I0923 12:08:37.121116 2341 sgd_solver.cpp:106] Iteration 1300, Ir = 0.01
*** Aborted at 1474628919 (unix time) try "date -d #1474628919" if you are using GNU date ***
PC: # 0x0 (unknown)
*** SIGBUS (#0x7ddea45000) received by PID 2341 (TID 0x7faa9fdf70) from PID 18446744073149894656; stack trace: ***
# 0x7fb4b014e0 (unknown)
# 0x7fb3ebe8b0 (unknown)
# 0x7fb4057248 (unknown)
# 0x7fb40572b4 (unknown)
# 0x7fb446e120 caffe::db::LMDBCursor::value()
# 0x7fb4587624 caffe::DataReader::Body::read_one()
# 0x7fb4587a90 caffe::DataReader::Body::InternalThreadEntry()
# 0x7fb458a870 caffe::InternalThread::entry()
# 0x7fb458b0d4 boost::detail::thread_data<>::run()
# 0x7fafdf7ef0 (unknown)
# 0x7fafcfde48 start_thread
Bus error
I use ubuntu 14, NVIDIA TegraX1, RAM 3.8 GB.
As i understood it is a memory issue. Could you please explain better about it and help me how I can solve this problem?
If any other information is needed please let me know.

Asking for the installation of Caffe

I was installing the Caffe library on Mac OS, but when I type 'make run test', I encountered the following problem. What should I do? Thanks in advance. My macbook doesn't contain Cudas, does this affect the installation?
.build_release/test/test_all.testbin 0 --gtest_shuffle
Cuda number of devices: 32767
Setting to use device 0
Current device id: 0
Current device name:
Note: Randomizing tests' orders with a seed of 14037 .
[==========] Running 1927 tests from 259 test cases.
[----------] Global test environment set-up.
[----------] 4 tests from BlobSimpleTest/0, where TypeParam = f
[ RUN ] BlobSimpleTest/0.TestPointersCPUGPU
E0306 11:45:15.035683 2126779136 common.cpp:104] Cannot create Cublas handle. Cublas won't be available.
E0306 11:45:15.114891 2126779136 common.cpp:111] Cannot create Curand generator. Curand won't be available.
F0306 11:45:15.115012 2126779136 syncedmem.cpp:55] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version
*** Check failure stack trace: ***
# 0x10d2c976a google::LogMessage::Fail()
# 0x10d2c8f14 google::LogMessage::SendToLog()
# 0x10d2c93c7 google::LogMessage::Flush()
# 0x10d2cc679 google::LogMessageFatal::~LogMessageFatal()
# 0x10d2c9a4f google::LogMessageFatal::~LogMessageFatal()
# 0x10e023406 caffe::SyncedMemory::to_gpu()
# 0x10e022c5e caffe::SyncedMemory::gpu_data()
# 0x108021d9c caffe::BlobSimpleTest_TestPointersCPUGPU_Test<>::TestBody()
# 0x10849ba5c testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x10848a1ba testing::Test::Run()
# 0x10848b0e2 testing::TestInfo::Run()
# 0x10848b7d0 testing::TestCase::Run()
# 0x108491f86 testing::internal::UnitTestImpl::RunAllTests()
# 0x10849c264 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x108491c99 testing::UnitTest::Run()
# 0x107f8c89a main
# 0x7fff903e15c9 start
# 0x3 (unknown)
make: *** [runtest] Abort trap: 6
I had the same issue. But i have a Graphic card specifically to run Caffe on it, so CPU_ONLY was not an option ;-)
To check if it's the same cause that mine, try to run CUDA Samples deviceQuery example
I fixed using CUDA Guide runfile verifications
sudo chmod 0666 /dev/nvidia*
Finally, I find a solution by setting CPU_ONLY := 1 in Makefile.config (uncomment the original one by removing the '#' in line "CPU_ONLY := 1 in Makefile.config") and rerun the command"make clean", "make all", then "make test", then "make runtest" referring to this link - https://github.com/BVLC/caffe/issues/736