Caffe Framework Runtest Core dumped error - cuda

I have been installing Caffe Framework with the following GPU:
Geforce 9500 GT
CUDA 6.5 (not work with 7.0)
when I run: make runtest the following errors appeared and I don't know what are the reasons:
make runtest
.build_debug/tools/caffe
caffe: command line brew
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
device_query show GPU diagnostic information
time benchmark model execution time
Flags from tools/caffe.cpp:
-gpu (Run in GPU mode on given device ID.) type: int32 default: -1
-iterations (The number of iterations to run.) type: int32 default: 50
-model (The model definition protocol buffer text file..) type: string
default: ""
-snapshot (Optional; the snapshot solver state to resume training.)
type: string default: ""
-solver (The solver definition protocol buffer text file.) type: string
default: ""
-weights (Optional; the pretrained weights to initialize finetuning. Cannot
be set simultaneously with snapshot.) type: string default: ""
.build_debug/test/test_all.testbin 0 --gtest_shuffle
Cuda number of devices: 1
Setting to use device 0
Current device id: 0
Note: Randomizing tests' orders with a seed of 60641 .
[==========] Running 1356 tests from 214 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from PowerLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN ] PowerLayerTest/3.TestPower
F0616 20:08:47.978885 31913 math_functions.cu:81] Check failed: error == cudaSuccess (11 vs. 0) invalid argument
*** Check failure stack trace: ***
# 0x2b2716c57daa (unknown)
# 0x2b2716c57ce4 (unknown)
# 0x2b2716c576e6 (unknown)
# 0x2b2716c5a687 (unknown)
# 0x2b27187a66fd caffe::caffe_gpu_memcpy()
# 0x2b27186fa15d caffe::SyncedMemory::to_gpu()
# 0x2b27186f9b44 caffe::SyncedMemory::gpu_data()
# 0x2b27186a4701 caffe::Blob<>::gpu_data()
# 0x2b27187b3a70 caffe::PowerLayer<>::Forward_gpu()
# 0x4adaca caffe::Layer<>::Forward()
# 0x5a033e caffe::PowerLayerTest<>::TestForward()
# 0x59f381 caffe::PowerLayerTest_TestPower_Test<>::TestBody()
# 0x7cf479 testing::internal::HandleSehExceptionsInMethodIfSupported<>()
# 0x7caa12 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x7b7e29 testing::Test::Run()
# 0x7b85c2 testing::TestInfo::Run()
# 0x7b8bb0 testing::TestCase::Run()
# 0x7bda3a testing::internal::UnitTestImpl::RunAllTests()
# 0x7d04ac testing::internal::HandleSehExceptionsInMethodIfSupported<>()
# 0x7cb6c9 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x7bc7ce testing::UnitTest::Run()
# 0x480a13 main
# 0x2b27196a1ec5 (unknown)
# 0x480819 (unknown)
# (nil) (unknown)
make: *** [runtest] Aborted (core dumped)

It turns out that (as #Jez suggested) my GPU does not support double precision which used by Caffe math functions. That's the reason for crashes. I have searched for workaround on this issue but haven't found one. Maybe the only solution is to use more modern GPU

Related

OpenShift upgrade error 4.11.x -> 4.12.2 Marking Degraded due to: unexpected on-disk state validating against rendered-worker

I'm administrating RHEL OpenShift cluster. I'm upgrading from 4.10.x -> 4.11.x -> 4.12.2
There are 3 masters, and 7 worker nodes.
all 3 masters updated
3 of the 8 workers updated.
Thus far the upgrade is now stuck on worker0 with:
oc logs machine-config-daemon-4bs9x -n openshift-machine-config-operator
< snip >
I0216 21:00:08.555947 3136 daemon.go:1255] Current config: rendered-worker-8ebd95b2c00a22992daf1248ebc5640f
I0216 21:00:08.555986 3136 daemon.go:1256] Desired config: rendered-worker-263c6ea5fafb6f1da35a31749a1180d7
I0216 21:00:08.555992 3136 daemon.go:1258] state: Degraded
I0216 21:00:08.566365 3136 update.go:2089] Running: rpm-ostree cleanup -r
Deployments unchanged.
I0216 21:00:08.647332 3136 update.go:2104] Disk currentConfig rendered-worker-263c6ea5fafb6f1da35a31749a1180d7 overrides node's currentConfig annotation rendered-worker-8ebd95b2c00a22992daf1248ebc5640f
I0216 21:00:08.651201 3136 daemon.go:1564] Validating against pending config rendered-worker-263c6ea5fafb6f1da35a31749a1180d7
E0216 21:00:10.291740 3136 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-263c6ea5fafb6f1da35a31749a1180d7: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a866916d3c75fb02ee", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17" ("b5390d80a8b7f90b0b64f9db3e92848591c967612740716c656c6e88696e0c3f")
I've had this problem before and followed the RedHat solutions to run the following command. But this is now failing.
oc debug node/worker0.xx.com
sh-4.4# chroot /host
sh-4.4# rpm-ostree status
State: idle
Deployments:
* db83d20cf09a263777fcca78594b16da00af8acc245d29cc2a1344abc3f0dac2
Version: 412.86.202301311551-0 (2023-01-31T15:54:05Z)
sh-4.4#
sh-4.4# /run/bin/machine-config-daemon pivot "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a8669163c75fb02ee"
I0216 21:02:54.449270 3962714 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-821872843 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a866916d3c75fb02ee
I0216 21:03:48.349962 3962714 rpm-ostree.go:209] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17
I0216 21:03:49.926169 3962714 rpm-ostree.go:246] No com.coreos.ostree-commit label found in metadata! Inspecting...
I0216 21:03:49.926234 3962714 rpm-ostree.go:412] Running captured: ostree refs --repo /run/mco-machine-os-content/os-content-821872843/srv/repo
error: error running ostree refs --repo /run/mco-machine-os-content/os-content-821872843/srv/repo: exit status 1
error: opening repo: opendir(/run/mco-machine-os-content/os-content-821872843/srv/repo): No such file or directory
sh-4.4#
After a reboot and retry now I'm getting:
sh-4.4# /run/bin/machine-config-daemon pivot "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a866916375fb02ee"
I0217 19:10:06.928154 1443914 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-903744214 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a8669163c75fb02ee
error: "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a8669163c75fb02ee" is not a valid image reference: invalid checksum digest length
W0217 19:10:07.176459 1443914 run.go:45] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-903744214 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a8669163c75fb02ee failed: error: "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a8669163c75fb02ee" is not a valid image reference: invalid checksum digest length
: exit status 1; retrying...
^C
I tried this:
/run/bin/machine-config-daemon pivot "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:f454144d2c32aa6fd99b8c68082f59554751282865dce6a866916375fb02ee"
expecting this result ( from a previous upgrade problem ):
sh-4.4# chroot /host
sh-4.4# /run/bin/machine-config-daemon pivot "quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17"
I0208 21:50:00.408235 2962835 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-3432684387 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17
I0208 21:50:29.727695 2962835 rpm-ostree.go:353] Running captured: rpm-ostree status --json
I0208 21:50:29.780350 2962835 rpm-ostree.go:261] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:7c252d64354d207cd7fb2a6e2404e611a29bf214f63a97345dee1846055c15d8
I0208 21:50:31.456928 2962835 rpm-ostree.go:293] Pivoting to: 411.86.202301242231-0 (b5390d80a8b7f90b0b64f9db3e92848591c967612740716c656c6e88696e0c3f)
I0208 21:50:31.456966 2962835 rpm-ostree.go:325] Executing rebase from repo path /run/mco-machine-os-content/os-content-3432684387/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17 and checksum b5390d80a8b7f90b0b64f9db3e92848591c967612740716c656c6e88696e0c3f
I0208 21:50:31.457048 2962835 update.go:1972] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-3432684387/srv/repo:b5390d80a8b7f90b0b64f9db3e92848591c967612740716c656c6e88696e0c3f --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev#sha256:73b311468554ffe8bdd0dd51df7dafd7a791a16c3147374cc7b28f0d3d7fcc17 --custom-origin-description Managed by machine-config-operator
0 metadata, 0 content objects imported; 0 bytes content written
Staging deployment... done
Upgraded:
NetworkManager 1:1.30.0-16.el8_4 -> 1:1.36.0-12.el8_6
< snip>
zlib 1.2.11-18.el8_4 -> 1.2.11-19.el8_6
Removed:
ModemManager-glib-1.10.8-2.el8.x86_64
libmbim-1.20.2-1.el8.x86_64
libqmi-1.24.0-1.el8.x86_64
openvswitch2.16-2.16.0-108.el8fdp.x86_64
redhat-release-coreos-410.84-2.el8.x86_64
Added:
WALinuxAgent-udev-2.3.0.2-2.el8_6.3.noarch
glibc-gconv-extra-2.28-189.5.el8_6.x86_64
libbpf-0.4.0-3.el8.x86_64
openvswitch2.17-2.17.0-67.el8fdp.x86_64
redhat-release-8.6-0.1.el8.x86_64
redhat-release-eula-8.6-0.1.el8.x86_64
shadow-utils-subid-2:4.6-16.el8.x86_64
Run "systemctl reboot" to start a reboot
sh-4.4# systemctl reboot

COLMAP running error in remote server while running Non-Rigid NeRF

I was checking the github code of LLFF : https://github.com/Fyusion/LLFF, Non-Rigid NeRF : https://github.com/facebookresearch/nonrigid_nerf and followed the suggested steps to install requirements. While running a preprocess file which return poses from images by SfM using COLMAP. I was getting the following error while executing the preprocessing in a remote server. Can anyone please help me with solving this?
python preprocess.py --input data/example_sequence1/
Need to run COLMAP
qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.
Available platform plugins are: eglfs, minimal, minimalegl, offscreen, vnc, webgl, xcb.
*** Aborted at 1660905461 (unix time) try "date -d #1660905461" if you are using GNU date ***
PC: # 0x0 (unknown)
*** SIGABRT (#0x3e900138a9f) received by PID 1280671 (TID 0x7f5740d49000) from PID 1280671; stack trace: ***
# 0x7f57463a2197 google::(anonymous namespace)::FailureSignalHandler()
# 0x7f574421f420 (unknown)
# 0x7f5743bf300b gsignal
# 0x7f5743bd2859 abort
# 0x7f57442be35b QMessageLogger::fatal()
# 0x7f574477c799 QGuiApplicationPrivate::createPlatformIntegration()
# 0x7f574477cb6f QGuiApplicationPrivate::createEventDispatcher()
# 0x7f57443dbb62 QCoreApplicationPrivate::init()
# 0x7f574477d1e1 QGuiApplicationPrivate::init()
# 0x7f5744c03bc5 QApplicationPrivate::init()
# 0x562bbb634975 colmap::RunFeatureExtractor()
# 0x562bbb61d1a0 main
# 0x7f5743bd4083 __libc_start_main
# 0x562bbb620e39 (unknown)
Traceback (most recent call last):
File "imgs2poses.py", line 18, in <module>
gen_poses(args.scenedir, args.match_type)
File "/data1/user_data/ashish/NeRF/LLFF/llff/poses/pose_utils.py", line 268, in gen_poses
run_colmap(basedir, match_type)
File "/data1/user_data/ashish/NeRF/LLFF/llff/poses/colmap_wrapper.py", line 35, in run_colmap
feat_output = ( subprocess.check_output(feature_extractor_args, universal_newlines=True) )
File "/home/ashish/anaconda3/envs/nrnerf/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/home/ashish/anaconda3/envs/nrnerf/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['colmap', 'feature_extractor', '--database_path', 'scenedir/database.db', '--image_path', 'scenedir/images', '--ImageReader.single_camera', '1']' died with <Signals.SIGABRT: 6>.
'''

openpose issue when running the example: check failed: error == cudaSuccess (2 vs. 0) out of memory, result in core dumped

Does anyone encounter this issue when using the openpose 1.7 under ubuntu 20.04?
I cannot run the example provided. It will simply core dumped. CUDA version 11.3, Nvidia driver version 465.19.01, GPU geforce rtx 3070.
dys#dys:~/Desktop/openpose$ ./build/examples/openpose/openpose.bin --video examples/media/video.avi
Starting OpenPose demo...
Configuring OpenPose...
Starting thread(s)...
Auto-detecting all available GPUs... Detected 1 GPU(s), using 1 of them starting at GPU 0.
F0610 18:34:51.300406 28248 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
# 0x7f63e2bc51c3 google::LogMessage::Fail()
# 0x7f63e2bca25b google::LogMessage::SendToLog()
# 0x7f63e2bc4ebf google::LogMessage::Flush()
# 0x7f63e2bc56ef google::LogMessageFatal::~LogMessageFatal()
# 0x7f63e28ffe2a caffe::SyncedMemory::mutable_gpu_data()
# 0x7f63e27796a6 caffe::Blob<>::mutable_gpu_data()
# 0x7f63e293a9ee caffe::CuDNNConvolutionLayer<>::Forward_gpu()
# 0x7f63e28bfb62 caffe::Net<>::ForwardFromTo()
# 0x7f63e327a25e op::NetCaffe::forwardPass()
# 0x7f63e32971ea op::PoseExtractorCaffe::forwardPass()
# 0x7f63e329228b op::PoseExtractor::forwardPass()
# 0x7f63e328fd80 op::WPoseExtractor<>::work()
# 0x7f63e32c0c7f op::Worker<>::checkAndWork()
# 0x7f63e32c0e0b op::SubThread<>::workTWorkers()
# 0x7f63e32ce8ed op::SubThreadQueueInOut<>::work()
# 0x7f63e32c5981 op::Thread<>::threadFunction()
# 0x7f63e2f04d84 (unknown)
# 0x7f63e2c07609 start_thread
# 0x7f63e2d43293 clone
Aborted (core dumped)

3D caffe make runtest error

make 3D-caffe, make all and make test is fine. But make runtest is wrong here
it looks like a relation with GPU setting, but I am not sure
[----------] 4 tests from SoftmaxWithLossLayerTest/3, where TypeParam
= caffe::GPUDevice<double> [ RUN ] SoftmaxWithLossLayerTest/3.TestGradient
*** Aborted at 1493416676 (unix time) try "date -d #1493416676" if you are using GNU date *** PC: # 0x7f4ddfd59a05 caffe::Blob<>::gpu_data()
*** SIGSEGV (#0x17ec) received by PID 15580 (TID 0x7f4de5f8fac0) from PID 6124; stack trace: ***
# 0x7f4ddf3a1390 (unknown)
# 0x7f4ddfd59a05 caffe::Blob<>::gpu_data()
# 0x7f4ddfd93ad0 caffe::SoftmaxWithLossLayer<>::Forward_gpu()
# 0x45ba59 caffe::Layer<>::Forward()
# 0x4844a0 caffe::GradientChecker<>::CheckGradientSingle()
# 0x487603 caffe::GradientChecker<>::CheckGradientExhaustive()
# 0x5d44c7 caffe::SoftmaxWithLossLayerTest_TestGradient_Test<>::TestBody()
# 0x8ac7d3 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x8a5dea testing::Test::Run()
# 0x8a5f38 testing::TestInfo::Run()
# 0x8a6015 testing::TestCase::Run()
# 0x8a72ef testing::internal::UnitTestImpl::RunAllTests()
# 0x8a7613 testing::UnitTest::Run()
# 0x4512a9 main
# 0x7f4ddefe7830 __libc_start_main
# 0x4577c9 _start
# 0x0 (unknown) Makefile:468: recipe for target 'runtest' failed make: *** [runtest] Segmentation fault (core dumped)
I would love to help you, but your message is unclear. I would suggest checking your grammar and word choice next time.

Sidekiq server is not processing scheduled jobs when started using systemd

I have a cuba application which I want to use sidekiq with.
This is how I setup the config.ru:
require './app'
require 'sidekiq'
require 'sidekiq/web'
environment = ENV['RACK_ENV'] || "development"
config_vars = YAML.load_file("./config.yml")[environment]
Sidekiq.configure_client do |config|
config.redis = { :url => config_vars["redis_uri"] }
end
Sidekiq.configure_server do |config|
config.redis = { url: config_vars["redis_uri"] }
config.average_scheduled_poll_interval = 5
end
# run Cuba
run Rack::URLMap.new('/' => Cuba, '/sidekiq' => Sidekiq::Web)
I started sidekiq using systemd. This is the systemd script which I adapted from the sidekiq.service on the sidekiq site.:
#
# systemd unit file for CentOS 7, Ubuntu 15.04
#
# Customize this file based on your bundler location, app directory, etc.
# Put this in /usr/lib/systemd/system (CentOS) or /lib/systemd/system (Ubuntu).
# Run:
# - systemctl enable sidekiq
# - systemctl {start,stop,restart} sidekiq
#
# This file corresponds to a single Sidekiq process. Add multiple copies
# to run multiple processes (sidekiq-1, sidekiq-2, etc).
#
# See Inspeqtor's Systemd wiki page for more detail about Systemd:
# https://github.com/mperham/inspeqtor/wiki/Systemd
#
[Unit]
Description=sidekiq
# start us only once the network and logging subsystems are available,
# consider adding redis-server.service if Redis is local and systemd-managed.
After=syslog.target network.target
# See these pages for lots of options:
# http://0pointer.de/public/systemd-man/systemd.service.html
# http://0pointer.de/public/systemd-man/systemd.exec.html
[Service]
Type=simple
Environment=RACK_ENV=development
WorkingDirectory=/media/temp/bandmanage/repos/fall_prediction_verification
# If you use rbenv:
#ExecStart=/bin/bash -lc 'pwd && bundle exec sidekiq -e production'
ExecStart=/home/froy001/.rvm/wrappers/fall_prediction/bundle exec "sidekiq -r app.rb -L log/sidekiq.log -e development"
# If you use the system's ruby:
#ExecStart=/usr/local/bin/bundle exec sidekiq -e production
User=root
Group=root
UMask=0002
# if we crash, restart
RestartSec=1
Restart=on-failure
# output goes to /var/log/syslog
StandardOutput=syslog
StandardError=syslog
# This will default to "bundler" if we don't specify it
SyslogIdentifier=sidekiq
[Install]
WantedBy=multi-user.target
The code calling the worker is :
raw_msg = JSON.parse(req.body.read, {:symbolize_names => true})
if raw_msg
ts = raw_msg[:ts]
waiting_period = (1000*60*3) # wait 3 min before checking
perform_at_time = Time.at((ts + waiting_period)/1000).utc
FallVerificationWorker.perform_at((0.5).minute.from_now, raw_msg)
my_res = { result: "success", status: 200}.to_json
res.status = 200
res.write my_res
else
my_res = { result: "not found", status: 404}.to_json
res.status = 404
res.write my_res
end
I am only using the default q.
My problem is that the job is not being processed at all.
After you run systemctl enable sidekiq so that it starts at boot and systemctl start sidekiq so that it starts immediately, then you should have some logs to review which will provide some detail about any failure to start:
sudo journalctl -u sidekiq
Review the logs, review the systemd docs and adjust your unit file as needed. You can find all the installed systemd documentation with apropos systemd. Some of the most useful man pages to review are systemd.service,systemd.exec and systemd.unit