Failed to install ROCm on Ubuntu 20.04 [closed] - deep-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I would like to set up AMD Radeon for Deep Learning on Ubuntu. The main libraries for my work are keras and pytorch. I followed strictly on ROCm installation guideline here but failed at the 3rd step with the command sudo apt install rocm-dkms. Error messages were shown as follows.
Setting up dkms (2.8.1-5ubuntu1) ...
Setting up hip-rocclr (4.0.20496.5685.40000-23) ...
Setting up rock-dkms (1:4.0-23) ...
Loading new amdgpu-4.0-23 DKMS files...
Building for 5.8.0-41-generic
Building for architecture x86_64
Building initial module for 5.8.0-41-generic
Error! Bad return status for module build on kernel: 5.8.0-41-generic (x86_64)
Consult /var/lib/dkms/amdgpu/4.0-23/build/make.log for more information.
dpkg: error processing package rock-dkms (--configure):
installed rock-dkms package post-installation script subprocess returned error
exit status 10
Setting up g++-9 (9.3.0-17ubuntu1~20.04) ...
Setting up g++ (4:9.3.0-1ubuntu2) ...
update-alternatives: using /usr/bin/g++ to provide /usr/bin/c++ (c++) in auto mo
de
Setting up build-essential (12.8ubuntu1.1) ...
dpkg: dependency problems prevent configuration of rocm-dkms:
rocm-dkms depends on rock-dkms; however:
Package rock-dkms is not configured yet.
dpkg: error processing package rocm-dkms (--configure):
dependency problems - leaving unconfigured
Setting up gcc-multilib (4:9.3.0-1ubuntu2) ...
No apport report written because the error message indicates its a followup erro
r from a previous failure.
Setting up g++-9-multilib (9.3.0-17ubuntu1~20.04) ...
Setting up g++-multilib (4:9.3.0-1ubuntu2) ...
Processing triggers for sgml-base (1.29.1) ...
Setting up x11proto-dev (2019.2-1ubuntu1) ...
Setting up libxau-dev:amd64 (1:1.0.9-0ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
Processing triggers for man-db (2.9.1-1) ...
Setting up libxdmcp-dev:amd64 (1:1.1.3-0ubuntu1) ...
Setting up x11proto-core-dev (2019.2-1ubuntu1) ...
Setting up libxcb1-dev:amd64 (1.14-2) ...
Setting up libx11-dev:amd64 (2:1.6.9-2ubuntu1.1) ...
Setting up libglx-dev:amd64 (1.3.2-1~ubuntu0.20.04.1) ...
Setting up libgl-dev:amd64 (1.3.2-1~ubuntu0.20.04.1) ...
Setting up mesa-common-dev:amd64 (20.2.6-0ubuntu0.20.04.1) ...
Setting up rocm-opencl-dev (3.6Beta-17-g875c1f8-rocm-rel-4.0-23) ...
Settin XT g up rocm-clang-ocl (0.5.0.64-rocm-rel-4.0-23-50fb51a) ...
Setting up rocm-utils (4.0.0.40000-23) ...
Setting up rocm-dev (4.0.0.40000-23) ...
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
Errors were encountered while processing:
rock-dkms
rocm-dkms
E: Sub-process /usr/bin/dpkg returned an error code (1)
My kernel version is 5.8.0-41-generic. My VGA card is Gigabyte Radeon RX6900 XT. My CPU is AMD Ryzen 9 3900 XT. I tried several solutions suggested in previous posts but it did not solve my problem. May I have your suggestions to fix this.

I've been having the same issue as well. The only way I found to fix it is to roll back to the 5.6.0-1042-oem kernel. The AMD drivers don't seem to support any kernel past this one.
Edit: This is also a way to get the amdgpupro drivers to install without a problem.
WARNING: I'm writing all this after the fact and i might have missed a step or something along the way. Please be very careful especially with trying to remove kernels and when working in your boot directory. If you're uncomfortable with the idea of wrecking your system you can always set grub's default selection which is a lot safer than removing an initramfs.
Here's how I got RocM working
sudo apt install linux-image-5.6.0-1042-oem linux-headers-5.6.0-1042-oem && reboot
Make sure you boot into the 5.6 kernel by accessing the Ubuntu advanced options in grub.
sudo apt remove linux-image-5.8.0-41-generic linux-headers-5.8.0-41-generic && sudo apt autoremove && reboot
Again you'll have to reboot into 5.6 through the advanced options. (Hold the shift key after BIOS finishes loading to get the Ubuntu Advanced Options menu.) After you're back in it's a good idea to set your headers and image as held back because a kernel update will most likely break RocM.
sudo apt-mark hold linux-image-generic linux-headers-generic
Now we're going to try and flush out the 5.8 kernel. Start by flushing out the temporary files.
sudo rm -rv ${TMPDIR:-/var/tmp}/mkinitramfs-*
Now list all of the kernels installed.
dpkg -l | tail -n +6 | grep -E 'linux-image-[0-9]+'
And try to remove the 5.8 kernel. Do this for any kernel you have above the 5.6 one we installed.
sudo update-initramfs -d -k 5.8.0-41-generic
Now the initramfs, Systemmap, and config are still present in the boot dir so we need to clear those out to get grub working properly again.
cd /boot/
sudo rm vmlinuz-5.8.0-41-generic System.map-5.8.0-41-generic config-5.8.0-41-generic
Now you should be finally ready to update grub
sudo update-grub && reboot
Now when you load back in you should be able to install RocM
sudo apt install rocm-dkms

As per the official notes in this link, AMD ROCm platform is designed to support Ubuntu 20.04.1 (5.4 and 5.6-oem) and 18.04.5 (Kernel 5.4).
So kernel version 5.8 is not supported. However, downgrading is an option but instead of rushing to that, you can simply boot into an older version of kernel.
Try following steps:
Restart your computer,
Wait for the grub menu to open (how to open grub menu: link).
Select advanced options for ubuntu
Select an alternate kernal from the list shown.

Related

Gnome Boxes on Fedora 33 fails to open

I attempt to load gnome-boxes from the terminal (I'm running Fedora 33) and get the following error
$ gnome-boxes
(gnome-boxes:3194): Gtk-WARNING **: 12:34:57.343: GtkFlowBox with a model will ignore sort and filter functions
(gnome-boxes:3194): Gtk-WARNING **: 12:34:57.344: GtkListBox with a model will ignore sort and filter functions
(gnome-boxes:3194): Boxes-WARNING **: 12:34:57.904: libvirt-machine.vala:83: Failed to disable 3D Acceleration
(gnome-boxes:3194): Boxes-WARNING **: 12:34:57.913: libvirt-broker.vala:70: Failed to update domain 'fedora33-wor-2': Failed to set domain configuration: XML error: Invalid PCI address 0000:04:00.0. slot must be >= 1
(gnome-boxes:3194): Boxes-CRITICAL **: 12:34:57.916: boxes_vm_importer_get_source_media: assertion 'self != NULL' failed
Segmentation fault (core dumped)
My system:
$uname -a
Linux localhost.localdomain 5.9.16-200.fc33.x86_64 #1 SMP Mon Dec 21 14:08:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
I don't whether it's related but I recently updated from kernel 5.9.11 directly to 5.9.16 (haven't used the PC in question for some weeks) and before gnome-boxes was working as normal.
Please advise how I can restore gnome-boxes - I have some virtual machines that I need to access...
I faced this issue when I force stopped Gnome-Boxes while cloning a VM.
Deleting the conflicting VM will resolve your issue(in your case 'fedora33-wor-2').
To delete the VM in fedora, install "libvirt-client" which provides "virsh" using the command
dnf install libvirt-client
then double check the available VM's using
virsh list --all
Delete the VM using command,
virsh undefine VM_Name
#channel-fun solved the problem of staring up gnome-boxes.
But the real problem is in cloning procedure. The XML describing the new machine is malformed.
virt-clone --original fedora33-ser --auto-clone
works properly.
I know this is an old thread, but I had the same problem recently.
I shut down gnome boxes whilst it was cloning a vm, and shutdown the machine.
I then couldn't open boxes, as it would just crash.
I was able to delete the VM itself, and then deleted the XML file associated with it.
To delete the VM itself, go to :
$HOME/.var/app/org.gnome.Boxes/data/gnome-boxes/images (which in my case is a symbolic link to a data drive)
and delete the VM with the name that you were cloning to (or safer, just move it somewhere).
To delete the XML file associated with it:
$HOME/.var/app/org.gnome.Boxes/config/libvirt/qemu/
and delete (or safer move) the file that is named VM_NAME.xml.
Then boxes should open ok, at least it worked for me.
Extending on Channel Fun's answer for Ubuntu repos the package is libvirt-clients (note the plural s):
sudo apt install libvirt-clients
Check the available VM's using:
virsh list --all
Delete the VM using:
virsh undefine VM_Name
If you receive the error:
error: Refusing to undefine while domain managed save image exists
Then you can explicitly remove that also using the --managed-save flag:
virsh undefine VM_Name --managed-save

Failed to start zabbix3.0 in centos7

I met a issue when I install zabbix3.0 by packages on centos7.
When I done on mysql , php , apache and configuration in zabbix.conf.
I run systemctl start zabbix-server.service.It didn't work!and show :
Job for zabbix-server.service failed. See 'systemctl status zabbix-server.service' and 'journalctl -xn' for details.Then , my colleague told me to install trousers and gnutls,and then ,zabbix-server worked.What is the use of these two software?If they are necessary,Why not put them in the package of zabbix?
You won't start Zabbix Server 3.0 on CentOS 7 because you don't have "Disabled" SELinux.
You can disabled SELinux right here: /etc/selinux/config.
After that, you must reboot your server with reboot or shutdown -r now.
After reboot, confirm that the getenforce command returns Disabled.
Most likely, you didn't install those packages, but upgraded them. They are linked in through the Jabber/XMPP support.
This was a bug in RedHat packages that took some time to resolve, see this bugreport : https://bugzilla.redhat.com/show_bug.cgi?id=1071171
And this is the Zabbix issue tracking the same problem : https://support.zabbix.com/browse/ZBX-7790

CUDA 7.0 Error while compiling samples

I'm trying to install CUDA 7.0 on Ubuntu 14.04. I've followed the installation instructions as outlined here. Specifically, I've followed steps in section 3.6 and Chapter 6. While compiling the examples (Section 6.2.2.2) using make, I'm getting the following error:
make[1]: Entering directory `/usr/local/cuda-7.0/samples/3_Imaging/cudaDecodeGL'
/usr/local/cuda-7.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_20,
code=compute_20 -o cudaDecodeGL FrameQueue.o ImageGL.o VideoDecoder.o
VideoParser.o VideoSource.o cudaModuleMgr.o cudaProcessFrame.o
videoDecodeGL.o -L../../common/lib/linux/x86_64 -L/usr/lib/"nvidia-346"
-lGL -lGLU -lX11 -lXi -lXmu -lglut -lGLEW -lcuda -lcudart -lnvcuvid
/usr/bin/ld: cannot find -lnvcuvid
collect2: error: ld returned 1 exit status
make[1]: *** [cudaDecodeGL] Error 1
make[1]: Leaving directory `/usr/local/cuda-7.0/samples/3_Imaging/cudaDecodeGL'
make: *** [3_Imaging/cudaDecodeGL/Makefile.ph_build] Error 2
If you notice, there is -L/usr/lib/"nvidia-346". In my case, I have installed nvidia-349. What worked for me is to edit NVIDIA_CUDA-7.0_Samples/3_Imaging/cudaDecodeGL/findgllib.mk and change UBUNTU_PKG_NAME = "nvidia-346" to nvidia-349.
In order to properly install CUDA 7.0 on Ubuntu 14.04, you need a nvidia driver version 346 or higher.
If you're using the .deb installation method, the nvidia graphics driver is installed automatically.
If you used the .run file installation method and chose not to install the nvidia driver, you can manually install the driver afterwards through the package manager:
sudo apt-add-repository ppa:xorg-edgers/ppa && sudo apt-get update
sudo apt-get install nvidia-346 nvidia-346-dev nvidia-346-uvm libcuda1-346 nvidia-libopencl1-346 nvidia-icd-346
In my case, I installed nvidia-352 afterwards due to a bug in nvidia-346 and I stumbled upon the same error.
andoum's approach of manually changing the hard-coded UBUNTU_PKG_NAME = "nvidia-346" to UBUNTU_PKG_NAME = "nvidia-352" in NVIDIA_CUDA-7.0_Samples/3_Imaging/cudaDecodeGL/findgllib.mk worked fine for me.
I met the same issue and solution is that put path of nvidia into system path:
sudo gedit /etc/environment
add these path into environment
LIBRARY_PATH=/usr/lib/your_nvidia_edition:$LIBRARY_PATH
In fact I have encountered this problem when I made a make. I installed Cuda 8.0 under my Ubuntu 16.04. This problem had been confusing me for several weeks and I was almost tending to reinstall ubuntu for that after reviewing many suggestions via google, but finally I addressed it myself recently.
First of all, you should replace all the UBUNTU_PKG_NAME= ##nvidia-3xx## to the one of your actually installed nvidia driver version as recommended above. Then you will probably get compiling error after you do a new make. In my case, I have the link errors like
/usr/bin/ld: warning: libGLX.so.0, needed by /usr/lib/nvidia-
375/libGL.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libGLdispatch.so.0, needed by /usr/lib/nvidia-
375/libGL.so, not found (try using -rpath or -rpath-link)
....
or whatever contains missing link errors. Do locate the files you miss like
$ locate libGLX.so.
/usr/lib/nvidia-375/libGLX.so.0
/usr/lib32/nvidia-375/libGLX.so.0
$ locate libGLdispatch.so.0
/usr/lib/nvidia-375/libGLdispatch.so.0
/usr/lib32/nvidia-375/libGLdispatch.so.0
The error above is probably caused the compiling files cannot find in the default cuda libraries as you set, so you just need to copy the missing files to /usr/lib/nvidia-3xx/ (the actual path in your case) and this should work(it works in my case), if it doesn't maybe you could try to link the new add files to the one that need using a
$ sudo ln -s (requested file) (requesting file).
Hope this will help.

JRuby disabled stack guard

I've installed jruby 1.7.4 and every time an program is executed I get the following error:
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/jruby/lib/native/arm- Linux/libjffi-1.2.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
I tried to disable the error message with the "execstack -c /opt/jruby/lib/native/arm- Linux/libjffi-1.2.so" but the error keeps creeping in.
How can disable/fix this error message?
simply install a somehow recent of JRuby 1.7.x ... that is e.g. 1.7.17 at the moment

I think OSX-gcc-installer broke some things and I'm not sure what to do

Long story short is I foolishly installed OSX-gcc-installer about a month ago and at first it prevented me, I believe, from installing Ruby gems on my machine. With the help of someone I did get this one issue fixed. Going through other the other posts on this website I was not able to find the solution that I'm looking for, so I decided to post here.
When I run brew doctor I get the following:
albys-mbp:folder alby$ brew doctor
Please note that these warnings are just used to help the Homebrew maintainers
with debugging if you file an issue. If everything you use Homebrew for is
working fine: please don't worry and just ignore them. Thanks!
Warning: You have an outdated version of /usr/bin/install_name_tool installed.
This will cause binary package installations to fail.
This can happen if you install osx-gcc-installer or RailsInstaller.
To restore it, you must reinstall OS X or restore the binary from
the OS packages.
Warning: Broken symlinks were found. Remove them with `brew prune`:
/usr/local/bin/aclocal
/usr/local/bin/aclocal-1.14
/usr/local/bin/autoconf
/usr/local/bin/autoheader
/usr/local/bin/autom4te
/usr/local/bin/automake
/usr/local/bin/automake-1.14
/usr/local/bin/autoreconf
/usr/local/bin/autoscan
/usr/local/bin/autoupdate
/usr/local/bin/cscope
/usr/local/bin/erb
/usr/local/bin/gem
/usr/local/bin/glibtool
/usr/local/bin/glibtoolize
/usr/local/bin/gpg-error
/usr/local/bin/gpg-error-config
/usr/local/bin/ifnames
/usr/local/bin/irb
/usr/local/bin/ksba-config
/usr/local/bin/ocs
/usr/local/bin/pkg-config
/usr/local/bin/rake
/usr/local/bin/rdoc
/usr/local/bin/ri
/usr/local/bin/ruby
/usr/local/bin/testrb
/usr/local/include/gpg-error.h
/usr/local/include/ksba.h
/usr/local/include/libltdl
/usr/local/include/ltdl.h
/usr/local/include/ruby-2.1.0
/usr/local/include/yaml.h
/usr/local/lib/libgpg-error.0.dylib
/usr/local/lib/libgpg-error.dylib
/usr/local/lib/libksba.8.dylib
/usr/local/lib/libksba.dylib
/usr/local/lib/libltdl.7.dylib
/usr/local/lib/libltdl.a
/usr/local/lib/libltdl.dylib
/usr/local/lib/libruby.2.1.0-static.a
/usr/local/lib/libruby.2.1.0.dylib
/usr/local/lib/libruby.2.1.dylib
/usr/local/lib/libruby.dylib
/usr/local/lib/libyaml-0.2.dylib
/usr/local/lib/libyaml.a
/usr/local/lib/libyaml.dylib
/usr/local/lib/pkgconfig/ruby-2.1.pc
/usr/local/lib/pkgconfig/yaml-0.1.pc
/usr/local/lib/ruby/2.1.0
/usr/local/lib/ruby/gems
/usr/local/share/aclocal/README
/usr/local/share/aclocal/argz.m4
/usr/local/share/aclocal/dirlist
/usr/local/share/aclocal/gpg-error.m4
/usr/local/share/aclocal/ksba.m4
/usr/local/share/aclocal/libtool.m4
/usr/local/share/aclocal/ltdl.m4
/usr/local/share/aclocal/ltoptions.m4
/usr/local/share/aclocal/ltsugar.m4
/usr/local/share/aclocal/ltversion.m4
/usr/local/share/aclocal/lt~obsolete.m4
/usr/local/share/aclocal/pkg.m4
/usr/local/share/aclocal-1.14
/usr/local/share/autoconf
/usr/local/share/automake-1.14
/usr/local/share/common-lisp
/usr/local/share/doc/automake
/usr/local/share/doc/pkg-config/pkg-config-guide.html
/usr/local/share/emacs
/usr/local/share/info/gpgrt.info
/usr/local/share/info/ksba.info
/usr/local/share/info/libtool.info
/usr/local/share/info/libtool.info-1
/usr/local/share/info/libtool.info-2
/usr/local/share/libtool
/usr/local/share/man/man1/aclocal-1.14.1
/usr/local/share/man/man1/aclocal.1
/usr/local/share/man/man1/autoconf.1
/usr/local/share/man/man1/autoheader.1
/usr/local/share/man/man1/autom4te.1
/usr/local/share/man/man1/automake-1.14.1
/usr/local/share/man/man1/automake.1
/usr/local/share/man/man1/autoreconf.1
/usr/local/share/man/man1/autoscan.1
/usr/local/share/man/man1/autoupdate.1
/usr/local/share/man/man1/config.guess.1
/usr/local/share/man/man1/config.sub.1
/usr/local/share/man/man1/cscope.1
/usr/local/share/man/man1/erb.1
/usr/local/share/man/man1/glibtool.1
/usr/local/share/man/man1/glibtoolize.1
/usr/local/share/man/man1/gpg-error-config.1
/usr/local/share/man/man1/ifnames.1
/usr/local/share/man/man1/irb.1
/usr/local/share/man/man1/pkg-config.1
/usr/local/share/man/man1/rake.1
/usr/local/share/man/man1/ri.1
/usr/local/share/man/man1/ruby.1
/usr/local/Library/LinkedKegs/autoconf
/usr/local/Library/LinkedKegs/automake
/usr/local/Library/LinkedKegs/cscope
/usr/local/Library/LinkedKegs/libgpg-error
/usr/local/Library/LinkedKegs/libksba
/usr/local/Library/LinkedKegs/libtool
/usr/local/Library/LinkedKegs/libyaml
/usr/local/Library/LinkedKegs/pkg-config
/usr/local/Library/LinkedKegs/ruby
Warning: You seem to have osx-gcc-installer installed.
Homebrew doesn't support osx-gcc-installer. It causes many builds to fail and
is an unlicensed distribution of really old Xcode files.
Please run `xcode-select --install` to install the CLT.
Warning: Some installed formula are missing dependencies.
You should `brew install` the missing dependencies:
brew install openssl
Run `brew missing` for more details.
alby-mbp:folder alby$
As you can also see above, I am missing openssl. This is because I uninstalled it and attempted to reinstall it thinking that this would help, but I was not able to reinstall. Here is what I get when I try to do brew install openssl:
albys-mbp:folder alby$ brew install openssl
Warning: You seem to have osx-gcc-installer installed.
Homebrew doesn't support osx-gcc-installer. It causes many builds to fail and
is an unlicensed distribution of really old Xcode files.
Please run `xcode-select --install` to install the CLT.
Warning: You have an outdated version of /usr/bin/install_name_tool installed.
This will cause binary package installations to fail.
This can happen if you install osx-gcc-installer or RailsInstaller.
To restore it, you must reinstall OS X or restore the binary from
the OS packages.
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/openssl-1.0.1j_1.mavericks.bottle.tar.gz
Already downloaded: /Library/Caches/Homebrew/openssl-1.0.1j_1.mavericks.bottle.tar.gz
Error: SHA1 mismatch
Expected: 65e125a4777eb6dfb63f01a18f724246123dd79e
Actual: eac5e2d21af64224fc533ebb793b99a2aea434c7
Archive: /Library/Caches/Homebrew/openssl-1.0.1j_1.mavericks.bottle.tar.gz
To retry an incomplete download, remove the file above.
Warning: Bottle installation failed: building from source.
==> Installing openssl dependency: makedepend
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/makedepend-1.0.5.mavericks.bottle.tar.gz
Already downloaded: /Library/Caches/Homebrew/makedepend-1.0.5.mavericks.bottle.tar.gz
Error: SHA1 mismatch
Expected: 83db1daee01e4eb752c711934eb88850b3ee70d6
Actual: eac5e2d21af64224fc533ebb793b99a2aea434c7
Archive: /Library/Caches/Homebrew/makedepend-1.0.5.mavericks.bottle.tar.gz
To retry an incomplete download, remove the file above.
Warning: Bottle installation failed: building from source.
Error: /usr/local/opt/pkg-config not present or broken
Please reinstall pkg-config. Sorry :(
albys-mbp:folder alby$
Another issue is that when I try and run mysql in bash, I get the following:
albys-mbp:folder alby$ mysql
dyld: Library not loaded: ##HOMEBREW_PREFIX##/opt/openssl/lib/libssl.1.0.0.dylib
Referenced from: /usr/local/bin/mysql
Reason: image not found
Trace/BPT trap: 5
albys-mbp:folder alby$
I am mostly new to programming and am not sure where to begin to solve this issue. I was able to backup my HD to an external HD, but this is after installing the gcc-installer which overwrote my Command Line Tools. So even if I were to do a reinstall, would that even work? I've never had to reinstall my OS X before; would I be able to pick and choose files from the external HD or is it all restored?
Also, I have OS X 10.9.
Thanks!