I am trying to set-up configuration file on Ubuntu 20.04. I have tried several thing and searched for errors on other websites (link1, link2, link3) and slurm-website as well. Another similar question on SO as well.
Given the following information about my computer, what is the minimum required information must be provided in slurm.conf file.
The general information for my computer;
RAM: 125.5 GB
CPU: 1-20 (Intel® Xeon(R) CPU E5-2687W v3 # 3.10GHz × 20 )
Graphics: NVIDIA Corporation GP104 [GeForce GTX 1080] / NVIDIA Corporation
OS: Ubuntu 20.04.2 LTS 64 bit
and I want to have 2 nodes with 10 CPUs for each and 1 node for GPU.
I have tried the followings;
After configuration and running the followings;
>sudo systemctl restart slurmctld
with no error. But I got error witj slurmd.
> sudo systemctl restart slurmd
Error is as below;
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
if I run "systemctl status slurmd.service"
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2021-06-06 21:47:26 CEST; 1min 14s ago
Docs: man:slurmd(8)
Process: 52710 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Here is my configuration file slurm.conf generated by configurator_easy.html and saved in /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=myhostname
#
AuthType=auth/menge
Epilog=/usr/local/slurm/epilog
Prolog=/usr/local/slurm/prolog
FirstJobId=0
InactiveLimit=120
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp
KillWait=30
MinJobAge=300
MaxJobCount=10000
#PluginDir=/usr/local/lib
ReturnToService=0
SlurmdPort=6818
SlurmctldPort=6817
SlurmdSpoolDir=/var/spool/slurmd.spool
StateSaveLocation=/var/spool/slurm-llnl/slurm.state
SwitchType=switch/none
TmpFS=/tmp
WaitTime=30
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmUser=slurm
SlurmdUser=root
TaskPlugin=task/affinity
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=300
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile
#SlurmdDebug=info
#SlurmdLogFile=
#
# COMPUTE NODES
NodeName=Linux[1-32] State=UP
NodeName=DEFAULT State=UNKNOWN
PartitionName=Linux[1-32] Default=YES
I have Ubuntu 20.04 running on wsl and I was also struggling with setting up slurm as well. It looks like everything is running fine now. I am still a beginner..
I recommend you to really check the logs:
cat /var/log/slurmctld.log
cat /var/log/slurmd.log
In my case I had some permission issues and therefore had to make sure slurm related directories had to be owned by SlurmUser as defined in your config.
At first glance I see in your config the following lines which could cause the problem (if I compare the settings with mine):
I wonder that you defined NodeName twice.
In my case it has at first the value of SlurmctldHost
Hope something of the above mentioned can help.
Regards
Edit: I also would refer to the following Post, which could be similar to yours, if you run your command with sudo.
Related
I attempt to load gnome-boxes from the terminal (I'm running Fedora 33) and get the following error
$ gnome-boxes
(gnome-boxes:3194): Gtk-WARNING **: 12:34:57.343: GtkFlowBox with a model will ignore sort and filter functions
(gnome-boxes:3194): Gtk-WARNING **: 12:34:57.344: GtkListBox with a model will ignore sort and filter functions
(gnome-boxes:3194): Boxes-WARNING **: 12:34:57.904: libvirt-machine.vala:83: Failed to disable 3D Acceleration
(gnome-boxes:3194): Boxes-WARNING **: 12:34:57.913: libvirt-broker.vala:70: Failed to update domain 'fedora33-wor-2': Failed to set domain configuration: XML error: Invalid PCI address 0000:04:00.0. slot must be >= 1
(gnome-boxes:3194): Boxes-CRITICAL **: 12:34:57.916: boxes_vm_importer_get_source_media: assertion 'self != NULL' failed
Segmentation fault (core dumped)
My system:
$uname -a
Linux localhost.localdomain 5.9.16-200.fc33.x86_64 #1 SMP Mon Dec 21 14:08:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
I don't whether it's related but I recently updated from kernel 5.9.11 directly to 5.9.16 (haven't used the PC in question for some weeks) and before gnome-boxes was working as normal.
Please advise how I can restore gnome-boxes - I have some virtual machines that I need to access...
I faced this issue when I force stopped Gnome-Boxes while cloning a VM.
Deleting the conflicting VM will resolve your issue(in your case 'fedora33-wor-2').
To delete the VM in fedora, install "libvirt-client" which provides "virsh" using the command
dnf install libvirt-client
then double check the available VM's using
virsh list --all
Delete the VM using command,
virsh undefine VM_Name
#channel-fun solved the problem of staring up gnome-boxes.
But the real problem is in cloning procedure. The XML describing the new machine is malformed.
virt-clone --original fedora33-ser --auto-clone
works properly.
I know this is an old thread, but I had the same problem recently.
I shut down gnome boxes whilst it was cloning a vm, and shutdown the machine.
I then couldn't open boxes, as it would just crash.
I was able to delete the VM itself, and then deleted the XML file associated with it.
To delete the VM itself, go to :
$HOME/.var/app/org.gnome.Boxes/data/gnome-boxes/images (which in my case is a symbolic link to a data drive)
and delete the VM with the name that you were cloning to (or safer, just move it somewhere).
To delete the XML file associated with it:
$HOME/.var/app/org.gnome.Boxes/config/libvirt/qemu/
and delete (or safer move) the file that is named VM_NAME.xml.
Then boxes should open ok, at least it worked for me.
Extending on Channel Fun's answer for Ubuntu repos the package is libvirt-clients (note the plural s):
sudo apt install libvirt-clients
Check the available VM's using:
virsh list --all
Delete the VM using:
virsh undefine VM_Name
If you receive the error:
error: Refusing to undefine while domain managed save image exists
Then you can explicitly remove that also using the --managed-save flag:
virsh undefine VM_Name --managed-save
I am installing Galera 4 on top of MySQL 8 on Debian but can't make it work. Once I start first node with bootstrap command:
mysqld_bootstrap
it starts with the following options:
/usr/sbin/mysqld $$'$\'$\\\'--wsrep-new-cluster --wsrep-on\\\'\'' --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
Problem is there is no pid file created and even though it appears to be running and I can't connect to the database.
There is nothing going to the log file either so I think it is ommiting the config files.
I have tried running config validator:
mysqld --validate_config
but it hangs on futex (checked with strace). In both cases it is not possible to kill mysqld normally and -9 option has to be used.
LXC is used to run this instance with following kernel:
Linux node01 4.15.18-26-pve #1 SMP PVE 4.15.18-54 (Sat, 15 Feb 2020 15:34:24 +0100) x86_64 GNU/Linux
The answer was pretty obvious after some investigation. No rsync used to sync the cluster was installed on the nodes so they can't sync together.
This is the error i encountered when i updated my CentOS 8.1/RHEL 8.1 machines and all the KVMs are showing the error below:
error: internal error: process exited while connecting to monitor: 2020-06-09T12:41:10.410896Z qemu-kvm: -machine pc-q35-rhel8.1.0,accel=kvm,usb=off,vmport=off,smm=on,dump-guest-core=off: unsupported machine type
Use -machine help to list supported machines
Note: The problem states the machine type Q35 is not well stated/configured in your virtual Kernel based machines RUNNING on RHEL 8/ CentOS 8
[Step 1:] cat /etc/libvirt/qemu/*.xml | grep \<name'\| machine'
This will list the machine type in all of the KVMs installed.
[Output Snippet]
machine pc-q35-rhel8.1.0
[Step 2:] cd /etc/libvirt/qemu; ll
This will list all the xml files in connection with your KVMs
[Step 3:] At /etc/libvirt/qemu Use virsh edit <KVM file> ###Don't include .xml###
Navigate to machine
[Output Snippet]
<os>
<type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
<loader readonly='yes' secure='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>
<nvram>/var/lib/libvirt/qemu/nvram/Loadbalancer_VARS.fd</nvram>
<boot dev='hd'/>
</os>
Change machine='pc-q35-rhel8.1.0' to machine='q35'
shift + zz to save and quit
[Step 4:]
systemctl restart libvirtd && systemctl status -l libvirtd
virsh list --all
virsh start --domain <KVM>
Check the status of your running KVMs
virsh list --state-running
Now the issue should be resolved and your KVMs should be humming away.
Note though if head back in and check on the configuration xml file with virsh edit, you'll note that q35 converts to pc-q35-rhel7.6.0 automatically.
But this shouldn't be an issue.
Cheers :)
I would like to test whether my server creating a crash dump upon a OS crash. I can see the /etc/sysconfig/kdump config file is configured.
So I issued the command to kernel panic echo c > /proc/sysrq-trigger so it crashed the server but it never create a dump file for some reason. This is HP BL460g7 blade with ASR disabled.
When I trigger the kernel panic it crashed but stays about 10 minutes (looks like its trying to save a crash dump) but it never. I checked the message logs but cannot see reason why its not dumping. Main problem is how do I find why it's not dumping a crash file, is there are any logs I can check what has really gone wrong?
I'm using SUSE Linux Enterprise Server 11 (x86_64) SP 1.
Did you follow the steps explained here?
SUSE Support - Configure kernel core dump capture
The most important tasks should be:
install kdump, kexec-tools and makedumpfile
add crashkernel=... to the kernel command line (Grub)
chkconfig boot.kdump on
ensure that you have enough free space in /var/crash (default dir)
Then please reboot your system and run:
sync; echo c >/proc/sysrq-trigger
After another boot please check for new files in /var/crash. If this doesn't work for you, please show us the content of /etc/sysconfig/kdump and at least the output of
cat /proc/cmdline
chkconfig boot.kdump
Do you have a display connected to the machine?
I have KVM virtual machine running CentOS 7 as guest OS. I'm trying to attach an additional disk to it on the run (without shutting it down) using this command:
$ sudo virsh attach-disk centos --source /var/lib/libvirt/images/newdisk.img --target sdb --persistent
But receive an error:
error: Failed to attach disk
error: internal error: cannot update AppArmor profile 'libvirt-d2e7bbb8-c7b3-44ec-b0ea-27539e0df732'
If I do the same with Debian guest - everything is ok.
What is difference, how to solve that?
UPDATE:
I have a comment!
I compared two VM's xml and saw that CentOS have QEMU-agent in his configuration:
<channel type="unix">
<source mode="bind" path="/var/lib/libvirt/qemu/channel/target/centos_auto.org.qemu.guest_agent.0"></source>
<target name="org.qemu.guest_agent.0" type="virtio"></target>
<address bus="0" controller="0" port="1" type="virtio-serial"></address>
</channel>
Then I removed "channel qemu-ga", restarted VM and checked "hot add" feature. It worked.
I tested it on other VMs (CentOS, Fedora, Debian) and saw the same.
As a result:
If enable qemu-agent i cannot use hot plug.
If use "hot plug" i must forget about agent.
Is it my mistake in configuration or these features can't work together?
Host-OS: Ubuntu 15.10
QEMU emulator: now 2.4.92 (tested 2.3 and 2.4.1)
VMM: 1.3.0
This is a clear bug in the apparmor security driver for libvirt. The existence of the QEMU guest agent config in the XML should have no impact on ability to hotplug disks to a guest. This bug should be reported to the libvirt upstream, or Ubuntu bug trackers.