Mesos DC/OS how to configure multiple zones in 1.9 - configuration

I want to split upp my agent nodes in multiple zones depending on HW the agent nodes is running on. How do i add Zones in the setup configuration when installing?
And can a agent node be in multiple zones at the same time? both zone a and b or just one?
Mesos install page 1.9:
All agents within a zone should be tagged with an attribute (e.g., zone:us-east-1a )
current config:
---
agent_list:
- 10.0.0.1
- 10.0.0.2
- 10.0.0.3
bootstrap_url: file:///opt/dcos_install_tmp
cluster_name: DC/OS
exhibitor_storage_backend: static
ip_detect_path: genconf/ip-detect
master_discovery: static
master_list:
- 10.0.0.3
process_timeout: 10000
public_agent_list:
- 10.0.0.5
resolvers:
- 8.8.8.8
- 8.8.4.4
ssh_key_path: genconf/ssh_key
ssh_port: 22
ssh_user: centos

I know you asked this 6 months ago... but if you are still using DC/OS 1.9 and seeking an answer:
I believe the issue you are seeing with zones is due to the fact that the Mesos attributes are not set in the cluster's config.yaml file, but instead on a file that lives on each host node.
TL;DR you need to create or edit /var/lib/dcos/mesos-slave-common on each agent to contain a list of Mesos attributes separated via semi colon as such: MESOS_ATTRIBUTES=<key>:<value>;<key>:<value>
And as an example (you can create any key:values you would like):
MESOS_ATTRIBUTES=aws_instance_type:m4.xlarge;aws_availability_zone:us-east-1b
The next step is to remove the slave state and restart the agent. This will allow you to see (and restrict offers to) these attributes. Note that removing the latest slave state will kill any running tasks on the agent because Mesos sees addition of Mesos attributes as an agent re-registration event.
systemctl stop dcos-mesos-slave
rm -f /var/lib/mesos/slave/meta/slaves/latest
systemctl start dcos-mesos-slave
The explicit DC/OS documentation on updating agents:
https://docs.mesosphere.com/1.9/administering-clusters/update-a-node/
How to launch Marathon tasks using those attributes:
https://github.com/mesosphere/marathon/blob/master/docs/docs/constraints.md
Hope this helps!

Related

Where do I find slurm diagnostic information when a job just hangs?

I am running slurm 20.11.8 on a system with 1 login node and 3 compute nodes. The status information I can find is below.
$ slurmd -V
slurm 20.11.8
$ sinfo -N
NODELIST NODES PARTITION STATE
pauli-node-01 1 normal* idle
pauli-node-02 1 normal* idle
pauli-node-03 1 normal* idle
$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2021-10-05 22:04:10 CDT; 10h ago
Main PID: 11802 (slurmctld)
Tasks: 7
Memory: 6.7M
CGroup: /system.slice/slurmctld.service
└─11802 /usr/sbin/slurmctld -D
Oct 05 22:04:10 pauli.mer.utexas.edu systemd[1]: Started Slurm controller daemon.
Here is my configuration file:
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=pauli
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=pauli
#AccountingStoragePass=abcdef
#AccountingStoragePort=1234
AccountingStorageType=accounting_storage/none
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
JobCompHost=pauli
#JobCompLoc=slurm_comp_db
#JobCompPass=abcdef
#JobCompPort=1234
JobCompType=jobcomp/none
JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
#SlurmctldLogFile=
SlurmdDebug=info
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=pauli-node-0[1-2] CPUs=64 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN
NodeName=pauli-node-03 CPUs=40 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
PartitionName=normal Nodes=pauli-node-0[1-3] Default=YES MaxTime=INFINITE State=UP
When I try to run a simple job on 1 node with srun (along the lines of Slurm: Quick Start User Guide), the job just hangs. Does anyone know where I should look for diagnostic information to figure out why the job hangs?
$ srun -N1 -n1 -l hostname
One of the first things to check is the network connectivity and making sure no firewall is in the way. You can check that with
scontrol ping
on the control nodes. Also, srun has a -v option that can tell you where it is blocked (you can add multiple of such options to increase the verbosity).
And of course, the log files for both the controller and the slurmd's may contain information. Again, the log level can be increased, with scontrol setdebug.
The usual suspects, besides the firewall, are often SELinux, netmasks, IP routes. Make also sure the clocks are in sync and munge is running OK.
SOLVED. The firewall for all compute nodes must be either "off" or configured to trust the other nodes in the system.
See Compute node firewall must be off
I was able to run (on Linux RedHat)
firewall-cmd --zone=trusted --add-source=10.xxx.xxx.xxx --add-source=10.xxx.xxx.xxx --add-source=10.xxx.xxx.xxx
on each compute node in order to avoid turning off the firewall altogether. I think the reason the problem came up recently is that the firewall was probably deactivated, and after a recent reboot of the system, the firewall came back up.
Thanks #damienfrancois for helping point me to firewall problems.

Changing the default behavior of Kubernetes

I have setup a K8S cluster (1 master and 2 slaves) using Kubeadm on my laptop.
Deployed 6 replicas of a pod. 3 of them got deployed to each of the slaves.
Did a shutdown of one of the slave.
It took ~6 minutes for the 3 pods to be scheduled on the running node.
Initially, I thought that it had to do something with the K8S setup. After some digging found out, it's because of the defaults in the K8S for Controller Manager and Kubelet as mentioned here. It made sense. I checked out the K8S documentation on where to change the configuration properties and also checked the configuration files on the cluster node, but couldn't figure it out.
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
Could someone point out what needs to be done to make the above-mentioned configuration changes permanent and also the different options for the same?
On the kubelet change this file on all your nodes:
/var/lib/kubelet/kubeadm-flags.env
Add the option at the end or anywhere on this line:
KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin
--cni-conf-dir=/etc/cni/net.d --network-plugin=cni
--resolv-conf=/run/systemd/resolve/resolv.conf
--node-status-update-frequency=10s <== add this
On your kube-controller-manager change on the master the following file:
/etc/kubernetes/manifests/kube-controller-manager.yaml
In this section:
containers:
- command:
- kube-controller-manager
- --address=127.0.0.1
- --allocate-node-cidrs=true
- --cloud-provider=aws
- --cluster-cidr=192.168.0.0/16
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --node-cidr-mask-size=24
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --use-service-account-credentials=true
- –-node-monitor-period=5s <== add this line
On your master do a sudo systemctl restart docker
On all your nodes do a sudo systemctl restart kubelet
You should have the new configs take effect.
Hope it helps.

deploy router status Pending

By the following composition, openshift-origin, using playbook in ansible, the environment was built.
[node]
openshift-master.example.com<br>
openshift-node01.example.com<br>
openshift-node02.example.com<br>
openshift-etcd.example.com<br>
[/etc/ansible/hosts]
[OSEv3:children]
masters
nodes
etcd
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
ansible_ssh_user=root
deployment_type=origin
[masters]
openshift-master.example.com
[etcd]
openshift-etcd.example.com
# host group for nodes, includes region info
[nodes]
openshift-master.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
openshift-node01.example.com openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
openshift-node02.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
In following command, in openshift, login, oh, it was done.
[login command]
oc login -u system:admin -n default
And Replica in router, it was made in following command.
[create router command]
oc scale dc/router --replicas=2
The following event occurs, and a place can't make replica in router.
[create router command]
Failed scheduling
pod (router-2-ievkl) failed to fit in any node fit failure on node (openshift-node01.example.com): CheckServiceAffinity fit failure on node (openshift-node02.example.com): CheckServiceAffinity fit failure on node (openshift-master.example.com): PodFitsHostPorts
It's such situation, but when how doing correspond, would I be able to make replica in router right?
Got the same issue after clean install of origin.
Uncordoning masters do the thing. Thnx to lorenzvth7
During advanced installation, the openshift_hosted_router_selector and openshift_registry_selector Ansible settings are set to region=infra by default. The default router and registry will only be automatically deployed if a node exists that matches the region=infra label.
Also, according to an error topic starter's case "PodFitsHostPorts"
Routers directly attach to port 80 and 443 on all interfaces on a host. Restrict routers to hosts where port 80/443 is available and not being consumed by another service, and set this using node selectors and the scheduler configuration. As an example, you can achieve this by dedicating infrastructure nodes to run services such as routers.
So, this means that you shoul re-label e.g. openshift-node01.example.com as region:infra

Questions on starting Locator using snappydata/bin> ./spark-shell.sh script

Spark v. 0.5
Here's the command I used to start a Locator:
ubuntu#ip-172-31-8-115:/snappydata-0.5-bin/bin$ ./snappy-shell locator start
Starting SnappyData Locator using peer discovery on:
0.0.0.0[10334] Starting DRDA server for SnappyData at address localhost/127.0.0.1[1527]
Logs generated in /snappydata-0.5-bin/bin/snappylocator.log
SnappyData Locator pid: 9352 status: running
It looks like it starts the DRDA server locally, with no outside interface for a client to connect to. So, I cannot reach my SnappyData Locator using this JDBC URL from an outside client host (e.g. my SquirrelSQL editor).
This does not connect:
jdbc:snappydata://MY-AWS-PUBLIC-IP-HERE:1527/
What property do I pass my ./snappy-shell.sh location start command to get the DRDA Server to start on a public IP address instead of "localhost/127.0.0.1"?
Use -client-bind-address and -client-port options. For locator also use the -peer-discovery-address and -peer-discovery-port options to specify bind address for other locators/servers/leads (that are passed to their -locators=<address>:<port>):
snappy-shell locator start -peer-discovery-address=<internal IP for peers> -client-bind-address=<public IP for clients>
See the output of snappy-shell locator --help for commonly used options.
For SnappyData releases, you may find it much easier to use the global configuration for all of the locators, servers, leads. Check configuring the cluster.
This will allow specifying all options for all JVMs of the cluster in conf/locators, conf/leads, conf/servers then starting with snappy-start-all.sh, status with snappy-status-all.sh and stop all with snappy-stop-all.sh
On a related note, we at SnappyData Inc., are developing scripts to enable users quickly launch SnappyData cluster on AWS.
If you want to try it out, below steps would guide you. We would love to hear your feedback on this.
Download its development branch git clone https://github.com/SnappyDataInc/snappydata.git -b SNAP-864 (You don't need to clone the repo for this, but I could not find a way to attach the scripts here.)
Go to ec2 directory cd snappydata/cluster/ec2
Run snappy-ec2. ./snappy-ec2 -k ec2-keypair-name -i /path/to/keypair/private/key/file launch your-cluster-name
See this README for more details.

Emerge issue in Gentoo

I have an issue with my gentoo. I tried to install BIND into my gentoo but everytime i want to install it, i will get an error message.
Here is whats happen in my Konsole :
emerge --ask net-dns/bind
* IMPORTANT: 3 config files in '/etc/portage' need updating.
* See the CONFIGURATION FILES section of the emerge
* man page to learn how to update config files.
These are the packages that would be merged, in order:
Calculating dependencies... done!
[ebuild R ] dev-libs/openssl-1.0.1g USE="-bindist*"
[ebuild N ] net-dns/bind-9.9.4_p2 USE="berkdb dlz gost ipv6 ldap odbc ssl -caps -doc -filter-aaaa -fixed-rrset -geoip -gssapi -idn -mysql -postgres -python -rpz -rrl -sdb-ldap (-selinux) -static-libs -threads -urandom -xml"
!!! Multiple package instances within a single package slot have been pulled
!!! into the dependency graph, resulting in a slot conflict:
dev-libs/openssl:0
(dev-libs/openssl-1.0.1g::gentoo, ebuild scheduled for merge) pulled in by
>=dev-libs/openssl-1.0.0:0[-bindist] required by (net-dns/bind-9.9.4_p2::gentoo, ebuild scheduled for merge)
dev-libs/openssl:0[-bindist] required by (net-dns/bind-9.9.4_p2::gentoo, ebuild scheduled for merge)
(dev-libs/openssl-1.0.1g::gentoo, installed) pulled in by
>=dev-libs/openssl-0.9.6d:0[bindist] required by (net-misc/openssh-5.9_p1-r4::gentoo, installed)
It may be possible to solve this problem by using package.mask to
prevent one of those packages from being selected. However, it is also
possible that conflicting dependencies exist such that they are
impossible to satisfy simultaneously. If such a conflict exists in
the dependencies of two different packages, then those packages can
not be installed simultaneously. You may want to try a larger value of
the --backtrack option, such as --backtrack=30, in order to see if
that will solve this conflict automatically.
For more information, see MASKED PACKAGES section in the emerge man
page or refer to the Gentoo Handbook.
!!! The following installed packages are masked:
- media-libs/mesa-9.0::gentoo (masked by: package.mask)
/usr/portage/profiles/package.mask:
# Chí-Thanh Christopher Nguyễn <chithanh#gentoo.org> (26 Mar 2014)
# Affected by multiple vulnerabilities, #445916, #471098 and #472280
For more information, see the MASKED PACKAGES section in the emerge
man page or refer to the Gentoo Handbook.
Can anyone show me how to resolve this issue in my Gentoo. I have a hard time to install anything.
UPDATED
emerge --ask net-dns/bind
* IMPORTANT: 3 config files in '/etc/portage' need updating.
* See the CONFIGURATION FILES section of the emerge
* man page to learn how to update config files.
These are the packages that would be merged, in order:
Calculating dependencies... done!
[ebuild R ] dev-libs/openssl-1.0.1g USE="-bindist*"
[ebuild N ] net-dns/bind-9.9.4_p2 USE="berkdb dlz gost ipv6 ldap odbc ssl -caps -doc -filter-aaaa -fixed-rrset -geoip -gssapi -idn -mysql -postgres -python -rpz -rrl -sdb-ldap (-selinux) -static-libs -threads -urandom -xml"
The following USE changes are necessary to proceed:
see "package.use" in the portage(5) man page for more details)
# required by net-dns/bind-9.9.4_p2[ssl]
# required by net-dns/bind (argument)
=dev-libs/openssl-1.0.1g -bindist
Use --autounmask-write to write changes to config files (honoring
CONFIG_PROTECT). Carefully examine the list of proposed changes,
paying special attention to mask or keyword changes that may expose
experimental or unstable packages.
!!! The following installed packages are masked:
- media-libs/mesa-9.0::gentoo (masked by: package.mask)
/usr/portage/profiles/package.mask:
# Chí-Thanh Christopher Nguyễn <chithanh#gentoo.org> (26 Mar 2014)
# Affected by multiple vulnerabilities, #445916, #471098 and #472280
For more information, see the MASKED PACKAGES section in the emerge
man page or refer to the Gentoo Handbook.
2 steps to solve this problems:
package.use/bind
net-dns/bind -ipv6 dlz
dev-libs/openssl -bindist
net-misc/openssh -bindist
recompile ssl ssh, and install bind
emerge -Uav dev-libs/openssl net-misc/openssh
emerge -av net-dns/bind
the uses for bind:
equery uses bind -i
[ Legend : U - final flag setting for installation]
[ : I - package is installed with flag ]
[ Colors : set, unset ]
* Found these USE flags for net-dns/bind-9.10.2_p2:
U I
+ + berkdb : Add support for sys-libs/db (Berkeley DB
for MySQL)
+ + caps : Use Linux capabilities library to control
privilege
+ + dlz : Enables dynamic loaded zones, 3rd party
extension
- - doc : Add extra documentation (API, Javadoc,
etc). It is recommended to enable per
package instead of globally
- - filter-aaaa : Enable filtering of AAAA records over IPv4
- - fixed-rrset : Enables fixed rrset-order option
- - geoip : Add geoip support for country and city
lookup based on IPs
- - gost : Enables gost OpenSSL engine support
- - gssapi : Enable gssapi support
- - idn : Enable support for Internationalized Domain
Names
- - ipv6 : Add support for IP version 6
- - json : Enable JSON statistics channel
- - ldap : Add LDAP support (Lightweight Directory
Access Protocol)
- - mysql : Add mySQL Database support
- - nslint : Build and install the nslint util
- - odbc : Add ODBC Support (Open DataBase
Connectivity)
- - postgres : Add support for the postgresql database
- - python : Add optional support/bindings for the
Python language
+ + python_targets_python2_7 : Build with Python 2.7
+ + python_targets_python3_3 : Build with Python 3.3
- - python_targets_python3_4 : Build with Python 3.4
- - rpz : Enable response policy rewriting (rpz)
- - seccomp : Enable seccomp for system call filtering
+ + ssl : Add support for Secure Socket Layer
connections
- - static-libs : Build static versions of dynamic libraries
as well
+ + threads : Add threads support for various packages.
Usually pthreads
- - urandom : Use /dev/urandom instead of /dev/random
- - xml : Add support for XML files
maybe this help:
# vi /etc/portage/package.use
and add this line:(this line was changed)
dev-libs/openssl -bindist
I have no other way if it doesn't work, Sorry :(
maybe you can get help from gentoo forums.
good luck.
emerge net-dns/bind --autounmask-write
etc-update
emerge net-dns/bind
remove -bindist from USE flags
Just to help other people having the same error, you need add the line under "# Required by" to your package.use file.
echo "=dev-libs/openssl-1.0.1g -bindist" >> /etc/portage/package.use/zz-autounmask
or
nano -w /etc/portage/package.use/zz-autounmask
and then manually copy the line into the file.
Replace "=dev-libs/openssl-1.0.1g -bindist" with what's required to be added to your package.use