Systemd: service doesn't restart with WatchdogSec option set - qemu

I'm trying to manager my vm with systemd. In case the qemu crashes, I want to implement watchdog for the service, following is the unit file.
[Unit]
Description=vm manager
After=network.target
Before=shutdown.target reboot.target poweroff.target halt.target
[Service]
Type=forking
ExecStart=/root/vm/vm-manager.sh start-vm
ExecStop=/root/vm/vm-manager.sh stop-vm
KillSignal=SIGCONT
PIDFile=/root/vm/run/pid
WatchdogSec=30s
Restart=on-failure
[Install]
WantedBy=multi-user.target
I didn't call sd_notify(0, "WATCHDOG=1") in my application, above is the background. I have two questions:
In my opinion, this service should be restarted after 30s, why it keeps running until I kill or stop it?
When I kill qemu manually(I take qemu process as the main process), the service restart immediately, without waiting.
Besides the two questions, if there's anything wrong or suggestion about the unit file, please raise it freely.
Thanks!

You need to set
NotifyAccess=1
if you dont, systemd will not enable the watchdog feature for your service.
The man page for systemd.service says:
WatchdogSec=
If this option is used, NotifyAccess= (see below) should be set to open access to the notification socket provided by systemd. If NotifyAccess= is not set, it will be implicitly set to main. Defaults to 0, which disables this feature. The service can check whether the service manager expects watchdog keep-alive notifications. See sd_watchdog_enabled(3) for details. sd_event_set_watchdog(3) may be used to enable automatic watchdog notification support.

Related

Unable to connect worker node to master using K3S

I am trying to setup a K3S cluster for learning purposes but I am having trouble connecting the master node with agents. I have looked several tutorials and discussions on this but I can't find a solution. I know I am probably missing something obvious (due to my lack of knowledge), but still help would be much appreciated.
I am using two AWS t2.micro instances with default configuration.
When ssh into the master and installed K3S using
curl -sfL https://get.k3s.io | sh -s - --no-deploy traefik --write-kubeconfig-mode 644 --node-name k3s-master-01
with kubectl get nodes, I am able to see the master
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 13s v1.23.6+k3s1
So far it seems I am doing things right. From what I understand, I am supposed to configure the kubeconfig file. So, I accessed it by using
cat /etc/rancher/k3s/k3s.yaml
I copied the configuration file and the server info to match the private IP I took from AWS console, resulting in something like this
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <lots_of_info>
server: https://<master_private_IP>:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: <my_certificate_data>
client-key-data: <my_key_data>
Then, I ran vi ~/.kube/config, and there I pasted the kubeconfig file
Finally, I grabbed the token with cat /var/lib/rancher/k3s/server/node-token, ssh into the other machine and then run the following
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-01 K3S_URL=https://<master_private_IP>:6443 K3S_TOKEN=<master_token> sh -
The output is
[INFO] Finding release for channel stable
[INFO] Using v1.23.6+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO] systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO] systemd: Starting k3s-agent
By this output, it looks like I have created an agent. However, when I run kubectl get nodes in the master, I still get
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 12m v1.23.6+k3s1
What is the thing I was supposed to do in order to get the agent connected to the master? I am guess I am probably missing something simple, but I just can't seem to find the solution. I've read all the documentation but it is still not clear to me where I am making the mistake. I've tried saving the private master IP and token into the agent as environmental variables with export K3S_TOKEN=master_token and K3S_URL=master_private_IP and then simply running curl -sfL https://get.k3s.io | sh - but I still can't see the worker nodes when running kubectl get nodes
Any help would be appreciated.
It might be your VM instance firewall that prevents appropriate connection from your master to the worker node (and vice versa). Official rancher documentation advise to disable firewall for (Red Hat/CentOS) Enterprise Linux:
It is recommended to turn off firewalld:
systemctl disable firewalld --now
If enabled, it is required to disable nm-cloud-setup and reboot the node:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
If you are using Ubuntu on your VM's, there is a different firewall tool (ufw).
In my case, allowing 6443 and 443(not sure if required) port TCP connections worked fine.
Allow port 6443 and TCP connection in all of your cluster machines:
sudo ufw allow 6443/tcp
Then apply k3s installation script in your worker node(s):
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-1 K3S_URL=https://<k3s-master-1 IP>:6443 K3S_TOKEN=<k3s-master-1 TOKEN> sh -
This should work. If not, you can try adding additional allow rule for 443 tcp port as well.
A few options to check.
Check Journalctl for errors
journalctl -u k3s-agent.service -n 300 -xn
If using RaspberryPi for a worker node, make sure you have
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
as the very end of your /boot/cmdline.txt file. DO NOT PUT THIS VALUE ON A NEW LINE! Should just be appended to the end of the line.
If your master node(s) have self-signed certs, make sure you copy the master node's self signed cert to your worker node(s). In linux or raspberry pi copy cert to /usr/local/share/ca-certificates, then issue an
sudo update-ca-certificates
on the worker node
Don't forget to reboot the worker node after you make these changes!
Hope this helps someone!

Questions on starting Locator using snappydata/bin> ./spark-shell.sh script

Spark v. 0.5
Here's the command I used to start a Locator:
ubuntu#ip-172-31-8-115:/snappydata-0.5-bin/bin$ ./snappy-shell locator start
Starting SnappyData Locator using peer discovery on:
0.0.0.0[10334] Starting DRDA server for SnappyData at address localhost/127.0.0.1[1527]
Logs generated in /snappydata-0.5-bin/bin/snappylocator.log
SnappyData Locator pid: 9352 status: running
It looks like it starts the DRDA server locally, with no outside interface for a client to connect to. So, I cannot reach my SnappyData Locator using this JDBC URL from an outside client host (e.g. my SquirrelSQL editor).
This does not connect:
jdbc:snappydata://MY-AWS-PUBLIC-IP-HERE:1527/
What property do I pass my ./snappy-shell.sh location start command to get the DRDA Server to start on a public IP address instead of "localhost/127.0.0.1"?
Use -client-bind-address and -client-port options. For locator also use the -peer-discovery-address and -peer-discovery-port options to specify bind address for other locators/servers/leads (that are passed to their -locators=<address>:<port>):
snappy-shell locator start -peer-discovery-address=<internal IP for peers> -client-bind-address=<public IP for clients>
See the output of snappy-shell locator --help for commonly used options.
For SnappyData releases, you may find it much easier to use the global configuration for all of the locators, servers, leads. Check configuring the cluster.
This will allow specifying all options for all JVMs of the cluster in conf/locators, conf/leads, conf/servers then starting with snappy-start-all.sh, status with snappy-status-all.sh and stop all with snappy-stop-all.sh
On a related note, we at SnappyData Inc., are developing scripts to enable users quickly launch SnappyData cluster on AWS.
If you want to try it out, below steps would guide you. We would love to hear your feedback on this.
Download its development branch git clone https://github.com/SnappyDataInc/snappydata.git -b SNAP-864 (You don't need to clone the repo for this, but I could not find a way to attach the scripts here.)
Go to ec2 directory cd snappydata/cluster/ec2
Run snappy-ec2. ./snappy-ec2 -k ec2-keypair-name -i /path/to/keypair/private/key/file launch your-cluster-name
See this README for more details.

Unable to create tap device vnet d: Operation not permitted

I am trying to add a bridge network to my guest VM on a Centos 6 host.
I have created a bridge br0 by adding a file:
/etc/sysconfig/network-scripts/ifcfg-br0:
DEVICE=br0
TYPE=Bridge
BOOTPROTO=dhcp
STP=on
ONBOOT=yes
Also, I have added a line in my /etc/sysconfig/network-scripts/ifcfg-eth0:
BRIDGE=br0
Now, I tried to create a VM using:
virt-install -n ubuntu_vm --disk path=kvm-images/ubuntu-12.04.qcow2,size=30,format=qcow2 --ram=2048 --cdrom= --os-type=linux --network bridge=br0 --os-variant=ubuntuprecise --graphics vnc,listen=0.0.0.0
Now, I am getting the following error:
Starting install...
**ERROR Unable to create tap device vnet%d: Operation not permitted**
Domain installation does not appear to have been successful.
If it was, you can restart your domain by running:
virsh --connect qemu:///session start ubuntu_new_vm
otherwise, please restart your installation.
I see that this problem was fixed before libvirt 0.10.2 which I am using currently, but still I am getting the same error.
http://www.redhat.com/archives/libvir-list/2012-May/msg00678.html
From the error message I see that you are running virt-install as an unprivileged user connecting to qemu:///session. With the unprivileged libvirtd instance you are very limited in what networking modes you can use, and in partuclar the 'network' mode is not available as your user won't have privileges to manage the TAP devices.
The alternatives are you use the privileged libvirtd instance (qemu:///system) to run the VM, which gives it full network access, or enable the QEMU setuid network helper. This lets you use --network bridge=NAME for virt-install when running unprivileged, delegating TAP device setup to the setuid helper program

Hot reconfiguration of HAProxy still lead to failed request, any suggestions?

I found there are still failed request when the traffic is high using command like this
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
to hot reload the updated config file.
Here below is the presure testing result using webbench :
/usr/local/bin/webbench -c 10 -t 30 targetHProxyIP:1080
Webbench – Simple Web Benchmark 1.5
Copyright (c) Radim Kolar 1997-2004, GPL Open Source Software.
Benchmarking: GET targetHProxyIP:1080
10 clients, running 30 sec.
Speed=70586 pages/min, 13372974 bytes/sec.
**Requests: 35289 susceed, 4 failed.**
I run command
haproxy -f /etc/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
several times during the pressure testing.
In the haproxy documentation, it mentioned
They will receive the SIGTTOU
611 signal to ask them to temporarily stop listening to the ports so that the new
612 process can grab them
so there is a time period that the old process is not listening on the PORT(say 80) and the new process haven’t start to listen to the PORT (say 80), and during this specific time period, it will cause the NEW connections failed, make sense?
So is there any approach that makes the configuration reload of haproxy that will not impact both existing connections and new connections?
On recent kernels where SO_REUSEPORT is finally implemented (3.9+), this dead period does not exist anymore. While a patch has been available for older kernels for something like 10 years, it's obvious that many users cannot patch their kernels. If your system is more recent, then the new process will succeed its attempt to bind() before asking the previous one to release the port, then there's a period where both processes are bound to the port instead of no process.
There is still a very tiny possibility that a connection arrived in the leaving process' queue at the moment it closes it. There is no reliable way to stop this from happening though.

Monit to monitor 2 searchd instances on one server

I have 2 rails apps hosted on the same server and each one of them have its own configuration for thinking_sphinx /searchd with different ports configured. I managed to get this set up working and I have 2 instances of searchd running.
My problem is getting Monit to monitor these 2 instances. Even though these 2 instances of searchd have its own PID in separate directories, I was not able to define the configuration in the monitrc because the process names in this case are the same, namely searchd.
In my monitrc, i have 2 separate commands as follows:
check process searchd with pidfile /var/www/app1/shared/pids/production.sphinx.pid
start program=....
stop program=....
check process searchd with pidfile /var/www/app2/shared/pids/production.sphinx.pid
start program=...
stop program=...
Monit requires a unique process name. Is it possible to start up my second instance of searchd using a different process name?
Thanks for the help.
You can call the process whatever you wish in monit configuration files - it doesn't need to match the executable. So:
check process searchd_app1 with pidfile /var/www/app1/shared/pids/production.sphinx.pid
start program=....
stop program=....
check process searchd_app2 with pidfile /var/www/app2/shared/pids/production.sphinx.pid
start program=...
stop program=...