I have this simple script that checks if mysql on remote servers (db-test-1 and db-test-2) is in SST mode and sends a message to a slack channel. The script is running on a third server dedicated for running cron jobs. Here is the code below:
#!/bin/bash
time=$(date);
array=( db-test-1 db-test-2 )
for i in "${array[#]}"
do
S=$(ssh $i ps -ef |grep mysql | grep wsrep_sst_xtrabackup-v2);
if [[ "$S" != "" ]]; then
curl -X POST --data-urlencode "payload={\"channel\": \"#db-share-test\", \"username\": \"wsrep_local_state_comment\", \"text\": \"*$i*: ${time}\n>State transfer in progress, setting sleep higher mysqld\", \"icon_emoji\": \":scorpion:\"}" https://hooks.slack.com/services/G824ZJS9N/B6QS5JEKP/ZjV1hmM1k4dZGsf9HDC1o1jd
exit 0
else
curl -X POST --data-urlencode "payload={\"channel\": \"#db-share-test\", \"username\": \"wsrep_local_state_comment\", \"text\": \"*$i*: ${time}\n>State transfer is complete. Server is Synced now.\", \"icon_emoji\": \":scorpion:\"}" https://hooks.slack.com/services/G824ZJS9N/B6QS5JEKP/ZjV1hmM1k4dZGsf9HDC1o1jd
exit 2
fi
done
The two servers, db-test1 and db-test-2 are part of a PXC cluster. So when i start db-test-1 in SST to join the cluster, i get the following in my slack channel as expected:
*db-test-1*: Sun Aug 27 15:12:44 CST 2017
>State transfer in progress, setting sleep higher mysqld
[3:12]
*db-test-1*: Sun Aug 27 15:12:49 CST 2017
State transfer in progress, setting sleep higher mysqld
[3:12]
*db-test-1*: Sun Aug 27 15:12:51 CST 2017
State transfer in progress, setting sleep higher mysqld
[3:12]
*db-test-1*: Sun Aug 27 15:12:54 CST 2017
State transfer in progress, setting sleep higher mysqld
So the results are being displayed approximately every 3 seconds. However, the cron job executing this script is scheduled to run every minute, hence not sure why it is sending results every 3 seconds or so as shown above.
How can i ensure that the results are displayed every 1 minute to avoid my channel being thrown the same message every 3 seconds? Also, how can i make sure that when the SST is finished, a single message to slack to indicate that the state transfer is finished instead of sending this message none-stop every time the two db servers are not in SST mode?
Besides checking that the cron is properly set, probably something like:
#every_minute /path/to/script
or
*/1 * * * * /path/to/script
It could be also good to ensure that only one occurrence of the program is running, try adding this to your script used within the cron job
*/1 * * * * pgrep script > /dev/null || /path/to/script
or by using something like:
#!/bin/sh
if ps -ef | grep -v grep | grep your_script_name ; then
exit 0
fi
# your code goes below
# ...
Related
I want to execute action once in a day. Like whenever day's 1st alert is present in the Zabbix that time Action will restart the service/process. After that suppose same alert is come in Zabbix that time no Action will performed that time.
I am using Zabbix version 6.2.4 and on client side it is zabbix-agent2
I have created the Action on the trigger.
item --> get information process is running or not.
system.run["ps -aux | grep -i 'proces' | grep -v 'grep' | wc -l"]
trigger --> When process is down that time trigger id generated.
last(/Monitor process/system.run["ps -aux | grep -i 'proces' | grep -v 'grep' | wc -l"])<1
0 indicate --> Through the alert in Zabbix
1 indicate --> do not do anything.
Action --> When trigger is present in the Zabbix Action execute the service start script.
Up to this point all function is working fine.
But I want this action only once in a day. Like whenever day's 1st alert is present in the Zabbix that time Action will restart the service/process.
Problem history is not available as an Action condition, so you don't know in Zabbix if the Event already triggered, and when. You can however manage it inside the restart script.
Following script restarts the service when it's been more than 24 hours since last restart.
#!/bin/bash
SEMAPHORE=/tmp/zabbix.service1.lastrestart
first_time=0
day=$(expr 24 \* 60 \* 60)
delta=$(expr $(date +%s) - $(stat -c %Y $SEMAPHORE 2>/dev/null) 2>/dev/null)
if [ -z "$delta" ]; then
# file doesn't exists
first_time=1
elif [ $delta -gt $day ] ; then
# more than 1d since last restart
first_time=1
fi
if [ $first_time -eq 1 ]; then
systemctl restart service1
touch $SEMAPHORE
fi
Please note that system.run[] is deprecated, you should use it for tests only. In this case the appropriate key may be proc.num[], or even systemd.unit.info[] since you are using Agent2 newer than 4.4 (see https://www.zabbix.com/documentation/6.2/en/manual/discovery/low_level_discovery/examples/systemd?hl=systemd).
I want to implement a shutdown-script that would be called when my VM is going to be preempted on Google Compute Engine. That VM is used to run dockers containers that execute long running batches, so I send them a signal to make them gracefully exit.
That shutting-down script is working well when I execute it manually, yet it breaks on a real premption use-case, or when I kill the VM by myself.
I got this error:
... logs from my containers ...
A 2019-08-13T16:54:07.943153098Z time="2019-08-13T16:54:07Z" level=error msg="error waiting for container: unexpected EOF"
(just after this error, I can see what I put in the 1st line of my shutting-down script, see code below)
A 2019-08-13T16:54:08.093815210Z 2019-08-13 16:54:08: Shutting down! TEST SIGTERM SHUTTING DOWN (this is the 1st line of my shuttig-down script)
A 2019-08-13T16:54:08.093845375Z docker ps -a
(no reult)
A 2019-08-13T16:54:08.155512145Z ps -ef
... a lot of things, but nothing related to docker ...
2019-08-13 16:54:08: task_subscriber not running, shutting down immediately.
I use preemptible VM from GCE, with image Container-Optimized OS 73-11647.267.0 stable. I run my dockers as service with systemctl, yet I don't thik this is related - [edit] Actually I could solve my issue thanks to this.
Right now, I am pretty sure that a lot of things happens when Google send the ACPI signal to my VM, even before the shutdown-script is fetched from the VM metadata and is called.
My guess is that all the services are being stopped at the same time, eventually docker.service itself.
When my container is running, I can get the same level=error msg="error waiting for container: unexpected EOF" with a simple sudo systemctl stop docker.service
Here is a part of my shuting-down script:
#!/bin/bash
# This script must be added in the VM metadata as "shutdown-script" so that
# it is executed when the instance is being preempted.
CONTAINER_NAME="task_subscriber" # For example, "task_subscriber"
logTime() {
local datetime="$(date +"%Y-%m-%d %T")"
echo -e "$datetime: $1" # Console
echo -e "$datetime: $1" >>/var/log/containers/$CONTAINER_NAME.log
}
logTime "Shutting down! TEST SIGTERM SHUTTING DOWN"
echo "docker ps -a" >>/var/log/containers/$CONTAINER_NAME.log
docker ps -a >>/var/log/containers/$CONTAINER_NAME.log
echo "ps -ef" >>/var/log/containers/$CONTAINER_NAME.log
ps -ef >>/var/log/containers/$CONTAINER_NAME.log
if [[ ! "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; then
logTime "${CONTAINER_NAME} not running, shutting down immediately."
sleep 10 # Give time to send logs
exit 0
fi
logTime "Sending SIGTERM to ${CONTAINER_NAME}"
#docker kill --signal=SIGTERM ${CONTAINER_NAME}
systemctl stop taskexecutor.service
# Portable waitpid equivalent
while [[ "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; do
sleep 1
logTime "Waiting for ${CONTAINER_NAME} termination"
done
logTime "${CONTAINER_NAME} is done, shutting down."
logTime "TEST SIGTERM SHUTTING DOWN BYE BYE"
sleep 10 # Give time to send logs
If I simply call systemctl stop taskexecutor.service manually (not by really shutting down the server), the SIGTERM signal is sent to my docker and my app properly handle it and exists.
Any idea?
-- How I solved my issue --
I could solve it by adding this dependency on docker in my service config:
[Unit]
Wants=gcr-online.target docker.service
After=gcr-online.target docker.service
I don't know how the magic works beyond the execution of the shutdown-script stored in metadata by Google. But I think that they should fix something in their Container-Optimized OS so that that magic happens before docker is stopped. Otherwise, we could not rely on it to gracefully shutdown a basic script with it (hopefully I was using systemd here)...
From the documentation[1] usage of shutdown scripts on the preemptible VM instances are feasible. However, it seems there are some limitations in place while using the shutdown scripts, Compute Engine executes shutdown scripts only on a best-effort basis. In rare cases, Compute Engine cannot guarantee that the shutdown script will complete. Also I would like to mention Preemptible instances has 30 seconds after instance preemption begins[2] which might be killing the docker before the shutdown script was executed.
From the error message provided in your use case, it seems to be an expected behaviour with the Docker running continuously for longer time[3].
[1]https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#handle_preemption
[2] https://cloud.google.com/compute/docs/shutdownscript#limitations
[3] https://github.com/docker/for-mac/issues/1941
I want to ping an external ip from all of my servers that run zabbix agent.
I searched and find some articles about zabbix user parameters.
In /etc/zabbix/zabbix_agentd.conf.d/ I created a file named userparameter_ping.conf with following content:
UserParameter=checkip[*],ping -c4 8.8.8.8 && echo 0 || echo 1
I created an item named checkip in zabbix server with a graph but got no data. After some another digging I found zabbix_get and tested my userparameter but I got the error : ZBX_NOTSUPPORTED
# zabbix_get -s 172.20.4.43 -p 10050 -k checkip
my zabbix version :
Zabbix Agent (daemon) v2.4.5 (revision 53282) (21 April 2015)
Does anybody know what I can do to address this?
After some change and talks with folks in mailing list finally it worked but how :
first i created a file in :
/etc/zabbix/zabbix_agentd.conf.d/
and add this line :
UserParameter=checkip[*],ping -W1 -c2 $1 >/dev/null 2>&1 && echo 0 || echo 1
and run this command :
./sbin/zabbix_agentd -t checkip["8.8.8.8"]
checkip[8.8.8.8] [t|0]
so everything done but Timeout option is very important for us :
add time out in /etc/zabbix/zabbix_agentd.conf
Timeout=30
Timeout default is 3s so if we run
time ping -W1 -c2 8.8.8.8
see maybe it takes more than 3s so you got error :
ZBX_NOTSUPPORTED
It can be anything. For example timeout - default timeout is 3 sec and ping -c4 requires at least 3 seconds, permission/path to ping, not restarted agent, ...
Increase debug level, restart agent and check zabbix logs. Also you can test zabbix_agentd directly:
zabbix_agentd -t checkip[]
[m|ZBX_NOTSUPPORTED] [Timeout while executing a shell script.] => Timeout problem. Edit zabbix_agentd.conf and increase Timeout settings. Default 3 seconds are not the best for your ping, which needs 3+ seconds.
If you need more than 30s for the execution, you can use the nohup (command..) & combo to curb the timeout restriction.
That way, if you generate some file with the results, in the next pass, you can read the file and get back the results without any need to wait at all.
For those who may be experiencing other issues with the same error message.
It is important to run zabbix_agentd with the -c parameter:
./sbin/zabbix_agentd -c zabbix_agentd.conf --test checkip["8.8.8.8"]
Otherwise zabbix might not pick up on the command and will thus yield ZBX_NOTSUPPORTED.
It also helps to isolate the command into a script file, as Zabbix will butcher in-line commands in UserParameter= much more than you'd expect.
I defined two user parameters like this for sync checking between to samba DCs.
/etc/zabbix/zabbix_agentd.d/userparameter_samba.conf:
UserParameter=syncma, sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com
UserParameter=syncam, sudo samba-tool drs replicate smb2 smb1 cn=schema,cn=configuration,dc=domain,dc=com
and also provided sudoer access for Zabbix user to execute the command. /etc/sudoers.d/zabbix:
Defaults:zabbix !syslog
Defaults:zabbix !requiretty
zabbix ALL=(ALL) NOPASSWD: /usr/bin/samba-tool
zabbix ALL=(ALL) NOPASSWD: /usr/bin/systemctl
And "EnableRemoteCommands" is enabled on my zabbix_aganetd.conf, sometimes when I run
zabbix_get -s CLIENT_IP -p10050 -k syncma or
zabbix_get -s CLIENT_IP -p10050 -k syncam
I get the error ZBX_NOTSUPPORTED: Timeout while executing a shell script.
but after executing /sbin/zabbix_agentd -t syncam on the client, Zabbix server just responses normally.
Replicate from smb2 to smb1 was successful.
and when it has a problem I get below error on my zabbix.log
failed to kill [ sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com]: [1] Operation not permitted
It seems like it is a permission error! but It just resolved after executing /sbin/zabbix_agentd -t syncam but I am not sure the error is gone permanently or will happen at the next Zabbix item check interval.
Metadata: My system is debian 7 64bit
Hi, I am new to varnish and I have encountered an error that i cant seem to solve.
I have the following in my /etc/default/varnish configuration file:
DAEMON_OPTS="-a :80 \
-T localhost:6082 \
-f /etc/varnish/default.vcl \
-u www-data -g www-data \
-S /etc/varnish/secret \
-p thread_pools=2 \
-p thread_pool_min=25 \
-p thread_pool_max=250 \
-p thread_pool_add_delay=2 \
-p timeout_linger=50 \
-p sess_workspace=262144 \
-p cli_timeout=40 \
-s malloc,512m"
When I restarted he varnish service it failed with the error Unknown parameter "sess_workspace".
I checked the /var/log/varnish/ and no logs where generated.
I also checked syslog and the only logs that varnish wrote were logs that had to do with the startup and shutdown of varnish after he install and when I ran service varnish restart. Here ara all the relevant syslog entries (cat syslog | grep varnish):
Nov 6 00:56:27 HOSTNAME varnishd[7557]: Platform: Linux,3.2.0-4-amd64,x86_64,-smalloc,-smalloc,-hcritbit
Nov 6 00:56:27 HOSTNAME varnishd[7557]: child (7618) Started
Nov 6 00:56:27 HOSTNAME varnishd[7557]: Child (7618) said Child starts
Nov 6 01:04:58 HOSTNAME varnishd[7557]: Manager got SIGINT
Nov 6 01:04:58 HOSTNAME varnishd[7557]: Stopping Child
Nov 6 01:04:59 HOSTNAME varnishd[7557]: Child (7618) ended
Nov 6 01:04:59 HOSTNAME varnishd[7557]: Child (7618) said Child dies
Nov 6 01:04:59 HOSTNAME varnishd[7557]: Child cleanup complete
I have searched the vast seas of google but with no luck (I have even compared with example code from varnish site and still no luck).
Any ideas that may help me?
Bit delayed but in case anyone else has this issue I will post what I found.
Found this here: http://www.0550go.com/proxy/varnish/varnish-v4-config.html
sess_workspace
In 3.0 it was often necessary to increase
sess_workspace if a lot of VMODs, complex header operations or ESI
were in use.
This is no longer necessary, because ESI scratch space
happens elsewhere in 4.0.
If you are using a lot of VMODs, you may
need to increase either workspace_backend and workspace_client based
on where your VMOD is doing its work.
I am trying to run a script every other monday using this cron job. (/ect/crontab)
45 8 * * mon root expr `date +%U` % 2 >/dev/null || /home/joe/Scripts/test1.sh
This morning I checked my /var/log/syslog and found these entries.
/USR/SBIN/CRON[874]: (root) CMD (expr `date +)
/USR/SBIN/CRON[872]: (CRON) error (grandchild #874 failed with exit status 2)
/USR/SBIN/CRON[872]: (CRON) info (No MTA installed, discarding output)
I also tried typing this directly into the command line while logged in as root.
/bin/sh -c "(export PATH=/usr/bin:/bin; expr `date +%U` % 2 >/dev/null || /home/joe/Scripts/test1.sh </dev/null >/dev/null 2>&1)"
It worked with no output into the syslog. Does anyone know why my cron job is giving this error?
Percent is a reserved character in cron and must be escaped with a backslash.