How can I automatically kill idle GCE instances based on CPU usage? - google-compute-engine

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).
However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.
Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.

The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.
Credit, and more detailed explanation: Rohit Rawat's blog.
#!/bin/bash
threshold=0.4
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done

It seems like there are two parts to this question:
Identifying dead instances.
Killing off those instances.
In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.
Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.

I wish I could add this as a comment to viswajithiii answer but I'm just shy of the reputations necessary to comment.
I found the static threshold variable to be inappropriate when I am using cloud VM's with variable numbers of cpu's as the output of uptime scales with the number of CPU's as discussed here.
My updated script adds two lines below the threshold assignment to scale the threshold by the number of cpu's. This allows me to set a percentage cpu utilization that will work across VM's with different numbers of cpu's.
Otherwise, the script is the same as viswajithiii's.
#!/bin/bash
threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done

This works without bc (not in GCP Container OS) using viswajithiii's answer and this post:
How can I replace 'bc' tool in my bash script?
It also appends the history list to file before poweroff. I set my threshold very low, but the load is showing 0.00 even when I'm editing files via cli. Might work better if instance is under heavy load.
#!/bin/bash
threshold=10
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
load2=$(awk -v a="$load" 'BEGIN {print a*100}')
echo $load2
if [ $load2 -lt $threshold ]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
That's not working for my low cpu, but this seems too:
#!/bin/bash
threshold=1
count=0
while true
do
load=$(awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
load2=$(printf "%.0f\n" $load)
echo $load
echo $load2
if [[ $load2 -lt $threshold ]]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
It only works with both echo loads for some reason.
credits:
How to get overall CPU usage (e.g. 57%) on Linux
https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer
FYI: according to here, GCP monitoring agent is not available for N type instances: Google Cloud Platform: how to monitor memory usage of VM instances
Put this in a startup script in /etc/my_init.d and make it executable:
sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh
Actually, that's being deleted. Instead add to Custom Metadata in Edit for the instance: startup-script and #! /bin/bash \n~./autooff.sh

Related

How to send binary flashing file to embedded system with only serial console?

I have an embedded Linux system that uses ramdisk boot so it has run time no persistent storage available (it does have Flash to store kernel and ramdisk).
The only connectivity is RS-232 serial login console. So I am limited by what is provided by its built in busybox. I want to retrieve the ramdisk, modify it, and rewrite the ramdisk. The kernel does not have Flash filesystem support built-in. The ramdisk partition size is about 10 MBytes. When all files in the user directory are deleted, the free ramdisk size is about 14 MBytes.
The command dd is available so I can copy the ramdisk partition to the ramdisk, and can write to the flash from a ramdisk file. flashcp is also available.
So my problem is now how to receive and send binary files through the RS-232 serial console?
I research the followings and none is useful for me:
Linux command to send binary file to serial port with HW flow control? on stackoverflow
Binary data over serial terminal on stackoverflow
Transferring files using serial console on k.japko.eu
File transfer over a serial line on superuser.com
How to get file to a host when all you have is a serial console? on stackexchange
Mostly because x/y/zmodem are not available in the busybox.
Any idea? Thanks!
Per the request, here's what I should have included in the first place.
Available u-boot commands:
U-Boot >?
? - alias for 'help'
askenv - get environment variables from stdin
base - print or set address offset
bdinfo - print Board Info structure
boot - boot default, i.e., run 'bootcmd'
bootd - boot default, i.e., run 'bootcmd'
bootm - boot application image from memory
cmp - memory compare
coninfo - print console devices and information
cp - memory copy
crc32 - checksum calculation
crc32_chk_uimage- checksum calculation of an image for u-boot
echo - echo args to console
editenv - edit environment variable
env - environment handling commands
exit - exit script
false - do nothing, unsuccessfully
fatinfo - print information about filesystem
fatload - load binary file from a dos filesystem
fatls - list files in a directory (default /)
fatwrite- write file into a dos filesystem
go - start application at address 'addr'
gpio - input/set/clear/toggle gpio pins
help - print command description/usage
i2c - I2C sub-system
iminfo - print header information for application image
imxtract- extract a part of a multi-image
itest - return true/false on integer compare
loadb - load binary file over serial line (kermit mode)
loads - load S-Record file over serial line
loady - load binary file over serial line (ymodem mode)
loop - infinite loop on address range
md - memory display
mdc - memory display cyclic
mm - memory modify (auto-incrementing address)
mw - memory write (fill)
mwc - memory write cyclic
nm - memory modify (constant address)
printenv- print environment variables
reset - Perform RESET of the CPU
run - run commands in an environment variable
saveenv - save environment variables to persistent storage
saves - save S-Record file over serial line
setenv - set environment variables
sf - SPI flash sub-system
showvar - print local hushshell variables
sleep - delay execution for some time
source - run script from memory
sspi - SPI utility command
test - minimal test like /bin/sh
true - do nothing, successfully
usb - USB sub-system
usbboot - boot from USB device
version - print monitor, compiler and linker version
U-Boot >
Available busybox commands:
BusyBox v1.13.2 (2015-03-16 10:50:56 EDT) multi-call binary
Copyright (C) 1998-2008 Erik Andersen, Rob Landley, Denys Vlasenko
and others. Licensed under GPLv2.
See source distribution for full notice.
Usage: busybox [function] [arguments]...
or: function [arguments]...
BusyBox is a multi-call binary that combines many common Unix
utilities into a single executable. Most people will create a
link to busybox for each function they wish to use and BusyBox
will act like whatever it was invoked as!
Currently defined functions:
[, [[, addgroup, adduser, ar, ash, awk, basename, blkid,
bunzip2, bzcat, cat, chattr, chgrp, chmod, chown, chpasswd,
chroot, chvt, clear, cmp, cp, cpio, cryptpw, cut, date,
dc, dd, deallocvt, delgroup, deluser, df, dhcprelay, diff,
dirname, dmesg, du, dumpkmap, dumpleases, echo, egrep, env,
expr, false, fbset, fbsplash, fdisk, fgrep, find, free,
freeramdisk, fsck, fsck.minix, fuser, getopt, getty, grep,
gunzip, gzip, halt, head, hexdump, hostname, httpd, hwclock,
id, ifconfig, ifdown, ifup, inetd, init, insmod, ip, kill,
killall, klogd, last, less, linuxrc, ln, loadfont, loadkmap,
logger, login, logname, logread, losetup, ls, lsmod, makedevs,
md5sum, mdev, microcom, mkdir, mkfifo, mkfs.minix, mknod,
mkswap, mktemp, modprobe, more, mount, mv, nc, netstat,
nice, nohup, nslookup, od, openvt, passwd, patch, pidof,
ping, ping6, pivot_root, poweroff, printf, ps, pwd, rdate,
rdev, readahead, readlink, readprofile, realpath, reboot,
renice, reset, rm, rmdir, rmmod, route, rtcwake, run-parts,
sed, seq, setconsole, setfont, sh, showkey, sleep, sort,
start-stop-daemon, strings, stty, su, sulogin, swapoff,
swapon, switch_root, sync, sysctl, syslogd, tail, tar, tcpsvd,
tee, telnet, telnetd, test, tftp, tftpd, time, top, touch,
tr, traceroute, true, tty, udhcpc, udhcpd, udpsvd, umount,
uname, uniq, unzip, uptime, usleep, vconfig, vi, vlock,
watch, wc, wget, which, who, whoami, xargs, yes, zcat
In uboot you could use loady/loadx to get file from pc via uart.I usually use teraterm to send file.
The process should be this:
run loady in uboot
use teraterm send data
the file is transfer to you device's memory located in 0x01000000.
Independently I found a way to upload binary files through the Linux console and I'll document the steps here in case others find it useful since I had a hard time looking for this information on the net.
Here's the theory: change the console mode to raw so all the binary traffic are't interpretted as console command, e.g. ctrl-C. Turn off echo so it doesn't add extra serial traffic. Run tar to accept input from the stdin. Since ctrl-C won't work, and tar won't know when to terminate, use a background task to kill the login shell so you can login again to do your staff.
Steps:
Create a script to run in the background. Change myvar variable so it kills the login shell after the transfer is complete. Currently 120 corresponds to 1200 seconds, sufficient for a 10 MBytes file. In addition edit the 808 to match your login shell PID:
create bg file:
myvar=120
while [ $myvar -gt 0 ]
do
myvar=$(( $myvar-1 ))
echo -e " $myvar \n"
ls -l
sleep 10
done
kill -9 808
Launch the script in the background:
in console type:
source ./bg &
Use stty to change console to raw mode and do not echo
in console type:
stty raw -echo
Start tar to untar stdin. Note: I have to use ctrl-J since no longer work after the stty command
in console type and ends with ctrl-j, not :
tar zx -f - 1> 1.log 2> 2.log
Start Teraterm to send binary file
Wait for completion and the new login prompt
I forgot I asked this question. I figured out how to make ssh connection which in turn allows many more things to be done more easily. Of course it requires sshd in addition to nc and stty so you are out of luck if these are not available on your embedded Linux. I have tried it several times and it seems to work well, allowing multiple ssh sessions to be established, and mc to transfer files.
You will need two shell sessions on the host computer, one to loop the serial port to socket, and the other for the ssh, and more if you want to establish more ssh sessions.
First you need to setup the serial port. The '--noreset' option for picocom does this:
sudo picocom --noreset -b 115200 -e b /dev/ttyUSB3
Quit picocom once this is done (^B^X to exit).
Next we need to verify that the line endings are not translated or else ssh won't work. In the first shell run:
cat /dev/ttyUSB3 | hexdump -C
In the second shell run:
echo "echo -e \"LFLF\\n\\nCRCR\\r\\rEND\"" > /dev/ttyUSB3
You may see that \n (0x0A) is translated to \r\n (0x0D0x0A)
Use stty to set raw mode without echo and you should see no more translation:
echo "stty raw -echo" > /dev/ttyUSB3
echo "echo -e \"LFLF\\n\\nCRCR\\r\\rEND\"" > /dev/ttyUSB3
Finally in the first shell run nc to funnel local traffic between the serial port and ssh socket:
cat /dev/ttyUSB3 | nc -l -p 2222 > /dev/ttyUSB3
and funnel remote serial traffic to sshd:
echo "while true ; do nc localhost 22 ; done" > /dev/ttyUSB3
and connect ssh with port forwarding:
ssh -vvv root#localhost -p 2222 -L 0.0.0.0:22022:localhost:22
you can make more ssh connections simultaneously:
ssh -vvv root#localhost -p 22022
if you use mc, you can connect to it so you can easily browse the remote file system and copy files:
sh://root#localhost:22022
Last words: nc strips the TCP headers so the ssh packets are no checksumed and are not retried. If there were data error, the connection will break. If you remember your login shell PID, you can kill it and login again, otherwise you have to reboot. The '-vvv' flag for the ssh is for debugging.

How to accelerate a shell script for a mongoDB application?

I started shell scripting for my work, but I must admit I'm still far away from even being a rookie. Therefore I wanted to ask you for your help/advice.
I build a script for a big data application (taking the quick and dirty approach, patching stuff from the internet together) to recursively go through a folder structure and convert convert all XML files to JSON.
The status quo of my script is:
#!/bin/sh
# Shell script to find out all the files under a directory and
#its subdirectories. This also takes into consideration those files
#or directories which have spaces or newlines in their names
cd /Users/q337498/Desktop/Archiv/2014/01/10
DIR="."
function list_files()
{
if !(test -d "$1")
then echo $1; return;
fi
cd "$1"
#echo; echo `pwd`:; #Display Directory name
for i in *
do
if test -d "$i"; then # if dictionary
if [ "$(ls -A $i)" ]; then
list_files "$i" #recursively list files
cd ..
else
echo "$i is Empty"
fi
else
java -jar /Users/q337498/Desktop/XML2JSON/SaxonEE9-5-1-4J/saxon9ee.jar -s:"$i" -xsl:/Users/q337498/Desktop/xsltjson-master/conf/xml-to-json.xsl -o:output/$(pwd)/${i%%[.]*}
# if jsonlint /Users/q337498/Desktop/Archiv/2014/01/08/$(pwd)/${i%%[.]*} -q; then
# echo "GOOD"
# else
# echo "NOT GOOD"
# fi
# echo ${i%%[.]*}
# echo "$i"; #Display File name
fi
done
}
if [ $# -eq 0 ]
then list_files .
exit 0
fi
for i in $*
do
DIR="$1"
list_files "$DIR"
shift 1 #To read next directory/file namedone
done
This code works, but the problem is that for 60000 files it takes up to 15 hours on a macbook pro with 16gb RAM and an 2.8 Ghz i7. And I need to convert 10 million files.
How do you think that I could accelerate the script? parallelize? take some commands out? What options do I have, and how would I actually implement them?
The files are ultimately going to end up in a MongoDB, so if someone knows a better way to convert xml to json and upload it to mongo his input is also welcome.
Cheers,
Dudu
I see 2 immediate problems here:
You are invoking java once for each file, therefore incurring the JVM startup time for every file which is going to add up to a huge chunk of time.
You are running single threaded
So I would suggest that you:
Write a Java program that does the directory traversal and does your transformation
Benchmark the performance difference
Try other Java libraries for doing the XML->JSON conversion https://github.com/beckchr/staxon/wiki/Benchmark
If necessary for performance, add multi-threading to your application using java.util.concurrent.

Recording busy network traffic with tcpdump

I have set up a system on my Raspberry Pi to record some TCPDUMP data. This system works under a light workload, but for some unknown reason, doesn't work under my "heavy" traffic (27 relevant packets per second).
Under the last heavy traffic system I tried to record, my monitor.log file had 35,200 rows that only contained the last 16 minutes worth of data (judging by the timestamps). My filter.log also only goes back 16 minutes worth. There should be something like 1 million rows.
Could anyone advise on how to find the possible bug, bottle-necks, dropped pipe data, etc?
RC.LOCAL:
java -jar filter.jar > filter.log 2>&1 &
bash ./monitor &
MONITOR:
TCPDUMP -l | SED | tee monitor.log | tee myFIFO
You may try an utility like iptraf to monitor traffic.
try sed with stream option (unbuffered) ( -u on aix and --unbuffered on GNU sed) so sed does not wait for an EOF or assimilate (like a >> file from a discontinu stream)

Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)

I had a quick look on the forums and I don't think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD.
Each CPU has its own GPU.
My task is to gather data by running the (already working) code, and implement extra things.
Turning this code into a single CPU / Multi-GPU one is not an option at the moment (later, possibly.).
I would like to make use of performance profiling tools to analyse the whole thing.
For now an idea is to have each CPU launch nvvp for its own GPU and gather data, while another profiling tool will take care of general CPU/MPI part (I plan to use TAU, as I usually do).
Problem is, launching nvvp's interface 8 simultaneous times (if running with 8 CPU/GPUs) is extremely annoying. I would like to avoid going through the interface, and get a command line that directly writes the data in a file, that I can feed to nvvc's interface later and analyse.
I'd like to get a command line that will be executed by each CPU and will produce for each of them a file giving data about their own GPU. 8 (GPUs/CPUs) = 8 files.
Then I plan to individually feed and analyse these files with nvcc one by one, comparing the data manually.
Any idea ?
Thanks !
Take a look at nvprof, part of the CUDA 5.0 Toolkit (currently available as a release candidate). There are some limitations - it can only collect a limited number of counters in a given pass and it cannot collect metrics (so for now you'd have to script multiple launches if you want more than a few events). You can get more information from the nvvp built-in help, including an example MPI launch script (copied here but I suggest you check out the nvvp help for an up-to-date version if you have anything newer than the 5.0 RC).
#!/bin/sh
#
# Script to launch nvprof on an MPI process. This script will
# create unique output file names based on the rank of the
# process. Examples:
# mpirun -np 4 nvprof-script a.out
# mpirun -np 4 nvprof-script -o outfile a.out
# mpirun -np 4 nvprof-script test/a.out -g -j
# In the case you want to pass a -o or -h flag to the a.out, you
# can do this.
# mpirun -np 4 nvprof-script -c a.out -h -o
# You can also pass in arguments to nvprof
# mpirun -np 4 nvprof-script --print-api-trace a.out
#
usage () {
echo "nvprof-script [nvprof options] [-h] [-o outfile] a.out [a.out options]";
echo "or"
echo "nvprof-script [nvprof options] [-h] [-o outfile] -c a.out [a.out options]";
}
nvprof_args=""
while [ $# -gt 0 ];
do
case "$1" in
(-o) shift; outfile="$1";;
(-c) shift; break;;
(-h) usage; exit 1;;
(*) nvprof_args="$nvprof_args $1";;
esac
shift
done
# If user did not provide output filename then create one
if [ -z $outfile ] ; then
outfile=`basename $1`.nvprof-out
fi
# Find the rank of the process from the MPI rank environment variable
# to ensure unique output filenames. The script handles Open MPI
# and MVAPICH. If your implementation is different, you will need to
# make a change here.
# Open MPI
if [ ! -z ${OMPI_COMM_WORLD_RANK} ] ; then
rank=${OMPI_COMM_WORLD_RANK}
fi
# MVAPICH
if [ ! -z ${MV2_COMM_WORLD_RANK} ] ; then
rank=${MV2_COMM_WORLD_RANK}
fi
# Set the nvprof command and arguments.
NVPROF="nvprof --output-profile $outfile.$rank $nvprof_args"
exec $NVPROF $*
# If you want to limit which ranks get profiled, do something like
# this. You have to use the -c switch to get the right behavior.
# mpirun -np 2 nvprof-script --print-api-trace -c a.out -q
# if [ $rank -le 0 ]; then
# exec $NVPROF $*
# else
# exec $*
# fi
Another option is since you are already using TAU to profile the CPU side of the application you could also use TAU to collect the GPU performance data. TAU supports multi-gpu execution along with MPI, take a look at http://www.nic.uoregon.edu/tau-wiki/Guide:TAUGPU for instructions on how to get started using TAU's GPU profiling capabilites. TAU uses CUPTI (CUda Performance Tools Interface) underneath and so the data you will be able to collect with TAU will be very similar to what to can collect with nVidia's Visual Profiler.
Things have changed since CUDA 5.0 and now we can simply use %h, %p and %q{ENV} as mentioned here instead of using a wrapper script:
$ mpirun -np 2 -host c0-0,c0-1 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./my_mpi_app
Apparently since 2015 it is possible to auto-annotated MPI calls via NVTX and mpi_interceptions.so library when using nvprof profiler:
https://devblogs.nvidia.com/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/
http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf
TAO still does not support distributed deep learning according to this presentation:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7684-allen-malony-performance-analysis-of-cuda-deep-learning-networks-using-tau.pdf

MySql - replication monitoring tool [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have a master/slave MySql replication.
Im looking for a tool that will allow me to monitor the replication (see it has no error, check on the lag, etc.)
I prefer a visual tool that will allow all team members get visibility on the status and not a script tool.
any ideas?
We are using the following bash script. You could do the same idea in php and web base the code.
#!/bin/sh
## Joel Chaney##
## joel.chaney#mongoosemetrics.com (look at robots.txt) ##
## 2012-02-03 ##
repeat_alert_interval=30 # minutes for lock file life
lock_file=/tmp/slave_alert.lck # location of lock file
EMAIL=YOURNAME#YOURCOMPANY.DOM # where to send alerts
SSTATUS=/tmp/sstatus # location of sstatus file
### Code -- do not edit below ##
NODE=`uname -n`
## Check if alert is locked ##
function check_alert_lock () {
if [ -f $lock_file ] ; then
current_file=`find $lock_file -cmin -$repeat_alert_interval`
if [ -n "$current_file" ] ; then
# echo "Current lock file found"
return 1
else
# echo "Expired lock file found"
rm $lock_file
return 0
fi
else
touch $lock_file
return 0
fi
}
SLAVE=mysql
$SLAVE -e 'SHOW SLAVE STATUS\G' > $SSTATUS
function extract_value {
FILENAME=$1
VAR=$2
grep -w $VAR $FILENAME | awk '{print $2}'
}
Master_Binlog=$(extract_value $SSTATUS Master_Log_File )
Master_Position=$(extract_value $SSTATUS Exec_Master_Log_Pos )
Master_Host=$(extract_value $SSTATUS Master_Host)
Master_Port=$(extract_value $SSTATUS Master_Port)
Master_Log_File=$(extract_value $SSTATUS Master_Log_File)
Read_Master_Log_Pos=$(extract_value $SSTATUS Read_Master_Log_Pos)
Slave_IO_Running=$(extract_value $SSTATUS Slave_IO_Running)
Slave_SQL_Running=$(extract_value $SSTATUS Slave_SQL_Running)
Slave_ERROR=$(extract_value $SSTATUS Last_Error)
ERROR_COUNT=0
if [ "$Master_Binlog" != "$Master_Log_File" ]
then
ERRORS[$ERROR_COUNT]="master binlog ($Master_Binlog) and Master_Log_File ($Master_Log_File) differ"
ERROR_COUNT=$(($ERROR_COUNT+1))
fi
POS_DIFFERENCE=$(echo ${Master_Position}-${Read_Master_Log_Pos}|bc)
if [ $POS_DIFFERENCE -gt 1000 ]
then
ERRORS[$ERROR_COUNT]="The slave is lagging behind of $POS_DIFFERENCE"
ERROR_COUNT=$(($ERROR_COUNT+1))
fi
if [ "$Slave_IO_Running" == "No" ]
then
ERRORS[$ERROR_COUNT]="Replication is stopped"
ERROR_COUNT=$(($ERROR_COUNT+1))
fi
if [ "$Slave_SQL_Running" == "No" ]
then
ERRORS[$ERROR_COUNT]="Replication (SQL) is stopped"
ERROR_COUNT=$(($ERROR_COUNT+1))
fi
if [ $ERROR_COUNT -gt 0 ]
then
if [ check_alert_lock == 0 ]
then
SUBJECT="${NODE}-ERRORS in replication"
BODY=''
CNT=0
while [ "$CNT" != "$ERROR_COUNT" ]
do
BODY="$BODY ${ERRORS[$CNT]}"
CNT=$(($CNT+1))
done
BODY=$BODY" \n${Slave_ERROR}"
echo $BODY | mail -s "$SUBJECT" $EMAIL
fi
else
echo "Replication OK"
fi
#!/bin/bash
HOST=your-server-ip
USER=mysql-user
PASSWORD=mysql-password
SUBJECT="Mysql replication problem"
EMAIL=your#email.address
RESULT=`mysql -h $HOST -u$USER -p$PASSWORD -e 'show slave status\G' | grep Last_SQL_Error | sed -e 's/ *Last_SQL_Error: //'`
if [ -n "$RESULT" ]; then
echo "$RESULT" | mail -s "$SUBJECT" $EMAIL
fi
You can use any programming language to query mysql and fetch the results from:
show slave status; <-- execute on slave
show master status; <-- execute on master
If you think this is a bad idea, then install phpmyadmin, there is an already built-in GUI for replication monitoring, like: http://demo.phpmyadmin.net/master-config/ (replication)
If you're just interested in whether the slave is up to date or not:
mysql 'your connection info' -e 'show slave status\G' | grep -i seconds_behind
I've used a few different approaches, the most basic being using a PHP web page to check the slave status, and then getting standard monitoring tools to monitor the page. This is a nice approach as it means your existing monitoring tools can be used for alerts by checking the web page.
Example: to check the status of a database server at host db1.internal
http://mywebserver.com/replicationtest.php?host=db1.internal
Should always return "Yes"
replicationtest.php:
<?php
$username="myrepadmin";
$password="";
$database="database";
mysql_connect($_REQUEST['host'],$username,$password);
#mysql_select_db($database) or die( "Unable to select database");
$query="show slave status;";
$result=mysql_query($query);
$arr = mysql_fetch_assoc($result);
echo $arr['Slave_SQL_Running'] ;
mysql_close();
?>
You could also monitor Seconds_Behind_Master, Last_IO_Errno, Last_SQL_Errno etc.
You can monitor this web page externally, or add it into many standard monitoring tools that can check a web page. I've used free service http://monitor.us
Alternatively, if you don't mind running code from 3rd parties on your internal infrastructure http://newrelic.com offer great server monitoring tools with a web interface, and include a MySQL plugin that provides lots of great info such as Query Analysis, InnoDB metrics and Replication status with lag monitors. New Relic specialise on web app monitoring but the free service allows you to monitor an unlimited number of servers.
I currently use a combination of these tools with the above web page being used to trigger alerts for emergencies, and the NewRelic tools for viewing long term performance and trend analysis.
The question is:
do you want to know if your Mysql replication is OK
OR do you want to know if your data are consistent ?
You can't rely only on SHOW SLAVE STATUS output to know if your slave is identical to Master: a (bad) try to solve an error that halted your replication may imply that some INSERT or UPDATE or whatever, never occured on your slave.
To check this you have to read SHOW SLAVE STATUS, of course, everything has to be ok in that output, but you also have to compare data (ie number of rows, checksum, ...).
I've written a PHP tool to do that: https://bitbucket.org/verticalassertions/verticalslave
It features :
check replication (checks on show slave status values, check table, checksum table, ...)
check replication with auto-repair (same as above + dump of replicated tables in errors)
dump of non-replicated databases (listed in config)
reset replication when you broke all to the ground - basically a dump of replicated databases and start slave)
send shorten report by mail, link to full report on website version
store past reports
can be run from CLI (crontab) or manually from website you set up
Feel free to fork and improve. I'm sure some tools are better (especially in layout xD ), but I needed a tool that does exactly what I ask and no fancy things I couldn't figure.