Bash - Faster way to check for file changes than md5? - mysql

I've got a MySQL DB set up on my system for local testing, and I'm monitoring the tables to see when a change is made.
Step 1 - Go to DIR
cd /usr/local/mysql-5.7.16-osx10.11-x86_64/data/blog_atom_tables/
Step 2 - Run Script
watchDB
Where watchDB() is (slightly modified for readability)...
function watchDB() {
declare -A aa // Associative array of filenames and their md5 hashes
declare k // Holder for current md5
prt="0"
while true; do // Run forever
// Loop through all table files within directory
for i in *.ibd;
do
k=$(sudo md5 -q $i) // md5 of file (table)
// If table has not been hashed yet
if [[ ${aa[$(echo $i | cut -f 1 -d '.')]} == "" ]]; then
aa[$(echo $i | cut -f 1 -d '.')]=$k
// If table has been hashed, and diff md5 (i.e. table changed)
elif [[ ${aa[$(echo $i | cut -f 1 -d '.')]} != $k ]]; then
echo $i;
aa[$(echo $i | cut -f 1 -d '.')]=$k
fi
done
done
}
TL;DR Loop through all the table files within the directory, save a copy of each md5, and continue looping through checking for a change.
I don't need to see what rows/columns have been changed, only that the table itself is different. For the most part, this works exactly as I want, but calculating the md5 for every table takes a noticeable amount of time. For only 25 tables, it takes between 3 and 5 seconds to execute each loop.
Is there a quicker way to do this, other than md5? I'd use something like cmp, but I need to save a reference of the current state of the file, so I have something to compare it against.
This is only about 1/6 of the total tables that will eventually be in there, so any improvement on speed is welcome.

While it's not really checking the content of the file, you could use file system attributes as a simple way to monitor for changes. Unless the filesystem is mounted with the timestamps disabled, you can monitor the access time and modification time timestamps:
stat -f "%m" <filename>
The filesystem driver knows when reads and writes occur and subsequently updates the timestamps.

Related

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

Bash read line from variables

Hi & thanks in advance.
I'm trying to update a column(version) on an MySQL table from a Bash script.
I've populated a variable with the version numbers, but it fails after applying the first version in the list.
CODE:
UP_VER=`seq ${DB_VER} ${LT_VER} | sed '1d'`
UP_DB=`echo "UPDATE client SET current_db_vers='${UP_VER}' WHERE client_name='${CLIENT}'" | ${MYSQL_ID}`
while read -r line
do
${UP_DB}
if [[ "${OUT}" -eq "0" ]]; then
echo "Database upgraded.."
else
echo "Failed to upgrade.."
exit 1
fi
done < "${UP_VER}"
Thanks
Hopefully solved... My $UP_VER is in a a row not a column.
You're misunderstanding what several shell constructs do:
var=`command` # This executes the command immediately, and stores
# its result (NOT the command itself) in the variable
... < "${UP_VER}" # Treats the contents of $UP_VER as a filename, and tries
# to use that file as input
if [[ "${OUT}" -eq "0" ]]; then # $OUT is not defined anywhere
... current_db_vers='${UP_VER}' ... # this sets current_db_vers to the entire
# list of versions at once
Also, in the shell it's best to use lowercase (or mixed-case) variable names to avoid conflicts with the variables that have special meanings (which are all uppercase).
To fix the first problem, my recommendation is don't try to store shell commands in variables, it doesn't work right. (See BashFAQ #50: I'm trying to put a command in a variable, but the complex cases always fail!.) Either use a function, or just write the command directly where it's going to be executed. In this case I'd vote for just putting it directly where it's going to be executed. BTW, you're making the same mistake with ${MYSQL_ID}, so I'd recommend fixing that as well.
For the second problem, you can use <<< "${UP_VER}" to feed a variable's contents as input (although this is a bashism, and not available in generic posix shells). But in this case I'd just use a for loop:
for ((ver=db_ver+1; ver<=lt_ver; ver++)); do
For the third problem, the simplest way to test the success of a command is to put it directly in the if:
if somecommand; then
echo "Database upgraded.."
else # ... etc
So, here's my take at a rewrite:
mysql_id() {
# appropriate function definition goes here...
}
for ((ver=db_ver+1; ver<=lt_ver; ver++)); do
if echo "UPDATE client SET current_db_vers='${ver}' WHERE client_name='${client}'" | mysql_id; then
echo "Database upgraded.."
else
echo "Failed to upgrade.."
exit 1
fi
done
... but I'm not sure I understand what it's supposed to do. It seems to be updating current_db_vers one number at a time until it reaches $ver_lt... but why not set it directly to $ver_lt in a single UPDATE?
try something like :
done <<< "${UP_VER}"

Split large directory into subdirectories

I have a directory with about 2.5 million files and is over 70 GB.
I want to split this into subdirectories, each with 1000 files in them.
Here's the command I've tried using:
i=0; for f in *; do d=dir_$(printf %03d $((i/1000+1))); mkdir -p $d; mv "$f" $d; let i++; done
That command works for me on a small scale, but I can leave it running for hours on this directory and it doesn't seem to do anything.
I'm open for doing this in any way via command line: perl, python, etc. Just whatever way would be the fastest to get this done...
I suspect that if you checked, you'd noticed your program was actually moving the files, albeit really slowly. Launching a program is rather expensive (at least compared to making a system call), and you do so three or four times per file! As such, the following should be much faster:
perl -e'
my $base_dir_qfn = ".";
my $i = 0;
my $dir;
opendir(my $dh, $base_dir_qfn)
or die("Can'\''t open dir \"$base_dir_qfn\": $!\n");
while (defined( my $fn = readdir($dh) )) {
next if $fn =~ /^(?:\.\.?|dir_\d+)\z/;
my $qfn = "$base_dir_qfn/$fn";
if ($i % 1000 == 0) {
$dir_qfn = sprintf("%s/dir_%03d", $base_dir_qfn, int($i/1000)+1);
mkdir($dir_qfn)
or die("Can'\''t make directory \"$dir_qfn\": $!\n");
}
rename($qfn, "$dir_qfn/$fn")
or do {
warn("Can'\''t move \"$qfn\" into \"$dir_qfn\": $!\n");
next;
};
++$i;
}
'
Note: ikegami's helpful Perl-based answer is the way to go - it performs the entire operation in a single process and is therefore much faster than the Bash + standard utilities solution below.
A bash-based solution needs to avoid loops in which external utilities are called order to perform reasonably.
Your own solution calls two external utilities and creates a subshell in each loop iteration, which means that you'll end up creating about 7.5 million processes(!) in total.
The following solution avoids loops, but, given the sheer number of input files, will still take quite a while to complete (you'll end up creating 4 processes for every 1000 input files, i.e., ca. 10,000 processes in total):
printf '%s\0' * | xargs -0 -n 1000 bash -O nullglob -c '
dirs=( dir_*/ )
dir=dir_$(printf %04s $(( 1 + ${#dirs[#]} )))
mkdir "$dir"; mv "$#" "$dir"' -
printf '%s\0' * prints a NUL-separated list of all files in the dir.
Note that since printf is a Bash builtin rather than an external utility, the max. command-line length as reported by getconf ARG_MAX does not apply.
xargs -0 -n 1000 invokes the specified command with chunks of 1000 input filenames.
Note that xargs -0 is nonstandard, but supported on both Linux and BSD/OSX.
Using NUL-separated input robustly passes filenames without fear of inadvertently splitting them into multiple parts, and even works with filenames with embedded newlines (though such filenames are very rare).
bash -O nullglob -c executes the specified command string with option nullglob turned on, which means that a globbing pattern that matches nothing will expand to the empty string.
The command string counts the output directories created so far, so as to determine the name of the next output dir with the next higher index, creates the next output dir, and moves the current batch of (up to) 1000 files there.
if the directory is not under use, I suggest the following
find . -maxdepth 1 -type f | split -l 1000 -d -a 5
this will create n number of files named x00000 - x02500 (just to make sure 5 digits although 4 will work too). You can then move the 1000 files listed in each file to a corresponding directory.
perhaps set -o noclobber to eliminate risk of overrides in case of name clash.
to move the files, it's easier to use brace expansion to iterate over file names
for c in x{00000..02500};
do d="d$c";
mkdir $d;
cat $c | xargs -I f mv f $d;
done
Moving files around is always a challenge. IMHO all the solutions presented so far have some risk of destroying your files. This may be because the challenge sounds simple, but there is a lot to consider and to test when implementing it.
We must also not underestimate the efficiency of the solution as we are potentially handling a (very) large number of files.
Here is script carefully & intensively tested with own files. But of course use at your own risk!
This solution:
is safe with filenames that contain spaces.
does not use xargs -L because this will easily result in "Argument list too long" errors
is based on Bash 4 and does not depend on awk, sed, tr etc.
is scaling well with the amount of files to move.
Here is the code:
if [[ "${BASH_VERSINFO[0]}" -lt 4 ]]; then
echo "$(basename "$0") requires Bash 4+"
exit -1
fi >&2
opt_dir=${1:-.}
opt_max=1000
readarray files <<< "$(find "$opt_dir" -maxdepth 1 -mindepth 1 -type f)"
moved=0 dirnum=0 dirname=''
for ((i=0; i < ${#files[#]}; ++i))
do
if [[ $((i % opt_max)) == 0 ]]; then
((dirnum++))
dirname="$opt_dir/$(printf "%02d" $dirnum)"
fi
# chops the LF printed by "find"
file=${files[$i]::-1}
if [[ -n $file ]]; then
[[ -d $dirname ]] || mkdir -v "$dirname" || exit
mv "$file" "$dirname" || exit
((moved++))
fi
done
echo "moved $moved file(s)"
For example, save this as split_directory.sh. Now let's assume you have 2001 files in some/dir:
$ split_directory.sh some/dir
mkdir: created directory some/dir/01
mkdir: created directory some/dir/02
mkdir: created directory some/dir/03
moved 2001 file(s)
Now the new reality looks like this:
some/dir contains 3 directories and 0 files
some/dir/01 contains 1000 files
some/dir/02 contains 1000 files
some/dir/03 contains 1 file
Calling the script again on the same directory is safe and returns almost immediately:
$ split_directory.sh some/dir
moved 0 file(s)
Finally, let's take a look at the special case where we call the script on one of the generated directories:
$ time split_directory.sh some/dir/01
mkdir: created directory 'some/dir/01/01'
moved 1000 file(s)
real 0m19.265s
user 0m4.462s
sys 0m11.184s
$ time split_directory.sh some/dir/01
moved 0 file(s)
real 0m0.140s
user 0m0.015s
sys 0m0.123s
Note that this test ran on a fairly slow, veteran computer.
Good luck :-)
This is probably slower than a Perl program (1 minute for 10.000 files) but it should work with any POSIX compliant shell.
#! /bin/sh
nd=0
nf=0
/bin/ls | \
while read file;
do
case $(expr $nf % 10) in
0)
nd=$(/usr/bin/expr $nd + 1)
dir=$(printf "dir_%04d" $nd)
mkdir $dir
;;
esac
mv "$file" "$dir/$file"
nf=$(/usr/bin/expr $nf + 1)
done
With bash, you can use arithmetic expansion $((...)).
And of course this idea can be improved by using xargs - should not take longer than ~ 45 sec for 2.5 million files.
nd=0
ls | xargs -L 1000 echo | \
while read cmd;
do
nd=$((nd+1))
dir=$(printf "dir_%04d" $nd)
mkdir $dir
mv $cmd $dir
done
I would use the following from the command line:
find . -maxdepth 1 -type f |split -l 1000
for i in `ls x*`
do
mkdir dir$i
mv `cat $i` dir$i& 2>/dev/null
done
Key is the "&" which threads out each mv statement.
Thanks to karakfa for the split idea.

Load every 10mins generated web-application logs to mysql,automatically

My purpose is to analyse web-application logs, use mysql as database. First, i filtered some useless information use awk to generate a filted-log, then i apply LOAD DATA import this log to mysql.
My problem is : those original logs generate every 10mins, every day. How can i generate filted-logs once new web-application logs was generated? After new filted-logs generated, how can i import those files to mysql automatically?
the original logs:
20150414/0900.log
20150414/0910.log
I´ve tried to create a little script that will easy explain the way to do it. There you have an awk that controls all the readFiles. If the number of read files is bigger when a new read is done, the system will parse de name and save it in "readFiles" file which will be checked in the awk to assure the file was not read before.
Please, check your system will not erase old logs, and be careful of splitting readed control files, or creating new ones each day to avoid very big files.
//this will give you the today datae
date +%Y%m%d
This is the code:
echo "x" > readFiles
lastnum=0
num=0
count=0
while true
do
echo "LOOKING FOR NEW FILES. LASTCOUNT="$lastcount
count=`ls ./2015*/*.log | wc -l`
echo $count
if [ $count -gt $lastnum ]
then
lastnum=$count
`ls ./2015*/*.log | awk -F"/" 'BEGIN {
while(( getline < "readFiles") > 0 ) {
readedFiles[$0]
}}
{if(!($0 in readedFiles)){print $0}}
'`>> readFiles
echo "WAITING RESTART"
sleep 10
else
echo "NO NEW FILES FOUND"
sleep 10
fi
done
Instead of writing script to monitor logs. I use inotify-tools to trigger scripts on filesystem events, just few lines get things done.
NOW=$(date +"%Y%m%d")
while true ;
do
inotifywait -r -e create,move /rsynclog/logs/$NOW && \
/rsynclog/logs/generate.sh
done

Processing MySQL result in bash

I'm currently having a already a bash script with a few thousand lines which sends various queries MySQL to generate applicable output for munin.
Up until now the results were simply numbers which weren't a problem, but now I'm facing a challenge to work with a more complex query in the form of:
$ echo "SELECT id, name FROM type ORDER BY sort" | mysql test
id name
2 Name1
1 Name2
3 Name3
From this result I need to store the id and name (and their respective association) and based on the IDs need to perform further queries, e.g. SELECT COUNT(*) FROM somedata WHERE type = 2 and later output that result paired with the associated name column from the first result.
I'd know easily how to do it in PHP/Ruby , but I'd like to spare to fork another process especially since it's polled regularly, but I'm complete lost where to start with bash.
Maybe using bash is the wrong approach anyway and I should just fork out?
I'm using GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu).
My example is not Bash, but I'd like to point out my parameters at invoking the mysql command, they surpress the boxing and the headers.
#!/bin/sh
mysql dbname -B -N -s -e "SELECT * FROM tbl" | while read -r line
do
echo "$line" | cut -f1 # outputs col #1
echo "$line" | cut -f2 # outputs col #2
echo "$line" | cut -f3 # outputs col #3
done
You would use a while read loop to process the output of that command.
echo "SELECT id, name FROM type ORDER BY sort" | mysql test | while read -r line
do
# you could use an if statement to skip the header line
do_something "$line"
done
or store it in an array:
while read -r line
do
array+=("$line")
done < <(echo "SELECT id, name FROM type ORDER BY sort" | mysql test)
That's a general overview of the technique. If you have more specific questions post them separately or if they're very simple post them in a comment or as an edit to your original question.
You're going to "fork out," as you put it, to the mysql command line client program anyhow. So either way you're going to have process-creation overhead. With your approach of using a new invocation of mysql for each query you're also going to incur the cost of connecting to and authenticating to the mysqld server multiple times. That's expensive, but the expense may not matter if this app doesn't scale up.
Making it secure against sql injection is another matter. If you prompt a user for her name and she answers "sally;drop table type;" she's laughing and you're screwed.
You might be wise to use a language that's more expressive in the areas that are important for data-base access for some of your logic. Ruby, PHP, PERL are all good choices. PERL happens to be tuned and designed to run snappily under shell script control.