Speed-up Bash ID3 to MySQL import - mysql

I have 2 Bash scripts that go through a directory, extract ID3 info from MP3's and them import the tag info into a MySQL DB. It takes quite a while to finish running so I was hoping somebody could help me make the scripts a bit more efficient.
Scripts are as follows:
makeid3dbentry.sh
TRACK=$(id3info "$1" | grep '^=== TIT2' | sed -e 's/.*: //g')
ARTIST=$(id3info "$1" | grep '^=== TPE1' | sed -e 's/.*: //g')
ALBUM=$(id3info "$1" | grep '^=== TALB' | sed -e 's/.*: //g')
ALBUMARTIST=$(id3info "$1" | grep '^=== TPE2' | sed -e 's/.*: //g')
COLS='`artist`,`name`,`album`,`albumartist`,`filename`'
# Replace all: ${string//substring/replacement} to escape "
VALS='"'${ARTIST//\"/\\\"}'","'${TRACK//\"/\\\"}'","'${ALBUM//\"/\\\"}'","'${ALBUMARTIST//\"/\\\"}'","'${1}'"'
SETLIST='`artist`="'${ARTIST//\"/\\\"}'",`name`="'${TRACK//\"/\\\"}'",`album`="'${ALBUM//\"/\\\"}'",`albumartist`="'${ALBUMARTIST//\"/\\\"}'",`filename`="'${1}'"'
echo 'INSERT INTO `music` ('${COLS}') VALUES ('${VALS}') ON DUPLICATE KEY UPDATE '${SETLIST}';'
exit
That produces an INSERT statement like
INSERT INTO `music` (`artist`,`name`,`album`,`albumartist`,`filename`) VALUES ("1200 Micrograms","Ayahuasca","1200 Micrograms","1200 Micrograms","/mnt/sharedmedia/music/Albums/1200 Micrograms/1200 Micrograms [2002]/1-01 - 1200 Micrograms - Ayahuasca.mp3") ON DUPLICATE KEY UPDATE `artist`="1200 Micrograms",`name`="Ayahuasca",`album`="1200 Micrograms",`albumartist`="1200 Micrograms",`filename`="/mnt/sharedmedia/music/Albums/1200 Micrograms/1200 Micrograms [2002]/1-01 - 1200 Micrograms - Ayahuasca.mp3";
That is then called from the main update script:
updatemusicdb.sh
DIRFULLPATH="${1}"
DIRECTORY=$(basename "${DIRFULLPATH}")
SQLFILE="/var/www/html/scripts/sql/rebuilddb_${DIRECTORY}.sql"
find "${DIRFULLPATH}" -type f -iname "*.mp3" -exec /var/www/html/scripts/bash/makeid3dbentry.sh {} > "${SQLFILE}" \;
mysql --defaults-extra-file=/var/www/html/config/website.cnf --default-character-set=utf8 "website" < "${SQLFILE}"
Unfortunately I don't know Bash & the Linux environment well enough to see where bottlenecks are and how to improve these scripts. I would appreciate any advice on improving the scripts or even a different approach if it's better / faster.

You can avoid running id3info multiple times as suggested in a comment already.
You can also make makeid3dbentry.sh take multiple arguments with a simple for file in "${#}" around your code. Then you can run find with -exec yourscript.sh {} + (works like xargs). This way you greatly reduce the number of invocations of your script. I'd probably recommend just doing this whole thing as one script, though. You can just wrap the insert statement generation in a for loop using your find command (without the -exec parameter) and pipe this output to file.
Assuming your MySQL database is using InnoDB you can speed up insertion by telling MySQL to skip committing the data until the job is done (instead of doing it on every insert).
Insert START TRANSACTION; in the top of your SQLFILE and COMMIT; at the bottom.
See http://dev.mysql.com/doc/refman/5.6/en/commit.html , http://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb-bulk-data-loading.html

Related

Shell script - math operations and loops

I'm writing a sh file to run it via cron jobs, but I have no idea about shell script, I want to get the count of rows from mysql query then divide it by 200, ceil the result and start a loop from 0 to that amount.
After a long search, I wrote this line to get the count of rows from mysql query, and this works fine.
total=`mysql database -uuser -ppassword -s -N -e "SELECT count(id) as total FROM users"`
but all what I get from Google doesn't help me to complete my work, I tested something like "expr" and "let" methods for math operations, but I don't know why not working.
Even examples of loop I found on Google not working.
Can you help me with this script?
I guess you are using Bash. Here is how you divide:
#!/usr/bin/env bash
# ^
# |
# |
# --> This should be at the top of your script to make sure you run Bash
value=$((total / 200)) # total => value returned from mysql
for ((i = 0; i <= $value; i++)); do
# your code here
done
Stackoverflow has an astounding number of examples. I recommend you taking a look at Bash documentation here: https://stackoverflow.com/tags/bash/info.

Split large directory into subdirectories

I have a directory with about 2.5 million files and is over 70 GB.
I want to split this into subdirectories, each with 1000 files in them.
Here's the command I've tried using:
i=0; for f in *; do d=dir_$(printf %03d $((i/1000+1))); mkdir -p $d; mv "$f" $d; let i++; done
That command works for me on a small scale, but I can leave it running for hours on this directory and it doesn't seem to do anything.
I'm open for doing this in any way via command line: perl, python, etc. Just whatever way would be the fastest to get this done...
I suspect that if you checked, you'd noticed your program was actually moving the files, albeit really slowly. Launching a program is rather expensive (at least compared to making a system call), and you do so three or four times per file! As such, the following should be much faster:
perl -e'
my $base_dir_qfn = ".";
my $i = 0;
my $dir;
opendir(my $dh, $base_dir_qfn)
or die("Can'\''t open dir \"$base_dir_qfn\": $!\n");
while (defined( my $fn = readdir($dh) )) {
next if $fn =~ /^(?:\.\.?|dir_\d+)\z/;
my $qfn = "$base_dir_qfn/$fn";
if ($i % 1000 == 0) {
$dir_qfn = sprintf("%s/dir_%03d", $base_dir_qfn, int($i/1000)+1);
mkdir($dir_qfn)
or die("Can'\''t make directory \"$dir_qfn\": $!\n");
}
rename($qfn, "$dir_qfn/$fn")
or do {
warn("Can'\''t move \"$qfn\" into \"$dir_qfn\": $!\n");
next;
};
++$i;
}
'
Note: ikegami's helpful Perl-based answer is the way to go - it performs the entire operation in a single process and is therefore much faster than the Bash + standard utilities solution below.
A bash-based solution needs to avoid loops in which external utilities are called order to perform reasonably.
Your own solution calls two external utilities and creates a subshell in each loop iteration, which means that you'll end up creating about 7.5 million processes(!) in total.
The following solution avoids loops, but, given the sheer number of input files, will still take quite a while to complete (you'll end up creating 4 processes for every 1000 input files, i.e., ca. 10,000 processes in total):
printf '%s\0' * | xargs -0 -n 1000 bash -O nullglob -c '
dirs=( dir_*/ )
dir=dir_$(printf %04s $(( 1 + ${#dirs[#]} )))
mkdir "$dir"; mv "$#" "$dir"' -
printf '%s\0' * prints a NUL-separated list of all files in the dir.
Note that since printf is a Bash builtin rather than an external utility, the max. command-line length as reported by getconf ARG_MAX does not apply.
xargs -0 -n 1000 invokes the specified command with chunks of 1000 input filenames.
Note that xargs -0 is nonstandard, but supported on both Linux and BSD/OSX.
Using NUL-separated input robustly passes filenames without fear of inadvertently splitting them into multiple parts, and even works with filenames with embedded newlines (though such filenames are very rare).
bash -O nullglob -c executes the specified command string with option nullglob turned on, which means that a globbing pattern that matches nothing will expand to the empty string.
The command string counts the output directories created so far, so as to determine the name of the next output dir with the next higher index, creates the next output dir, and moves the current batch of (up to) 1000 files there.
if the directory is not under use, I suggest the following
find . -maxdepth 1 -type f | split -l 1000 -d -a 5
this will create n number of files named x00000 - x02500 (just to make sure 5 digits although 4 will work too). You can then move the 1000 files listed in each file to a corresponding directory.
perhaps set -o noclobber to eliminate risk of overrides in case of name clash.
to move the files, it's easier to use brace expansion to iterate over file names
for c in x{00000..02500};
do d="d$c";
mkdir $d;
cat $c | xargs -I f mv f $d;
done
Moving files around is always a challenge. IMHO all the solutions presented so far have some risk of destroying your files. This may be because the challenge sounds simple, but there is a lot to consider and to test when implementing it.
We must also not underestimate the efficiency of the solution as we are potentially handling a (very) large number of files.
Here is script carefully & intensively tested with own files. But of course use at your own risk!
This solution:
is safe with filenames that contain spaces.
does not use xargs -L because this will easily result in "Argument list too long" errors
is based on Bash 4 and does not depend on awk, sed, tr etc.
is scaling well with the amount of files to move.
Here is the code:
if [[ "${BASH_VERSINFO[0]}" -lt 4 ]]; then
echo "$(basename "$0") requires Bash 4+"
exit -1
fi >&2
opt_dir=${1:-.}
opt_max=1000
readarray files <<< "$(find "$opt_dir" -maxdepth 1 -mindepth 1 -type f)"
moved=0 dirnum=0 dirname=''
for ((i=0; i < ${#files[#]}; ++i))
do
if [[ $((i % opt_max)) == 0 ]]; then
((dirnum++))
dirname="$opt_dir/$(printf "%02d" $dirnum)"
fi
# chops the LF printed by "find"
file=${files[$i]::-1}
if [[ -n $file ]]; then
[[ -d $dirname ]] || mkdir -v "$dirname" || exit
mv "$file" "$dirname" || exit
((moved++))
fi
done
echo "moved $moved file(s)"
For example, save this as split_directory.sh. Now let's assume you have 2001 files in some/dir:
$ split_directory.sh some/dir
mkdir: created directory some/dir/01
mkdir: created directory some/dir/02
mkdir: created directory some/dir/03
moved 2001 file(s)
Now the new reality looks like this:
some/dir contains 3 directories and 0 files
some/dir/01 contains 1000 files
some/dir/02 contains 1000 files
some/dir/03 contains 1 file
Calling the script again on the same directory is safe and returns almost immediately:
$ split_directory.sh some/dir
moved 0 file(s)
Finally, let's take a look at the special case where we call the script on one of the generated directories:
$ time split_directory.sh some/dir/01
mkdir: created directory 'some/dir/01/01'
moved 1000 file(s)
real 0m19.265s
user 0m4.462s
sys 0m11.184s
$ time split_directory.sh some/dir/01
moved 0 file(s)
real 0m0.140s
user 0m0.015s
sys 0m0.123s
Note that this test ran on a fairly slow, veteran computer.
Good luck :-)
This is probably slower than a Perl program (1 minute for 10.000 files) but it should work with any POSIX compliant shell.
#! /bin/sh
nd=0
nf=0
/bin/ls | \
while read file;
do
case $(expr $nf % 10) in
0)
nd=$(/usr/bin/expr $nd + 1)
dir=$(printf "dir_%04d" $nd)
mkdir $dir
;;
esac
mv "$file" "$dir/$file"
nf=$(/usr/bin/expr $nf + 1)
done
With bash, you can use arithmetic expansion $((...)).
And of course this idea can be improved by using xargs - should not take longer than ~ 45 sec for 2.5 million files.
nd=0
ls | xargs -L 1000 echo | \
while read cmd;
do
nd=$((nd+1))
dir=$(printf "dir_%04d" $nd)
mkdir $dir
mv $cmd $dir
done
I would use the following from the command line:
find . -maxdepth 1 -type f |split -l 1000
for i in `ls x*`
do
mkdir dir$i
mv `cat $i` dir$i& 2>/dev/null
done
Key is the "&" which threads out each mv statement.
Thanks to karakfa for the split idea.

linux command line to zip files based on mysql resultset

i have a table, where some filenames are stored.
i would like to find all the files having that name under a specific folder and zip all of them.
on disk the structure is similar to this:
/folder/sub1/file1
/folder/sub1/file2
/folder/sub2/file1 <- same name as under sub1
/folder/sub2/file2
so i am looking for something similar to:
mysql -e "select file from table" | find /folder -type f -name <the value of file from mysql result set> | zip <all files found by all find commands>
thanks.
Couple of additions to your command:
Firstly, you want to use mysql in batch mode, so you do this:
mysql -Be "select file from table"
It gives you a single column table with no borders, so you get rid of the headers by piping it to tail starting at the second line:
tail -n +2
Then you pipe that to xargs, but before you do, hack it a bit with concat (you'll see why in a sec):
mysql -Be "select concat(' -o -name ', file) from table"
NOW you pipe it to xargs:
xargs find /folder -false
This does a "false" test (i.e. a no-op), but it appends a whole pile of things like -o -name somename.file, each of which performs a boolean or (with false originally, later with all other file names) and ultimately returns the list of files that match.
...which you finally pipe to zip, with another xargs:
xargs zip files.zip
Again, this puts the file names as arguments to zip.
Here's the total line:
mysql -Be "select concat(' -o -name ', file) from table" | tail -n +2 | xargs find /folder -false | xargs zip files.zip
Bear in mind that this assumes you have no spaces in your filenames. If you do, that'll add a bit of complexity: You can work around that by using -print0 and -0 in find and xargs respectively, although zip will have a harder time with that so you'd need to add another intermediate stage (or use zip -r).

Parse ClamAV logs in Bash script using Regex to insert in MySQL

Morning/Evening all,
I've got a problem where I'm making a script for work that uses ClamAV to scan for malware, and then place it's results in MySQL by taking the resultant ClamAV logs using grep with awk to convert the right parts of the log to a variable. The problem I have is that whilst I have done the summary ok, the syntax of detections makes it slightly more difficult. I'm no expert at regex by all means and this is a bit of a learning experience, so there is probably a far better way of doing it than I have!
The lines I'm trying to parse looks like these:
/net/nas/vol0/home/recep/SG4rt.exe: Worm.SomeFool.P FOUND
/net/nas/vol0/home/recep/SG4rt.exe: moved to '/srv/clamav/quarantine/SG4rt.exe'
As far as I was able to establish, I need a positive lookbehind to match what happens after and before the colon, without actually matching the colon or the space after it, and I can't see a clear way of doing it from RegExr without it thinking I'm trying to look for two colons. To make matters worse, we sometimes get these too...
WARNING: Can't open file /net/nas/vol0/home/laser/samples/sample1.avi: Permission denied
The end result is that I can build a MySQL query that inserts the path, malware found and where it was moved to or if there was an error then the path, then the error encountered so as to convert each element to a variable contents in a while statement.
I've done the scan summary as follows:
Summary looks like:
----------- SCAN SUMMARY -----------
Known viruses: 329
Engine version: 0.97.1
Scanned directories: 17350
Scanned files: 50342
Infected files: 3
Total errors: 1
Data scanned: 15551.73 MB
Data read: 16382.67 MB (ratio 0.95:1)
Time: 3765.236 sec (62 m 45 s)
Parsing like this:
SCANNED_DIRS=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned directories" | awk '{gsub("Scanned directories: ", "");print}')
SCANNED_FILES=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned files" | awk '{gsub("Scanned files: ", "");print}')
INFECTED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Infected files" | awk '{gsub("Infected files: ", "");print}')
DATA_SCANNED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data scanned" | awk '{gsub("Data scanned: ", "");print}')
DATA_READ=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data read" | awk '{gsub("Data read: ", "");print}')
TIME_TAKEN=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Time" | awk '{gsub("Time: ", "");print}')
END_TIME=$(date +%s)
mysql -u scanner_parser --password=removed sc_live -e "INSERT INTO bs.live.bs_jobstat VALUES (NULL, '$CURRTIME', '$PID', '$IY', '$SCANNED_DIRS', '$SCANNED_FILES', '$INFECTED', '$DATA_SCANNED', '$DATA_READ', '$TIME_TAKEN', '$END_TIME');"
rm -f /srv/clamav/$IY-scan-$LOGTIME.log
Some of those variables are from other parts of the script and can be ignored. The reason I'm doing this is to save logfile clutter and have a simple web based overview of the status of the system.
Any clues? Am I going about all this the wrong way? Thanks for help in advance, I do appreciate it!
From what I can determine from the question, it seems like you are asking how to distinguish the lines you want from the logger lines that start with WARNING, ERROR, INFO.
You can do this without getting to fancy with lookahead or lookbehind. Just grep for lines beginning with
"/net/nas/vol0/home/recep/SG4rt.exe: "
then using awk you can extract the remainder of the line. Or you can gsub the prefix out like you are doing in the summary processing section.
As far as the question about processing the summary goes, what strikes me most is that you are processing the entire file multiple times, each time pulling out one kind of line. For tasks like this, I would use Perl, Ruby, or Python and make one pass through the file, collecting the pieces of each line after the colon, storing them in regular programming language variables (not env variables), and forming the MySQL insert string using interpolation.
Bash is great for some things but IMHO you are justified in using a more general scripting language (Perl, Python, Ruby come to mind).

Processing MySQL result in bash

I'm currently having a already a bash script with a few thousand lines which sends various queries MySQL to generate applicable output for munin.
Up until now the results were simply numbers which weren't a problem, but now I'm facing a challenge to work with a more complex query in the form of:
$ echo "SELECT id, name FROM type ORDER BY sort" | mysql test
id name
2 Name1
1 Name2
3 Name3
From this result I need to store the id and name (and their respective association) and based on the IDs need to perform further queries, e.g. SELECT COUNT(*) FROM somedata WHERE type = 2 and later output that result paired with the associated name column from the first result.
I'd know easily how to do it in PHP/Ruby , but I'd like to spare to fork another process especially since it's polled regularly, but I'm complete lost where to start with bash.
Maybe using bash is the wrong approach anyway and I should just fork out?
I'm using GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu).
My example is not Bash, but I'd like to point out my parameters at invoking the mysql command, they surpress the boxing and the headers.
#!/bin/sh
mysql dbname -B -N -s -e "SELECT * FROM tbl" | while read -r line
do
echo "$line" | cut -f1 # outputs col #1
echo "$line" | cut -f2 # outputs col #2
echo "$line" | cut -f3 # outputs col #3
done
You would use a while read loop to process the output of that command.
echo "SELECT id, name FROM type ORDER BY sort" | mysql test | while read -r line
do
# you could use an if statement to skip the header line
do_something "$line"
done
or store it in an array:
while read -r line
do
array+=("$line")
done < <(echo "SELECT id, name FROM type ORDER BY sort" | mysql test)
That's a general overview of the technique. If you have more specific questions post them separately or if they're very simple post them in a comment or as an edit to your original question.
You're going to "fork out," as you put it, to the mysql command line client program anyhow. So either way you're going to have process-creation overhead. With your approach of using a new invocation of mysql for each query you're also going to incur the cost of connecting to and authenticating to the mysqld server multiple times. That's expensive, but the expense may not matter if this app doesn't scale up.
Making it secure against sql injection is another matter. If you prompt a user for her name and she answers "sally;drop table type;" she's laughing and you're screwed.
You might be wise to use a language that's more expressive in the areas that are important for data-base access for some of your logic. Ruby, PHP, PERL are all good choices. PERL happens to be tuned and designed to run snappily under shell script control.