linux command line to zip files based on mysql resultset - mysql

i have a table, where some filenames are stored.
i would like to find all the files having that name under a specific folder and zip all of them.
on disk the structure is similar to this:
/folder/sub1/file1
/folder/sub1/file2
/folder/sub2/file1 <- same name as under sub1
/folder/sub2/file2
so i am looking for something similar to:
mysql -e "select file from table" | find /folder -type f -name <the value of file from mysql result set> | zip <all files found by all find commands>
thanks.

Couple of additions to your command:
Firstly, you want to use mysql in batch mode, so you do this:
mysql -Be "select file from table"
It gives you a single column table with no borders, so you get rid of the headers by piping it to tail starting at the second line:
tail -n +2
Then you pipe that to xargs, but before you do, hack it a bit with concat (you'll see why in a sec):
mysql -Be "select concat(' -o -name ', file) from table"
NOW you pipe it to xargs:
xargs find /folder -false
This does a "false" test (i.e. a no-op), but it appends a whole pile of things like -o -name somename.file, each of which performs a boolean or (with false originally, later with all other file names) and ultimately returns the list of files that match.
...which you finally pipe to zip, with another xargs:
xargs zip files.zip
Again, this puts the file names as arguments to zip.
Here's the total line:
mysql -Be "select concat(' -o -name ', file) from table" | tail -n +2 | xargs find /folder -false | xargs zip files.zip
Bear in mind that this assumes you have no spaces in your filenames. If you do, that'll add a bit of complexity: You can work around that by using -print0 and -0 in find and xargs respectively, although zip will have a harder time with that so you'd need to add another intermediate stage (or use zip -r).

Related

Export Redis hashes to csv

This answer doesn't work for me
I run this command to find the number of keys that I want
SCAN 0 MATCH "test_user:*"
so I got a (very long) list of hashes that I want to export to CSV.
I tried
SCAN 0 MATCH "test_user:*" > list.csv
or simply
SCAN 0 MATCH "test_user:*" > list.txt
but always with syntax error response.
Any idea?
The only way I found is this (creating a sh script)
redis-cli --scan --pattern test_user:* |\
grep -e "^test_users:[^:]*$" |\
awk '{print "hmget " $0 " id display_name reputation location"}' |\
redis-cli --csv > test_user.csv
It works very well scanning for pattern, you can use regex for better accuracy.
Then you use a awk script to run the redis command 'hmget'.
Finally the output is printed in a csv file with the --csv utility
https://rdbtools.com/blog/redis-export-hashes-as-csv-using-cli/

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

How to combine thousands of .csv files into one master file?

I have a folder of a little over 10,000 .csv files that I want to combine into one master file. They are all categorized the same way (Column A B C D E F are the same thing in each file). I'd prefer to do it in a shell script.
I tried
cat *.csv > Everything.csv
and it returns that Argument is too long
I also tried
copy *.csv > Everything.csv
and it returns the same error.
How do I get it to combine about 10,000 files into one Master file?
This question discusses the error you're seeing: Argument list too long error for rm, cp, mv commands
One possible solution would be something like:
find . -name "*.csv" -exec cat '{}' >> ./Everything.csv ';'
I've done it using:
cat *.csv >> /tmp/master_file_name.csv
Single '>' will re-write the file, not append it,
You can use a simple for loop:
for file in `ls`; do
cat $file >> master_file
done
find . -name "*.csv" | xargs -I{} cat '{}' > Everything.csv
(edit) Ah well, beat to the punch...

Finding all kinds of extensions referenced in a html file

Here is my problem statement :
There is a folder with many html and text files. I need to recursively go through each one of them and find all kinds of file extensions referenced in these html/text files like .jpg, .tif, .png etc
The problem is I don't have a defined list of the extensions I want to search for.
What would be the best way to achieve this using a shell script ?
Coming up with a Reg-ex which would essentially search for all occurrences of a dot followed by 3 or 4 letters, and filtering out the ones which end with a space or a comma, or a quote etc ??
Any suggestions would be helpful.
You can use shell script to parse file name with regex, but straight forward version is pretty simple:
$ cat *.{txt,html} | grep -oP '\b[A-Za-z0-9_]+\.[A-Za-z0-9]{1,4}\b' | awk -F. '{ print "." $(NF) }' | sort -u
For recursive search:
find . -name '*.txt' -or -name '*.html' -exec grep -oP '\b[A-Za-z0-9_.]+\.[A-Za-z0-9]{1,4}\b' {} \; | awk -F. '{ print "." $(NF) }' | sort -u

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.