Outputting data from 5gb file with awk

Outputting data from 5gb file with awk - csv

I have a csv file with approximately 300 columns.
I'm using awk to create a subset of this file where the 24th column is "CA".
Example of data:
Here's what I am trying:
awk -F "," '{if($24~/CA/)print}' myfile.csv > subset.csv
After approximately 10 minutes the subset file grew to 400 mb, and then I killed it because this is too slow.
How can I speed this up? Perhaps a combination of sed / awk?
\

tl;dr:
awk implementations can significantly differ in performance.
In this particular case, see if using gawk (GNU awk) helps.
Ubuntu comes with mawk as the default awk, which is usually considered faster than gawk. However, in the case at hand it seems that gawk is significantly faster (related to line length?), at least based on the following simplified tests, which I ran
in a VM on Ubuntu 14.04 on a 1-GB file with 300 columns of length 2.
The tests also include an equivalent sed and grep command.
Hopefully they provide at least a sense of comparative performance.
Test script:
#!/bin/bash
# Pass in test file
f=$1
# Suppress stdout
exec 1>/dev/null
awkProg='$24=="CA"'
echo $'\n\n\t'" $(mawk -W version 2>&1 | head -1)" >&2
time mawk -F, "$awkProg" "$f"
echo $'\n\n\t'" $(gawk --version 2>&1 | head -1)" >&2
time gawk -F, "$awkProg" "$f"
sedProg='/^([^,]+,){23}CA,/p'
echo $'\n\n\t'" $(sed --version 2>&1 | head -1)" >&2
time sed -En "$sedProg" "$f"
grepProg='^([^,]+,){23}CA,'
echo $'\n\n\t'" $(grep --version 2>&1 | head -1)" >&2
time grep -E "$grepProg" "$f"
Results:
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
real 0m11.341s
user 0m4.780s
sys 0m6.464s
GNU Awk 4.0.1
real 0m3.560s
user 0m0.788s
sys 0m2.716s
sed (GNU sed) 4.2.2
real 0m9.579s
user 0m4.016s
sys 0m5.504s
grep (GNU grep) 2.16
real 0m50.009s
user 0m42.040s
sys 0m7.896s

Related

Using awk to extract a token from a larger JSON string

I have a string assigned to a variable:
#/bin/bash
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
I need to extract only l0ng_Str1ng.of.d1fF3erent_charAct3rs without quotes and assign that to another variable.
I understand I can use awk, sed, or cut but I am having trouble getting around the special characters in the original string.
Thanks in advance!
EDIT: I was not awake I should specify this is JSON. Thanks for the replies so far.
EDIT2: I am using BSD (macOS)

It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.
It is most robust to use a JSON parser.
You could use ruby with its JSON parser library:
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
Or, if you don't want the quoted string (which is useful for Bash):
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Or with jq:
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
All these solutions will work even if the JSON string is in a different order:
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep:
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
Or in Perl:
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
Both of those also work if the string is in a different order.
Or, with POSIX awk:
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
Or, with POSIX sed, you can do:
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.
Ps: If you want to remove the quotes from a line, that is a great job for sed:
$ echo '"quoted string"'
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

In awk:
$ awk -v f="$fullToken" '
BEGIN{
while(match(f,/[^:{},]+:[^:{},]+/)) { # search key:value pairs
p=substr(f,RSTART,RLENGTH) # set pair to p
f=substr(f,RSTART+RLENGTH) # remove p from f
split(p,a,":") # split to get key and value
for(i in a) # remove leadin and trailing "
gsub(/^"|"$/,"",a[i])
if(a[1]=="token") { # if key is token
print a[2] # output value
exit # no need to process further
}
}
}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
l0ng_String can't have characters :{}.

GNU sed:
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
echo "$fullToken"|sed -r 's/.*"(.*)".*/\1/'

grep method would be,
$ grep -oP '[^"]+(?="[^"]+$)' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Brief explanation,
[^"]+ : grep would extract the non-" pattern
(?="[^"]+$): extract until the pattern ahead of last "
You may also use sed method to do that,
$sed -E 's/.*"([^"]+)"[^"]+$/\1/' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

If the source of your string is JSON, then you should use JSON-specific tools. If not, then consider:
Using awk
$ fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
$ echo "$fullToken" | awk -F'"' '{print $8}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using cut
$ echo "$fullToken" | cut -d'"' -f8
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using sed
$ echo "$fullToken" | sed -E 's/.*"([^"]*)"[^"]*$/\1/'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using bash and one of the above
The above all work with POSIX shells. If the shell is bash, then we can use a here-string and eliminate the pipeline. Taking cut as the example:
$ cut -d'"' -f8 <<<"$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

How to add a comma to the end of every line in a json file [duplicate]

given a plain text document with several lines like:
c48 7.587 7.39
c49 7.508 7.345983
c50 5.8 7.543
c51 8.37454546 7.34
I need to add some info 2 spaces after the end of the line, so for each line I would get:
c48 7.587 7.39 def
c49 7.508 7.345983 def
c50 5.8 7.543 def
c51 8.37454546 7.34 def
I need to do this for thousands of files. I guess this is possible to do with sed, but do not know how to. Any hint? Could you also give me some link with a tutorial or table for this cases?
Thanks

if all your files are in one directory
sed -i.bak 's/$/ def/' *.txt
to do it recursive (GNU find)
find /path -type f -iname '*.txt' -exec sed -i.bak 's/$/ def/' "{}" +;
you can see here for introduction to sed
Other ways you can use,
awk
for file in *
do
awk '{print $0" def"}' $file >temp
mv temp "$file"
done
Bash shell
for file in *
do
while read -r line
do
echo "$line def"
done < $file >temp
mv temp $file
done

for file in ${thousands_of_files} ; do
sed -i ".bak" -e "s/$/ def/" file
done
The key here is the search-and-replace s/// command. Here we replace the end of the line $ with 2 spaces and your string.
Find the sed documentation at http://sed.sourceforge.net/#docs

BATCH: grep equivalent

I need some help what ith the equivalent code for grep -v Wildcard and grep -o in batch file.
This is my code in shell.
result=`mysqlshow --user=$dbUser --password=$dbPass sample | grep -v Wildcard | grep -o sample`

The batch equivalent of grep (not including third party tools like GnuWin32 grep), will be findstr.
grep -v finds lines that don't match the pattern. The findstr version of this is findstr /V
grep -o shows only the part of the line that matches the pattern. Unfortunately, there's no equivalent of this, but you can run the command and then have a check along the lines of
if %errorlevel% equ 0 echo sample

Printing column separated by comma using Awk command line

I have a problem here. I have to print a column in a text file using awk. However, the columns are not separated by spaces at all, only using a single comma. Looks something like this:
column1,column2,column3,column4,column5,column6
How would I print out 3rd column using awk?

Try:
awk -F',' '{print $3}' myfile.txt
Here in -F you are saying to awk that use , as the field separator.

If your only requirement is to print the third field of every line, with each field delimited by a comma, you can use cut:
cut -d, -f3 file
-d, sets the delimiter to a comma
-f3 specifies that only the third field is to be printed

Try this awk
awk -F, '{$0=$3}1' file
column3
, Divide fields by ,
$0=$3 Set the line to only field 3
1 Print all out. (explained here)
This could also be used:
awk -F, '{print $3}' file

A simple, although awk-less solution in bash:
while IFS=, read -r a a a b; do echo "$a"; done <inputfile
It works faster for small files (<100 lines) then awk as it uses less resources (avoids calling the expensive fork and execve system calls).
EDIT from Ed Morton (sorry for hi-jacking the answer, I don't know if there's a better way to address this):
To put to rest the myth that shell will run faster than awk for small files:
$ wc -l file
99 file
$ time while IFS=, read -r a a a b; do echo "$a"; done <file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
$ time awk -F, '{print $3}' file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
I expect if you get a REALY small enough file then you will see the shell script run in a fraction of a blink of an eye faster than the awk script but who cares?
And if you don't believe that it's harder to write robust shell scripts than awk scripts, look at this bug in the shell script you posted:
$ cat file
a,b,-e,d
$ cut -d, -f3 file
-e
$ awk -F, '{print $3}' file
-e
$ while IFS=, read -r a a a b; do echo "$a"; done <file
$

Removing old MySQL / MariaDB backups in Bash Backup Script

I've written a bash script, initiated on cron, that backups all databases on a particular machine nightly and weekly. The script correctly removes old databases, except for those cases when there's been a change in month.
As an example, let's say is November 2nd. The script runs at 11:00pm, and correctly removes the backup made from November 1st. But come December 1st, the script gets confused, and does not correctly remove the backup made from November 30th.
How can I fix this script to correctly remove the old backups in this case?
DATABASES=$(echo 'show databases;' | mysql -u backup --password='(password)' | grep -v ^Database$)
LIST=$(echo $DATABASES | sed -e "s/\s/\n/g")
DATE=$(date +%Y%m%d)
DAYOLD=$(($DATE-1))
SUNDAY=$(date +%a)
WEEKOLD=$(($DATE-7))
for i in $LIST; do
if [[ $i != "mysql" ]]; then
mysqldump --single-transaction $i > /mnt/backups/mariadb/daily/$i.$DATE.sql
if [ -f /mnt/backups/mariadb/daily/$i.$DAYOLD.sql ]; then
rm -f /mnt/backups/mariadb/daily/$i.$DAYOLD.sql
fi
if [[ $SUNDAY == "Sun" ]]; then
cp /mnt/backups/mariadb/daily/$i.$DATE.sql /mnt/backups/mariadb/weekly/$i.$DATE.sql
rm -f /mnt/backups/mariadb/weekly/$i.$WEEKOLD.sql
fi
fi
done

If you know the number of backups performed in a specific range of time, let's say you know from 2nd Nov until 2nd Dec you know that exactly 30 backups have been made and you now want to erase those, just use the number of backups, it's super simple to do and you don't have to deal with dates which is pretty complex in bash:
$ (ls -t|head -n 30;ls)|grep -v ^Database|sort|uniq -u|xargs rm -rf
You can then easily automate this script by removing each day the older one so you only get the fix number of backups you want:
#! /bin/bash
# Create new full backup
BACKUP_DIR="/path-to-backups/"
BACKUP_DAYS=1
# Prepare backup
cd ${BACKUP_DIR}
latest=`ls -rt | grep 201 | head -1`
# Change latest reference
ln -sf ${BACKUP_DIR}${latest} latest
# Cleanup older than one week (n days)
to_remove=`(ls -t | grep 201 | head -n 3;ls)|sort|uniq -u`
echo "Cleaning up... $to_remove"
(ls -t|head -n ${BACKUP_DAYS};ls)|sort|uniq -u|xargs rm -rf
echo "Backup Finished"
exit 0
Then you can link it to daily cron. This is explained in this blog entry, how to do this stuff in a very straightforward fashion (but with hot backups, no mysqldump): http://codeispoetry.me/index.php/mariadb-daily-hot-backups-with-xtrabackup/

I was making this too complicated. Instead of using the date at all, I'm just searching for the age of the file backup with:
find /mnt/backups/mariadb/weekly/* -type f -mtime +8 -exec rm -f {} \;
So the entire script becomes:
DATABASES=$(echo 'show databases;' | mysql -u backup --password='foo' | grep -v ^Database$)
LIST=$(echo $DATABASES | sed -e "s/\s/\n/g")
DATE=$(date +%Y%m%d)
SUNDAY=$(date +%a)
for i in $LIST; do
if [[ $i != "mysql" ]]; then
/bin/nice mysqldump --single-transaction $i > /mnt/backups/mariadb/daily/$i.$DATE.sql
find /mnt/backups/mariadb/daily/* -type f -mtime +1 -exec rm -f {} \;
if [[ $SUNDAY == "Sun" ]]; then
cp /mnt/backups/mariadb/daily/$i.$DATE.sql /mnt/backups/mariadb/weekly/$i.$DATE.sql
find /mnt/backups/mariadb/weekly/* -type f -mtime +8 -exec rm -f {} \;
fi
fi
chown -R backup.backup /mnt/backups
done

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Outputting data from 5gb file with awk - csv

Related

Using awk to extract a token from a larger JSON string

How to add a comma to the end of every line in a json file [duplicate]

BATCH: grep equivalent

Printing column separated by comma using Awk command line

Removing old MySQL / MariaDB backups in Bash Backup Script

Categories

Resources