Running math on a few columns in awk - csv

This seems like a simple problem but I find everything command line confusing.
I have a CSV with 5 columns. I want to multiply everything in columns 2-5 by a variable defined earlier in my bash script.
Obviously the below doesn't work, but show's what I'm trying to achieve:
awk -F , -v OFS=, 'seq($2 $5)*='$MULTIPLIER in.csv > out.csv

Generally speaking:
awk -F, -v OFS=, -v m="${MULTIPLIER}" '{for (i=2;i<=5;i++) $i*=m}1' in.csv > out.csv
Assuming there's a header record, and a variation on setting FS/OFS:
awk -v m="${MULTIPLIER}" 'BEGIN {FS=OFS=","} NR>1 {for (i=2;i<=5;i++) $i*=m}1' in.csv > out.csv

Related

Merge csv files with same header: passing multiple files to awk with xargs

I need to merge together all csv files of a directory that have the same header (1st line). Let's say we have:
file a.txt:
head1,head2,head3
1,2,"abc"
8,42,"def"
file b.txt:
head4,head2
"aa",2
file c.txt:
head1,head2,head3
12,2,"z"
15,2,"z"
If I want all files with header "head1,head2,head3", then it should merge files a and c and produce:
awk 'FNR==1 && NR!=1{next;}{print}' a.txt c.txt
head1,head2,head3
1,2,"abc"
8,42,"def"
12,2,"z"
15,2,"z"
Now I can, for a given header, detect the files to merge automatically, but I can't pass the resulting list to awk. I am using the following command:
head -n1 -v * | grep -B1 "head1,head2,head3" | awk "/==>/{ print \$2 }" | xargs -l -0 awk 'FNR==1 && NR!=1{next;}{print}'
awk: fatal: cannot open file `a.txt
c.txt
' for reading (No such file or directory)
where head lists the file names and first lines, grep keeps only the headers that are matching (and associated filenames on the preceding lines with -B1), and the first call to awk keeps only the file names, one per line.
I tried as well (adding tr '\n' ' '):
head -n1 -v * | grep -B1 "head1,head2,head3" | awk "/==>/{ print \$2 }" | tr '\n' ' ' | xargs -l -0 awk 'FNR==1 && NR!=1{next;}{print}'
awk: fatal: cannot open file `a.txt c.txt ' for reading (No such file or directory)
I eventually tried the following (using tr '\n' '\0' instead):
head -n1 -v * | grep -B1 "head1,head2,head3" | awk "/==>/{ print \$2 }" | tr '\n' '\0' | xargs -l -0 awk 'FNR==1 && NR!=1{next;}{print}'
head1,head2,head3
1,2,"abc"
8,42,"def"
head1,head2,head3
12,2,"z"
15,2,"z"
(although I'm not sure to understand exactly how \0 is interpreted), at least this command works but it looks like each file is processed separately by awk, as the header is printed two times.
What am I missing?
Does this help?
$ awk -v h='head1,head2,head3' 'BEGIN{print h} FNR==1{f=$0==h?1:0; next} f' *.txt
head1,head2,head3
1,2,"abc"
8,42,"def"
12,2,"z"
15,2,"z"
-v h='head1,head2,head3' save the header line to check in a variable h
BEGIN{print h} print the header (assumes there'll be at least one file that matches)
FNR==1{f=$0==h?1:0; next} set/clear flag based on first line of file matching contents of h
f print if flag is set
*.txt list of files to merge
With GNU awk, you can skip unnecessarily reading files that don't match by using FNR==1{if($0!=h) nextfile; next} 1

Using awk to extract a token from a larger JSON string

I have a string assigned to a variable:
#/bin/bash
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
I need to extract only l0ng_Str1ng.of.d1fF3erent_charAct3rs without quotes and assign that to another variable.
I understand I can use awk, sed, or cut but I am having trouble getting around the special characters in the original string.
Thanks in advance!
EDIT: I was not awake I should specify this is JSON. Thanks for the replies so far.
EDIT2: I am using BSD (macOS)
It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.
It is most robust to use a JSON parser.
You could use ruby with its JSON parser library:
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
Or, if you don't want the quoted string (which is useful for Bash):
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Or with jq:
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
All these solutions will work even if the JSON string is in a different order:
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep:
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
Or in Perl:
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
Both of those also work if the string is in a different order.
Or, with POSIX awk:
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
Or, with POSIX sed, you can do:
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.
Ps: If you want to remove the quotes from a line, that is a great job for sed:
$ echo '"quoted string"'
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string
In awk:
$ awk -v f="$fullToken" '
BEGIN{
while(match(f,/[^:{},]+:[^:{},]+/)) { # search key:value pairs
p=substr(f,RSTART,RLENGTH) # set pair to p
f=substr(f,RSTART+RLENGTH) # remove p from f
split(p,a,":") # split to get key and value
for(i in a) # remove leadin and trailing "
gsub(/^"|"$/,"",a[i])
if(a[1]=="token") { # if key is token
print a[2] # output value
exit # no need to process further
}
}
}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
l0ng_String can't have characters :{}.
GNU sed:
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
echo "$fullToken"|sed -r 's/.*"(.*)".*/\1/'
grep method would be,
$ grep -oP '[^"]+(?="[^"]+$)' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Brief explanation,
[^"]+ : grep would extract the non-" pattern
(?="[^"]+$): extract until the pattern ahead of last "
You may also use sed method to do that,
$sed -E 's/.*"([^"]+)"[^"]+$/\1/' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
If the source of your string is JSON, then you should use JSON-specific tools. If not, then consider:
Using awk
$ fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
$ echo "$fullToken" | awk -F'"' '{print $8}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using cut
$ echo "$fullToken" | cut -d'"' -f8
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using sed
$ echo "$fullToken" | sed -E 's/.*"([^"]*)"[^"]*$/\1/'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using bash and one of the above
The above all work with POSIX shells. If the shell is bash, then we can use a here-string and eliminate the pipeline. Taking cut as the example:
$ cut -d'"' -f8 <<<"$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

Outputting data from 5gb file with awk

I have a csv file with approximately 300 columns.
I'm using awk to create a subset of this file where the 24th column is "CA".
Example of data:
Here's what I am trying:
awk -F "," '{if($24~/CA/)print}' myfile.csv > subset.csv
After approximately 10 minutes the subset file grew to 400 mb, and then I killed it because this is too slow.
How can I speed this up? Perhaps a combination of sed / awk?
\
tl;dr:
awk implementations can significantly differ in performance.
In this particular case, see if using gawk (GNU awk) helps.
Ubuntu comes with mawk as the default awk, which is usually considered faster than gawk. However, in the case at hand it seems that gawk is significantly faster (related to line length?), at least based on the following simplified tests, which I ran
in a VM on Ubuntu 14.04 on a 1-GB file with 300 columns of length 2.
The tests also include an equivalent sed and grep command.
Hopefully they provide at least a sense of comparative performance.
Test script:
#!/bin/bash
# Pass in test file
f=$1
# Suppress stdout
exec 1>/dev/null
awkProg='$24=="CA"'
echo $'\n\n\t'" $(mawk -W version 2>&1 | head -1)" >&2
time mawk -F, "$awkProg" "$f"
echo $'\n\n\t'" $(gawk --version 2>&1 | head -1)" >&2
time gawk -F, "$awkProg" "$f"
sedProg='/^([^,]+,){23}CA,/p'
echo $'\n\n\t'" $(sed --version 2>&1 | head -1)" >&2
time sed -En "$sedProg" "$f"
grepProg='^([^,]+,){23}CA,'
echo $'\n\n\t'" $(grep --version 2>&1 | head -1)" >&2
time grep -E "$grepProg" "$f"
Results:
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
real 0m11.341s
user 0m4.780s
sys 0m6.464s
GNU Awk 4.0.1
real 0m3.560s
user 0m0.788s
sys 0m2.716s
sed (GNU sed) 4.2.2
real 0m9.579s
user 0m4.016s
sys 0m5.504s
grep (GNU grep) 2.16
real 0m50.009s
user 0m42.040s
sys 0m7.896s

BATCH: grep equivalent

I need some help what ith the equivalent code for grep -v Wildcard and grep -o in batch file.
This is my code in shell.
result=`mysqlshow --user=$dbUser --password=$dbPass sample | grep -v Wildcard | grep -o sample`
The batch equivalent of grep (not including third party tools like GnuWin32 grep), will be findstr.
grep -v finds lines that don't match the pattern. The findstr version of this is findstr /V
grep -o shows only the part of the line that matches the pattern. Unfortunately, there's no equivalent of this, but you can run the command and then have a check along the lines of
if %errorlevel% equ 0 echo sample

Printing column separated by comma using Awk command line

I have a problem here. I have to print a column in a text file using awk. However, the columns are not separated by spaces at all, only using a single comma. Looks something like this:
column1,column2,column3,column4,column5,column6
How would I print out 3rd column using awk?
Try:
awk -F',' '{print $3}' myfile.txt
Here in -F you are saying to awk that use , as the field separator.
If your only requirement is to print the third field of every line, with each field delimited by a comma, you can use cut:
cut -d, -f3 file
-d, sets the delimiter to a comma
-f3 specifies that only the third field is to be printed
Try this awk
awk -F, '{$0=$3}1' file
column3
, Divide fields by ,
$0=$3 Set the line to only field 3
1 Print all out. (explained here)
This could also be used:
awk -F, '{print $3}' file
A simple, although awk-less solution in bash:
while IFS=, read -r a a a b; do echo "$a"; done <inputfile
It works faster for small files (<100 lines) then awk as it uses less resources (avoids calling the expensive fork and execve system calls).
EDIT from Ed Morton (sorry for hi-jacking the answer, I don't know if there's a better way to address this):
To put to rest the myth that shell will run faster than awk for small files:
$ wc -l file
99 file
$ time while IFS=, read -r a a a b; do echo "$a"; done <file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
$ time awk -F, '{print $3}' file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
I expect if you get a REALY small enough file then you will see the shell script run in a fraction of a blink of an eye faster than the awk script but who cares?
And if you don't believe that it's harder to write robust shell scripts than awk scripts, look at this bug in the shell script you posted:
$ cat file
a,b,-e,d
$ cut -d, -f3 file
-e
$ awk -F, '{print $3}' file
-e
$ while IFS=, read -r a a a b; do echo "$a"; done <file
$