AWK to filter CSV files - csv

I want to filter out the lines that has "synonymous" in the 3rd column. The command is like below
awk '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
But the fileCSV2.csv still contains the word "synonymous" in the 3rd column. I wonder what might be wrong?
Two lines from the fileCSV.csv:
"exonic","LINC00115","synonymous SNV","uc010nxx.2:c.C299T:p.P100L",,"0.99",,0.56,rs3115849,,,,,,,,,,,,,chr1,762273,762273,G,A,"chr1","762273",".","G","A","30483.62","PASS","AC=24;AF=1.00;AN=24;DP=2972;FS=0.000;MLEAC=8;MLEAF=1.00;MQ0=0;VQSLOD=19.50;culprit=FS;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,2:2:6:66,6,0","1/1:0,297:297:99:10476,951,0","1/1:0,304:304:99:10098,950,0","1/1:0,295:295:99:9869,929,0","1/1:0,292:292:99:8655,895,0","1/1:0,304:304:99:10006,965,0","1/1:0,179:179:99:5862,568,0","1/1:0,273:273:99:9328,851,0","1/1:0,279:279:99:7946,850,0","1/1:0,283:283:99:9214,866,0","1/1:0,8:8:21:229,21,0","1/1:0,456:456:99:16385,1285,0"
"exonic","SAMD11","synonymous SNV","uc001abw.1:c.T1027C:p.W343R","559;Name=lod=249",,,1.00,rs6672356,1,0.916445,N,0.0,T,0.0,B,0.998605,N,4.19E-4,N,3.17,chr1,877831,877831,T,C,"chr1","877831",".","T","C","3594.56","PASS","AC=24;AF=1.00;AN=24;DP=387;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=60.00;MQ0=0;VQSLOD=15.00;culprit=DP;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,3:3:9:97,9,0","1/1:0,3:3:12:113,12,0","1/1:0,64:64:99:1805,189,0","1/1:0,57:57:99:1605,168,0","1/1:0,30:30:90:768,90,0","1/1:0,69:69:99:2026,216,0","1/1:0,15:15:45:428,45,0","1/1:0,23:23:81:809,81,0","1/1:0,22:22:69:562,69,0","1/1:0,40:40:99:1142,117,0","1/1:0,3:3:9:94,9,0","1/1:0,58:58:99:14,7,0"

If your fileCSV.csv has columns separated by , than you need to
awk -F, '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
If -F does not work with your version of awk try
awk 'BEGIN{FS=","} $3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
EDIT: you also need to take " into account, so use /^"synonymous/

To process csv file using awk I would prefer the following method to automatically account for quotation marks, namely preprocess with sed.
For your concrete question I would use
sed -e 's/^"//;s/"$//' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
If you also want to correctly process files with string fields containing quotation marks (which will be represented by double quotation marks in csv files), you need to change the sed expression the following way,
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
This method has the advantage that it allows you to correctly print or process some fields using awk. For example if you want to print the first and fifth field from the filtered lines, seperated by a : you can now use
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/ { print $1,":",$5}'
(If the difference between the methods is not clear to you, you can try the last awk command without the sed preprocessing)

Related

prefix every header column with string using awk

I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount

How to extract JSON string from a longer mixed string in a shell script

Given the following string:
arn:aws:secretsmanager:us-east-1:3264873466873:secret:foo/bar 1564681234.974 foo/bar {"username":"admin","password":"admin123","secret_key":"KASJDFJHAKHFKAHASDF"} 4e397333-3797-4f0b-ad7e-8c1cc0ed041c VERSIONSTAGES AWSCURRENT
Within a shell script, how do you extract just the JSON portion to end up like this:
{"username":"admin","password":"admin123","secret_key":"KASJDFJHAKHFKAHASDF"}
I was able to do it using two sed commands:
echo $longString | sed 's/^.*{/{/' | sed 's/}.*$/}/'
but was wondering if there is a way to do it using only one command.
To extract continuous part of the input, you can use grep with its -o option (if supported on your system). It tells grep to only output the matching part.
grep -o '{.*}'
For extracting columns, use awk:
echo $longString | awk '{print $4}'
Or cut:
echo $longString | cut -f 4 -d ' '
Beware if you have spaces in your JSON data. You might be better off using jq to process the results of aws secretsmanager list-secrets and similar.
You can use
echo $longString | sed -n 's|.*\({.*}\).*|\1|p'
to match and print the desired pattern
you can just join the sed commands to a single command
sed 's/^.*{/{/;s/}.*$/}/'
This awk should do. I will handle if there are any space in the string.
echo $string | awk -F"[{}]" '{print $2}'
"username":"admin","password":"admin123","secret_key":"KASJDFJHAKHFKAHASDF"

Find Values in CSV that only Appear Once

I have a csv file with thousands of lines in it. I'd like to be able to find values that only appear once in this file.
For instance
dog
dog
cat
dog
bird
I'd like to get as my result:
cat
bird
I tried using the following awk command but it returned one of each value in the file:
awk -F"," '{print $1}' test.csv|sort|uniq
Returns:
dog
cat
bird
Thank you for your help!
Just with awk:
awk -F, '{count[$1]++} END {for (key in count) if (count[key] == 1) print key}' test.csv
Close. Try:
awk -F"," '{print $1}' test.csv |sort | uniq -c | awk '{if ($1 == 1) print $2}'
the -c flag on uniq will give you counts. Next awk will look for any items with the count of 1 (first field) and print the value of the second field ($2)
Only caveat is that this will return bird before cat due to it being previously sroted. you could pipe once more to sort -r to reverse the sort direction. This would be identical to the expected answer you asked for, but it is not the original sort order.
Cutting to first field, then sorting and displaying only uniques:
cut -d ',' -f 1 test.csv | sort | uniq -u
That is, if you append -u to your command, it'd work. This is just using cut instead of awk.
If Perl is an option, this code is similar to #glenn jackman's:
perl -F, -lane '$c{$F[0]}++; END{for $k (sort keys %c){print $k if $c{$k} == 1}}' test.csv
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]

removing commas from numbers in CSV file

I have a file that has many columns and I only need two of those columns. I am getting the columns I need using
cut -f 2-3 -d, file1.csv > file2.csv
The issue I am having is that the first column is ID and once it gets past 999 it becomes 1,000 and so it is treated as an extra column now. I cant get rid of all commas because I need them to separate the data. Is there a way to use sed to remove commas that only show up between 0-9?
I'd use a real CSV parser, and count backwards from the end of the line:
ruby -rcsv -ne '
row = $_.parse_csv
puts row[-5..-4].to_csv :force_quotes => true
' <<END
999,"someone#example.com","Doe, John","Doe","555-1212","address"
1,234,"email#email.com","name","lastname","phone","address"
END
"someone#example.com","Doe, John"
"email#email.com","name"
This works for the example in the comments:
awk -F'"?,"' '{print $2, $3}' file
The field separator is zero or one " followed by ,". This means that the comma in the first number doesn't count.
To separate the two fields with a comma instead of a space, you can change the OFS variable like this:
awk -F'"?,"' -v OFS=',' '{print $2, $3}' file
Or like this:
awk -F'"?,"' 'BEGIN{OFS=","}{print $2, $3}' file
Alternatively, if you want the quotes as well, you can use printf:
awk -F'"?,"' '{printf "\"%s\",\"%s\"\n", $2, $3}' file
From your comments, it sounds like there is a comma and a space (', ') pattern between tokens.
If this is the case, you can do this easily with sed. The strategy is to first replace all occurrences of , with some unique character sequence (like maybe ||).
's:, :||:g'
From there you can remove all commas:
's:,::g'
Finally, replace the double pipes with comma-space again.
's:||:, :g'
Putting it into one statement:
sed -i -e 's:, :||:g;s:,::g;s:||:, :g' your_odd_file.csv
And a command-line example to try before you buy:
bash$ sed -e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000, hello world, 123,456"
1200000, hello world, 123456
If you are in the unfortunate situation where there is not a space between fields in the CSV - you can attempt to 'fake it' by detecting changes in data type - like where there is a numeric field followed by a text field.
's:,\([^0-9]\):, \1:g' # numeric followed by non-numeric
's:\([^0-9]\),:\1, :g' # non-numeric field followed by something (anything)
You can put this all together into one statement, but you are venturing into dangerous waters here - this will definitely be a one-off solution and should be taken with a large grain of salt.
sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' file1.csv > file2.csv
And another example:
bash$ sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000,hello world,123,456"
1200000, hello world, 123456

parse a csv file that contains commans in the fields with awk

i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23