Parsing csv using conditional statement in awk - csv

I have a csv file
file.csv
C75ADANXX,5,20,,AGGCAGAA,AGAGTAGA,,,,,AB
C75ADANXX,5,21,,AGGCAGAA,GTAAGGAG,,,,,AB
C75ADANXX,5,22,,AGGCAGAA,ACTGCATA,,,,,AB
C75ADANXX,5,23,,AGGCAGAA,AAGGAGTA,,,,,TC
C75ADANXX,5,24,,AGGCAGAA,CTAAGCCT,,,,,TC
C75ADANXX,5,25,,TCCTGAGC,GCGTAAGA,,,,,TC
when I run the following awk command :
awk -F "," '{print$11}' file.csv ##prints last cloumn
i want to extract lines with TC ; but the following command prints nothing
awk -F "," '{if($11==TC){print$0}}' file.csv
Where am I going wrong in writing the command ? Thank you.

modified the command to
awk -F "," '{if($11=="TC\r"){print$0}}' file.csv
this file was copied from windows , it had a carriage return character at the end of the line which was obviously not seen when you print only last column.

Modify
awk -F "," '{if($11==TC){print$0}}' file.csv
To
awk -F "," '{if($11=="TC"){print$0}}' file.csv
OR even simple
awk -F, '$11=="TC"' file.csv
if($11==TC) Since variable TC is not defined before (because no quotes used, awk treats TC as variable not as string), it evaluates false always, so it prints nothing.

try this one
grep ",TC$" file.csv

Related

Trying to get number of rows containing '/2020' in a column

I have a sizeable dataset of about 7 million lines and I am trying to find the number of rows in column $2 that contain "/2020" in the date ($2 is all dates in the format mm/dd/yyyy). However, all of the awk commands I'm trying are either giving me 0 or aren't printing anything at all, and I'm not sure why.
awk -F',' '$2 == "/2020" { count++ } END { print count }' file.csv
prints nothing
awk -v variable="2020" '$2 ~ variable' file.csv | wc -l
prints 0
awk ' BEGIN {count=0;} { if ($2 =="2020") count += 1} END {print count}' file.csv
prints 0
I'd appreciate some help. thanks!
The syntax to use is:
awk -F, '$2 ~ /\/2020/{cnt++} END {print cnt}' file.csv
== would mean that second field will be exactly like the pattern, while ~ means that it is matching the pattern, just a part of the field can be like the pattern.
See also the related part of the GNU awk manual
Also, your second attempt would have worked if you have added the field separator, note that here you match only the year without the slash.
awk -F, -v variable="2020" '$2 ~ variable' file.csv | wc -l
Note: Assuming that there are no separators (commas) nested into quotes fields in your file, at least for the first two fields. If there are, a more complex pattern should be used as the field separator.
Combination of the best parts of your trials is:
$ awk -F, -v variable=2020 '$2~variable{c++}END{print c}' file
2
Since $2 is all dates in the format mm/dd/yyyy no need to put the / in the query (avoid an escaping), 2020 is enough - when using earthbound calendars...
But without a proper sample this is still all guessing.
Could you please try following if you want to use variable.
awk -v variable="2020" 'BEGIN{FS=","} $2 ~ ("/"variable){cnt++} END{print cnt}' file.csv

How to change csv file delimiter

here's a csv file items.txt
item-number,item-description,item-category,cost,quantity-available
I tried to change the field separator from , to \n using awk and I need a easy way to do it.
1) this command does not work
awk -F, 'BEGIN{OFS="\n"} {print}' items.txt
2) this command works, but the real csv i need to process has 15+ columns, i dont want to provide all columns variables
awk -F, 'BEGIN{OFS="\n"} {print $1,$2}' items.txt
Thanks in advance.
If your fields do not contain ,'s, you may use tr:
tr ',' '\n' < infile > outfile
You were close. After setting the OFS to newline, you will need to re-construct the fields to get them separated to newlines.
$1=$1 re-evaluates the fields and separates them by OFS which by default is space. Since we set the OFS to RS which by default is new line, you get the desired output.
The 1 at the end of statement is idiomatic way of saying print the line.
$ awk 'BEGIN{FS=",";OFS=RS}{$1=$1}1' file
item-number
item-description
item-category
cost
quantity-available
You need to get awk to re-evaluate the string by changing a field.
awk -F, 'BEGIN{OFS="\n"} {$1 = $1; print}' items.txt
Or, if you're sure the first column is always non-empty and nonzero, you could use the somewhat simpler:
awk -F, 'BEGIN{OFS="\n"} $1 = $1' items.txt

Print out only the first column in a non-standard CSV file

I have a file that is delimited by comma ",", but some rows have only one column, and some rows have multiple columns separated by ",". For example:
NM_001066
NM_015378,NM_018156
NM_001006624,NM_001006625,NM_006474,NM_198389
As you can see above, the third line has 4 columns delimited by ",", but I only need to get the first column in every line.
I tried to use awk: cat fileName.txt | awk '{print $1}', but it does not work. I am looking for help with this. Thank you!
I guess you're looking for this:
awk -F, '{print $1}' file.txt
-F, tells awk to use comma as the field separator.
In this simple case, the same thing is simpler with cut:
cut -f1 -d, file.txt
you are close:
awk -F, '{print $1}' file
or
awk -F, '$0=$1' file

Awk a range of numbers from a specific column in a csv file

I am trying to print the rows from a csv file that have a third column value greater than -39. I have tried using awk but have not been able to get the command to work:
awk -F "," '{$3 > -39}' momenttensors.csv
You have your comparison inside an action block. You want it in the pattern section. Remove the { and }.
awk -F, '$3 > -39' momenttensors.csv
Try this:
awk -F, '$3 > -39' momenttensors.csv
You need the test part. {} this is an action part.

parse a csv file that contains commans in the fields with awk

i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23