AWK using file to remove csv rows - csv

I have the following csv:
old.csv
irrelevant,irrelevant,Abc#gmail.com,irrelevant
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,yZ#yahoo.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,this#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
irrelevant,irrelevant,anoTher#mydomain.com,irrelevant
that I need to remove the rows containing emails from this file:
remove.txt
abc#gmail.com
yz#yahoo.com
this#email.com
another#mydomain.com
And I need the output to be this:
new.csv
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
I've tried this, but it doesn't work. Can anyone help?
awk -F, 'BEGIN{IGNORECASE = 1};NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt old.csv > new.csv

With grep:
grep -v -i -f remove.txt all.csv
Here,
-f - Obtain patterns from FILE, one per line.
-i - Ignore case
-v - Invert the matching
With awk:
awk -F, 'BEGIN{IGNORECASE=1} NR==FNR{a[$1]++;next} {for(var in a){if($3 ~ var){print}}}' remove.txt all.csv
Another awk:
awk -F, 'NR==FNR{a[tolower($1)]++;next} !(tolower($3) in a){print}' remove.txt all.csv
In your case, it won't work. Because,
IGNORECASE=1
will work only on if (x ~ /ab/) and not with array indexes.
index in array

IGNORECASE is gawk-specific, you may not be using gawk.
You're testing the wrong field.
Incrementing the array element does nothing useful.
Try this:
awk -F, 'NR==FNR{remove[tolower($1)];next}!(tolower($3) in remove)' remove.txt old.csv > new.csv

Related

prefix every header column with string using awk

I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount

Trying to get number of rows containing '/2020' in a column

I have a sizeable dataset of about 7 million lines and I am trying to find the number of rows in column $2 that contain "/2020" in the date ($2 is all dates in the format mm/dd/yyyy). However, all of the awk commands I'm trying are either giving me 0 or aren't printing anything at all, and I'm not sure why.
awk -F',' '$2 == "/2020" { count++ } END { print count }' file.csv
prints nothing
awk -v variable="2020" '$2 ~ variable' file.csv | wc -l
prints 0
awk ' BEGIN {count=0;} { if ($2 =="2020") count += 1} END {print count}' file.csv
prints 0
I'd appreciate some help. thanks!
The syntax to use is:
awk -F, '$2 ~ /\/2020/{cnt++} END {print cnt}' file.csv
== would mean that second field will be exactly like the pattern, while ~ means that it is matching the pattern, just a part of the field can be like the pattern.
See also the related part of the GNU awk manual
Also, your second attempt would have worked if you have added the field separator, note that here you match only the year without the slash.
awk -F, -v variable="2020" '$2 ~ variable' file.csv | wc -l
Note: Assuming that there are no separators (commas) nested into quotes fields in your file, at least for the first two fields. If there are, a more complex pattern should be used as the field separator.
Combination of the best parts of your trials is:
$ awk -F, -v variable=2020 '$2~variable{c++}END{print c}' file
2
Since $2 is all dates in the format mm/dd/yyyy no need to put the / in the query (avoid an escaping), 2020 is enough - when using earthbound calendars...
But without a proper sample this is still all guessing.
Could you please try following if you want to use variable.
awk -v variable="2020" 'BEGIN{FS=","} $2 ~ ("/"variable){cnt++} END{print cnt}' file.csv

Find Values in CSV that only Appear Once

I have a csv file with thousands of lines in it. I'd like to be able to find values that only appear once in this file.
For instance
dog
dog
cat
dog
bird
I'd like to get as my result:
cat
bird
I tried using the following awk command but it returned one of each value in the file:
awk -F"," '{print $1}' test.csv|sort|uniq
Returns:
dog
cat
bird
Thank you for your help!
Just with awk:
awk -F, '{count[$1]++} END {for (key in count) if (count[key] == 1) print key}' test.csv
Close. Try:
awk -F"," '{print $1}' test.csv |sort | uniq -c | awk '{if ($1 == 1) print $2}'
the -c flag on uniq will give you counts. Next awk will look for any items with the count of 1 (first field) and print the value of the second field ($2)
Only caveat is that this will return bird before cat due to it being previously sroted. you could pipe once more to sort -r to reverse the sort direction. This would be identical to the expected answer you asked for, but it is not the original sort order.
Cutting to first field, then sorting and displaying only uniques:
cut -d ',' -f 1 test.csv | sort | uniq -u
That is, if you append -u to your command, it'd work. This is just using cut instead of awk.
If Perl is an option, this code is similar to #glenn jackman's:
perl -F, -lane '$c{$F[0]}++; END{for $k (sort keys %c){print $k if $c{$k} == 1}}' test.csv
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]

Print out only the first column in a non-standard CSV file

I have a file that is delimited by comma ",", but some rows have only one column, and some rows have multiple columns separated by ",". For example:
NM_001066
NM_015378,NM_018156
NM_001006624,NM_001006625,NM_006474,NM_198389
As you can see above, the third line has 4 columns delimited by ",", but I only need to get the first column in every line.
I tried to use awk: cat fileName.txt | awk '{print $1}', but it does not work. I am looking for help with this. Thank you!
I guess you're looking for this:
awk -F, '{print $1}' file.txt
-F, tells awk to use comma as the field separator.
In this simple case, the same thing is simpler with cut:
cut -f1 -d, file.txt
you are close:
awk -F, '{print $1}' file
or
awk -F, '$0=$1' file

Awk a range of numbers from a specific column in a csv file

I am trying to print the rows from a csv file that have a third column value greater than -39. I have tried using awk but have not been able to get the command to work:
awk -F "," '{$3 > -39}' momenttensors.csv
You have your comparison inside an action block. You want it in the pattern section. Remove the { and }.
awk -F, '$3 > -39' momenttensors.csv
Try this:
awk -F, '$3 > -39' momenttensors.csv
You need the test part. {} this is an action part.