Related
I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
I have a sizeable dataset of about 7 million lines and I am trying to find the number of rows in column $2 that contain "/2020" in the date ($2 is all dates in the format mm/dd/yyyy). However, all of the awk commands I'm trying are either giving me 0 or aren't printing anything at all, and I'm not sure why.
awk -F',' '$2 == "/2020" { count++ } END { print count }' file.csv
prints nothing
awk -v variable="2020" '$2 ~ variable' file.csv | wc -l
prints 0
awk ' BEGIN {count=0;} { if ($2 =="2020") count += 1} END {print count}' file.csv
prints 0
I'd appreciate some help. thanks!
The syntax to use is:
awk -F, '$2 ~ /\/2020/{cnt++} END {print cnt}' file.csv
== would mean that second field will be exactly like the pattern, while ~ means that it is matching the pattern, just a part of the field can be like the pattern.
See also the related part of the GNU awk manual
Also, your second attempt would have worked if you have added the field separator, note that here you match only the year without the slash.
awk -F, -v variable="2020" '$2 ~ variable' file.csv | wc -l
Note: Assuming that there are no separators (commas) nested into quotes fields in your file, at least for the first two fields. If there are, a more complex pattern should be used as the field separator.
Combination of the best parts of your trials is:
$ awk -F, -v variable=2020 '$2~variable{c++}END{print c}' file
2
Since $2 is all dates in the format mm/dd/yyyy no need to put the / in the query (avoid an escaping), 2020 is enough - when using earthbound calendars...
But without a proper sample this is still all guessing.
Could you please try following if you want to use variable.
awk -v variable="2020" 'BEGIN{FS=","} $2 ~ ("/"variable){cnt++} END{print cnt}' file.csv
I would like to remove all rows in my CSV where under the column "Owner" The Value is Fishbowl Digital Content.
Here is a sample CSV
ID,Name,Function,Media,Owner
415,Sam,Run,Footage,Production
213,Raj,Catch,Footage,Fishbowl Digital Content
214,Jack,Hold,Website,Salvage
256,Jason,Catch,Website,Fishbowl Digital Content
I have tried
awk -F , '$4 != "Fishbowl Digital Content" {print $0}' Test.csv >TestModified.csv
But I still see lines where the Value under "Owner" is 'Fishbowl Digital Content'
awk -F , '$4 != "Fishbowl Digital Content" {print $0}' Test.csv >TestModified.csv
Here are the desired Results:
ID,Name,Function,Media,Owner
415,Sam,Run,Footage,Production
214,Jack,Hold,Website,Salvage
I know it's not awk, but I find that the Miller way (http://johnkerl.org/miller/doc/) is very easy and useful
mlr --csv filter -x '$Owner=="Fishbowl Digital Content"' inputFile.csv
1st Solution(With hard-coding field value): Could you please try following.
awk -F, 'FNR==1{print;next} $NF!="Fishbowl Digital Content"' Input_file
2nd solution(Generic solution without hard-coding field value): Adding more generic solution in which it will check which field is having Owner value and then it will skipm lines. Meaning we are NOT hard coding value of Owner column here.
awk -F, 'FNR==1{for(i=1;i<=NF;i++){if($i=="Owner"){val=i}};print;next} $val!="Fishbowl Digital Content"' Input_file
You can also use grep in case you are matching the first or the last column. For the last column, here is the filter:
grep -v ',"Fishbowl Digital Content"$' file
If there could be trailing spaces, then:
grep -vE ',"Fishbowl Digital Content"[[:space:]]*$' file
I have the following csv:
old.csv
irrelevant,irrelevant,Abc#gmail.com,irrelevant
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,yZ#yahoo.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,this#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
irrelevant,irrelevant,anoTher#mydomain.com,irrelevant
that I need to remove the rows containing emails from this file:
remove.txt
abc#gmail.com
yz#yahoo.com
this#email.com
another#mydomain.com
And I need the output to be this:
new.csv
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
I've tried this, but it doesn't work. Can anyone help?
awk -F, 'BEGIN{IGNORECASE = 1};NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt old.csv > new.csv
With grep:
grep -v -i -f remove.txt all.csv
Here,
-f - Obtain patterns from FILE, one per line.
-i - Ignore case
-v - Invert the matching
With awk:
awk -F, 'BEGIN{IGNORECASE=1} NR==FNR{a[$1]++;next} {for(var in a){if($3 ~ var){print}}}' remove.txt all.csv
Another awk:
awk -F, 'NR==FNR{a[tolower($1)]++;next} !(tolower($3) in a){print}' remove.txt all.csv
In your case, it won't work. Because,
IGNORECASE=1
will work only on if (x ~ /ab/) and not with array indexes.
index in array
IGNORECASE is gawk-specific, you may not be using gawk.
You're testing the wrong field.
Incrementing the array element does nothing useful.
Try this:
awk -F, 'NR==FNR{remove[tolower($1)];next}!(tolower($3) in remove)' remove.txt old.csv > new.csv
i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23