awk CSV Split with headers Windows - csv

Ok I have a csv file I need to split based on a column value which is fine, but I cannot get the headers to print in each file.
Currently I use:
awk "FS =\",\" {output=$3\".csv\"; print $0 > output}" test.csv
Which splits the files file based on column 3, but I don't know how to add the header to each file.
I've searched high & low but can't find a solution that will work in a one liner...
UPDATE
OK to date we have a working one liner:
awk -F, "NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3\".csv\"}{print>$3\".csv\"}" test.csv
Or in test.awk:
BEGIN{FS=","} NR==1 {hdr=$0;next}!($3 in files) {files[$3]=1;print hdr>$3".csv"}{print>$3".csv"}
Command to run used:
awk -f test.awk test.csv
I really appreciate the help here, I've been trying for hours and have a few things left to work out.
1) Blank line inserted after header
2) Sort the data on specified fields
Further down the line I want to additionally do a row count & cut a reference number from another file is this possible with AWK or am I using the wrong tool for the job?
Thanks again.

UPDATED#2
Blank line after header line
UPDATED
Try this:
On Unix/cygwin (I tested on cygwin):
awk -F, 'NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr"\n">$3".csv"}{print>$3".csv"}' test.csv
Or adding Kent's ideas:
awk -F, 'NR==1{hdr=$0;next}{out=$3".csv"}!($3 in files){files[$3];print hdr"\n">out}{print>out}' test.csv
On windows cmd (not tested):
awk -F, "NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr\"\n\">$3\".csv\"}{print>$3\".csv\"}" test.csv
This stores the header line in test.csv to hdr. For the next lines it checks if the file name value is already exists. If not then stores its name in the files hash and prints the header line. And anyway it prints the whole line to the file.
Example file:
$ cat test.csv
A,B,C,D
1,2,a,3
4,5,b,4
Output
$ cat a.csv
A,B,C,D
1,2,a,3
$ cat b.csv
A,B,C,D
4,5,b,4
ADDED
If You would like to put the awk script into a file You could try (I cannot test is, sorry).
test.awk
BEGIN{FS=","}
NR==1 {hdr=$0;next}
!($3 in files) {files[$3]=1;print hdr"\n">$3".csv"}
{print>"$3.csv"}
Then You may call it as
awk -f test.awk test.csv

awk -F, 'NR==1{h=$0;next}{out=$3".csv";
if!(out in a)print h> out; print $0 > out;a[out]}' test.csv

Try something like this:
awk -F, '
BEGIN {
getline header
}
{
out=$3".csv"
if (!($3 in seen)) {
print header > out
}
print $0 > out
seen[$3]
}' test.csv
Windows version: (Not tested)
awk " FS =\",\"
BEGIN {
getline header
}
{
out=$3\".csv\"
if (!($3 in seen)) {
print header > out
}
print $0 > out
seen[$3]
}" test.csv

awk '{ output=$3".csv"; if( !($0 in a)) print "header" > output; a[$0]
print > output}' FS=, test.csv

Related

Increment field value provided another field matches a string

I am trying to increment a value in a csv file, provided it matches a search string. Here is the script that was utilized:
awk -i inplace -F',' '$1 == "FL" { print $1, $2+1} ' data.txt
Contents of data.txt:
NY,1
FL,5
CA,1
Current Output:
FL 6
Intended Output:
NY,1
FL,6
CA,1
Thanks.
$ awk 'BEGIN{FS=OFS=","} $1=="FL"{++$2} 1' data.txt
NY,1
FL,6
CA,1
Intended Output:
NY,1 FL,6 CA,1
I would harness GNU AWK for this task following way, let file.txt content be
NY,1
FL,5
CA,1
then
awk 'BEGIN{FS=OFS=",";ORS=" "}{print $1,$2+($1=="FL")}' file.txt
gives output
NY,1 FL,6 CA,1
Explanation: I inform GNU AWK that field separator (FS) and output field separator (OFS) is , and output row separator (ORS) is space with accordance to your requirements. Then for each line I print 1st field followed by 2nd field increased by is 1st field FL? with 1 denoting it does hold, 0 denotes it does not hold. If you want to know more about FS or OFS or ORS then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Use this Perl one-liner:
perl -i -F',' -lane 'if ( $F[0] eq "FL" ) { $F[1]++; } print join ",", #F;' data.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Adding a column in multiple csv file using awk

I want to add a column at the multiple (500) CSV files (same dimensionality). Each column should act as an identifier for the individual file. I want to create a bash script using awk(I am a new bee in awk). The CSV files do come with headers.
For eg.
Input File1.csv
#name,#age,#height
A,12,4.5
B,13,5.0
Input File2.csv
#name,#age,#height
C,11,4.6
D,12,4.3
I want to add a new column "#ID" in both the files, where the value of ID will be same for an individual file but not for both the file.
Expected Output
File1.csv
#name,#age,#height,#ID
A,12,4.5,1
B,13,5.0,1
Expected File2.csv
#name,#age,#height,#ID
C,11,4.6,2
D,12,4.3,2
Please suggest.
If you don't need to extract the id number from the filename, this should do.
$ c=1; for f in File*.csv;
do
sed -i '1s/$/,#ID/; 2,$s/$/,'$c'/' "$f";
c=$((c+1));
done
note that this is inplace edit. Perhaps make a backup or test first.
UPDATE
If you don't need the individual files to be updated, this may work better for you
$ awk -v OFS=, 'BEGIN {f="allFiles.csv"}
FNR==1 {c++; print $0,"#ID" > f; next}
{print $0,c > f}' File*.csv
awk -F, -v OFS=, ‘
FNR == 1 {
$(NF + 1) = “ID#”
i++
f = FILENAME
sub(/Input/, “Output”, f)
} FNR != 1 {
$(NF + 1) = i
} {
print > f
}’ Input*.csv
With GNU awk for inplace editing and ARGIND:
awk -i inplace -v OFS=, '{print $0, (FNR==1 ? "#ID" : ARGIND)}' File*.csv

How do I conditionally append the occurrence of a field using awk?

I have a file that looks like this:
Level,Member
HIGH,John
HIGH,John
HIGH,Paul
HIGH,George
REG,George
REG,George
REG,George
REG,John
REG,Paul
REG,Paul
REG,Ringo
If I want to append a count of the occurrence of data in the second column, this works great:
awk 'BEGIN{ FS=OFS="," }{ $0=$0 OFS (++a[$2]) }1' file
But I'm having trouble figuring out how to add an if/else statement so that I can conditionally count by level so that my output looks like this:
Level,Member,1
HIGH,George,1
HIGH,John,1
HIGH,John,2
HIGH,Paul,1
REG,George,1
REG,George,2
REG,George,3
REG,John,1
REG,Paul,1
REG,Paul,2
REG,Ringo,1
Please note that the count starts over when the level changes from HIGH to REG. The file is already sorted by level and then by member.
or just..
$ awk '{$0=$0","++a[$0]}1' file
Level,Member,1
HIGH,John,1
HIGH,John,2
HIGH,Paul,1
HIGH,George,1
REG,George,1
REG,George,2
REG,George,3
REG,John,1
REG,Paul,1
REG,Paul,2
REG,Ringo,1
Keep it simple:
awk '{print $0","(++c[$0])}'
Your command was already pretty fine. I have just changed the key of associative array a[]:
awk 'BEGIN{ FS=OFS="," }{ $0=$0 OFS (++a[$2$1]) }1' file
or:
awk 'BEGIN{ FS=OFS="," }{ $0=$0 OFS (++a[$0]) }1' file

awk compare one column from two CSV files and display fields from both files

I want to compare second column of 1st file with 1st column of 2nd file, if match found display all fields from 1st file and all fields from 2nd file.
file1:
"971525408953","a8:5b:78:5a:dd:dc","TRUE"
"971558216784","ec:1f:72:24:7b:30","TRUE"
"971506509910","e8:50:8b:d8:f3:b5","TRUE"
"971509525934","c8:14:79:b4:bc:da","FALSE"
"971506904830","58:48:22:83:87:7f","TRUE"
file2:
"fc:e9:98:1e:a2:a2",2016-03-07 23:39:29,"TRUE"
"c8:14:79:b4:bc:da",2016-03-08 04:26:06,"TRUE"
"78:a3:e4:87:df:19",2015-12-30 01:22:42,"TRUE"
"18:f6:43:b1:82:47",2016-03-08 08:38:41,"TRUE"
"58:48:22:83:87:7f",2015-12-22 01:22:42,"TRUE"
output expected:
"c8:14:79:b4:bc:da",2016-03-08 04:26:06,"TRUE","971509525934","c8:14:79:b4:bc:da","FALSE"
"58:48:22:83:87:7f",2015-12-2201:22:42,"TRUE","971506904830","58:48:22:83:87:7f","TRUE"
But if i run following command i get this output without n[$2] and n[$3]
awk -F"," 'NR==FNR { n[$2] = $1; next } ($1 in n) {print $1,$2,$3,n[$1],n[$2],n[$3] }' file1 file2
"c8:14:79:b4:bc:da",2016-03-0804:26:06,"TRUE","971509525934",, "58:48:22:83:87:7f",2015-12-22 01:22:42,"TRUE","971506904830",,
Can any one help me on this?
awk -F"," -v OFS="," 'NR==FNR { n[$2] = $1$2$3; next } ($1 in n) {print $1,$2,$3, n[$1] }' file1 file2
output:
"c8:14:79:b4:bc:da",2016-03-08 04:26:06,"TRUE","971509525934""c8:14:79:b4:bc:da""TRUE"
"58:48:22:83:87:7f",2015-12-22 01:22:42,"TRUE","971506904830""58:48:22:83:87:7f""TRUE"

AWK Split large CSV file with headers and print output files based on column value

I have a CSV file of around 800 mb which I need to split up using AWK.
The file has a column with ID's in them which I want to use to split the file on.
I'm familiar/know how to accomplish this with Perl but not with AWK since I've only used it a few times.
(In perl I would use the Text::CSV module but I don't have the option in this case)
I found this answer: https://stackoverflow.com/a/16795137 which is basically what I want but with a small alteration. It has to contain an if statement so it will only print if the column I want to split it on is a digit. This is necessary because the file column can shift sometimes and I want to send the non-digit lines to a seperate file (junk.csv).
I'm using the windows cmd version for testing right now but I'll eventually run it on linux. (Below original code)
awk -F, "NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr\"\n\">$3\".csv\"}{print>$3\".csv\"}" test.csv
And my intention is this:
awk -F";" "{if ($3 ~ /^[0-9]+$/){"NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3\".csv\"}{print>$3\".csv\"}"" test.csv
I can't figure out how to do this in AWK (just yet). The double quotes are also throwing me off (because of the windows version). Where am I going wrong?
This is my error output:
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk: ^ syntax error
errcount: 4
This is my (sample) data:
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
And I want to accomplish this:
File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
You can simplify the awk as
awk -F\; '{print > $3".csv"}' input
Will produce the following csv files with content
26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
NOTE
If you want to send the lines which have non digits in column 3 to junk.csv a small change in the above awk can be helpfull
awk -F\; '$3 ~ /^[0-9]+$/{print > $3".csv"; next} {print > "junk.csv"}' input
$3 ~ /^[0-9]+$/ performs a regex match on column 3 and if it matches, sends to corresponding csv file. else the line is written to junk.csv
OR
a much simpler version like
awk -F\; '{file=$3~/^[0-9]+$/?$3:"junk";print >file".csv"}'
Thanks to Jidder for the suggestion.