How to combine several GAWK statements? - csv

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks

In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

Increment field value provided another field matches a string

I am trying to increment a value in a csv file, provided it matches a search string. Here is the script that was utilized:
awk -i inplace -F',' '$1 == "FL" { print $1, $2+1} ' data.txt
Contents of data.txt:
NY,1
FL,5
CA,1
Current Output:
FL 6
Intended Output:
NY,1
FL,6
CA,1
Thanks.
$ awk 'BEGIN{FS=OFS=","} $1=="FL"{++$2} 1' data.txt
NY,1
FL,6
CA,1
Intended Output:
NY,1 FL,6 CA,1
I would harness GNU AWK for this task following way, let file.txt content be
NY,1
FL,5
CA,1
then
awk 'BEGIN{FS=OFS=",";ORS=" "}{print $1,$2+($1=="FL")}' file.txt
gives output
NY,1 FL,6 CA,1
Explanation: I inform GNU AWK that field separator (FS) and output field separator (OFS) is , and output row separator (ORS) is space with accordance to your requirements. Then for each line I print 1st field followed by 2nd field increased by is 1st field FL? with 1 denoting it does hold, 0 denotes it does not hold. If you want to know more about FS or OFS or ORS then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Use this Perl one-liner:
perl -i -F',' -lane 'if ( $F[0] eq "FL" ) { $F[1]++; } print join ",", #F;' data.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

AWK : comparing 2 columns from 2 csv files, outputting to a third. How do I also get the output that doesnt match to another file?

I currently have the following script:
awk -F, 'NR==FNR { a[$1 FS $4]=$0; next } $1 FS $4 in a { printf a[$1 FS $4]; sub($1 FS $4,""); print }' file1.csv file2.csv > combined.csv
this compares two columns 1 & 4 from both csv files and outputs the result from both files to combined.csv. Is it possible to output the lines from file 1 & file 2 that dont match to other files with the same awk line? or would i need to do seperate parses?
File1
ResourceName,ResourceType,PatternType,User,Host,Operation,PermissionType
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow
File2
topic,groupName,Name,User,email,team,contact,teamemail,date,clienttype
BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
combined
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
Wanted additional files:
non matched file1:
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
non matched file2:
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer```
again, I might be trying to do too much in one line? would it be wiser to run another parse?
Assuming the key pairs of $1 and $4 are unique within each input file then using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { next }
{ key = $1 FS $4 }
NR==FNR {
file1[key] = $0
next
}
key in file1 {
print file1[key], $0 > "out_combined"
delete file1[key]
next
}
{
print > "out_file2_only"
}
END {
for (key in file1) {
print file1[key] > "out_file1_only"
}
}
$ awk -f tst.awk file{1,2}
$ head out_*
==> out_combined <==
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
==> out_file1_only <==
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
==> out_file2_only <==
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
The order of lines in out_file1_only will be shuffled by the in operator - if that's a problem let us know as it's an easy tweak to retain the input order.

Awk multiple transformation / separators in once

I have to transform (preprocess) a CSV file, by generating / inserting a new column, being the result of the concat of existing columns.
For example, transform:
A|B|C|D|E
into:
A|B|C|D|C > D|E
In this example, I do it with:
cat myfile.csv | awk 'BEGIN{FS=OFS="|"} {$4 = $4 OFS $3" > "$4} 1'
But now I have something more complex to do, and dont find how to do this.
I have to transform:
A|B|C|x,y,z|E
into
A|B|C|x,y,z|C > x,C > y,C > z|E
How can it be done in awk (or other command) efficiently (my csv file can contains thousands of lines)?
Thanks.
With GNU awk (for gensub which is a GNU extension):
awk -F'|' '{$6=$5; $5=gensub(/(^|,)/,"\\1" $3 " > ","g",$4); print}' OFS='|'
You can split the 4th field into an array:
awk 'BEGIN{FS=OFS="|"} {split($4,a,",");$4="";for(i=1;i in a;i++)$4=($4? $4 "," : "") $3 " > " a[i]} 1' myfile.csv
A|B|C|C > x,C > y,C > z|E
There are many ways to do this, but the simplest is the following:
$ awk 'BEGIN{FS=OFS="|"}{t=$4;gsub(/[^,]+/,$3" > &",t);$4 = $4 OFS t}1'
we make a copy of the fourth field in variable t. In there, we replace every string which does not contain the new separator (,) by the content of the third field followed by > and the original matched string (&).

Adding a column in multiple csv file using awk

I want to add a column at the multiple (500) CSV files (same dimensionality). Each column should act as an identifier for the individual file. I want to create a bash script using awk(I am a new bee in awk). The CSV files do come with headers.
For eg.
Input File1.csv
#name,#age,#height
A,12,4.5
B,13,5.0
Input File2.csv
#name,#age,#height
C,11,4.6
D,12,4.3
I want to add a new column "#ID" in both the files, where the value of ID will be same for an individual file but not for both the file.
Expected Output
File1.csv
#name,#age,#height,#ID
A,12,4.5,1
B,13,5.0,1
Expected File2.csv
#name,#age,#height,#ID
C,11,4.6,2
D,12,4.3,2
Please suggest.
If you don't need to extract the id number from the filename, this should do.
$ c=1; for f in File*.csv;
do
sed -i '1s/$/,#ID/; 2,$s/$/,'$c'/' "$f";
c=$((c+1));
done
note that this is inplace edit. Perhaps make a backup or test first.
UPDATE
If you don't need the individual files to be updated, this may work better for you
$ awk -v OFS=, 'BEGIN {f="allFiles.csv"}
FNR==1 {c++; print $0,"#ID" > f; next}
{print $0,c > f}' File*.csv
awk -F, -v OFS=, ‘
FNR == 1 {
$(NF + 1) = “ID#”
i++
f = FILENAME
sub(/Input/, “Output”, f)
} FNR != 1 {
$(NF + 1) = i
} {
print > f
}’ Input*.csv
With GNU awk for inplace editing and ARGIND:
awk -i inplace -v OFS=, '{print $0, (FNR==1 ? "#ID" : ARGIND)}' File*.csv