Awk multiple transformation / separators in once - csv

I have to transform (preprocess) a CSV file, by generating / inserting a new column, being the result of the concat of existing columns.
For example, transform:
A|B|C|D|E
into:
A|B|C|D|C > D|E
In this example, I do it with:
cat myfile.csv | awk 'BEGIN{FS=OFS="|"} {$4 = $4 OFS $3" > "$4} 1'
But now I have something more complex to do, and dont find how to do this.
I have to transform:
A|B|C|x,y,z|E
into
A|B|C|x,y,z|C > x,C > y,C > z|E
How can it be done in awk (or other command) efficiently (my csv file can contains thousands of lines)?
Thanks.

With GNU awk (for gensub which is a GNU extension):
awk -F'|' '{$6=$5; $5=gensub(/(^|,)/,"\\1" $3 " > ","g",$4); print}' OFS='|'

You can split the 4th field into an array:
awk 'BEGIN{FS=OFS="|"} {split($4,a,",");$4="";for(i=1;i in a;i++)$4=($4? $4 "," : "") $3 " > " a[i]} 1' myfile.csv
A|B|C|C > x,C > y,C > z|E

There are many ways to do this, but the simplest is the following:
$ awk 'BEGIN{FS=OFS="|"}{t=$4;gsub(/[^,]+/,$3" > &",t);$4 = $4 OFS t}1'
we make a copy of the fourth field in variable t. In there, we replace every string which does not contain the new separator (,) by the content of the third field followed by > and the original matched string (&).

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

How to combine several GAWK statements?

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks
In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

finding patterns in csv file

trying to sort a csv file based on the repeating rows
awk -F, 'NR>1{arr[$4,",",$5,",",$6,,",",$7,",",$8,",",$9]++}END{for (a in arr) printf "%s\n", arr[a] "-->" a}' test.txt
Input file
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
Create a file with
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,2,t,z,s-->2
a,b,k,1,4,5,5,5,6,1,k,r,s-->1
where the last column contains the number of occurrences of the pattern of numbers that start in the 4th place till the 9th place.
Count and sort the duplicate lines
I got to the point that I have the patterns with the count - but I don't know how to add the rest of the columns to the line:
thank you for the support.
A solution where the data is read twice, on the first go the duplicates are counted and on the second outputed:
$ awk -F, '
NR==FNR {
a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9]++ # count
next
}
{
print $0 "-->" a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9] # output
}' file file
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,t,z,s-->2
a,b,k,1,4,5,5,5,6,k,r,s-->1
Could you please try following, reading Input_file single time only.
awk '
BEGIN{
FS=OFS=","
}
{
a[FNR]=$0
b[FNR]=$4 FS $5 FS $6 FS $7 FS $8 FS $9
c[$4 FS $5 FS $6 FS $7 FS $8 FS $9]++
}
END{
for(i=1;i<=FNR;i++){
print a[i]" ---->" c[b[i]]
}
}' Input_file
The answer of James Brown is a very simple double-pass solution which has the advantage that you don't need to store the file into memory but the disadvantage of having to read a file twice. The following solution will just do the inverse, read the file only ones but have to keep it into memory. To this end we need 3 arrays. Array c to keep track of the count, array b to act as a buffer and array a to keep track of the original order.
Furthermore, we will make use of multidimensional array indices:
A valid array index shall consist of one or more <comma>-separated expressions, similar to the way in which multi-dimensional arrays are indexed in some programming languages. Because awk arrays are really one-dimensional, such a <comma>-separated list shall be converted to a single string by concatenating the string values of the separate expressions, each separated from the other by the value of the SUBSEP variable. Thus, the following two index operations shall be equivalent:
var[expr1, expr2, ... exprn]
var[expr1 SUBSEP expr2 SUBSEP... SUBSEP exprn]
The solution now reads:
{ a[NR] = $4 SUBSEP $5 SUBSEP $6 SUBSEP $7 SUBSEP $8 SUBSEP $9
b[$4,$5,$6,$7,$8,$9] = $0
c[$4,$5,$6,$7,$8,$9]++ }
END { for(i=1;i<=NR;++i) print b[a[i]],"-->",c[a[i]] }
Since the problem resembles SQL pattern, you can use sqlite also. Check this out.
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ cat sqllite_cols4_to_9.sh
#!/bin/sh
sqlite3 <<EOF
create table data(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12);
.separator ','
.import "$1" data
select t1.*, " --> " || t2.cw from data t1, ( select c4,c5,c6,c7,c8,c9, count(*) as cw from data group by c4,c5,c6,c7,c8,c9 ) t2
where t1.c4=t2.c4 and t1.c5=t2.c5 and t1.c6=t2.c6 and t1.c7=t2.c7 and t1.c8=t2.c8 and t1.c9=t2.c9;
EOF
$ ./sqllite_cols4_to_9.sh shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z, --> 2
k,s,t,1,2,3,4,5,6,t,z,s, --> 2
a,b,k,1,4,5,5,5,6,k,r,s, --> 1
$
You can try Perl also. The file is read only once, so it will be faster. Check this out:
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ perl -F, -lane ' $v=join(",",#F[3..8]);$kv{$_}{$v}=$kv2{$v}++; END { while(($x,$y)=each (%kv)){ while(($p,$q)=each (%{$y})) { print "$x --> $kv2{$p}" }}}' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$
Another Perl - shorter code
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { $t=join(",",(split /,/)[3..8]); print "$_ --> $kv2{$t}" } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
or
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { print "$_ --> ",$kv2{join(",",(split /,/)[3..8])} } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$

Adding a column in multiple csv file using awk

I want to add a column at the multiple (500) CSV files (same dimensionality). Each column should act as an identifier for the individual file. I want to create a bash script using awk(I am a new bee in awk). The CSV files do come with headers.
For eg.
Input File1.csv
#name,#age,#height
A,12,4.5
B,13,5.0
Input File2.csv
#name,#age,#height
C,11,4.6
D,12,4.3
I want to add a new column "#ID" in both the files, where the value of ID will be same for an individual file but not for both the file.
Expected Output
File1.csv
#name,#age,#height,#ID
A,12,4.5,1
B,13,5.0,1
Expected File2.csv
#name,#age,#height,#ID
C,11,4.6,2
D,12,4.3,2
Please suggest.
If you don't need to extract the id number from the filename, this should do.
$ c=1; for f in File*.csv;
do
sed -i '1s/$/,#ID/; 2,$s/$/,'$c'/' "$f";
c=$((c+1));
done
note that this is inplace edit. Perhaps make a backup or test first.
UPDATE
If you don't need the individual files to be updated, this may work better for you
$ awk -v OFS=, 'BEGIN {f="allFiles.csv"}
FNR==1 {c++; print $0,"#ID" > f; next}
{print $0,c > f}' File*.csv
awk -F, -v OFS=, ‘
FNR == 1 {
$(NF + 1) = “ID#”
i++
f = FILENAME
sub(/Input/, “Output”, f)
} FNR != 1 {
$(NF + 1) = i
} {
print > f
}’ Input*.csv
With GNU awk for inplace editing and ARGIND:
awk -i inplace -v OFS=, '{print $0, (FNR==1 ? "#ID" : ARGIND)}' File*.csv

How to replace a string/pattern in nth column/field in a comma-separated .csv file using sed/awk?

I have a .csv file ,i need to replace a string in 5th column by another string ,columns are seperated by ',' and each element in columns are bounded by " as shown below.
"ID","CIRCLE","IP_ADDRESS","DESCRIPTION","Current_Status"
"6","local","127.0.0.1","localhost","3"
"7","RPOP1","10.10.10.1","router1","3"
I need to replace all elements which are '3' in 5th column by string 'Alive'.
I have tried below script which is posted earlier in stackflow ,but it is not working for my case
/usr/bin/awk -F, '$5 ~ /3/ { OFS= ","; $5 = "Alive"; }' /tmp/HOST_REPORT.csv
Please provide a simple solution.
Thank you.
but it is not working for my case
Because you didn't choose to print it. Try:
/usr/bin/awk -F, '$5 ~ /3/ { OFS= ","; $5 = "Alive"; }1' /tmp/HOST_REPORT.csv
^
In order to preserve the quotes, you could say:
/usr/bin/awk -F, '$5 ~ /3/ { OFS="," ; $5 = "\"Alive\""; }1' /tmp/HOST_REPORT.csv