finding patterns in csv file - csv

trying to sort a csv file based on the repeating rows
awk -F, 'NR>1{arr[$4,",",$5,",",$6,,",",$7,",",$8,",",$9]++}END{for (a in arr) printf "%s\n", arr[a] "-->" a}' test.txt
Input file
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
Create a file with
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,2,t,z,s-->2
a,b,k,1,4,5,5,5,6,1,k,r,s-->1
where the last column contains the number of occurrences of the pattern of numbers that start in the 4th place till the 9th place.
Count and sort the duplicate lines
I got to the point that I have the patterns with the count - but I don't know how to add the rest of the columns to the line:
thank you for the support.

A solution where the data is read twice, on the first go the duplicates are counted and on the second outputed:
$ awk -F, '
NR==FNR {
a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9]++ # count
next
}
{
print $0 "-->" a[$4 ORS $5 ORS $6 ORS $7 ORS $8 ORS $9] # output
}' file file
a,b,d,1,2,3,4,5,6,y,x,z-->2
k,s,t,1,2,3,4,5,6,t,z,s-->2
a,b,k,1,4,5,5,5,6,k,r,s-->1

Could you please try following, reading Input_file single time only.
awk '
BEGIN{
FS=OFS=","
}
{
a[FNR]=$0
b[FNR]=$4 FS $5 FS $6 FS $7 FS $8 FS $9
c[$4 FS $5 FS $6 FS $7 FS $8 FS $9]++
}
END{
for(i=1;i<=FNR;i++){
print a[i]" ---->" c[b[i]]
}
}' Input_file

The answer of James Brown is a very simple double-pass solution which has the advantage that you don't need to store the file into memory but the disadvantage of having to read a file twice. The following solution will just do the inverse, read the file only ones but have to keep it into memory. To this end we need 3 arrays. Array c to keep track of the count, array b to act as a buffer and array a to keep track of the original order.
Furthermore, we will make use of multidimensional array indices:
A valid array index shall consist of one or more <comma>-separated expressions, similar to the way in which multi-dimensional arrays are indexed in some programming languages. Because awk arrays are really one-dimensional, such a <comma>-separated list shall be converted to a single string by concatenating the string values of the separate expressions, each separated from the other by the value of the SUBSEP variable. Thus, the following two index operations shall be equivalent:
var[expr1, expr2, ... exprn]
var[expr1 SUBSEP expr2 SUBSEP... SUBSEP exprn]
The solution now reads:
{ a[NR] = $4 SUBSEP $5 SUBSEP $6 SUBSEP $7 SUBSEP $8 SUBSEP $9
b[$4,$5,$6,$7,$8,$9] = $0
c[$4,$5,$6,$7,$8,$9]++ }
END { for(i=1;i<=NR;++i) print b[a[i]],"-->",c[a[i]] }

Since the problem resembles SQL pattern, you can use sqlite also. Check this out.
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ cat sqllite_cols4_to_9.sh
#!/bin/sh
sqlite3 <<EOF
create table data(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12);
.separator ','
.import "$1" data
select t1.*, " --> " || t2.cw from data t1, ( select c4,c5,c6,c7,c8,c9, count(*) as cw from data group by c4,c5,c6,c7,c8,c9 ) t2
where t1.c4=t2.c4 and t1.c5=t2.c5 and t1.c6=t2.c6 and t1.c7=t2.c7 and t1.c8=t2.c8 and t1.c9=t2.c9;
EOF
$ ./sqllite_cols4_to_9.sh shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z, --> 2
k,s,t,1,2,3,4,5,6,t,z,s, --> 2
a,b,k,1,4,5,5,5,6,k,r,s, --> 1
$

You can try Perl also. The file is read only once, so it will be faster. Check this out:
$ cat shimon.txt
a,b,d,1,2,3,4,5,6,y,x,z
k,s,t,1,2,3,4,5,6,t,z,s
a,b,k,1,4,5,5,5,6,k,r,s
$ perl -F, -lane ' $v=join(",",#F[3..8]);$kv{$_}{$v}=$kv2{$v}++; END { while(($x,$y)=each (%kv)){ while(($p,$q)=each (%{$y})) { print "$x --> $kv2{$p}" }}}' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$
Another Perl - shorter code
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { $t=join(",",(split /,/)[3..8]); print "$_ --> $kv2{$t}" } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
or
$ perl -F, -lane ' $kv{$_}=$kv2{join(",",#F[3..8])}++; END { for(keys %kv) { print "$_ --> ",$kv2{join(",",(split /,/)[3..8])} } } ' shimon.txt
a,b,k,1,4,5,5,5,6,k,r,s --> 1
a,b,d,1,2,3,4,5,6,y,x,z --> 2
k,s,t,1,2,3,4,5,6,t,z,s --> 2
$

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

AWK : comparing 2 columns from 2 csv files, outputting to a third. How do I also get the output that doesnt match to another file?

I currently have the following script:
awk -F, 'NR==FNR { a[$1 FS $4]=$0; next } $1 FS $4 in a { printf a[$1 FS $4]; sub($1 FS $4,""); print }' file1.csv file2.csv > combined.csv
this compares two columns 1 & 4 from both csv files and outputs the result from both files to combined.csv. Is it possible to output the lines from file 1 & file 2 that dont match to other files with the same awk line? or would i need to do seperate parses?
File1
ResourceName,ResourceType,PatternType,User,Host,Operation,PermissionType
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow
File2
topic,groupName,Name,User,email,team,contact,teamemail,date,clienttype
BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
combined
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
Wanted additional files:
non matched file1:
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
non matched file2:
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer```
again, I might be trying to do too much in one line? would it be wiser to run another parse?
Assuming the key pairs of $1 and $4 are unique within each input file then using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { next }
{ key = $1 FS $4 }
NR==FNR {
file1[key] = $0
next
}
key in file1 {
print file1[key], $0 > "out_combined"
delete file1[key]
next
}
{
print > "out_file2_only"
}
END {
for (key in file1) {
print file1[key] > "out_file1_only"
}
}
$ awk -f tst.awk file{1,2}
$ head out_*
==> out_combined <==
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
==> out_file1_only <==
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
==> out_file2_only <==
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
The order of lines in out_file1_only will be shuffled by the in operator - if that's a problem let us know as it's an easy tweak to retain the input order.

Awk multiple transformation / separators in once

I have to transform (preprocess) a CSV file, by generating / inserting a new column, being the result of the concat of existing columns.
For example, transform:
A|B|C|D|E
into:
A|B|C|D|C > D|E
In this example, I do it with:
cat myfile.csv | awk 'BEGIN{FS=OFS="|"} {$4 = $4 OFS $3" > "$4} 1'
But now I have something more complex to do, and dont find how to do this.
I have to transform:
A|B|C|x,y,z|E
into
A|B|C|x,y,z|C > x,C > y,C > z|E
How can it be done in awk (or other command) efficiently (my csv file can contains thousands of lines)?
Thanks.
With GNU awk (for gensub which is a GNU extension):
awk -F'|' '{$6=$5; $5=gensub(/(^|,)/,"\\1" $3 " > ","g",$4); print}' OFS='|'
You can split the 4th field into an array:
awk 'BEGIN{FS=OFS="|"} {split($4,a,",");$4="";for(i=1;i in a;i++)$4=($4? $4 "," : "") $3 " > " a[i]} 1' myfile.csv
A|B|C|C > x,C > y,C > z|E
There are many ways to do this, but the simplest is the following:
$ awk 'BEGIN{FS=OFS="|"}{t=$4;gsub(/[^,]+/,$3" > &",t);$4 = $4 OFS t}1'
we make a copy of the fourth field in variable t. In there, we replace every string which does not contain the new separator (,) by the content of the third field followed by > and the original matched string (&).

How to read a csv file into arrays and compare and replace it with entries from another csv file?

I have two csv files file1.csv and file2.csv.
file1.csv contains 4 columns.
file1:
Header1,Header2,Header3,Header4
aaaaaaa,bbbbbbb,ccccccc,ddddddd
eeeeeee,fffffff,ggggggg,hhhhhhh
iiiiiii,jjjjjjj,kkkkkkk,lllllll
mmmmmmm,nnnnnnn,ooooooo,ppppppp
file2:
"Header1","Header2","Header3"
"aaaaaaa","cat","dog"
"iiiiiii","doctor","engineer"
"mmmmmmm","sky","blue"
So what I am trying to do is read file1.csv line by line, put each entry into an array then compare the first element of that array with first column of file2.csv and if a match exists then replace the column1 and column2 of file1.csv with the corresponding column of file2.csv
So my desired output is:
cat,dog,ccccccc,ddddddd
eeeeeee,fffffff,ggggggg,hhhhhhh
doctor,engineer,kkkkkkk,lllllll
sky,blue,ooooooo,ppppppp
I am able to do it when there is only column to replace.
Here is my code:
awk -F'"(,")?' '
NR==FNR { r[$2] = $3; next }
{ for (n in r) gsub(n,r[n]) } IGNORECASE=1' file2.csv file1.csv>output.csv
My final step is to dump the entire array into a file with 10 columns.
Any suggestions where I can improve or correct my code?
EDIT: Considering that your Input_file2 is having date in "ytest","test2" etc format if yes then try following.(Thanks to Tiw for providing this samples in his/her post)
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
gsub(/\"/,"")
a[tolower($1)]=$0
next
}
a[tolower($1)]{
print a[tolower($1)],$NF
next
}
1' file2.csv file1.csv
Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]=$0
next
}
a[$1]{
print a[$1],$NF
next
}
1' Input_file2 Input_file1
OR in case you could have combination of lower and capital letters in Input_file(s) then try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[tolower($1)]=$0
next
}
a[tolower($1)]{
print a[tolower($1)],$NF
next
}
1' Input_file2 Input_file1
With any awk and any number of fields in either file:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
gsub(/"/,"")
key = tolower($1)
}
NR==FNR {
for (i=2; i<=NF; i++) {
map[key,i] = $i
}
next
}
{
for (i=2; i<=NF; i++) {
$(i-1) = ((key,i) in map ? map[key,i] : $(i-1))
}
print
}
$ awk -f tst.awk file2 file1
Header2,Header3,Header3,Header4
cat,dog,ccccccc,ddddddd
eeeeeee,fffffff,ggggggg,hhhhhhh
doctor,engineer,kkkkkkk,lllllll
sky,blue,ooooooo,ppppppp
Given your sample data, and the description from your comments, please try this:
(Judged from your own code, you may have quotes around fields, thus I didn't try to answer.)
awk 'BEGIN{FS=OFS=","}
NR==FNR{gsub(/^"|"$/,"");gsub(/","/,",");a[$1]=$2;b[$1]=$3;next}
$1 in a{$2=b[$1];$1=a[$1];}
1' file2.csv file1.csv
Eg:
$ cat file1.csv
Header1,Header2,Header3,Header4
aaaaaaa,bbbbbbb,ccccccc,ddddddd
eeeeeee,fffffff,ggggggg,hhhhhhh
iiiiiii,jjjjjjj,kkkkkkk,lllllll
mmmmmmm,nnnnnnn,ooooooo,ppppppp
$ cat file2.csv
"Header1","Header2","Header3"
"aaaaaaa","cat","dog"
"iiiiiii","doctor","engineer"
"mmmmmmm","sky","blue"
$ awk 'BEGIN{FS=OFS=","}
NR==FNR{gsub(/^"|"$/,"");gsub(/","/,",");a[$1]=$2;b[$1]=$3;next}
$1 in a{$2=b[$1];$1=a[$1];}
1' file2.csv file1.csv
Header2,Header3,Header3,Header4
cat,dog,ccccccc,ddddddd
eeeeeee,fffffff,ggggggg,hhhhhhh
doctor,engineer,kkkkkkk,lllllll
sky,blue,ooooooo,ppppppp
Another way, more verbose, but I think it's better for you to understand (GNU awk):
awk 'BEGIN{FS=OFS=","}
NR==FNR{for(i=1;i<=NF;i++)$i=gensub(/^"(.*)"$/,"\\1",1,$i);a[$1]=$2;b[$1]=$3;next}
$1 in b{$2=b[$1];}
$1 in a{$1=a[$1];}
1' file2.csv file1.csv
Note a pitfall here, since $1 is the key, so we should change $1 last.
A case insensitive solution:
awk 'BEGIN{FS=OFS=","}
NR==FNR{gsub(/^"|"$/,"");gsub(/","/,",");k=tolower($1);a[k]=$2;b[k]=$3;next}
{k=tolower($1);if(a[k]){$2=b[k];$1=a[k]}}
1' file2.csv file1.csv
For code concise, added variabe k and moved "if" inside.

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001