Enhance awk script by printing top 5 occurring data elements from each column - csv

I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.
Here is sample data:
Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
This is the current script that SiegeX has substantially contributed:
$ cat matrix2.awk
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-6d%s\n",n[i],head[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf("% -6d %s\n",arr[i,b[2]],b[2])
}
}
}
}
This is the current output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
3 Meat
1 Fish
4 Dairy
1 Bread
2 Spicy
17 No
2 Yes
And this is the desired output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
4 Dairy
3 Meat
1 Fish
2 Spicy
17 No
2 Yes
The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.

The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded
awk -F'|' '
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-32s%d\n",head[i],n[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf(" %-28s%d\n",b[2],arr[i,b[2]])
}
}
}
}' infile
Output
$ ./report
Food 9
Type 5
Meat 2
Bread 1
Vegetable 2
Fruit 2
Fish 1
Spicy 2
Yes 2
No 6

Not a complete solution, but something to get you started -
awk -F"|" '
NR>1{
a[$1]++;
b[$2]++;
c[$3]++}
END{
print "Food\t\t\t" length(a);
print "Type\t\t\t" length(b);
for (x in b)
if (x!="")
{
printf ("\t%-16s%s\n",x,b[x]);
}
print "Spicy\t\t\t" length(c);
for (y in c)
if (y!="")
{
printf ("\t%-16s%d\n",y,c[y])
}
}' testlist.csv
TEST:
[jaypal:~/Temp] cat testlist.csv
Food|Type|Spicy
Broccoli|Vegetable|No
Jalapeno|Vegetable|Yes
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
[jaypal:~/Temp] awk -F"|" 'NR>1{a[$1];b[$2]++;c[$3]++}END{print "Food\t\t\t" length(a); print "Type\t\t\t"length(b); for (x in b) if (x!="") printf ("\t%-16s%s\n",x,b[x]) ;print "Spicy\t\t\t"length(c); for (y in c) if (y!="") {printf ("\t%-16s%d\n",y,c[y])}}' testlist.csv
Food 9
Type 6
Fruit 2
Vegetable 2
Bread 1
Meat 2
Fish 1
Spicy 3
Yes 2
No 6

Related

How can I clean a TSV file having record or fields separators in one of its fields?

Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
+------+---------+------+
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
+------+---------+------+
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.
A good library cleanly handles things like embedded newlines and quoted separators (in fields)
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s+/ /g for #$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can fit in a one-liner; its functional interface (with csv) is particularly well suited for that.
if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
A way to do it in miller 5 is to use simply the put verb:
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --tsv put -S 'for (k in $*) {$[k] = gsub($[k], "\n", " ")}' then clean-whitespace
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s+/ /g for #{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.
With GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.
you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_+=_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _++) {
3 sub(/[\t-\r]+/, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $+_)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore() that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_+=_^=_<_,_)
} END {
for(--_;_<=NF;_++) {
sub(/[\t-\r]+/, $_~__?" ":"&",$_) } print }'

Use multiple lines with Awk

I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

Filtering CSV files by multiple columns, sort them and create 2 new files

I have been searching how to do the following for couple hours and could not find it. I apologize if I am repeating something.
I have 22 csv files with 14 columns and 17,392 lines in each.I am using awk to filter the original files using the following commands:
First need to get lines that have values on column 14 smaller than 0.05
awk -F '\t' '$14 < 0.05 { print $0 }' file1 > file2
Next I need to get the lines with values higher and 1 and smaller than -1.
awk -F '\t' '$10 < -1 { print $0 }' file2 > file3
awk -F '\t' '$10 > 1 { print $0 }' file2 > file4
My last step is to get the lines that have values on column 7 OR 8 higher than 1 (e.g. on 7 could be 0 if on 8 it is 1)
awk -F '\t' '$7<=1 {print $0}' file3 > file5
awk -F '\t' '$8>=1 {print $0}' file4 > file6
My problem is that I create several intermediate files. I would need just two files at the end. File3 and 4 where columns 7 or 8 have values equal or greater than 1. How can I make an awk command to do that at once?
Thank you.
Your question is ambiguous, so there are many possible answers. However, you can combine conditions in awk and you can write to separate files in a single pass, so you might mean:
awk -F '\t' '$14 < 0.05 && $10 < -1 && $7 > 1 { print > "file5" }
$14 < 0.05 && $10 > +1 && $8 > 1 { print > "file6" }' file1
This command should give you the same output in file5 and file6 as you got from your original sequence of operations (but it only makes one pass over the data, not many). (Strictly, it produces the same answer if you change your $7<=1 to $7>1 to agree with your description of wanting column 7 or 8 higher than 1, though that contradicts your example 'on 7 could be 0 if on 8 it is 1'.)
Given an input file:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The output in file5 is:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
and the output in file6 is:
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
If you need to combine the conditions differently, then you need to clarify your question.
You could try:
awk -F'\t' '($14 < 0.05) && ($10 < -1) && ($7 <= 1) {print}' file1 > file3

Dynamically edit lists within a csv file

I have a csv file that looks like that:
col1|col2
1|a
2|g
3|f
1|m
3|k
2|n
2|a
1|d
4|r
3|s
where | separates the columns, and would like to transform it into something homogeneous like:
------------------------
fields > 1 2 3 4
record1 a g f
record2 m n k
record3 d a s r
------------------------
Is there a way to do that? What would be better, using mysql or editing the csv file?
I wrote this, works for your example: gawk is required
awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d);
for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file
example:
kent$ cat file
1|a
2|g
3|f
1|m
3|k
2|n
2|a
1|d
4|r
3|s
kent$ awk -F'|' -v RS="" '{for(i=1;i<=NF;i+=2)a[$i]=$(i+1);asorti(a,d);
for(i=1;i<=length(a);i++)printf "%s", a[d[i]]((i==length(a))?"":" ");delete a;delete d;print ""}' file
a g f
m n k
d a s r
Here an awk solution:
BEGIN{
RS=""
FS="\n"
}
FNR==NR&&FNR>1{
for (i=1;i<=NF;i++) {
split($i,d,"|")
if (d[1] > max)
max = d[1]
}
next
}
FNR>1&&!header{
printf "%s\t","fields >"
for (i=1;i<=max;i++)
printf "%s\t",i
print ""
header=1
}
FNR>1{
printf "record%s\t\t",FNR-1
for (i=1;i<=NF;i++) {
split($i,d,"|")
val[d[1]] = d[2]
}
for (i=1;i<=max;i++)
printf "%s\t",val[i]?val[i]:"NULL"
print ""
delete val
}
Save as script.awk and run like (notice it uses a two pass approach so you need to give the file twice):
$ awk -f script.awk file file
fields > 1 2 3 4
record1 a g f NULL
record2 m n k NULL
record3 d a s r
Adding the line 5|b to the first record in file gives the output:
$ awk -f script.awk file file
fields > 1 2 3 4 5
record1 a g f NULL b
record2 m n k NULL NULL
record3 d a s r NULL
$ cat file
col1|col2
1|a
2|g
3|f
5|b
1|m
3|k
2|n
2|a
1|d
4|r
3|s
$
$ awk -f tst.awk file
fields > 1 2 3 4 5
record1 a g f NULL b
record2 m n k NULL NULL
record3 d a s r NULL
$
$ cat tst.awk
BEGIN{ RS=""; FS="\n" }
NR>1 {
++numRecs
for (i=1;i<=NF;i++) {
split($i,fldNr2val,"|")
fldNr = fldNr2val[1]
val = fldNr2val[2]
recNrFldNr2val[numRecs,fldNr] = val
numFlds = (fldNr > numFlds ? fldNr : numFlds)
}
}
END {
printf "fields >"
for (fldNr=1;fldNr<=numFlds;fldNr++) {
printf " %4s", fldNr
}
print ""
for (recNr=1; recNr<=numRecs; recNr++) {
printf "record%d ", recNr
for (fldNr=1;fldNr<=numFlds;fldNr++) {
printf " %4s", ((recNr,fldNr) in recNrFldNr2val ? recNrFldNr2val[recNr,fldNr] : "NULL")
}
print ""
}
}

Using AWK to get text and looping in csv

i'am new in awk and i want ask...
i have a csv file like this
IVALSTART IVALEND IVALDATE
23:00:00 23:30:00 4/9/2012
STATUS LSN LOC
K lskpg 1201
K lntrjkt 1201
K lbkkstp 1211
and i want to change like this
IVALSTART IVALEND
23:00:00 23:30:00
STATUS LSN LOC IVALDATE
K lskpg 1201 4/9/2012
K lntrjkt 1201 4/9/2012
K lbkkstp 1211 4/9/2012
How to do it in awk?
thanks and best regards!
Try this:
awk '
NR == 1 { name = $3; print $1, $2 }
NR == 2 { date = $3; print $1, $2 }
NR == 3 { print "" }
NR == 4 { $4 = name; print }
NR > 4 { $4 = date; print }
' FILE
If you need formating, it's necessary to change print to printf with appropriate specifiers.