How to count instances of string in a tab separated value file? - csv

How to count instances of strings in a tab separated value (tsv) file?
The tsv file has hundreds of millions of rows, each of which is of form
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
. How to count instances of each unique integer in the entire second column in the file, and ideally add the count as the fifth value in each row?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
I prefer a solution using only UNIX command line stream processing programs.

I'm not entirely clear what you want to do. Do you want to add 0/1 depending on the value of the second column as the fifth column or do you want to get the distribution of the values in the second column, total for the entire file?
In the first case, use something like awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.
In the second case, use something like awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.

One solution using perl assuming that values of second column are sorted, I mean, when found value 2, all lines with same value will be consecutive. The script keeps lines until it finds a different value in second column, get the count, print them and frees memory, so shouldn't generate a problem regardless of how big is the input file:
Content of script.pl:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my #f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless #f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push #{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( #{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift #{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push #{ $lines{ $f[1] } }, $_;
$count = 1;
}
Content of infile:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
Run it like:
perl script.pl infile
With following output:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2

Related

import csv to redis with hash datatype

awk -F, 'NR > 0{print "SET", "\"calc_"NR"\"", "\""$0"\"" }' files/calc.csv | unix2dos | redis-cli --pipe
I use the above command to import a csv file into redis database with string datatype.Something like,
set cal_1 product,cost,quantity
set cal_2 t1,100,5
How do I import as hash datatype with field name as rowcount , key as column header, value as column value in awk.
HMSET calc:1 product "t1" cost 100 quantity 5
HMSET calc:2 product "t2" cost 500 quantity 4
Input file Example:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
Can I get this result from awk
for each row present in csv file,
HMSET calc_'row no' 1st row 1st column value current row 1st column value 1st row 2nd column value current row 2nd column value 1st row 3rd column value urrent row 3rd column value
so for the above example,
HMSET calc_1 product t1 cost 100 quantity 5
HMSET calc_2 product t2 cost 500 quantity 4
HMSET calc_3 product t3 cost 600 quantity 9
for all the rows dynamically?
You can use the following awk command:
awk '{if(NR==1){col1=$1; col2=$2; col3=$3}else{product[NR]=$1;cost[NR]=$2;quantity[NR]=$3;tmp=NR}}END{printf "[("col1","col2","col3"),"; for(i=2; i<=tmp;i++){printf "("product[i]","cost[i]","quantity[i]")";}print "]";}' input_file.txt
on your input file:
product cost quantity
t1 100 5
t2 500 4
t3 600 9
it gives the following output:
[(product,cost,quantity),(t1,100,5)(t2,500,4)(t3,600,9)]
awk commands:
# gawk profile, created Fri Dec 29 15:12:39 2017
# Rule(s)
{
if (NR == 1) { # 1
col1 = $1
col2 = $2
col3 = $3
} else {
product[NR] = $1
cost[NR] = $2
quantity[NR] = $3
tmp = NR
}
}
# END rule(s)
END {
printf "[(" col1 "," col2 "," col3 "),"
for (i = 2; i <= tmp; i++) {
printf "(" product[i] "," cost[i] "," quantity[i] ")"
}
print "]"
}

Print multiple tcl lists in a uniform manner

I have a group of lists some with strings, some with numbers and some with both. All these lists have variable lengths. I would like to know what would be the best way to print it to a file so that they all have equal spacing between them.
For example, I use,
set numbers {0 1 2 3 4}
set type {dog reallybigbaddog thisisaevenlargersentence cat bird}
set paths {aaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb ccc ddddddddddddddddd efgh}
puts $fid "NUMBERS\t\tTYPE\tPATHS"
foreach numbersval $numbers typeval $type pathsval $paths {
puts $fid "$numbersval\t\t$typeval\t$pathsval"
}
The result was,
NUMBERS TYPE PATHS
0 dog AAA
1 reallybigbaddog bbbbbbbbbbbbbbbbbbbbbbbb
2 thisisaevenlargersentence ccc
3 cat ddddddddddddddddd
4 bird efgh
I Tried using "format" based on one of the suggestions on this site but that resulted in a similar output, I guess we need a way to determining what the longest string is and cant arbitrarily use "\t"? Would appreciate any better suggestions.
For reference, this is how you could do it with struct::matrix and report:
package require struct::matrix
package require report
set nrows 5
set ncols 3
set npads [expr {$ncols + 1}]
struct::matrix m
m add rows $nrows
m add column {0 1 2 3 4}
m add column {dog reallybigbaddog thisisaevenlargersentence cat bird}
m add column {aaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb ccc ddddddddddddddddd efgh}
m insert row 0 {NUMBERS TYPE PATHS}
report::report r $ncols
r data set [lrepeat $npads \t]
m format 2string r
(This uses only a fraction of the formatting power of report.) This method can handle values with spaces in them.
Result (there is a tab character to the left of the first column on each row, but it's lost in the formatting here.):
NUMBERS TYPE PATHS
0 dog aaa
1 reallybigbaddog bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
2 thisisaevenlargersentence ccc
3 cat ddddddddddddddddd
4 bird efgh
Documentation: expr, lrepeat, package, report package, set, struct::matrix package
In this case, I'd call out to column -t to do the work for me:
set all "NUMBERS TYPE PATHS\n"
foreach n $numbers t $type p $paths {
append all "$n $t $p\n"
}
set formatted [exec column -t << $all]
puts $formatted
NUMBERS TYPE PATHS
0 dog aaa
1 reallybigbaddog bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
2 thisisaevenlargersentence ccc
3 cat ddddddddddddddddd
4 bird efgh
A pure Tcl way to do this:
array set maxl {numbers 0 type 0 paths 0}
foreach l {numbers type paths} {
foreach e [concat $l [set $l]] {
if {[set len [string length $e]] > $maxl($l)} {
set maxl($l) $len
}
}
}
puts [format "%-*s %-*s %-*s" $maxl(numbers) NUMBERS $maxl(type) TYPE $maxl(paths) "PATH LISTS"]
foreach n $numbers t $type p $paths {
puts [format "%-*s %-*s %-*s" $maxl(numbers) $n $maxl(type) $t $maxl(paths) $p]
}

awk script to read txt file

How can I create the below output using awk? I couldn't create the loop for comma separated data.
awk '{print "echo "$1"\nadd "$2"\nremove "$3"\nlist "$4}' test.txt
test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
Output:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
I always feel like awk loses a bit of its pizazz when I have to do my own split and loop through the resulting array, but here is a straightforward way using a function to add the second loop in order to process your space-separated fields (that are themselves comma-separated values):
$ cat test.awk
function print_all(label, values) {
split(values, v, ",")
for (i=1; i<=length(v); ++i) {
print label " " v[i]
}
}
{
print "echo " $1
print_all("add", $2)
print_all("remove", $3)
print_all("list", $4)
}
$ cat test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
$ awk -f test.awk test.txt
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
Another alternative approach is a two stage awk
$ awk '{ print "echo " $1; print "add " $2; print "remove " $3}' file
| awk -F'[ ,]' 'NF==3{print $1,$2; print $1,$3;next}1'
You have two loops, which is why you have a problem - you need to split your line on whitespace, but then split the subelements on commas.
I would suggest using perl:
#!/usr/bin/env perl
use strict;
use warnings;
my #actions = qw ( echo add remove list );
#iterate the lines
while ( <DATA> ) {
#split on whitespace
my #fields = split;
#iterate actions and fields
foreach my $action ( #actions ) {
#split each field on ,
#print action and field for each.
print "$action $_\n" for split ( ",", shift #fields );
}
}
__DATA__
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
This gives us:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
Which I think is what you wanted?
This can be reduced to a one liner:
perl -ane 'for my $act ( qw ( echo add remove list ) ) { print "$act $_\n" for split ",", shift #F }' test.txt
Not necessarily recommended but if you're looking for compact you could replace commas with the extra text including the line breaks.
a = "," $2; b = "," $3; c = "," $4;
gsub(/,/, "\nadd ", a);
gsub(/,/, "\nremove ", b);
gsub(/,/, "\nlist ", c);
print "echo " $1 a b c "\n"
$ cat tst.awk
BEGIN { split("echo add remove list",names) }
{
for (fldNr=1;fldNr<=NF;fldNr++) {
split($fldNr,subFlds,/,/)
for (subFldNr=1;subFldNr in subFlds; subFldNr++) {
print names[fldNr], subFlds[subFldNr]
}
}
}
$ awk -f tst.awk file
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
awk -F" " '{for(i=1;i<=NF;i++){a[i]=$i;} {print "echo "a[1]"\n""add "a[2]"\nremove "a[3]"\nlist "a[4];}}' filename | awk -F" " '{sub(/,/,"\n"$1" ",$0);print}'
The above code can be used.
Also I would like to get inputs from others for an optimized code snippet of the above version.

Write partly tab-delimited data to MySQL database

I have a MySQL-Database with 7 Columns (chr, pos, num, iA, iB, iC, iD) and a file that contains 40 million lines each containing a dataset. Each line has 4 tab delimited columns, whereas the first three columns always contain data, and the fourth column can contain up to three different key=value pairs separated by a semicolon
chr pos num info
1 10203 3 iA=0.34;iB=nerv;iC=45;iD=dskf12586
1 10203 4 iA=0.44;iC=45;iD=dsf12586;iB=nerv
1 10203 5
1 10213 1 iB=nerv;iC=49;iA=0.14;iD=dskf12586
1 10213 2 iA=0.34;iB=nerv;iD=cap1486
1 10225 1 iD=dscf12586
The key=value pairs in the column info have no specific order. I'm also not sure if a key can occur twice (I hope not).
I'd like to write the data into the database. The first three columns are no problem, but extractiong the values from the info-columns puzzles me, since the key=value pairs are unordered and not every key has to be in the line.
For a similar dataset (with ordered info-Column) I used a java-Programm in connection with regular expressions, which allowed me to (1) check and (2) extract data, but now I'm stranded.
How can I resolve this task, preferably with a bash-script or directly in MySQL?
You did not mention exactly how you want to write the data. But the below example with awk shows how you can get each individual id and key in each line. instead of the printf, you can use your own logic to write data
[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log
{
if(length($4)) {
split($4,array,";");
print "In " $1, $2, $3;
for(element in array) {
key=substr(array[element],0,index(array[element],"="));
value=substr(array[element],index(array[element],"=")+1);
printf("found %s key and %s value for %d line from %s\n",key,value,NR,array[element]);
}
}
}
###########
In 1 10203 3
found iD= key and dskf12586 value for 1 line from iD=dskf12586
found iA= key and 0.34 value for 1 line from iA=0.34
found iB= key and nerv value for 1 line from iB=nerv
found iC= key and 45 value for 1 line from iC=45
In 1 10203 4
found iB= key and nerv value for 2 line from iB=nerv
found iA= key and 0.44 value for 2 line from iA=0.44
found iC= key and 45 value for 2 line from iC=45
found iD= key and dsf12586 value for 2 line from iD=dsf12586
In 1 10213 1
found iD= key and dskf12586 value for 4 line from iD=dskf12586
found iB= key and nerv value for 4 line from iB=nerv
found iC= key and 49 value for 4 line from iC=49
found iA= key and 0.14 value for 4 line from iA=0.14
In 1 10213 2
found iA= key and 0.34 value for 5 line from iA=0.34
found iB= key and nerv value for 5 line from iB=nerv
found iD= key and cap1486 value for 5 line from iD=cap1486
In 1 10225 1
found iD= key and dscf12586 value for 6 line from iD=dscf12586
Awk solution from #abasu with inserts that also solves the unordered key-value pairs.
parse.awk :
NR>1 {
col["iA"]=col["iB"]=col["iC"]=col["iD"]="null";
if(length($4)) {
split($4,array,";");
for(element in array) {
split(array[element],keyval,"=");
col[keyval[1]] = "'" keyval[2] "'";
}
}
print "INSERT INTO tbl VALUES (" $1 "," $2 "," $3 "," col["iA"] "," col["iB"] "," col["iC"] "," col["iD"] ");";
}
Test/run :
$ awk -f parse.awk file
INSERT INTO tbl VALUES (1,10203,3,'0.34','nerv','45','dskf12586');
INSERT INTO tbl VALUES (1,10203,4,'0.44','nerv','45','dsf12586');
INSERT INTO tbl VALUES (1,10203,5,null,null,null,null);
INSERT INTO tbl VALUES (1,10213,1,'0.14','nerv','49','dskf12586');
INSERT INTO tbl VALUES (1,10213,2,'0.34','nerv',null,'cap1486');
INSERT INTO tbl VALUES (1,10225,1,null,null,null,'dscf12586');

Parsing line and selecting values corresponding to a key

there is a set of data which is arranged in a specific manner (as a tree), as is given below. basically a key=value pair, with some additional values at the end, which informs how many children does the branch have and some junk value.
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0
my problem is to get the appropriate branch from the above data, which satisfies exactly the values, which i give as the input.
suppose, my input value to search in the above data stack are:
5=0
4=0
6=10
3=11
11=1
1=45
0=138
9=16
2=34
7=0
8=0
for the above given list of key->values, the function should return 11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 as the match.
likewise, for another input file, in which another set of keys is given:
5=0
4=0
6=10
3=11
11=1
1=45
9=16
2=34
7=0
8=0
the function should return 11=1>1=45>9=16>2=34>7=0>8=0 1 as the match. not the last line; as that would also match all the values given in my input key, but, i want only the exact match.
Also, I want to find out how many nodes were selected in the given array. (separated by >).
What will be the best way to implement this kind of scenario?
use strict;
use warnings;
my $tree;
while (<DATA>) {
my #data = split /\>/, (/^([^ ]*)/)[0];
my $ptr = \$tree;
for my $key (#data) {
$ptr = \$$ptr->{$key};
}
}
my #inputs = (
[qw(5=0 4=0 6=10 3=11 11=1 1=45 0=138 9=16 2=34 7=0 8=0)],
[qw(5=0 4=0 6=10 3=11 11=1 1=45 9=16 2=34 7=0 8=0)]
);
sub getKey {
my ( $lu, $node ) = #_;
exists $lu->{$_} and return $_ for keys %$node;
}
for my $input (#inputs) {
my %lu;
#lu{#$input} = ();
my #result;
my $node = $tree;
while (%lu) {
my $key = getKey( \%lu, $node );
if ($key) {
$node = $node->{$key};
push #result, $key;
delete $lu{$key};
}
else {
last;
}
}
print join( '>', #result ), "\n";
}
__DATA__
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0