awk script to read txt file - csv

How can I create the below output using awk? I couldn't create the loop for comma separated data.
awk '{print "echo "$1"\nadd "$2"\nremove "$3"\nlist "$4}' test.txt
test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
Output:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer

I always feel like awk loses a bit of its pizazz when I have to do my own split and loop through the resulting array, but here is a straightforward way using a function to add the second loop in order to process your space-separated fields (that are themselves comma-separated values):
$ cat test.awk
function print_all(label, values) {
split(values, v, ",")
for (i=1; i<=length(v); ++i) {
print label " " v[i]
}
}
{
print "echo " $1
print_all("add", $2)
print_all("remove", $3)
print_all("list", $4)
}
$ cat test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
$ awk -f test.awk test.txt
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer

Another alternative approach is a two stage awk
$ awk '{ print "echo " $1; print "add " $2; print "remove " $3}' file
| awk -F'[ ,]' 'NF==3{print $1,$2; print $1,$3;next}1'

You have two loops, which is why you have a problem - you need to split your line on whitespace, but then split the subelements on commas.
I would suggest using perl:
#!/usr/bin/env perl
use strict;
use warnings;
my #actions = qw ( echo add remove list );
#iterate the lines
while ( <DATA> ) {
#split on whitespace
my #fields = split;
#iterate actions and fields
foreach my $action ( #actions ) {
#split each field on ,
#print action and field for each.
print "$action $_\n" for split ( ",", shift #fields );
}
}
__DATA__
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
This gives us:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
Which I think is what you wanted?
This can be reduced to a one liner:
perl -ane 'for my $act ( qw ( echo add remove list ) ) { print "$act $_\n" for split ",", shift #F }' test.txt

Not necessarily recommended but if you're looking for compact you could replace commas with the extra text including the line breaks.
a = "," $2; b = "," $3; c = "," $4;
gsub(/,/, "\nadd ", a);
gsub(/,/, "\nremove ", b);
gsub(/,/, "\nlist ", c);
print "echo " $1 a b c "\n"

$ cat tst.awk
BEGIN { split("echo add remove list",names) }
{
for (fldNr=1;fldNr<=NF;fldNr++) {
split($fldNr,subFlds,/,/)
for (subFldNr=1;subFldNr in subFlds; subFldNr++) {
print names[fldNr], subFlds[subFldNr]
}
}
}
$ awk -f tst.awk file
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer

awk -F" " '{for(i=1;i<=NF;i++){a[i]=$i;} {print "echo "a[1]"\n""add "a[2]"\nremove "a[3]"\nlist "a[4];}}' filename | awk -F" " '{sub(/,/,"\n"$1" ",$0);print}'
The above code can be used.
Also I would like to get inputs from others for an optimized code snippet of the above version.

Related

awk extract rows with N columns

I have a tsv file with different column number
1 123 123 a b c
1 123 b c
1 345 345 a b c
I would like to extract only rows with 6 columns
1 123 123 a b c
1 345 345 a b c
How I can do that in bash (awk, sed or something else) ?
Using Awk
$ awk -F'\t' 'NF==6' file
1 123 123 a b c
1 345 345 a b c
FYI, most of the existing solutions have one potential pitfall :
echo "1\t2\t3\t4\t5\t" |
mawk '$!NF = "\n\n\t NF == "( NF ) \
" :\f\b<( "( $_ )" )>\n\n"' FS='\11'
NF == 6 :
<( 1 2 3 4 5 )>
if the input file happens to have a trailing tab \t, it would still be reported by awk as having NF count of 6. whether this test case line actually has 5 columns or 6 in the logical sense is open for interpretation.
Using GNU sed let file.txt content be
1 123 123 a b c
1 123 b c
1 345 345 a b c
1 777 777 a b c d
then
sed -n '/^[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*$/p' file.txt
gives output
1 123 123 a b c
1 345 345 a b c
Explanation: -n turn off default printing, sole action is to print (p) line matching pattern which is begin (^) and end ($) anchored consisting of 6 column of non-TABs separated by single TABs. This code does use very basic features sed but as you might observe is longer than AWK and not as easy in adjusting N.
(tested in GNU sed 4.2.2)
This might work for you (GNU sed):
sed -nE 's/\S+/&/6p' file
This will print lines with 6 or more fields.
sed -nE 's/\S+/&/6;T;s//&/7;t;p' file
This will print lines with only 6 fields.

How can I clean a TSV file having record or fields separators in one of its fields?

Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
+------+---------+------+
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
+------+---------+------+
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.
A good library cleanly handles things like embedded newlines and quoted separators (in fields)
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s+/ /g for #$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can fit in a one-liner; its functional interface (with csv) is particularly well suited for that.
if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
A way to do it in miller 5 is to use simply the put verb:
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --tsv put -S 'for (k in $*) {$[k] = gsub($[k], "\n", " ")}' then clean-whitespace
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s+/ /g for #{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.
With GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.
you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_+=_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _++) {
3 sub(/[\t-\r]+/, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $+_)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore() that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_+=_^=_<_,_)
} END {
for(--_;_<=NF;_++) {
sub(/[\t-\r]+/, $_~__?" ":"&",$_) } print }'

TCL how to traverse and print all values for a variable having string and list

I have a 3rd party APIs that gives me below output:
puts [GetDesc $desc " "] #prints below data
#A_Name 9023212134(M) emp#121 M { 41 423 }
How can I access all the token of the value printed and the list { 41 423 }?
The output is a list of 5 items, where the last is a list of two elements. You extract elements in a list using lindex:
set var {A_Name 9023212134(M) emp#121 M { 41 423 }}; # A_Name 9023212134(M) emp#121 M { 41 423 }
lindex $var 0; # A_Name
lindex $var 4; # 41 423 (Note: leading and trailing spaces are preserved)
lindex $var 4 0; # 41
lindex $var 4 1; # 432
You can iterate over the values in the result with foreach:
foreach value [GetDesc $desc " "] {
puts ">>> $value <<<"
}
This will print something like this (note the extra spaces with the last item; they're part of the value):
>>> A_Name <<<
>>> 9023212134(M) <<<
>>> emp#121 <<<
>>> M <<<
>>> 41 423 <<<
Another approach is to use lassign to put those values into variables:
lassign [GetDesc $desc " "] name code1 code2 code3 pair_of_values
Then you can work with $pair_of_values easily enough on its own.

Use multiple lines with Awk

I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

How to count instances of string in a tab separated value file?

How to count instances of strings in a tab separated value (tsv) file?
The tsv file has hundreds of millions of rows, each of which is of form
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
. How to count instances of each unique integer in the entire second column in the file, and ideally add the count as the fifth value in each row?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
I prefer a solution using only UNIX command line stream processing programs.
I'm not entirely clear what you want to do. Do you want to add 0/1 depending on the value of the second column as the fifth column or do you want to get the distribution of the values in the second column, total for the entire file?
In the first case, use something like awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.
In the second case, use something like awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.
One solution using perl assuming that values of second column are sorted, I mean, when found value 2, all lines with same value will be consecutive. The script keeps lines until it finds a different value in second column, get the count, print them and frees memory, so shouldn't generate a problem regardless of how big is the input file:
Content of script.pl:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my #f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless #f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push #{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( #{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift #{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push #{ $lines{ $f[1] } }, $_;
$count = 1;
}
Content of infile:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
Run it like:
perl script.pl infile
With following output:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2