I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Related
Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
+------+---------+------+
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
+------+---------+------+
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.
A good library cleanly handles things like embedded newlines and quoted separators (in fields)
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s+/ /g for #$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can fit in a one-liner; its functional interface (with csv) is particularly well suited for that.
if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
A way to do it in miller 5 is to use simply the put verb:
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --tsv put -S 'for (k in $*) {$[k] = gsub($[k], "\n", " ")}' then clean-whitespace
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s+/ /g for #{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.
With GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.
you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_+=_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _++) {
3 sub(/[\t-\r]+/, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $+_)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore() that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_+=_^=_<_,_)
} END {
for(--_;_<=NF;_++) {
sub(/[\t-\r]+/, $_~__?" ":"&",$_) } print }'
I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
The desired Output should look like below:
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
I tried with awk as below, but it gives me only the line numbers.
awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1 >Cluster 0
2 0 10565nt, >CL9602.Contig1_All... *
3 1 1331nt, >CL9602.Contig2_All... at -/98.05%
4 >Cluster 1
5 0 3798nt, >CL3196.Contig1_All... at +/97.63%
6 1 9084nt, >CL3196.Contig3_All... *
7 >Cluster 2
8 0 8710nt, >Unigene21841_All... *
9 >Cluster 3
10 0 8457nt, >Unigene10299_All... *
I also tried with sed, but it only prints the lines while even ommiting some lines.
sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0 1518nt, >CL2313.Contig1_All... at -/95.13%
1 8323nt, >CL2313.Contig8_All... *
In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.
I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g
grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
0 8710nt, >Unigene21841_All... *
0 8457nt, >Unigene10299_All... *
0 1518nt, >CL2313.Contig1_All... at -/95.13%
grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4
but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.
I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.
Thanks
Could you please try following.
awk '
/^>Cluster/{
if(count){
print prev,count
}
sub(/^>/,"")
prev=$0
count=""
next
}
{
count++
}
END{
if(count && prev){
print prev,count
}
}
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
/^>Cluster/{ ##Checking condition if a line is having string Cluster then do following.
if(count){ ##Checking condition if variable count is NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
sub(/^>/,"") ##Using sub for substitution of starting > with NULL in current line.
prev=$0 ##Creating a variable named prev whose value is current line.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for Cluster condition here.
{
count++ ##Doing increment of variable count each time cursor comes here.
}
END{ ##Mentioning END BLOCK for this program.
if(count && prev){ ##Checking condition if variable count and prev are NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for END BLOCK of this program.
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
With GNU awk for multi-char RS:
$ awk -v RS='(^|\n)(>|$)' -F'\n' 'NR>1{print $1, NF-1}' file
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The above just separates the input into records that start with > at the start of a line and then prints the number of lines in each record (subtracting 1 for the >Cluster... line).
Here's a, allbeit quite verbose one liner in Perl. I'm really not good at this golfing stuff.
perl -n -e "if ( /^>(.+)/ ) { print qq($last, $count\n) if $count; $last = $1; $count = 0; } else { $count++ } END { print qq($last, $count) }" All-Unigene_Clustered.fa.clstr
This is for Windows. For a unix shell you probably need to change the double to single quotes.
In perl the code can be in following form
use strict;
use warnings;
my $cluster;
my $count;
while( <DATA> ) {
chomp;
if( /Cluster \d+/ ) {
print "$cluster $count\n" if defined $cluster;
s/>//;
$cluster = $_;
$count = 0;
} else {
$count++;
}
}
print "$cluster $count\n" if defined $store;
__DATA__
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
output
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The following awk also works for 0 count lines (ie no lines until the next match):
BEGIN {
count = 0;
prev = "";
}
{
if ($0 ~ /LOGCL/) {
if (prev) {
print (prev ": " count);
}
#reset count & assign line to prev
count = 0;
prev = $0;
} else {
count++;
}
}
END {
print (prev ": " count);
}
So for following file (put above into count.awk and invoke with
awk -f count.awk test.txt):
test.txt:
LOGCL One
Blah
LOGCL Two
Blah
Blah
LOGCL Three
LOGCL Four
Blah
LOGCL Five
blah
blah
blah
Output is:
LOGCL One: 1
LOGCL Two: 2
LOGCL Three: 0
LOGCL Four: 1
LOGCL Five: 3
Particularly handy for analyzing log files to see how many SQL queries are being run after certain points in code...
How can I create the below output using awk? I couldn't create the loop for comma separated data.
awk '{print "echo "$1"\nadd "$2"\nremove "$3"\nlist "$4}' test.txt
test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
Output:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
I always feel like awk loses a bit of its pizazz when I have to do my own split and loop through the resulting array, but here is a straightforward way using a function to add the second loop in order to process your space-separated fields (that are themselves comma-separated values):
$ cat test.awk
function print_all(label, values) {
split(values, v, ",")
for (i=1; i<=length(v); ++i) {
print label " " v[i]
}
}
{
print "echo " $1
print_all("add", $2)
print_all("remove", $3)
print_all("list", $4)
}
$ cat test.txt
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
$ awk -f test.awk test.txt
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
Another alternative approach is a two stage awk
$ awk '{ print "echo " $1; print "add " $2; print "remove " $3}' file
| awk -F'[ ,]' 'NF==3{print $1,$2; print $1,$3;next}1'
You have two loops, which is why you have a problem - you need to split your line on whitespace, but then split the subelements on commas.
I would suggest using perl:
#!/usr/bin/env perl
use strict;
use warnings;
my #actions = qw ( echo add remove list );
#iterate the lines
while ( <DATA> ) {
#split on whitespace
my #fields = split;
#iterate actions and fields
foreach my $action ( #actions ) {
#split each field on ,
#print action and field for each.
print "$action $_\n" for split ( ",", shift #fields );
}
}
__DATA__
1 abc,bcd xyz,yza qwe,wer
2 abc xyz qwe
3 abc xyz,yza qwe,wer
4 abc,bcd xyz wer
This gives us:
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
Which I think is what you wanted?
This can be reduced to a one liner:
perl -ane 'for my $act ( qw ( echo add remove list ) ) { print "$act $_\n" for split ",", shift #F }' test.txt
Not necessarily recommended but if you're looking for compact you could replace commas with the extra text including the line breaks.
a = "," $2; b = "," $3; c = "," $4;
gsub(/,/, "\nadd ", a);
gsub(/,/, "\nremove ", b);
gsub(/,/, "\nlist ", c);
print "echo " $1 a b c "\n"
$ cat tst.awk
BEGIN { split("echo add remove list",names) }
{
for (fldNr=1;fldNr<=NF;fldNr++) {
split($fldNr,subFlds,/,/)
for (subFldNr=1;subFldNr in subFlds; subFldNr++) {
print names[fldNr], subFlds[subFldNr]
}
}
}
$ awk -f tst.awk file
echo 1
add abc
add bcd
remove xyz
remove yza
list qwe
list wer
echo 2
add abc
remove xyz
list qwe
echo 3
add abc
remove xyz
remove yza
list qwe
list wer
echo 4
add abc
add bcd
remove xyz
list wer
awk -F" " '{for(i=1;i<=NF;i++){a[i]=$i;} {print "echo "a[1]"\n""add "a[2]"\nremove "a[3]"\nlist "a[4];}}' filename | awk -F" " '{sub(/,/,"\n"$1" ",$0);print}'
The above code can be used.
Also I would like to get inputs from others for an optimized code snippet of the above version.
I have been searching how to do the following for couple hours and could not find it. I apologize if I am repeating something.
I have 22 csv files with 14 columns and 17,392 lines in each.I am using awk to filter the original files using the following commands:
First need to get lines that have values on column 14 smaller than 0.05
awk -F '\t' '$14 < 0.05 { print $0 }' file1 > file2
Next I need to get the lines with values higher and 1 and smaller than -1.
awk -F '\t' '$10 < -1 { print $0 }' file2 > file3
awk -F '\t' '$10 > 1 { print $0 }' file2 > file4
My last step is to get the lines that have values on column 7 OR 8 higher than 1 (e.g. on 7 could be 0 if on 8 it is 1)
awk -F '\t' '$7<=1 {print $0}' file3 > file5
awk -F '\t' '$8>=1 {print $0}' file4 > file6
My problem is that I create several intermediate files. I would need just two files at the end. File3 and 4 where columns 7 or 8 have values equal or greater than 1. How can I make an awk command to do that at once?
Thank you.
Your question is ambiguous, so there are many possible answers. However, you can combine conditions in awk and you can write to separate files in a single pass, so you might mean:
awk -F '\t' '$14 < 0.05 && $10 < -1 && $7 > 1 { print > "file5" }
$14 < 0.05 && $10 > +1 && $8 > 1 { print > "file6" }' file1
This command should give you the same output in file5 and file6 as you got from your original sequence of operations (but it only makes one pass over the data, not many). (Strictly, it produces the same answer if you change your $7<=1 to $7>1 to agree with your description of wanting column 7 or 8 higher than 1, though that contradicts your example 'on 7 could be 0 if on 8 it is 1'.)
Given an input file:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The output in file5 is:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
and the output in file6 is:
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
If you need to combine the conditions differently, then you need to clarify your question.
You could try:
awk -F'\t' '($14 < 0.05) && ($10 < -1) && ($7 <= 1) {print}' file1 > file3
I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.
Here is sample data:
Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
This is the current script that SiegeX has substantially contributed:
$ cat matrix2.awk
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-6d%s\n",n[i],head[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf("% -6d %s\n",arr[i,b[2]],b[2])
}
}
}
}
This is the current output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
3 Meat
1 Fish
4 Dairy
1 Bread
2 Spicy
17 No
2 Yes
And this is the desired output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
4 Dairy
3 Meat
1 Fish
2 Spicy
17 No
2 Yes
The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.
The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded
awk -F'|' '
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-32s%d\n",head[i],n[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf(" %-28s%d\n",b[2],arr[i,b[2]])
}
}
}
}' infile
Output
$ ./report
Food 9
Type 5
Meat 2
Bread 1
Vegetable 2
Fruit 2
Fish 1
Spicy 2
Yes 2
No 6
Not a complete solution, but something to get you started -
awk -F"|" '
NR>1{
a[$1]++;
b[$2]++;
c[$3]++}
END{
print "Food\t\t\t" length(a);
print "Type\t\t\t" length(b);
for (x in b)
if (x!="")
{
printf ("\t%-16s%s\n",x,b[x]);
}
print "Spicy\t\t\t" length(c);
for (y in c)
if (y!="")
{
printf ("\t%-16s%d\n",y,c[y])
}
}' testlist.csv
TEST:
[jaypal:~/Temp] cat testlist.csv
Food|Type|Spicy
Broccoli|Vegetable|No
Jalapeno|Vegetable|Yes
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
[jaypal:~/Temp] awk -F"|" 'NR>1{a[$1];b[$2]++;c[$3]++}END{print "Food\t\t\t" length(a); print "Type\t\t\t"length(b); for (x in b) if (x!="") printf ("\t%-16s%s\n",x,b[x]) ;print "Spicy\t\t\t"length(c); for (y in c) if (y!="") {printf ("\t%-16s%d\n",y,c[y])}}' testlist.csv
Food 9
Type 6
Fruit 2
Vegetable 2
Bread 1
Meat 2
Fish 1
Spicy 3
Yes 2
No 6