how can I count number of lines after a string match until next especific match occurs - language-agnostic

I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
The desired Output should look like below:
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
I tried with awk as below, but it gives me only the line numbers.
awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1 >Cluster 0
2 0 10565nt, >CL9602.Contig1_All... *
3 1 1331nt, >CL9602.Contig2_All... at -/98.05%
4 >Cluster 1
5 0 3798nt, >CL3196.Contig1_All... at +/97.63%
6 1 9084nt, >CL3196.Contig3_All... *
7 >Cluster 2
8 0 8710nt, >Unigene21841_All... *
9 >Cluster 3
10 0 8457nt, >Unigene10299_All... *
I also tried with sed, but it only prints the lines while even ommiting some lines.
sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0 1518nt, >CL2313.Contig1_All... at -/95.13%
1 8323nt, >CL2313.Contig8_All... *
In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.
I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g
grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
0 8710nt, >Unigene21841_All... *
0 8457nt, >Unigene10299_All... *
0 1518nt, >CL2313.Contig1_All... at -/95.13%
grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4
but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.
I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.
Thanks

Could you please try following.
awk '
/^>Cluster/{
if(count){
print prev,count
}
sub(/^>/,"")
prev=$0
count=""
next
}
{
count++
}
END{
if(count && prev){
print prev,count
}
}
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
/^>Cluster/{ ##Checking condition if a line is having string Cluster then do following.
if(count){ ##Checking condition if variable count is NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
sub(/^>/,"") ##Using sub for substitution of starting > with NULL in current line.
prev=$0 ##Creating a variable named prev whose value is current line.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for Cluster condition here.
{
count++ ##Doing increment of variable count each time cursor comes here.
}
END{ ##Mentioning END BLOCK for this program.
if(count && prev){ ##Checking condition if variable count and prev are NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for END BLOCK of this program.
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1

With GNU awk for multi-char RS:
$ awk -v RS='(^|\n)(>|$)' -F'\n' 'NR>1{print $1, NF-1}' file
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The above just separates the input into records that start with > at the start of a line and then prints the number of lines in each record (subtracting 1 for the >Cluster... line).

Here's a, allbeit quite verbose one liner in Perl. I'm really not good at this golfing stuff.
perl -n -e "if ( /^>(.+)/ ) { print qq($last, $count\n) if $count; $last = $1; $count = 0; } else { $count++ } END { print qq($last, $count) }" All-Unigene_Clustered.fa.clstr
This is for Windows. For a unix shell you probably need to change the double to single quotes.

In perl the code can be in following form
use strict;
use warnings;
my $cluster;
my $count;
while( <DATA> ) {
chomp;
if( /Cluster \d+/ ) {
print "$cluster $count\n" if defined $cluster;
s/>//;
$cluster = $_;
$count = 0;
} else {
$count++;
}
}
print "$cluster $count\n" if defined $store;
__DATA__
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
output
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1

The following awk also works for 0 count lines (ie no lines until the next match):
BEGIN {
count = 0;
prev = "";
}
{
if ($0 ~ /LOGCL/) {
if (prev) {
print (prev ": " count);
}
#reset count & assign line to prev
count = 0;
prev = $0;
} else {
count++;
}
}
END {
print (prev ": " count);
}
So for following file (put above into count.awk and invoke with
awk -f count.awk test.txt):
test.txt:
LOGCL One
Blah
LOGCL Two
Blah
Blah
LOGCL Three
LOGCL Four
Blah
LOGCL Five
blah
blah
blah
Output is:
LOGCL One: 1
LOGCL Two: 2
LOGCL Three: 0
LOGCL Four: 1
LOGCL Five: 3
Particularly handy for analyzing log files to see how many SQL queries are being run after certain points in code...

Related

How to write a for loop to perform an operation N times in the ash shell?

I'm looking to run a command a given number of times in an Alpine Linux docker container which features the /bin/ash shell.
In Bash, this would be
bash-3.2$ for i in {1..3}
> do
> echo "number $i"
> done
number 1
number 2
number 3
However, the same syntax doesn't seem to work in ash:
> docker run -it --rm alpine /bin/ash
/ # for i in 1 .. 3
> do echo "number $i"
> done
number 1
number ..
number 3
/ # for i in {1..3}
> do echo "number $i"
> done
number {1..3}
/ #
I had a look at https://linux.die.net/man/1/ash but wasn't able to easily find out how to do this; does anyone know the correct syntax?
I ended up using seq with command substitution:
/ # for i in $(seq 10)
> do echo "number $i"
> done
number 1
number 2
number 3
number 4
number 5
number 6
number 7
number 8
number 9
number 10
Simply like with bash or shell:
$ ash -c "for i in a b c 1 2 3; do echo i = \$i; done"
output:
i = a
i = b
i = c
i = 1
i = 2
i = 3
Another POSIX compatible alternative, which does not use potentially slow expansion, is to use
i=1; while [ ${i} -le 3 ]; do
echo ${i}
i=$(( i + 1 ))
done

Identify the empty line in text file and loop over with that list in tcl

I have a file which has the following kind of data
A 1 2 3
B 2 2 2
c 2 4 5
d 4 5 6
From the above file I want to execute a loop like ,
three iteration where first iteration will have A,B elements 2nd iteration with c elements and 3rd with d. so that my html table will look like
Week1 | week2 | week3
----------------------------
A 1 2 3 | c 2 4 5 | d 4 5 6
B 2 2 2
I found this in SO catch multiple empty lines in file in tcl but I'm not getting what I exactly want.
I would suggest using arrays:
# Counter
set week 1
# Create file channel
set file [open filename.txt r]
# Read file contents line by line and store the line in the varialbe called $line
while {[gets $file line] != -1} {
if {$line != ""} {
# if line not empty, add line to current array with counter $week
lappend Week($week) $line
} else {
# else, increment week number
incr week
}
}
# close file channel
close $file
# print Week array
parray Week
# Week(1) = {A 1 2 3} {B 2 2 2}
# Week(2) = {c 2 4 5}
# Week(3) = {d 4 5 6}
ideone demo

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'
awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

Use multiple lines with Awk

I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

Parsing line and selecting values corresponding to a key

there is a set of data which is arranged in a specific manner (as a tree), as is given below. basically a key=value pair, with some additional values at the end, which informs how many children does the branch have and some junk value.
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0
my problem is to get the appropriate branch from the above data, which satisfies exactly the values, which i give as the input.
suppose, my input value to search in the above data stack are:
5=0
4=0
6=10
3=11
11=1
1=45
0=138
9=16
2=34
7=0
8=0
for the above given list of key->values, the function should return 11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 as the match.
likewise, for another input file, in which another set of keys is given:
5=0
4=0
6=10
3=11
11=1
1=45
9=16
2=34
7=0
8=0
the function should return 11=1>1=45>9=16>2=34>7=0>8=0 1 as the match. not the last line; as that would also match all the values given in my input key, but, i want only the exact match.
Also, I want to find out how many nodes were selected in the given array. (separated by >).
What will be the best way to implement this kind of scenario?
use strict;
use warnings;
my $tree;
while (<DATA>) {
my #data = split /\>/, (/^([^ ]*)/)[0];
my $ptr = \$tree;
for my $key (#data) {
$ptr = \$$ptr->{$key};
}
}
my #inputs = (
[qw(5=0 4=0 6=10 3=11 11=1 1=45 0=138 9=16 2=34 7=0 8=0)],
[qw(5=0 4=0 6=10 3=11 11=1 1=45 9=16 2=34 7=0 8=0)]
);
sub getKey {
my ( $lu, $node ) = #_;
exists $lu->{$_} and return $_ for keys %$node;
}
for my $input (#inputs) {
my %lu;
#lu{#$input} = ();
my #result;
my $node = $tree;
while (%lu) {
my $key = getKey( \%lu, $node );
if ($key) {
$node = $node->{$key};
push #result, $key;
delete $lu{$key};
}
else {
last;
}
}
print join( '>', #result ), "\n";
}
__DATA__
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0