how can I count number of lines after a string match until next especific match occurs

how can I count number of lines after a string match until next especific match occurs - language-agnostic

I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
The desired Output should look like below:
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
I tried with awk as below, but it gives me only the line numbers.
awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1 >Cluster 0
2 0 10565nt, >CL9602.Contig1_All... *
3 1 1331nt, >CL9602.Contig2_All... at -/98.05%
4 >Cluster 1
5 0 3798nt, >CL3196.Contig1_All... at +/97.63%
6 1 9084nt, >CL3196.Contig3_All... *
7 >Cluster 2
8 0 8710nt, >Unigene21841_All... *
9 >Cluster 3
10 0 8457nt, >Unigene10299_All... *
I also tried with sed, but it only prints the lines while even ommiting some lines.
sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0 1518nt, >CL2313.Contig1_All... at -/95.13%
1 8323nt, >CL2313.Contig8_All... *
In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.
I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g
grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
0 8710nt, >Unigene21841_All... *
0 8457nt, >Unigene10299_All... *
0 1518nt, >CL2313.Contig1_All... at -/95.13%
grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4
but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.
I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.
Thanks

Could you please try following.
awk '
/^>Cluster/{
if(count){
print prev,count
}
sub(/^>/,"")
prev=$0
count=""
next
}
{
count++
}
END{
if(count && prev){
print prev,count
}
}
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
/^>Cluster/{ ##Checking condition if a line is having string Cluster then do following.
if(count){ ##Checking condition if variable count is NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
sub(/^>/,"") ##Using sub for substitution of starting > with NULL in current line.
prev=$0 ##Creating a variable named prev whose value is current line.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for Cluster condition here.
{
count++ ##Doing increment of variable count each time cursor comes here.
}
END{ ##Mentioning END BLOCK for this program.
if(count && prev){ ##Checking condition if variable count and prev are NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for END BLOCK of this program.
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1

With GNU awk for multi-char RS:
$ awk -v RS='(^|\n)(>|$)' -F'\n' 'NR>1{print $1, NF-1}' file
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The above just separates the input into records that start with > at the start of a line and then prints the number of lines in each record (subtracting 1 for the >Cluster... line).

Here's a, allbeit quite verbose one liner in Perl. I'm really not good at this golfing stuff.
perl -n -e "if ( /^>(.+)/ ) { print qq($last, $count\n) if $count; $last = $1; $count = 0; } else { $count++ } END { print qq($last, $count) }" All-Unigene_Clustered.fa.clstr
This is for Windows. For a unix shell you probably need to change the double to single quotes.

In perl the code can be in following form
use strict;
use warnings;
my $cluster;
my $count;
while( <DATA> ) {
chomp;
if( /Cluster \d+/ ) {
print "$cluster $count\n" if defined $cluster;
s/>//;
$cluster = $_;
$count = 0;
} else {
$count++;
}
}
print "$cluster $count\n" if defined $store;
__DATA__
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
output
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1

The following awk also works for 0 count lines (ie no lines until the next match):
BEGIN {
count = 0;
prev = "";
}
{
if ($0 ~ /LOGCL/) {
if (prev) {
print (prev ": " count);
}
#reset count & assign line to prev
count = 0;
prev = $0;
} else {
count++;
}
}
END {
print (prev ": " count);
}
So for following file (put above into count.awk and invoke with
awk -f count.awk test.txt):
test.txt:
LOGCL One
Blah
LOGCL Two
Blah
Blah
LOGCL Three
LOGCL Four
Blah
LOGCL Five
blah
blah
blah
Output is:
LOGCL One: 1
LOGCL Two: 2
LOGCL Three: 0
LOGCL Four: 1
LOGCL Five: 3
Particularly handy for analyzing log files to see how many SQL queries are being run after certain points in code...

Related

How to write a for loop to perform an operation N times in the ash shell?

I'm looking to run a command a given number of times in an Alpine Linux docker container which features the /bin/ash shell.
In Bash, this would be
bash-3.2$ for i in {1..3}
> do
> echo "number $i"
> done
number 1
number 2
number 3
However, the same syntax doesn't seem to work in ash:
> docker run -it --rm alpine /bin/ash
/ # for i in 1 .. 3
> do echo "number $i"
> done
number 1
number ..
number 3
/ # for i in {1..3}
> do echo "number $i"
> done
number {1..3}
/ #
I had a look at https://linux.die.net/man/1/ash but wasn't able to easily find out how to do this; does anyone know the correct syntax?

I ended up using seq with command substitution:
/ # for i in $(seq 10)
> do echo "number $i"
> done
number 1
number 2
number 3
number 4
number 5
number 6
number 7
number 8
number 9
number 10

Simply like with bash or shell:
$ ash -c "for i in a b c 1 2 3; do echo i = \$i; done"
output:
i = a
i = b
i = c
i = 1
i = 2
i = 3

Another POSIX compatible alternative, which does not use potentially slow expansion, is to use
i=1; while [ ${i} -le 3 ]; do
echo ${i}
i=$(( i + 1 ))
done

Identify the empty line in text file and loop over with that list in tcl

I have a file which has the following kind of data
A 1 2 3
B 2 2 2
c 2 4 5
d 4 5 6
From the above file I want to execute a loop like ,
three iteration where first iteration will have A,B elements 2nd iteration with c elements and 3rd with d. so that my html table will look like
Week1 | week2 | week3
----------------------------
A 1 2 3 | c 2 4 5 | d 4 5 6
B 2 2 2
I found this in SO catch multiple empty lines in file in tcl but I'm not getting what I exactly want.

I would suggest using arrays:
# Counter
set week 1
# Create file channel
set file [open filename.txt r]
# Read file contents line by line and store the line in the varialbe called $line
while {[gets $file line] != -1} {
if {$line != ""} {
# if line not empty, add line to current array with counter $week
lappend Week($week) $line
} else {
# else, increment week number
incr week
}
}
# close file channel
close $file
# print Week array
parray Week
# Week(1) = {A 1 2 3} {B 2 2 2}
# Week(2) = {c 2 4 5}
# Week(3) = {d 4 5 6}
ideone demo

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?

I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'

awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt

awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37

eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.

Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10

$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

Use multiple lines with Awk

I have a CSV that has underscore delimiters. I have 8 lines that need to be converted to one in this way:
101_1_variableName_(value)
101_1_variableName1_(value2)
into:
101 1 (value) (value2)
(in different boxes preferably)
The problem is that I don't know how to use multiple lines in awk to form a single line. Any help is appreciated.
UPDATE: (input + output)
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
(last part all one line)
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

You can use Perl:
use strict;
use warnings;
my %hash=();
while (<DATA>) {
if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/) {
push #{ $hash{join(' ', split('_', $1) )} }, $2;
}
}
print "$_ ". join(' ', #{ $hash{$_} })."\n" for (keys %hash);
__DATA__
101_1_1_trialOutcome_found_berry
101_1_1_trialStartTime_2014-08-05 11:26:49.510000
101_1_1_trialOutcomeTime_2014-08-05 11:27:00.318000
101_1_1_trialResponseTime_0:00:05.804000
101_1_1_whichResponse_d
101_1_1_bearPosition_6
101_1_1_patch_9
101_1_1_food_11
Prints:
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11
Or, perl one line version:
$ perl -lane '
> push #{ $hash{join(" ", split("_", $1) )} }, $2 if (m/^([0-9_]+)_(?:[^_]+)_(.*?)\s*$/);
> END { print "$_ ". join(" ", #{ $hash{$_}})."\n" for (keys %hash); }
> ' file.txt
101 1 1 found_berry 2014-08-05 11:26:49.510000 2014-08-05 11:27:00.318000 0:00:05.804000 d 6 9 11

Parsing line and selecting values corresponding to a key

there is a set of data which is arranged in a specific manner (as a tree), as is given below. basically a key=value pair, with some additional values at the end, which informs how many children does the branch have and some junk value.
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0
my problem is to get the appropriate branch from the above data, which satisfies exactly the values, which i give as the input.
suppose, my input value to search in the above data stack are:
5=0
4=0
6=10
3=11
11=1
1=45
0=138
9=16
2=34
7=0
8=0
for the above given list of key->values, the function should return 11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 as the match.
likewise, for another input file, in which another set of keys is given:
5=0
4=0
6=10
3=11
11=1
1=45
9=16
2=34
7=0
8=0
the function should return 11=1>1=45>9=16>2=34>7=0>8=0 1 as the match. not the last line; as that would also match all the values given in my input key, but, i want only the exact match.
Also, I want to find out how many nodes were selected in the given array. (separated by >).
What will be the best way to implement this kind of scenario?

use strict;
use warnings;
my $tree;
while (<DATA>) {
my #data = split /\>/, (/^([^ ]*)/)[0];
my $ptr = \$tree;
for my $key (#data) {
$ptr = \$$ptr->{$key};
}
}
my #inputs = (
[qw(5=0 4=0 6=10 3=11 11=1 1=45 0=138 9=16 2=34 7=0 8=0)],
[qw(5=0 4=0 6=10 3=11 11=1 1=45 9=16 2=34 7=0 8=0)]
);
sub getKey {
my ( $lu, $node ) = #_;
exists $lu->{$_} and return $_ for keys %$node;
}
for my $input (#inputs) {
my %lu;
#lu{#$input} = ();
my #result;
my $node = $tree;
while (%lu) {
my $key = getKey( \%lu, $node );
if ($key) {
$node = $node->{$key};
push #result, $key;
delete $lu{$key};
}
else {
last;
}
}
print join( '>', #result ), "\n";
}
__DATA__
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how can I count number of lines after a string match until next especific match occurs - language-agnostic

Related

How to write a for loop to perform an operation N times in the ash shell?

Identify the empty line in text file and loop over with that list in tcl

Complex CSV parsing with Linux commands

Use multiple lines with Awk

Parsing line and selecting values corresponding to a key

Categories

Resources