Parsing line and selecting values corresponding to a key - language-agnostic

there is a set of data which is arranged in a specific manner (as a tree), as is given below. basically a key=value pair, with some additional values at the end, which informs how many children does the branch have and some junk value.
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0
my problem is to get the appropriate branch from the above data, which satisfies exactly the values, which i give as the input.
suppose, my input value to search in the above data stack are:
5=0
4=0
6=10
3=11
11=1
1=45
0=138
9=16
2=34
7=0
8=0
for the above given list of key->values, the function should return 11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 as the match.
likewise, for another input file, in which another set of keys is given:
5=0
4=0
6=10
3=11
11=1
1=45
9=16
2=34
7=0
8=0
the function should return 11=1>1=45>9=16>2=34>7=0>8=0 1 as the match. not the last line; as that would also match all the values given in my input key, but, i want only the exact match.
Also, I want to find out how many nodes were selected in the given array. (separated by >).
What will be the best way to implement this kind of scenario?

use strict;
use warnings;
my $tree;
while (<DATA>) {
my #data = split /\>/, (/^([^ ]*)/)[0];
my $ptr = \$tree;
for my $key (#data) {
$ptr = \$$ptr->{$key};
}
}
my #inputs = (
[qw(5=0 4=0 6=10 3=11 11=1 1=45 0=138 9=16 2=34 7=0 8=0)],
[qw(5=0 4=0 6=10 3=11 11=1 1=45 9=16 2=34 7=0 8=0)]
);
sub getKey {
my ( $lu, $node ) = #_;
exists $lu->{$_} and return $_ for keys %$node;
}
for my $input (#inputs) {
my %lu;
#lu{#$input} = ();
my #result;
my $node = $tree;
while (%lu) {
my $key = getKey( \%lu, $node );
if ($key) {
$node = $node->{$key};
push #result, $key;
delete $lu{$key};
}
else {
last;
}
}
print join( '>', #result ), "\n";
}
__DATA__
11=1 123 2
11=1>1=45 234 1
11=1>1=45>9=16 345 1
11=1>1=45>9=16>2=34 222 1
11=1>1=45>9=16>2=34>7=0 2234 1
11=1>1=45>9=16>2=34>7=0>8=0 22345 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138 22234 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0 5566 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0 664 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10 443 1
11=1>1=45>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 445 0
11=1>1=47 4453 1
11=1>1=47>9=16 887 1
11=1>1=47>9=16>2=34 67 1
11=1>1=47>9=16>2=340>7=0 98 1
11=1>1=47>9=16>2=34>7=0>8=0 654 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138 5789 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0 9870 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0 3216 1
11=1>1=47>9=16>2=34>7=0>8=0>0=138>5=0>4=0>6=10>3=11 66678 0

Related

how can I count number of lines after a string match until next especific match occurs

I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
The desired Output should look like below:
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
I tried with awk as below, but it gives me only the line numbers.
awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1 >Cluster 0
2 0 10565nt, >CL9602.Contig1_All... *
3 1 1331nt, >CL9602.Contig2_All... at -/98.05%
4 >Cluster 1
5 0 3798nt, >CL3196.Contig1_All... at +/97.63%
6 1 9084nt, >CL3196.Contig3_All... *
7 >Cluster 2
8 0 8710nt, >Unigene21841_All... *
9 >Cluster 3
10 0 8457nt, >Unigene10299_All... *
I also tried with sed, but it only prints the lines while even ommiting some lines.
sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0 1518nt, >CL2313.Contig1_All... at -/95.13%
1 8323nt, >CL2313.Contig8_All... *
In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.
I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g
grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
0 8710nt, >Unigene21841_All... *
0 8457nt, >Unigene10299_All... *
0 1518nt, >CL2313.Contig1_All... at -/95.13%
grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4
but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.
I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.
Thanks
Could you please try following.
awk '
/^>Cluster/{
if(count){
print prev,count
}
sub(/^>/,"")
prev=$0
count=""
next
}
{
count++
}
END{
if(count && prev){
print prev,count
}
}
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
/^>Cluster/{ ##Checking condition if a line is having string Cluster then do following.
if(count){ ##Checking condition if variable count is NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
sub(/^>/,"") ##Using sub for substitution of starting > with NULL in current line.
prev=$0 ##Creating a variable named prev whose value is current line.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for Cluster condition here.
{
count++ ##Doing increment of variable count each time cursor comes here.
}
END{ ##Mentioning END BLOCK for this program.
if(count && prev){ ##Checking condition if variable count and prev are NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for END BLOCK of this program.
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
With GNU awk for multi-char RS:
$ awk -v RS='(^|\n)(>|$)' -F'\n' 'NR>1{print $1, NF-1}' file
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The above just separates the input into records that start with > at the start of a line and then prints the number of lines in each record (subtracting 1 for the >Cluster... line).
Here's a, allbeit quite verbose one liner in Perl. I'm really not good at this golfing stuff.
perl -n -e "if ( /^>(.+)/ ) { print qq($last, $count\n) if $count; $last = $1; $count = 0; } else { $count++ } END { print qq($last, $count) }" All-Unigene_Clustered.fa.clstr
This is for Windows. For a unix shell you probably need to change the double to single quotes.
In perl the code can be in following form
use strict;
use warnings;
my $cluster;
my $count;
while( <DATA> ) {
chomp;
if( /Cluster \d+/ ) {
print "$cluster $count\n" if defined $cluster;
s/>//;
$cluster = $_;
$count = 0;
} else {
$count++;
}
}
print "$cluster $count\n" if defined $store;
__DATA__
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
output
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
The following awk also works for 0 count lines (ie no lines until the next match):
BEGIN {
count = 0;
prev = "";
}
{
if ($0 ~ /LOGCL/) {
if (prev) {
print (prev ": " count);
}
#reset count & assign line to prev
count = 0;
prev = $0;
} else {
count++;
}
}
END {
print (prev ": " count);
}
So for following file (put above into count.awk and invoke with
awk -f count.awk test.txt):
test.txt:
LOGCL One
Blah
LOGCL Two
Blah
Blah
LOGCL Three
LOGCL Four
Blah
LOGCL Five
blah
blah
blah
Output is:
LOGCL One: 1
LOGCL Two: 2
LOGCL Three: 0
LOGCL Four: 1
LOGCL Five: 3
Particularly handy for analyzing log files to see how many SQL queries are being run after certain points in code...

How do I count multiple columns with a vertical value?

I'm new to programing. I have table
check_1,check_2,check_3 ..etc
------------------------------
1 1 1
0 0 1
1 0 1
And I want this output :
column_name, count_true
-----------------------
check_1 2
check_2 1
check_3 3
I've tried it with mysql using the union function, but when I try to apply it in laravel I have trouble with union. Is there a unionless query that can produce such output?
Thanks in advance
You can do this way. One query in db
$records = DB::table('your_table')->get();
$check1Count = $records->where('check_1', 1)->count();
$check2Count = $records->where('check_2', 1)->count();
$check3Count = $records->where('check_3', 1)->count();
......
Or
$records = DB::table('your_table')->get();
$columns = ['check_1', 'check_2', 'check_3', ...];
$data = [];
foreach($columns as $column) {
$data[] = [
'column_name' => $column,
'count_true' => $records->where($column, 1)->count();
];
}
Also you can do this way but it is many query
$check1Count = DB::table('your_table')
->selectRaw('COUNT(check_1) as count')
->where('check_1', 1)
->first()
->count;
$check2Count = DB::table('your_table')
->selectRaw('COUNT(check_2) as count')
->where('check_2', 1)
->first()
->count;
.....
A normalised approach might look like this:
response_id checkbox
1 1
1 2
1 3
2 3
3 1
3 3
You don't need union, just query all the data and process it on Laravel. Let say you query all the data using eloquent to $data variable, then you should do it like this:
//preparing a variable to hold all 24 database field value
$total = [];
foreach ($data as $d) {
//do this for all 24 database field
if ($d->check_1) {
$total[1]++;
}
if ($d->check_2) {
$total[2]++;
}
...
}
By using that way you can't check the resutl on $total[index] variable. And yes, there is a better way to store your data instead of saving all the field for each user. You can just store all checked value in database that look like this :
user_id checkbox_id
1 3
1 5
1 9
1 24
2 23
3 2
3 3
It more efficient since you don't need to save the unchecked checkbox value, if they more likely not to checked most of the checkbox.

Reading values from CSV and output them in the format Column1:Row1, Column2:Row1

I am trying to write simple PowerShell code to read two columns values from CSV and print in manner like
1
201
2
202
.
.
CSV File:
PRODID DEVID Name
1 201 Application
2 202 Product
3 203 Component
4 204 Author
5 205 Version
Powershell Code:
$DEVID = Import-Csv C:\test\install.csv | % {$_.DEVID}
$PRODID = Import-Csv C:\test\install.csv | % {$_.PRODID}
ForEach ($DEVI in $DEVID)
{
Write-Host $DEVI
ForEach ($PRODI in $PRODID)
{[enter image description here][1]
Write-Host $PRODI
}
}
But I am not getting expected output, though I have tried break, continue syntax.
Can anyone help me in this case please?
You only need to import your csv once. Then just iterate over it and output your desired records:
Import-Csv 'C:\test\install.csv' | Foreach {
$_.PRODID; $_.DEVID;
}
Output:
1
201
2
202
3
203
4
204
5
205
If this doesn't work, you have to show us your csv file.

Read in search strings from text file, search for string in second text file and output to CSV

I have a text file named file1.txt that is formatted like this:
001 , ID , 20000
002 , Name , Brandon
003 , Phone_Number , 616-234-1999
004 , SSNumber , 234-23-234
005 , Model , Toyota
007 , Engine ,V8
008 , GPS , OFF
and I have file2.txt formatted like this:
#==============================================
# 005 : Model
#------------------------------------------------------------------------------
[Model] = Honda
option = 0
length = 232
time = 1000
hp = 75.0
k1 = 0.3
k2 = 0.0
k1 = 0.3
k2 = 0.0
#------------------------------------------------------------------------------
[Model] = Toyota
option = 1
length = 223
time = 5000
speed = 50
CCNA = 1
#--------------------------------------------------------------------------
[Model] = Miata
option = 2
CCNA = 1
#==============================================
# 007 : Engine
#------------------------------------------------------------------------------
[Engine_Type] = V8 #1200HP
option = 0
p = 12.0
pp = 12.0
map = 0.4914
k1mat = 100
k2mat = 600
value =12.00
mep = 79.0
cylinders = 8
#------------------------------------------------------------------------------
[Engine_Type] = v6 #800HP
option = 1
active = 1
cylinders = 6
lim = 500
lim = 340
rpm = 330
start = 350
ul = 190.0
ll = 180.0
ul = 185.0
#==============================================
# 008 : GPS
#------------------------------------------------------------------------------
[GPS] = ON
monitor = 0
#------------------------------------------------------------------------------
[GPS] = OFF
monitor = 1
Enable = 1
#------------------------------------------------------------------------------
[GPS] = Only
monitor = 2
Enable = 1
#==============================================
# 014 :Option
#------------------------------------------------------------------------------
[Option] = Disable
monitor = 0
#------------------------------------------------------------------------------
[Option] = Enable
monitor = 1
#==============================================
# 015 : Weight
#------------------------------------------------------------------------------
[lbs] = &1
weight = &1
#==============================================
The expected output is supposed to look like this:
Since there is only option 005-008 in file1.txt the output would be:
Code:
#==============================================
# 005 : Model
#------------------------------------------------------------------------------
[Model] = Toyota
option = 1
length = 223
time = 5000
speed = 50
CCNA = 1
#==============================================
# 007 : Engine
#------------------------------------------------------------------------------
[Engine_Type] = V8 #1200HP
option = 0
p = 12.0
pp = 12.0
map = 0.4914
k1mat = 100
k2mat = 600
value =12.00
mep = 79.0
cylinders = 8
#==============================================
# 008 : GPS
#------------------------------------------------------------------------------
[GPS] = OFF
monitor = 1
Enable = 1
#-----------------------------------------------------------------
Now, using Awk and the values from the 2nd and 3rd columns in file1, I want to search for those strings in file2 and output everything in that section to a CSV file ie from where the string is found to where there is the #-------------
demarcation.
Could someone please help me with this and explain also? I am new to Awk
Thank you!
I wouldn't really use awk for this job as specified, but here's a little snippet to get started:
awk -F'[ ,]+' 'FNR == NR { section["[" $2 "]"] = $3; next }
/^\[/ && section[$1] == $3, /^#/' file1.txt file2.txt
1) The -F'[ ,]+' sets the field separator to one or more of spaces and/or commas (since file1.txt looks like it's not a proper CSV file).
2) FNR == NR (record number in file equals total record number) is only true when reading file1.txt. So for each line in file1.txt, we record [second_field] as the pattern to look for with the third field as value.
3) Then we look for lines that begin with a [ and where the value stored in section for the first field of that line matches the third field of that line (/^\[/ && section[$1] == $3), and print from that line until the next line that begins with a #.
The output for your example input is:
[Model] = Toyota
option = 1
length = 223
time = 5000
speed = 50
CCNA = 1
#--------------------------------------------------------------------------
[GPS] = OFF
monitor = 1
Enable = 1
#------------------------------------------------------------------------------
The matched lines in step 3 were [Model] = Toyota and [GPS] = OFF. The Engine line is missing because file2.txt had Engine_Type instead. Also, I didn't bother with the section headers; it would be easy to add another condition to print them all but it requires lookahead to print only the ones that are going to have matching content in them (because at the time you read the header you don't know if a match is found inside). For that, I would switch to another language (e.g., Ruby).

How to count instances of string in a tab separated value file?

How to count instances of strings in a tab separated value (tsv) file?
The tsv file has hundreds of millions of rows, each of which is of form
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
. How to count instances of each unique integer in the entire second column in the file, and ideally add the count as the fifth value in each row?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
I prefer a solution using only UNIX command line stream processing programs.
I'm not entirely clear what you want to do. Do you want to add 0/1 depending on the value of the second column as the fifth column or do you want to get the distribution of the values in the second column, total for the entire file?
In the first case, use something like awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.
In the second case, use something like awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.
One solution using perl assuming that values of second column are sorted, I mean, when found value 2, all lines with same value will be consecutive. The script keeps lines until it finds a different value in second column, get the count, print them and frees memory, so shouldn't generate a problem regardless of how big is the input file:
Content of script.pl:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my #f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless #f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push #{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( #{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift #{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push #{ $lines{ $f[1] } }, $_;
$count = 1;
}
Content of infile:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
Run it like:
perl script.pl infile
With following output:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2