Best way to Grep for html - html

I'm having trouble using grep to through some html code.
I'm trying to find similar strings to this
<td>product description here</td><td> $<font color='red'>0.25</font>
i'm trying to generalize formula to count each line that is under $0.25 the parts that will vary are the:
href='/go/12229' the number after /go/ will change but always be a number 5 digits long
the product description can be alphanumeric with spaces and special characters
and the price can be anything from 0.01 to 0.25
I've tried making formulas like the one below but it either does not work or returns nothing.
grep -c "href='/go/'[*] target="_blank" rel="nofollow">*</a></td><td> $<font color='red'>[0].[0-2][0-9]</font>"
I think it has to do with me not escaping special characters correctly, but i'm not sure.
Any help is appreciated.

Okay - this requires that each line be formated as in your example, but this should give you the link, description and prices where each line is between 0.01 and 0.25. The the contents of this code an put them in a file like "priceawk" and make it executable:
grep 'go\/[0-9]\{5\}' | awk -F"<" '
{
split( $7, price_arr, ">" )
if( price_arr[ 2 ] > 0.00 && price_arr[ 2 ] < 0.26 )
{
split( $3, link_arr, "'\''" )
split( link_arr[ 3 ], desc_arr, ">" )
printf( "%s %s %s\n", link_arr[ 2 ], desc_arr[ 2 ], price_arr[ 2 ] )
}
} '
Then use it like:
cat input | priceawk
With a test input file I made from your line, I get the following kinds of output:
/go/12229 product description here 0.25
/go/13455 find this line2 0.01
/go/12334 find this line3 0.23
/go/34455 find this line4 0.16
The printf() can be improved to give your output in a different form, with a more useful delimiter than the current space.

Related

How does the 'k' modifier in FINDC() work in SAS?

I'm reading through the book, "SAS Functions by Example - Second Edition" and having trouble trying to understand a certain function due to the example and output they get.
Function: FINDC
Purpose: To locate a character that appears or does not appear within a string. With optional arguments, you can define the starting point for the search, set the direction of the search, ignore case or trailing blanks, or look for characters except the ones listed.
Syntax: FINDC(character-value, find-characters <,'modifiers'> <,start>)
Two of the modifiers are i and k:
i ignore case
k count only characters that are not in the list of find-characters
So now one of the examples has this:
Note: STRING1 = "Apples and Books"
FINDC(STRING1,"aple",'ki')
For the Output, they said it returns 1 because the position of "A" in Apple. However this is what confuses me, because I thought the k modifier says to find characters that are not in the find-characters list. So why is it searching for a when the letter "A", case-ignored, is in the find-characters list. To me, I feel like this example should output 6 for the "s" in Apples.
Is anyone able to help explain the k modifier to me any better, and why the output for this answer is 1 instead of 6?
Edit 1
Reading the SAS documentation online, I found this example which seems to contradict the book I'm reading:
Example 3: Searching for Characters and Using the K Modifier
This example searches a character string and returns the characters that do
not appear in the character list.
data _null_;
string = 'Hi, ho!';
charlist = 'hi';
j = 0;
do until (j = 0);
j = findc(string, charlist, "k", j+1);
if j = 0 then put +3 "That's all";
else do;
c = substr(string, j, 1);
put +3 j= c=;
end;
end;
run;
SAS writes the following output to the log:
j=1 c=H
j=3 c=,
j=4 c=
j=6 c=o
j=7 c=!
That's all
So, is the book wrong?
The book is wrong.
511 data _null_;
512 STRING1 = "Apples and Books" ;
513 x=FINDC(STRING1,"aple",'ki');
514 put x=;
515 if x then do;
516 ch=char(string1,x);
517 put ch=;
518 end;
519 run;
x=6
ch=s

Highlight a matched pattern in a DNA sequence with HTML markup using Perl

I am working on generating an HTML page using a CGI script in Perl.
I need filter some sequences in order to check whether they contain a specific pattern; if they contain it I need to print those sequences on my page with 50 bases per line, and highlight the pattern in the sequences. My sequences are in an hash called %hash; the keys are the names, the values are the actual sequences.
my %hash2;
foreach my $key (keys %hash) {
if ($hash{$key} =~ s!(aaagg)!<b>$1</b>!) {
$hash2{$key} = $hash{$key}
}
}
foreach my $key (keys %hash2) {
print "<p> <b> $key </b> </p>";
print "<p>$_</p>\n" for unpack '(A50)*', $hash2{$key};
}
This method "does" the job however if I highlight the pattern "aaagg" using this method I am messing up the unpacking of the line (for unpack '(A50)*'); because now the sequences contains the extra characters of the bold tags which are included in the unpacking count. This beside making the lines of different length it is also a big problem if the tag falls between 2 lines due to unpacking 50 characters, it basically remains open and everything after that is bold.
The script below uses a single randomly generated DNA sequence of length 243 (generated using http://www.bioinformatics.org/sms2/random_dna.html) and a variable length pattern.
It works by first recording the positions which need to be highlighted instead of changing the sequence string. The highlighting is inserted after the sequence is split into chunks of 50 bases.
The highlighting is done in reverse order to minimize bookkeeping busy work.
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use YAML::XS;
my $PRETTY_WIDTH = 50;
# I am using bold-italic so the highlighting
# is visible on Stackoverflow, but in real
# life, this would be something like:
# my #PRETTY_MARKUP = ('<span class="highlighted-match">', '</span>');
my #PRETTY_MARKUP = ('<b><i>', '</i></b>');
use constant { BAŞ => 0, SON => 1, ROW => 0, COL => 1 };
my $sequence = q{ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgcttttccccgtaatgagggccccatattcaggccgtcgtccggaattgtcttggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatatctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagaccggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct};
my $wanted = 'c..?gg';
my #pos;
while ($sequence =~ /($wanted)/g) {
push #pos, [ pos($sequence) - length($1), pos($sequence) ];
}
print Dump \#pos;
my #output = unpack "(A$PRETTY_WIDTH)*", $sequence;
print Dump \#output;
while (my $pos = pop #pos) {
my #rc = map pos_to_rc($_, $PRETTY_WIDTH), #$pos;
substr($output[ $rc[$_][ROW] ], $rc[$_][COL], 0, $PRETTY_MARKUP[$_]) for SON, BAŞ;
}
print Dump \#output;
sub pos_to_rc {
my $r = int( $_[0] / $_[1] );
my $c = $_[0] - $r * $_[1];
[ $r, $c ];
}
Output:
C:\...\Temp> perl s.pl
---
- - 0
- 4
- - 76
- 80
- - 87
- 91
- - 97
- 102
- - 104
- 108
- - 165
- 170
- - 184
- 188
- - 198
- 202
- - 226
- 231
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
Especially since this turns out to have been a homework assignment, it is now up to you to take this and apply it to all sequences in your hash table.

Fixing broken csv files using awk

I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut
in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut
and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut
in
three;data 38
9;data 19;data 29;data 39
I am processing above crap with awk:
BEGIN { FS=OFS=";" } # delimiters
NR==1 { nf=NF; } # header record is fine, use the NF
NR>1 {
if(NF<nf) { # if NF less that header's NF
prev=$0 # store $0
if(getline==1) { # read the "next" line
succ=$0 # set the "next" line to succ
$0=prev succ # rebuild a current record
}
}
if(NF!=nf) # if NF is still not adequate
$0=succ # expect original line to be malformed
if(NF!=nf) # if the "next" line was malformed as well
next # well skip "next" line and move to next
} 1
Naturally above program will fail records 4 and 6 (as the actual data has several fields where the extra delimiter may lurk) and 8 (since I only read the next line if NF is too short. I can live with loosing 4 and 6 but 8 might be doable?
Also, three successive ifs scream for a for loop but it's Friday afternoon here and my day is nearing $ and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?
The key her is to keep a buffer containing the lines that are still not "complete"; once they are, print them and clear the buffer:
awk -F';' 'NF>=4 && !nf {print; next} # normal lines are printed
{ # otherwise,
if (nf>0) { # continue with a "broken" line by...
buff=buff OFS $0 # appending to the buffer
nf+=NF-1 # and adding NF
} else { # new "broken" line, so...
buff=$0 # start buffer
nf=NF # set number of fields already seen
}
}
nf>=4{ # once line is complete
print buff # print it
buff=""; nf=0 # and remove variables
}' file
Here, buff is such buffer and nf an internal counter to keep track of how many fields have been seen so far for the current record (like you did in your attempt).
We are adding NF-1 when appending to the buffer (that is, from the 2nd line of a broken stream) because a line with NF==1 does not add any record but just concatenates with the last field of the previous line:
8;data 18;cut # NF==3 |
in # NF==1 but it just continues $3 | all together, NF==4
three;data 38 # NF==2 but $1 continues $3 |
With your sample input:
$ awk -F';' 'NF>=4 && !nf {print; next} {buff=(nf>0 ? buff OFS : "") $0; nf+=(nf>0 ? NF-1 : NF)} nf>=4{print buff; buff=""; nf=0}' a
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut in three;data 38
9;data 19;data 29;data 39

A shorter non-repeating alphanumeric code than UUID in MySQL

Is it possible for MySQL database to generate a 5 or 6 digit code comprised of only numbers and letters when I insert a record? If so how?
Just like goo.gl, bit.ly and jsfiddle do it. For exaple:
http://bit.ly/3PKQcJ
http://jsfiddle.net/XzKvP
cZ6ahF, 3t5mM, xGNPN, xswUdS...
So UUID_SHORT() will not work because it returns a value like 23043966240817183
Requirements:
Must be unique (non-repeating)
Can be but not required to be based off of primary key integer value
Must scale (grow by one character when all possible combinations have been used)
Must look random. (item 1234 cannot be BCDE while item 1235 be BCDF)
Must be generated on insert.
Would greatly appreciate code examples.
Try this:
SELECT LEFT(UUID(), 6);
I recommend using Redis for this task, actually. It has all the features that make this task suitable for its use. Foremost, it is very good at searching a big list for a value.
We will create two lists, buffered_ids, and used_ids. A cronjob will run every 5 minutes (or whatever interval you like), which will check the length of buffered_ids and keep it above, say, 5000 in length. When you need to use an id, pop it from buffered_ids and add it to used_ids.
Redis has sets, which are unique items in a collection. Think of it as a hash where the keys are unique and all the values are "true".
Your cronjob, in bash:
log(){ local x=$1 n=2 l=-1;if [ "$2" != "" ];then n=$x;x=$2;fi;while((x));do let l+=1 x/=n;done;echo $l; }
scale=`redis-cli SCARD used_ids`
scale=`log 16 $scale`
scale=$[ scale + 6]
while [ `redis-cli SCARD buffered_ids` -lt 5000 ]; do
uuid=`cat /dev/urandom | tr -cd "[:alnum:]" | head -c ${1:-$scale}`
if [ `redis-cli SISMEMBER used_ids $uuid` == 1]; then
continue
fi
redis-cli SADD buffered_ids $uuid
done
To grab the next uid for use in your application (in pseudocode because you did not specify a language)
$uid = redis('SPOP buffered_ids');
redis('SADD used_ids ' . $uid);
edit actually there's a race condition there. To safely pop a value, add it to used_ids first, then remove it from buffered_ids.
$uid = redis('SRANDMEMBER buffered_ids');
redis('SADD used_ids ' . $uid);
redis('SREM buffered_ids ' . $uid);

awk set elements in array

I have a large .csv file to to process and my elements are arranged randomly like this:
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
xxxxxx,xx,MLOCAL,341993,22/10/2012
xxxxxx,xx,MREMOTE,9356828,08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
where fields LOCAL, REMOTE, MLOCAL or MREMOTE are displayed like:
when they are displayed as pairs (LOCAL/REMOTE) if 3rd field is MLOCAL, and 4th field is MREMOTE, then 5th and 7th field represent the value and date of MLOCAL, and 6th and 8th represent the value and date of MREMOTE
when they are displayed as single (only LOCAL or only REMOTE) then the 4th and 5th fields represent the value and date of field 3.
Now, I have split these rows using:
nawk 'BEGIN{
while (getline < "'"$filedata"'")
split($0,ft,",");
name=ft[1];
ID=ft[2]
?=ft[3]
?=ft[4]
....................
but because I can't find a pattern for the 3rd and 4th field I'm pretty stuck to continue to assign var names for each of the array elements in order to use them for further processing.
Now, I tried to use "case" statement but isn't working for awk or nawk (only in gawk is working as expected). I also tried this:
if ( ft[3] == "MLOCAL" && ft[4]!= "MREMOTE" )
{
MLOCAL=ft[3];
MLOCAL_qty=ft[4];
MLOCAL_TIMESTAMP=ft[5];
}
else if ( ft[3] == MLOCAL && ft[4] == MREMOTE )
{
MLOCAL=ft[3];
MREMOTE=ft[4];
MOCAL_qty=ft[5];
MREMOTE_qty=ft[6];
MOCAL_TIMESTAMP=ft[7];
MREMOTE_TIMESTAMP=ft[8];
}
else if ( ft[3] == MREMOTE && ft[4] != MOCAL )
{
MREMOTE=ft[3];
MREMOTE_qty=ft[4];
MREMOTE_TIMESTAMP=ft[5];
..........................................
but it's not working as well.
So, if you have any idea how to handle this, I would be grateful to give me a hint in order to be able to find a pattern in order to cover all the possible situations from above.
EDIT
I don't know how to thank you for all this help. Now, what I have to do is more complex than I wrote above, I'll try to describe as simple as I can otherwise I'll make you guys pretty confused.
My output should be like following:
NAME,UNIQUE_ID,VOLUME_ALOCATED,MLOCAL_VALUE,MLOCAL_TIMESTMP,MLOCAL_limit,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,MREMOTE_VALUE,MREMOTE_TIMESTAMP,REMOTE_VALUE,REMOTE_TIMESTAMP
(where MLOCAL_limit and LOCAL_limit are a subtract result between VOLUME_ALOCATED and MLOCAL_VALUE or LOCAL_VALUE)
So, in my output file, fields position should be arranged like:
4th field =MLOCAL_VALUE,5th field =MLOCAL_TIMESTMP,7th field=LOCAL_VALUE,
8th field=LOCAL_TIMESTAMP,10th field=MREMOTE_VALUE,11th field=MREMOTE_TIMESTAMP,12th field=REMOTE_VALUE,13th field=REMOTE_TIMESTAMP
Now, an example would be this:
for the following input: name,ID,VOLUME_ALLOCATED,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
name,ID,VOLUME_ALLOCATED,REMOTE,234455,19/12/2012
I should process this line and the output should be this:
name,ID,VOLUME_ALLOCATED,33222,22/10/2012,MLOCAL_LIMIT, ,,,56,18/10/2012,,
7th ,8th, 9th,12th, and 13th fields are empty because there is no info related to: LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,REMOTE_VALUE, and REMOTE_TIMESTAMP
OR
name,ID,VOLUME_ALLOCATED,,,,,,,,,234455,9/12/2012
4th,5th,6th,7th,8th,9th,10thand ,11th, fields should be empty values because there is no info about: MLOCAL_VALUE,MLOCAL_TIMESTAMP,MLOCAL_LIMIT,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_LIMIT,MREMOTE_VALUE,MREMOTE_TIMESTAMP
VOLUME_ALLOCATED is retrieved from other csv file (called "info.csv") based on the ID field which is processed earlier in the script like:
info.csv
VOLUME_ALLOCATED,ID,CLIENT
5242881,64,subscriber
567743,24,visitor
data.csv
NAME,64,MLOCAL,341993,23/10/2012
NAME,24,LOCAL$REMOTE,2347$4324,19/12/2012$18/12/2012
Now, my code is this:
#! /usr/bin/bash
input="info.csv"
filedata="data.csv"
outfile="out"
nawk 'BEGIN{
while (getline < "'"$input"'")
{
split($0,ft,",");
volume=ft[1];
id=ft[2];
client=ft[3];
key=id;
volumeArr[key]=volume;
clientArr[key]=client;
}
close("'"$input"'");
while (getline < "'"$filedata"'")
{
gsub(/\$/,","); # substitute the $ separator with comma
split($0,ft,",");
volume=volumeArr[id]; # Get the volume from the volumeArr, using "id" as key
segment=clientArr[id]; # Get the client mode from the clientArr, using "id" as key
NAME=ft[1];
id=ft[2];
here I'm stuck, I can't find the right way to set the rest of the
fields since I don't know how to handle the 3rd and 4th fields.
? =ft[3];
? =ft[4];
Sorry, if I make you pretty confused but this is my current situation right now.
Thanks
You didn't provide the expected output from your sample input but here's a start to show how to get the values for the 2 different formats of input line:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
delete value # or use split("",value) if your awk cant delete arrays
if ($4 ~ /LOCAL|REMOTE/) {
value[$3] = $5
date[$3] = $7
value[$4] = $6
date[$4] = $8
}
else {
value[$3] = $4
date[$3] = $5
}
print
for (type in value) {
printf "%15s%15s%15s\n", type, value[type], date[type]
}
}
$ awk -f tst.awk file
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
MREMOTE 56 18/10/2012
MLOCAL 33222 22/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
MREMOTE 33222 22/10/2012
MLOCAL 56 18/10/2012
xxxxxx,xx,MLOCAL,*341993,22/10/2012*
MLOCAL *341993 22/10/2012*
xxxxxx,xx,MREMOTE,9356828,08/10/2012
MREMOTE 9356828 08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
REMOTE 15253 22/10/2012
LOCAL 19316 22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
REMOTE 1865871 22/10/2012
LOCAL 383666 22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
REMOTE 1180306134 19/10/2012
and if you post the expected output we could help you more.