Why doesn't the match operator match anything? - html

I'm trying to parse this HTML block:
<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">
to capture the redirect link:
/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=
and video title:
The Valley Downs Chicago
When I use this simple Perl code:
foreach $_ (#promotedVideos)
{
if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/six)
{
print $1;
print $2;
}
}
nothing prints. While I'm troubleshooting this, I thought I'd ask you the experts if you see anything wrong or problematic. Thanks so much in advance for your help!

Your /x regex modifier messes something with whitespaces. Remove it.
That is, it should be
if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si)
/x makes perl ignore whitespaces inside regex, making your regex equivalent of following:
/\s<divclass="v120WrapperInner"><a href="([^"]*)"title="([^"]*)"><img/six
that will not match.
Also that \s at the beginning may brake things.
This is the code I've used for testing:
use strict;
my $inp = '<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">';
print "$inp\n";
if ( $inp =~ /<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si )
{
print "m:\n$1\n$2\n";
}

Okay, this is not exactly what you are asking, but I think (based in this and your older question) that you are parsing HTML.
Let me tell you this: regexes aren't the solution. You should use HTML::TreeBuilder to parse HTML documents, because HTML documents are horribly messy.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $div ($root->find_by_tag_name('div')) {
if ($div->attr('class') eq 'v120WrapperInner') {
foreach (my $a = $div->find_by_tag_name('a')) {
print "m:\n", $a->attr('href'), "\n", $a->attr('title'), "\n";
}
}
}

It is good that you are gaining experience with regex in perl, but for this type of work you might consider using a DOM parser like XML::DOM.

G'day,
If you're having problems understanding regexp's can I suggest having a read of the regexp intro in Dale Dougherty's excellent book "sed & awk" (sanitised Amazon link).
Definitely one of the best intro's to regexp's around.
HTH
cheers,

Related

Regex to detect all characters outside of the <img> tag

I don't have experience in regex. I am just trying to find a way to detect
and delete every character outside of the img tag. In other words I want to
strip a given html code from all text and tags and just keep everything within
the img tags. The result should show just the image tags like that:
<img src="sourcehere">
Is there a way to do this?
UPDATE:
I need specifically a regex that goes in preg_replace.
This is what I have done, but it doesn't work:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('(?i)<(?!img|/img).*?>', '', $buffer);
echo $buffer; /* should output <img src='image.jpg'> but it doesn't */
What are your plans -- do you want to log it to a file or just display in a console, or output it in some way. This worked for me, but actually 'stringing' it out might take extra work.
this is jQuery. From my understanding you want to remove everything but the images from your document.
var arr2 = Array.prototype.slice.call( document.images );
jQuery('body').contents().remove();
for(i = 0; i < arr2.length;i++){
jQuery('body').append(arr2[i])
}
This doesn't need to be some big and fancy regex.
<img[^>]*>
This matches the text "" followed by the closer ">".
Once you have the matches you would just want to write out the matches to a string, or to the document, or however you want to represent them.
EDIT:
To complete what the OP is showing in PHP, you would want to call match instead of replace. You don't really need to replace all of the non-matching sections. You can just keep the results:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
preg_match("/<img[^>]*>/", $buffer, $matchArray);
foreach ($matchArray as $match){
echo $match;
}
prints out:
<img src='image.jpg'>
EDIT:
The problem I am seeing with trying to replace every other tag will be when you have contents between the tags. If you don't care about that, then here is something that works using preg_replace().
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('/(?i)<\\/*(?!img)[^>]*>/', '', $buffer);
echo $buffer; /* outputs <img src='image.jpg'> */
Use
preg_replace('/<img[^>]*>(*SKIP)(*FAIL)|./si', '', $buffer)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
<img '<img'
--------------------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skips the match
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
. any character

preg_replace not working

I have this function in my website.
function autolink($content) {
$pattern = "/>>[0-9]/i" ;
$replacement = ">>$0";
return preg_replace($pattern, $replacement, $content, -1);
This is for making certain characters into a clickable hyperlink.
For example, (on a thread) when a user inputs '>>4' to denote to the another reply number 4, the function can be useful.
But it's not working. the characters are not converted into a hyperlink. They just remain as plain text. Not clickable.
Could someone tell me what is wrong with the function?
So the objective is to convert:
This is a reference to the >>4 reply
...into:
This is a reference to the >>4 reply
...where ">" is the HTML UTF-8 equivalent of ">". (remember, you don't want to create HTML issues)
The problems: (1) you forgot to escape the quotes in the replacement (2) since you want to isolate the number, you need to use parentheses to create a sub-pattern for later reference.
Once you do this, you arrive at:
function autolink($contents) {
return preg_replace( "/>>([0-9])/i",
">>$1",
$contents,
-1
);
}
Good luck

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
my $source = "foo";
my #time = "10-14-2011";
my $name = $html->find('a')->as_text;
my $comment = $html->as_text;
my #keywords = map { $_->as_text } $html->find('b');
Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish
Which is perfect and what I wanted from the test html but
this only works fine when I put in the aforementioned HTML, which I did for test purposes.
However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.
How can I account for multiple values for specific searches?
Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like
foreach my $anchor ($html->find('a')) {
print $anchor->as_text, "\n";
}
But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.
I cannot begin to guess about your problem with comments without seeing what data you are dealing with.
If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');
OUTPUT
1 h
User
: There are not enough
big
fish
in the lake ;

How can I reliably parse a QuakeLive player profile using Perl?

I'm currently working on a Perl script to gather data from the QuakeLive website.
Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.
I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.
I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.
You can find an example profile HERE.
Note that for the previous part of the page, it worked using code like:
$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";
The immediate problem is that you have:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
That is, there is no <br /> following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):
my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};
That should put everything up to the < of the next <div> in $favarena. Now, if all arenas are single words with no spaces in them,
my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};
would save you the trouble of having to trim whitespace afterwards.
Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
your script would be in trouble where as a solution using an HTML parser would not.
An example using HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );
while ( my $tag = $p->get_tag('p') ) {
next unless $tag->is_start_tag;
next unless defined (my $class = $tag->get_attr('class'));
next unless grep { /^prf_faves\z/ } split ' ', $class;
my $fav = $p->get_tag('b');
my $type = $p->get_text('/b');
my $value = $p->get_text('/p');
$value =~ s/\s+\z//;
print "$type = $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
And, here is an example using HTML::TreeBuilder:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
use YAML;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');
my #p = $tree->look_down(_tag => 'p', sub {
return unless defined (my $class = $_[0]->attr('class'));
return unless grep { /^prf_faves\z/ } split ' ', $class;
return 1;
}
);
for my $p ( #p ) {
my $text = $p->as_text;
$text =~ s/^\s+//;
my ($type, $value) = split ': ', $text;
print "$type: $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.
Using regular expressions for this particular task is less than ideal. There are just too many things that might change, and you're not taking advantage of inherent structure of HTML pages. Have you considered using something like HTML::TreeBuilder instead? It will allow you to say "get me the value of the 3rd table cell in the table named weapons", etc.

How can I convert a file to an HTML table using Perl?

I am trying to write a simple Perl CGI script that:
runs a CLI script
reads the resulting .out file and converts the data in the file to an HTML table.
Here is some sample data from the .out file:
10.255.202.1 2472327594 1720341
10.255.202.21 2161941840 1484352
10.255.200.0 1642646268 1163742
10.255.200.96 1489876452 1023546
10.255.200.26 1289738466 927513
10.255.202.18 1028316222 706959
10.255.200.36 955477836 703926
Any help would be much appreciated.
The following is untested and probably needs a lot of polishing
but it gives a rough idea:
use CGI qw/:standard *table/;
print
start_html('clicommand results'),
start_table;
open(my $csvh, 'clicommand |');
while (<$csvh>) {
print Tr(map { td($_) } split);
}
close($csvh);
print
end_table,
end_html;
This doesn't directly answer your question, but is it possible to use AWK instead? It shouldn't be too difficult to wrap the whole content, then each column entry with the appropriate html tags to create a basic table.
You'll very likely want to make the HTML prettier by using a CSS stylesheet or adding borders to the table, but here's a simple start.
#!/usr/bin/perl
use strict;
use warnings;
my $output = `cat dat`;
my #lines = split /\n/, $output;
my #data;
foreach my $line (#lines) {
chomp $line;
my #d = split /\s+/, $line;
push #data, \#d;
}
print <<HEADER;
<html>
<table>
HEADER
foreach my $d (#data) {
print "\t", "<tr>";
print map { "<td>$_</td>" } #$d;
print "</tr>", "\n";
}
print <<FOOTER;
</table>
</html>
FOOTER
This makes the following output:
<html>
<table>
<tr><td>10.255.202.1</td><td>2472327594</td><td>1720341</td></tr>
<tr><td>10.255.202.21</td><td>2161941840</td><td>1484352</td></tr>
<tr><td>10.255.200.0</td><td>1642646268</td><td>1163742</td></tr>
<tr><td>10.255.200.96</td><td>1489876452</td><td>1023546</td></tr>
<tr><td>10.255.200.26</td><td>1289738466</td><td>927513</td></tr>
<tr><td>10.255.202.18</td><td>1028316222</td><td>706959</td></tr>
<tr><td>10.255.200.36</td><td>955477836</td><td>703926</td></tr>
</table>
</html>
To understand how to modify the look of your HTML tables, the w3schools website entry on the table tag is a good start.