How can I reliably parse a QuakeLive player profile using Perl? - html

I'm currently working on a Perl script to gather data from the QuakeLive website.
Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.
I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.
I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.
You can find an example profile HERE.
Note that for the previous part of the page, it worked using code like:
$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";

The immediate problem is that you have:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
That is, there is no <br /> following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):
my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};
That should put everything up to the < of the next <div> in $favarena. Now, if all arenas are single words with no spaces in them,
my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};
would save you the trouble of having to trim whitespace afterwards.
Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
your script would be in trouble where as a solution using an HTML parser would not.
An example using HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );
while ( my $tag = $p->get_tag('p') ) {
next unless $tag->is_start_tag;
next unless defined (my $class = $tag->get_attr('class'));
next unless grep { /^prf_faves\z/ } split ' ', $class;
my $fav = $p->get_tag('b');
my $type = $p->get_text('/b');
my $value = $p->get_text('/p');
$value =~ s/\s+\z//;
print "$type = $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
And, here is an example using HTML::TreeBuilder:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
use YAML;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');
my #p = $tree->look_down(_tag => 'p', sub {
return unless defined (my $class = $_[0]->attr('class'));
return unless grep { /^prf_faves\z/ } split ' ', $class;
return 1;
}
);
for my $p ( #p ) {
my $text = $p->as_text;
$text =~ s/^\s+//;
my ($type, $value) = split ': ', $text;
print "$type: $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.

Using regular expressions for this particular task is less than ideal. There are just too many things that might change, and you're not taking advantage of inherent structure of HTML pages. Have you considered using something like HTML::TreeBuilder instead? It will allow you to say "get me the value of the 3rd table cell in the table named weapons", etc.

Related

Regex to detect all characters outside of the <img> tag

I don't have experience in regex. I am just trying to find a way to detect
and delete every character outside of the img tag. In other words I want to
strip a given html code from all text and tags and just keep everything within
the img tags. The result should show just the image tags like that:
<img src="sourcehere">
Is there a way to do this?
UPDATE:
I need specifically a regex that goes in preg_replace.
This is what I have done, but it doesn't work:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('(?i)<(?!img|/img).*?>', '', $buffer);
echo $buffer; /* should output <img src='image.jpg'> but it doesn't */
What are your plans -- do you want to log it to a file or just display in a console, or output it in some way. This worked for me, but actually 'stringing' it out might take extra work.
this is jQuery. From my understanding you want to remove everything but the images from your document.
var arr2 = Array.prototype.slice.call( document.images );
jQuery('body').contents().remove();
for(i = 0; i < arr2.length;i++){
jQuery('body').append(arr2[i])
}
This doesn't need to be some big and fancy regex.
<img[^>]*>
This matches the text "" followed by the closer ">".
Once you have the matches you would just want to write out the matches to a string, or to the document, or however you want to represent them.
EDIT:
To complete what the OP is showing in PHP, you would want to call match instead of replace. You don't really need to replace all of the non-matching sections. You can just keep the results:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
preg_match("/<img[^>]*>/", $buffer, $matchArray);
foreach ($matchArray as $match){
echo $match;
}
prints out:
<img src='image.jpg'>
EDIT:
The problem I am seeing with trying to replace every other tag will be when you have contents between the tags. If you don't care about that, then here is something that works using preg_replace().
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('/(?i)<\\/*(?!img)[^>]*>/', '', $buffer);
echo $buffer; /* outputs <img src='image.jpg'> */
Use
preg_replace('/<img[^>]*>(*SKIP)(*FAIL)|./si', '', $buffer)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
<img '<img'
--------------------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skips the match
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
. any character

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
my $source = "foo";
my #time = "10-14-2011";
my $name = $html->find('a')->as_text;
my $comment = $html->as_text;
my #keywords = map { $_->as_text } $html->find('b');
Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish
Which is perfect and what I wanted from the test html but
this only works fine when I put in the aforementioned HTML, which I did for test purposes.
However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.
How can I account for multiple values for specific searches?
Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like
foreach my $anchor ($html->find('a')) {
print $anchor->as_text, "\n";
}
But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.
I cannot begin to guess about your problem with comments without seeing what data you are dealing with.
If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');
OUTPUT
1 h
User
: There are not enough
big
fish
in the lake ;

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!
i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>
You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.
This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.
You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.
There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.

Why doesn't the match operator match anything?

I'm trying to parse this HTML block:
<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">
to capture the redirect link:
/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=
and video title:
The Valley Downs Chicago
When I use this simple Perl code:
foreach $_ (#promotedVideos)
{
if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/six)
{
print $1;
print $2;
}
}
nothing prints. While I'm troubleshooting this, I thought I'd ask you the experts if you see anything wrong or problematic. Thanks so much in advance for your help!
Your /x regex modifier messes something with whitespaces. Remove it.
That is, it should be
if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si)
/x makes perl ignore whitespaces inside regex, making your regex equivalent of following:
/\s<divclass="v120WrapperInner"><a href="([^"]*)"title="([^"]*)"><img/six
that will not match.
Also that \s at the beginning may brake things.
This is the code I've used for testing:
use strict;
my $inp = '<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&adtype=pyv&event=ad&usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">';
print "$inp\n";
if ( $inp =~ /<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si )
{
print "m:\n$1\n$2\n";
}
Okay, this is not exactly what you are asking, but I think (based in this and your older question) that you are parsing HTML.
Let me tell you this: regexes aren't the solution. You should use HTML::TreeBuilder to parse HTML documents, because HTML documents are horribly messy.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $div ($root->find_by_tag_name('div')) {
if ($div->attr('class') eq 'v120WrapperInner') {
foreach (my $a = $div->find_by_tag_name('a')) {
print "m:\n", $a->attr('href'), "\n", $a->attr('title'), "\n";
}
}
}
It is good that you are gaining experience with regex in perl, but for this type of work you might consider using a DOM parser like XML::DOM.
G'day,
If you're having problems understanding regexp's can I suggest having a read of the regexp intro in Dale Dougherty's excellent book "sed & awk" (sanitised Amazon link).
Definitely one of the best intro's to regexp's around.
HTH
cheers,

How can I extract HTML img tags wrapped in anchors in Perl?

I am working on parsing HTML obtain all the hrefs that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,
You should post some examples that you tried with "LinkExtractor, TokenParser, Mechanize & TreeBuilder" so that we can help you.
Here is something which works for me in pQuery:
use pQuery;
my $data = '
<html>
Not yahoo anchor text
<img src="somepath/nw.gif" alt="Open this result in new window"></img>
just text for yahoo
anchor text only<img src="blah" alt="alt text"/>
</html>
';
pQuery( $data )->find( 'a' )->each(
sub {
say $_->innerHTML
if $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
}
);
# produces:
#
# => <img alt="Open this result in new window" src="somepath/nw.gif"></img>
# => just text for yahoo
# => anchor text only<img /="/" alt="alt text" src="blah"></img>
#
And if you just want the text:
pQuery( $data )->find( 'a' )->each(
sub {
return unless $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
if ( my $text = pQuery($_)->text ) { say $text }
}
);
# produces:
#
# => just text for yahoo
# => anchor text only
#
/I3az/
Use a proper parser (like HTML::Parser or HTML::TreeBuilder). Using regular expressions to parse SGML (HTML/XML included) isn't really all that effective because of funny multiline tags and attributes like the one you've run into.
If the HTML you are working with is fairly close to well formed you can usually load it into an XML module that supports HTML and use it to find and extract data from the parts of the document you are interested in.
My method of choice is XML::LibXML and XPath.
use XML::LibXML;
my $parser = XML::LibXML->new();
my $html = ...;
my $doc = $parser->parse_html_string($html);
my #links = $doc->findnodes('//a[#href = "http://example.com"]');
for my $node (#links) {
say $node->textContent();
}
The string passed to findnodes is an XPath expression that looks for all 'a' element descendants of $doc that have an href attribute equal to "http://example.com".