How can I extract HTML img tags wrapped in anchors in Perl? - html

I am working on parsing HTML obtain all the hrefs that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,

You should post some examples that you tried with "LinkExtractor, TokenParser, Mechanize & TreeBuilder" so that we can help you.
Here is something which works for me in pQuery:
use pQuery;
my $data = '
<html>
Not yahoo anchor text
<img src="somepath/nw.gif" alt="Open this result in new window"></img>
just text for yahoo
anchor text only<img src="blah" alt="alt text"/>
</html>
';
pQuery( $data )->find( 'a' )->each(
sub {
say $_->innerHTML
if $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
}
);
# produces:
#
# => <img alt="Open this result in new window" src="somepath/nw.gif"></img>
# => just text for yahoo
# => anchor text only<img /="/" alt="alt text" src="blah"></img>
#
And if you just want the text:
pQuery( $data )->find( 'a' )->each(
sub {
return unless $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
if ( my $text = pQuery($_)->text ) { say $text }
}
);
# produces:
#
# => just text for yahoo
# => anchor text only
#
/I3az/

Use a proper parser (like HTML::Parser or HTML::TreeBuilder). Using regular expressions to parse SGML (HTML/XML included) isn't really all that effective because of funny multiline tags and attributes like the one you've run into.

If the HTML you are working with is fairly close to well formed you can usually load it into an XML module that supports HTML and use it to find and extract data from the parts of the document you are interested in.
My method of choice is XML::LibXML and XPath.
use XML::LibXML;
my $parser = XML::LibXML->new();
my $html = ...;
my $doc = $parser->parse_html_string($html);
my #links = $doc->findnodes('//a[#href = "http://example.com"]');
for my $node (#links) {
say $node->textContent();
}
The string passed to findnodes is an XPath expression that looks for all 'a' element descendants of $doc that have an href attribute equal to "http://example.com".

Related

Whats the most efficient/nicest way to extract a text value from a HTML tag using Symfony DOM Crawler?

Given the following HTML code snippet:
<div class="item">
large
<span class="some-class">size</span>
</div>
I'm looking for the best way to extract the string "large" using Symfony's Crawler.
$crawler = new Crawler($html);
Here I could use $crawler->html() then apply a regex search. Is there a better solution?
Or how would you do it exactly?
I've just found a solution that looks the cleanest to me:
$crawler = new Crawler($html);
$result = $crawler->filterXPath('//text()')->text();
This is a bit tricky as the text that you're trying to get is a text node that the DOMCrawler component doesn't (as far as I know) allow you to extract. Thankfully DOMCrawler is just a layer over the top of PHP's DOM classes which means you could probably do something like:
$crawler = new Crawler($html);
$crawler = $crawler->filterXPath('//div[#class="item"]');
$domNode = $crawler->getNode(0);
$text = null;
foreach ($domNode->children as $domChild) {
if ($domChild instanceof \DOMText) {
$text = $domChild->wholeText;
break;
}
}
This wouldn't help with HTML like:
<div>
text
<span>hello</span>
other text
</div>
So you would only get "text", not "text other text" in this instance. Take a look at the DOMText documentation for more details.
$crawler = new Crawler($html);
$node = $crawler->filterXPath('//div[#class="item"]');
$domElement = $node->getNode(0);
foreach ($node->children() as $child) {
$domElement->removeChild($child);
}
dump($node->text()); die();
After you have to trim whitespace.

as_html in HTML::TagParser

I'm working in perl
I would like to ask if there is something like
$value->as_html()
from HTML::TreeBuilder in HTML::TagParser;
I extracted tag which I needed in HTML::TagParser, but now the only option is:
$value->innerText();
which give me only text without HTML tags
Or maybe can I somehow connect result from HTML::TagParser with HTML::TreeBuilder, and take my HTML tags like this?
The HTML::TagParser does not only read the element content. It also keeps the element name and the attribute key/value pairs for each selected element. Therefore you can easily reproduce the complete HTML code of the element.
Actually, the HTML::TagParser CPAN page contains an example for this: The following code extracts all <a>nchor tags from a web page and reproduces them into an HTML fragment listing precisely these tags.
my $url = 'http://www.kawa.net/xp/index-e.html';
my $html = HTML::TagParser->new( $url );
my #list = $html->getElementsByTagName( "a" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$tagname";
foreach my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
if ( $text eq "" ) {
print " />\n";
} else {
print ">$text</$tagname>\n";
}
}
This works pretty well for simple element scanning. For more complex tasks (e.g. mixed inner HTML content) I would prefer to work with HTML::Parser.

How extract text from a tag using Nokogiri

Example:
<p>http://localhost:3000/replies/279<br><p>
Currently using Nokogiri to grab the href from the <a>:
doc.search('a').each do |node|
href = node.attributes['href'].try(:value)
I need to make sure what's in the text part is what's in the href and I'm not sure how to extract that.
Here are the basics for checking:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => false
Modifying the HTML so the HREF and text match:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => true
at returns only the first node that matches the selector, so if you're looking to check multiple nodes you'll want to use search and iterate over the NodeSet it returns.

Perl parse links from HTML Table

I'm trying to get links from table in HTML. By using HTML::TableExtract, I'm able to parse table and get text (i.e. Ability, Abnormal in below example) but cannot get link that involves in the table. For example,
<table id="AlphabetTable">
<tr>
<td>
Ability <span class="count">2650</span>
</td>
<td>
Abnormal <span class="count">26</span>
</td>
</table>
Is there a way to get link using HTML::TableExtract ? or other module that could possibly use in this situation. Thanks
part of my code:
$mech->get($link->url());
$te->parse($mech->content);
foreach $ts ($te->tables){
foreach $row ($ts->rows){
print #$row[0]; #it only prints text part
#but I want its link
}
}
HTML::LinkExtor, passing the extracted table text to its parse method.
my $le = HTML::LinkExtor->new();
foreach $ts ($te->tables){
foreach $row ($ts->rows){
$le->parse($row->[0]);
for my $link_tag ( $le->links ) {
my ($tag, %links) = #$link_tag;
# next if $tag ne 'a'; # exclude other kinds of links?
print for values %links;
}
}
}
Use keep_html option in the constructor.
keep_html
Return the raw HTML contained in the cell, rather than just the visible text. Embedded tables are not retained in the HTML extracted from a cell. Patterns for header matches must take into account HTML in the string if this option is enabled. This option has no effect if extracting into an element tree structure.
$te = HTML::TableExtract->new( keep_html => 1, headers => [qw(field1 ... fieldN)]);

How can I reliably parse a QuakeLive player profile using Perl?

I'm currently working on a Perl script to gather data from the QuakeLive website.
Everything was going fine until I couldn't get a set of data.
I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.
I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.
I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.
You can find an example profile HERE.
Note that for the previous part of the page, it worked using code like:
$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";
The immediate problem is that you have:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
That is, there is no <br /> following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):
my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};
That should put everything up to the < of the next <div> in $favarena. Now, if all arenas are single words with no spaces in them,
my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};
would save you the trouble of having to trim whitespace afterwards.
Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:
<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif"
width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
<b>Arena:</b> Campgrounds
<div class="cl"></div>
</p>
your script would be in trouble where as a solution using an HTML parser would not.
An example using HTML::TokeParser::Simple:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );
while ( my $tag = $p->get_tag('p') ) {
next unless $tag->is_start_tag;
next unless defined (my $class = $tag->get_attr('class'));
next unless grep { /^prf_faves\z/ } split ' ', $class;
my $fav = $p->get_tag('b');
my $type = $p->get_text('/b');
my $value = $p->get_text('/p');
$value =~ s/\s+\z//;
print "$type = $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
And, here is an example using HTML::TreeBuilder:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
use YAML;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');
my #p = $tree->look_down(_tag => 'p', sub {
return unless defined (my $class = $_[0]->attr('class'));
return unless grep { /^prf_faves\z/ } split ' ', $class;
return 1;
}
);
for my $p ( #p ) {
my $text = $p->as_text;
$text =~ s/^\s+//;
my ($type, $value) = split ': ', $text;
print "$type: $value\n";
}
Output:
Arena: Campgrounds
Game Type: Clan Arena
Weapon: Rocket Launcher
Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.
Using regular expressions for this particular task is less than ideal. There are just too many things that might change, and you're not taking advantage of inherent structure of HTML pages. Have you considered using something like HTML::TreeBuilder instead? It will allow you to say "get me the value of the 3rd table cell in the table named weapons", etc.