How do I extract an HTML element based on its class? - html

I'm just starting out in Perl, and wrote a simple script to do some web scraping. I'm using WWW::Mechanize and HTML::TreeBuilder to do most of the work, but I've run into some trouble. I have the following HTML:
<table class="winsTable">
<thead>...</thead>
<tbody>
<tr>
<td class = "wins">15</td>
</tr>
</tbody>
</table>
I know there are some modules that get data from tables, but this is a special case; not all the data I want is in a table. So, I tried:
my $tree = HTML::TreeBuilder->new_from_url( $url );
my #data = $tree->find('td class = "wins"');
But #data returned empty. I know this method would work without the class name, because I've successfully parsed data with $tree->find('strong'). So, is there a module that can handle this type of HTML syntax? I scanned through the HTML::TreeBuilder documentation and didn't find anything that appeared to, but I could be wrong.

You could use the look_down method to find the specific tag and attributes you're looking for. This is in the HTML::Element module (which is imported by HTML::TreeBuilder).
my $data = $tree->look_down(
_tag => 'td',
class => 'wins'
);
print $data->content_list, "\n" if $data; #prints '15' using the given HTML
$data = $tree->look_down(
_tag => 'td',
class => 'losses'
);
print $data->content_list, "\n" if $data; #prints nothing using the given HTML

I'm using excellent (but a bit slow sometimes) HTML::TreeBuilder::XPath module:
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my #data = $tree->findvalues('//table[ #class = "winsTable" ]//td[#class = "wins"]');

(This is kind of a supplementary answer to dspain's)
Actually you missed a spot in the HTML::TreeBuilder documentation where it says,
Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.
(Note that the bold formatting is mine, it's not in the documentation)
This indicates that you should read HTML::Element's documentation as well, where you would find the find method which says
This is just an alias to find_by_tag_name
This should tell you that it doesn't work for class names, but its description also mentions a look_down method which can be found slightly further down. If you look at the example, you'd see that it does what you want. And dspain's answer shows precisely how in your case.
To be fair, the documentation is not that easy to navigate.

I found the this link the most useful at telling me how to extract specific information from html content. I used the last example on the page:
use v5.10;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get( 'http://htmlparsing.com/' );
# Find all <h1> tags
my #list = $mech->find('h1');
# or this way <----- I found this way very useful to pinpoint exact classes with in some html
my #list = $mech->look_down('_tag' => 'h1',
'class' => 'main_title');
# Now just iterate and process
foreach (#list) {
say $_->as_text();
}
This seemed so much simpler to get up and running than any of the other modules that I looked at. Hope this helps!

Related

Get numbers, a given number of characters after a phrase, from HTML

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>
This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.
Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

How to get a canonical link from HTML head using Nokogiri

I'm trying to get the defined canonical link from a webpage using Nokogiri:
<link rel="canonical" href="https://test.com/somepage">
It's the href I'm after.
Whatever I try it doesn't seem to work. This is what I have:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//canonical/#href')
puts canon
This doesn't return anything, not even an error.
You are trying to get the attribute but that is not how you do it.
You can use this:
page.xpath('//link[#rel="canonical"]/#href')
What it says is: get me a link element anywhere in the document that has a rel attribute that equals "canonical" and when you find that node, get me its href attribute.
The full answers is:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//link[#rel="canonical"]/#href')
puts canon
What you tried to do is get a node that is called "canonical", not the attribute.
I'm a fan of using CSS selectors over XPath as they're more readable:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
There's come confusion about what Nokogiri, and XPath, are returning when accessing a node parameter. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
Here's how I'd do it using CSS:
doc.at('link[rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('link[rel="canonical"]')['href'].class # => String
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
XPath, while more powerful, is also capable of making you, or Nokogiri, Ruby or the CPU, do more work.
First, xpath, which is the XPath-specific version of search returns a NodeSet, not a node or an element. A NodeSet is akin to an array of nodes, which can bite you if you're not aware of what you've got. From the NodeSet documentation:
A NodeSet contains a list of Nokogiri::XML::Node objects. Typically a NodeSet is return as a result of searching a Document via Nokogiri::XML::Searchable#css or Nokogiri::XML::Searchable#xpath
If you are looking for a specific node, or only a single instance of a particular type of node, then use at, or if you want to be picky, use at_css or at_xpath. (Nokogiri can usually figure out what you mean when using at or search but sometimes you have to use the specific method to give Nokogiri a nudge in the right direction.) Using at in the above example shows it returns the node itself, and once you've got the node it's trivial to get the value of any parameter by treating it as a hash.
xpath, search and css all return NodeSets, so, like an array, you need to point to the actual element you want then access the parameter:
doc.xpath('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::NodeSet
doc.xpath('//link[#rel="canonical"]/#href').first.class # => Nokogiri::XML::Attr
doc.xpath('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
Notice that '//link[#rel="canonical"]/#href' results in Nokogiri returning an Attr object, not text. You can print that object, and Ruby will stringify it, but it won't behave like a String resulting in errors if you try to treat it like one. For instance:
doc.xpath('//link[#rel="canonical"]/#href').first.downcase # => NoMethodError: undefined method `downcase' for #<Nokogiri::XML::Attr:0x007faace115d20>
Instead, use text or content to get the text value:
doc.at('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::Attr
doc.at('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
or get the element itself and then access the parameter like you would a hash:
doc.at('//link[#rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('//link[#rel="canonical"]')['href'] # => "https://test.com/somepage"
either of which will return a String.
Also notice I'm not using #href to return the Attr in this example, I'm only getting the Node itself then using ['href'] to return the text of the parameter. It's a shorter selector and makes more sense, at least to me, since Nokogiri isn't having to return the Attr object which you then have to convert using text or possibly run into problems when you accidentally treat it as a String.

How can I strip HTML tags from a string in the model before I get to the view

Trying to determine how to strip the HTML tags from a string in Ruby. I need this to be done in the model before I get to the view. So using:
ActionView::Helpers::SanitizeHelperstrip_tags()
won't work. I was looking into using Nokogiri, but can't figure out how to do it.
If I have a string:
description = google
I need it to be converted to plain text without including HTML tags so it would just come out as "google".
Right now I have the following which will take care of HTML entities:
def simple_description
simple_description = Nokogiri::HTML.parse(self.description)
simple_description.text
end
You can call the sanitizer directly like this:
Rails::Html::FullSanitizer.new.sanitize('<b>bold</b>')
# => "bold"
There are also other sanitizer classes that may be useful: FullSanitizer, LinkSanitizer, Sanitizer, WhiteListSanitizer.
Nokogiri is a great choice if you don't own the HTML generator and you want to reduce your maintenance load:
require 'nokogiri'
description = 'google'
Nokogiri::HTML::DocumentFragment.parse(description).at('a').text
# => "google"
The good thing about a parser vs. using patterns, is the parser continues work with changes to the tags or format of the document, whereas patterns get tripped up by those things.
While using a parser is a little slower, it more than makes up for that by the ease of use and reduced maintenance.
The code above breaks down to:
Nokogiri::HTML(description).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>google</body></html>\n"
Rather than let Nokogiri add the normal HTML headers, I told it to parse only that one node into a document fragment:
Nokogiri::HTML::DocumentFragment.parse(description).to_html
# => "google"
at finds the first occurrence of that node:
Nokogiri::HTML::DocumentFragment.parse(description).at('a').to_html
# => "google"
text finds the text in the node.
Maybe you could use regular expression in ruby like following
des = 'google'
p des[/<.*>(.*)\<\/.*>/,1]
The result will be "google"
Regular expression is powerful.
You could customize to fit your needs.

Parse HTML Page For Links With Regex Using Perl [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

How can I extract URL and link text from HTML in Perl?

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.
Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.
Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com
HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}
HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.
We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple