How can I extract URL and link text from HTML in Perl? - html

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com

HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple

Related

Get numbers, a given number of characters after a phrase, from HTML

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>
This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.
Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

Perl Regular Expression to extract value from nested html tags

$match = q(<h1><b>Google</b></h1>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";
OUTPUT: Google</b></h1>
It Should be : Google
Unable to extract value from link using Regex in Perl, it could have one more or less nesting:
<h1><b><i>Google</i></b></h1>
Please Try this:
1) <td>Unix shell
2) <h1><b>HP</b></h1>
3) generic</td>);
4) <span>[</span>1<span>]</span>
OUTPUT:
Unix shell
HP
generic
[1]
Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:
use Mojo;
my $dom = Mojo::DOM->new(q(<h1><b>Google</b></h1>));
print $dom->at('a[href="#google"]')->all_text, "\n";
Or with HTML::TreeBuilder::XPath:
use HTML::TreeBuilder::XPath;
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<h1><b>Google</b></h1>));
print $dom->findvalue('//a[#href="#google"]'), "\n";
Try this:
if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)
That should take "everything after the href and between the <b>...</b> tags
Instead, to get "everything after the last > and before the first </, you can use
<a.*?href.*?>([^>]*?)<\/
For this simple case you could use: The requirements are no longer simple, look at #amon's answer for how to use an HTML parser.
/<a.*?>([^<]+)</
Match an opening a tag, followed by anything until you find something between > and <.
Though as others have mentioned, you should generally use a HTML parser.
echo '<td>Unix shell
<h1><b>HP</b></h1>
generic</td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic
I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*
(?<=>)((?:\w+)(?:\s*))(?1)*
Just take the first element of the returned array, ie array[0]

How do I extract an HTML element based on its class?

I'm just starting out in Perl, and wrote a simple script to do some web scraping. I'm using WWW::Mechanize and HTML::TreeBuilder to do most of the work, but I've run into some trouble. I have the following HTML:
<table class="winsTable">
<thead>...</thead>
<tbody>
<tr>
<td class = "wins">15</td>
</tr>
</tbody>
</table>
I know there are some modules that get data from tables, but this is a special case; not all the data I want is in a table. So, I tried:
my $tree = HTML::TreeBuilder->new_from_url( $url );
my #data = $tree->find('td class = "wins"');
But #data returned empty. I know this method would work without the class name, because I've successfully parsed data with $tree->find('strong'). So, is there a module that can handle this type of HTML syntax? I scanned through the HTML::TreeBuilder documentation and didn't find anything that appeared to, but I could be wrong.
You could use the look_down method to find the specific tag and attributes you're looking for. This is in the HTML::Element module (which is imported by HTML::TreeBuilder).
my $data = $tree->look_down(
_tag => 'td',
class => 'wins'
);
print $data->content_list, "\n" if $data; #prints '15' using the given HTML
$data = $tree->look_down(
_tag => 'td',
class => 'losses'
);
print $data->content_list, "\n" if $data; #prints nothing using the given HTML
I'm using excellent (but a bit slow sometimes) HTML::TreeBuilder::XPath module:
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my #data = $tree->findvalues('//table[ #class = "winsTable" ]//td[#class = "wins"]');
(This is kind of a supplementary answer to dspain's)
Actually you missed a spot in the HTML::TreeBuilder documentation where it says,
Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.
(Note that the bold formatting is mine, it's not in the documentation)
This indicates that you should read HTML::Element's documentation as well, where you would find the find method which says
This is just an alias to find_by_tag_name
This should tell you that it doesn't work for class names, but its description also mentions a look_down method which can be found slightly further down. If you look at the example, you'd see that it does what you want. And dspain's answer shows precisely how in your case.
To be fair, the documentation is not that easy to navigate.
I found the this link the most useful at telling me how to extract specific information from html content. I used the last example on the page:
use v5.10;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get( 'http://htmlparsing.com/' );
# Find all <h1> tags
my #list = $mech->find('h1');
# or this way <----- I found this way very useful to pinpoint exact classes with in some html
my #list = $mech->look_down('_tag' => 'h1',
'class' => 'main_title');
# Now just iterate and process
foreach (#list) {
say $_->as_text();
}
This seemed so much simpler to get up and running than any of the other modules that I looked at. Hope this helps!

Print content of <P> html, perl

i'm doing perl programming. i'm opening a input of .html. i want to copy the content of <P> tag into variables so that i can used the content only and make some changes to the content
below is my code
use utf8;
package MyParser;
use base qw(HTML::Parser);
$lines = <INPUT>;
my $parser = MyParser->new;
$parser->parse( $lines );
print $lines;
but it print only (!DOCTYPE html PUBLIC ......)
does anyone know how to it?
thanks in advance
Consider using HTML::TokeParser::Simple for simple stream parsing of HTML documents.
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(...);
while (my $tag = $parser->get_tag('p')) {
print $parser->get_trimmed_text('/p'), "\n";
}
If you want the entire document tree to query and change, HTML::TreeBuilder will give you an HTML::Tree.
I highly recommend the use of a parser (HTML::Parser), and avoid the use of regular expressions to do this kind os tasks

Parse HTML Page For Links With Regex Using Perl [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.