Parse HTML Page For Links With Regex Using Perl [duplicate] - html

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody

Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en

Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array

URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

Related

perl - matching greater than charater in regex

$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";
I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings.
Basically, need everything from ('>) to the last character of the string.
Tried two ways,
1) The standard intuitive
($string1=~ /\'>(.*?)/) {print "got $1";}
but this does not seem to work on '>' symbol.
2) Also tried
if ($string1=~ /(?=>)(.*?)/) {print "got $1";}
based on inputs from Greater than and less than symbol in regular expressions, but it is not working.
Any inputs will be useful.
PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!
Thanks
Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.
For example:
<tag>
outer
<tag>
middle
<tag>inner</tag>
middle
</tag>
outer
</tag>
Instead, use an HTML parser and search tools such as XPath.
Here is a demonstration using XML::LibXML.
use strict;
use warnings;
use v5.10;
use XML::LibXML;
my $html = q{
<html>
<body>
<a href='/channels/folder1'>Alpha-Seeking</a>
<a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};
# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);
# Find all links.
for my $node ($dom->findnodes('//a')) {
# Print their text.
say $node->textContent;
}
I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.
Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.
Here's what you have:
if ($string1=~ /\'>(.*?)/) {print "got $1";}
And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.
Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings) {
if ($string =~ /'>(.*)/) { # Note: No "?" here
print "got $1\n";
}
}
This displays:
got Alpha-Seeking
got No Underlying Index ,
This works for me
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings)
{
if ($string =~ /'>(.*?)$/)
{
print "got $1\n";
}
}
running it gives
$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,
While exploring various options, I managed to get this working with the following:
Replace the greater than sign with some other generic symbol (like a pipe)
$string=~ s/>/\|/g; #Interestingly, '>' matches here without any issues
After that, split on the pipe char, and print/parse the second part:
($o1,$o2) = split(/\|/, $string);
print "$o2|";
Works perfectly as a work-around.

Get numbers, a given number of characters after a phrase, from HTML

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>
This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.
Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

Count html tags with Perl regex

I'm trying to parse an HTML file to count HTML tags. I'm not much familiar with Regexp though.
My current code counts only by line. not tag by tag. It returns the whole line.
while(<SUB>){
while(/(<[^\/][a-z].*>)/gi){
print $_;
$count++;
}
}
suppose that we have a line like this in the file
<div>blahblahblah</div><h1>hello</h1><p>blah</>
I need to extract the opening tag of every HTML tag and also tags like <hr>,<br> and <img>.
Could you please put me in the right direction.
If you want to count HTML tags within a document I suggest that you use HTML::Treebuilder.
use strict;
use HTML::Tree;
use LWP::Simple;
my $ex = "http://www.google.com";
my $content = get($ex);
my $tree = HTML::Tree->new();
$tree->parse($content);
my #a_tags = $tree->look_down( '_tag' , 'div' );
my $size=#a_tags;
print $size;
Now you can specify different tag names instead of div and count all different tags that you require. I suggest studying HTML::Treebuilder as it is a very useful module and you may finds methods you may find useful.

Perl Regular Expression to extract value from nested html tags

$match = q(<h1><b>Google</b></h1>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";
OUTPUT: Google</b></h1>
It Should be : Google
Unable to extract value from link using Regex in Perl, it could have one more or less nesting:
<h1><b><i>Google</i></b></h1>
Please Try this:
1) <td>Unix shell
2) <h1><b>HP</b></h1>
3) generic</td>);
4) <span>[</span>1<span>]</span>
OUTPUT:
Unix shell
HP
generic
[1]
Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:
use Mojo;
my $dom = Mojo::DOM->new(q(<h1><b>Google</b></h1>));
print $dom->at('a[href="#google"]')->all_text, "\n";
Or with HTML::TreeBuilder::XPath:
use HTML::TreeBuilder::XPath;
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<h1><b>Google</b></h1>));
print $dom->findvalue('//a[#href="#google"]'), "\n";
Try this:
if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)
That should take "everything after the href and between the <b>...</b> tags
Instead, to get "everything after the last > and before the first </, you can use
<a.*?href.*?>([^>]*?)<\/
For this simple case you could use: The requirements are no longer simple, look at #amon's answer for how to use an HTML parser.
/<a.*?>([^<]+)</
Match an opening a tag, followed by anything until you find something between > and <.
Though as others have mentioned, you should generally use a HTML parser.
echo '<td>Unix shell
<h1><b>HP</b></h1>
generic</td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic
I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*
(?<=>)((?:\w+)(?:\s*))(?1)*
Just take the first element of the returned array, ie array[0]

How can I extract URL and link text from HTML in Perl?

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.
Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.
Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com
HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}
HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.
We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple