i'm doing perl programming. i'm opening a input of .html. i want to copy the content of <P> tag into variables so that i can used the content only and make some changes to the content
below is my code
use utf8;
package MyParser;
use base qw(HTML::Parser);
$lines = <INPUT>;
my $parser = MyParser->new;
$parser->parse( $lines );
print $lines;
but it print only (!DOCTYPE html PUBLIC ......)
does anyone know how to it?
thanks in advance
Consider using HTML::TokeParser::Simple for simple stream parsing of HTML documents.
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(...);
while (my $tag = $parser->get_tag('p')) {
print $parser->get_trimmed_text('/p'), "\n";
}
If you want the entire document tree to query and change, HTML::TreeBuilder will give you an HTML::Tree.
I highly recommend the use of a parser (HTML::Parser), and avoid the use of regular expressions to do this kind os tasks
Related
I'm trying to parse an HTML file to count HTML tags. I'm not much familiar with Regexp though.
My current code counts only by line. not tag by tag. It returns the whole line.
while(<SUB>){
while(/(<[^\/][a-z].*>)/gi){
print $_;
$count++;
}
}
suppose that we have a line like this in the file
<div>blahblahblah</div><h1>hello</h1><p>blah</>
I need to extract the opening tag of every HTML tag and also tags like <hr>,<br> and <img>.
Could you please put me in the right direction.
If you want to count HTML tags within a document I suggest that you use HTML::Treebuilder.
use strict;
use HTML::Tree;
use LWP::Simple;
my $ex = "http://www.google.com";
my $content = get($ex);
my $tree = HTML::Tree->new();
$tree->parse($content);
my #a_tags = $tree->look_down( '_tag' , 'div' );
my $size=#a_tags;
print $size;
Now you can specify different tag names instead of div and count all different tags that you require. I suggest studying HTML::Treebuilder as it is a very useful module and you may finds methods you may find useful.
$match = q(<h1><b>Google</b></h1>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";
OUTPUT: Google</b></h1>
It Should be : Google
Unable to extract value from link using Regex in Perl, it could have one more or less nesting:
<h1><b><i>Google</i></b></h1>
Please Try this:
1) <td>Unix shell
2) <h1><b>HP</b></h1>
3) generic</td>);
4) <span>[</span>1<span>]</span>
OUTPUT:
Unix shell
HP
generic
[1]
Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:
use Mojo;
my $dom = Mojo::DOM->new(q(<h1><b>Google</b></h1>));
print $dom->at('a[href="#google"]')->all_text, "\n";
Or with HTML::TreeBuilder::XPath:
use HTML::TreeBuilder::XPath;
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<h1><b>Google</b></h1>));
print $dom->findvalue('//a[#href="#google"]'), "\n";
Try this:
if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)
That should take "everything after the href and between the <b>...</b> tags
Instead, to get "everything after the last > and before the first </, you can use
<a.*?href.*?>([^>]*?)<\/
For this simple case you could use: The requirements are no longer simple, look at #amon's answer for how to use an HTML parser.
/<a.*?>([^<]+)</
Match an opening a tag, followed by anything until you find something between > and <.
Though as others have mentioned, you should generally use a HTML parser.
echo '<td>Unix shell
<h1><b>HP</b></h1>
generic</td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic
I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*
(?<=>)((?:\w+)(?:\s*))(?1)*
Just take the first element of the returned array, ie array[0]
I'm trying to split a chunck of html code by the "table" tag and its contents.
So, I tried
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my #values = split(/<table*.*\/table>/, $html);
After this, I want the #values array to look like this:
array('aaa', 'bbb', 'ccc').
But it returns this array:
array('aaa', 'ccc').
Can anyone tell me how I can specify to the split function that each table should be parsed separately?
Thank you!
Your regex is greedy, change it to /<table.*?\/table>/ and it will do what you want. But you should really look into a proper HTML parser if you are going to be doing any serious work. A search of CPAN should find one that is suited to your needs.
Your regex .* is greedy, therefore chewing its way to the last part of the string. Change it to .*? and it should work better.
Use a ? to specify non-greedy wild-card char slurping, i.e.
my #values = split(/<table*.*?\/table>/, $html);
Maybe using HTML parser is a bit overkill for your example, but it will pay off later when your example grows. Solution using HTML::TreeBuilder:
use HTML::TreeBuilder;
use Data::Dump qw(dd);
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my $tree = HTML::TreeBuilder->new_from_content($html);
# remove all <table>....</table>
$_->delete for $tree->find('table');
dd($tree->guts); # ("aaa", "bbb", "ccc")
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.
I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.
Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.
Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com
HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}
HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.
We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple