How can I remove external links from HTML using Perl? - html

I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex
$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;
will match the beginning of an anchor tag and the end of an external link tag e.g.
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
so I end up with nothing instead of
1
some other html
It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future.
Is the something I can change so it only matches the one a tag?

Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.
If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.
Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while ( my $token = $parser->get_token ) {
if ($token->is_start_tag('a')) {
my $href = $token->get_attr('href');
if (defined $href and $href !~ /^#/) {
print $parser->get_trimmed_text('/a');
$parser->get_token; # discard </a>
next;
}
}
print $token->as_is;
}
__DATA__
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider click here >>>
either</p>
Output:
C:\Temp> hjk
1
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered
<p>Maybe you did not consider click here >>>
either</p>
NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.

A bit more like a SAX type parser is HTML::Parser:
use strict;
use warnings;
use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;
my $omitted;
sub tag_handler {
my ( $self, $tag_name, $text, $attr_hashref ) = #_;
if ( $tag_name eq 'a' ) {
my $href = first {; defined } #$attr_hashref{ qw<href HREF> };
$omitted = substr( $href, 0, 7 ) eq 'http://';
return if $omitted;
}
print $text;
}
sub end_handler {
my $tag_name = shift;
if ( $tag_name eq 'a' && $omitted ) {
$omitted = false;
return;
}
print shift;
}
my $parser
= HTML::Parser->new( api_version => 3
, default_h => [ sub { print shift; }, 'text' ]
, start_h => [ \&tag_handler, 'self,tagname,text,attr' ]
, end_h => [ \&end_handler, 'tagname,text' ]
);
$parser->parse_file( $path_to_file ) or die $OS_ERROR;

Yet another solution. I love HTML::TreeBuilder and family.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
if ($a->attr('href') !~ /^#/) {
$a->replace_with_content($a->as_text);
}
}
print $root->as_HTML(undef, "\t");
__DATA__
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider click here >>>
either</p>

Why not just only remove links for which the href attribute doesn't begin with a pound sign? Something like this:
html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;

Even more simple, if you don't care about tag attributes:
$html =~ s/<a[^>]+>(.+?)<\/a>/$1/sig;

Related

How to stop at the next specific character in regex

I have many links in a large variable, and am using regex to extract links. The most ideal link would look like
View Stock
And my regex works perfectly looking for two matches: The full Link and the vendornum.
/<a href="\/search\/\product/(.*?)\/.*?>(.*?)<\/a>/igm
But occasionally, the link will include other info such as a class, which has it's own quotes
<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
And the extra "s throw me off. I cannot figure out the first match, which would be the first two "s
<a href="([^"]+)".*[^>].*?>View Stock</a>
I know regex can be very challenging, and I am using RegEx101.com, a real life saver.
But I just can't seem to figure out how to match the first pattern, the full href link, but excluding any other classes with their own before I reach the closing >
Any experts in regex the can guide me?
There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?
An example with HTML::TreeBuilder (also extracting links, need stated in a comment)
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $links_string =
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );
my $dom = HTML::TreeBuilder->new_from_content($links_string);
my #links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) {
push #links_html, $tag->as_HTML; # the whole link, as is
my $href = $tag->attr("href");
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $tag->as_trimmed_text; # or: ->as_text, keep some spaces
# Or:
# say for $tag->content_list; # all children, and/or text
};
#say for #links_html;
I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.
A few commments
The workhorse here is HTML::Element class, with its powerful and flexible look_down method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986
This prints
vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock
Another option is Mojo::DOM, which is a part of a whole ecosystem
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $links_string = q( ... ); # as above
my $dom = Mojo::DOM->new($links_string);
my #links_html;
foreach my $node ( $dom->find('a')->each ) {
push #links_html, $node->to_string; # or $node, gets stringified to HTML
my $href = $node->attr('href');
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $node->text;
}
#say for #links_html;
I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.
While it is probably useful here to loop as above, as an example we can also do
my $v = $dom -> find('a')
-> map(
sub {
my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;
say "$name = $value";
say $_->text;
}
);
what prints the same as above. See Mojo::Collection to better play with this.
The parameters in the URL can be parsed using Mojo::URL if you really know the name
my $value = Mojo::URL->new($href)
-> query
-> param('vendornum');
If these aren't fixed then Mojo::Parameters is useful
my $param_names = Mojo::Parameters
-> new( Mojo::URL->new($href)->query )
-> names
where $param_names is an arrayref with names of all parameters in the query, or use
my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = #{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };
which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).
An HTML document can be nicely parsed using XML::LibXML as well.
If I read correctly, you'd like to extract the vendornum value from the URL, and the link text. Best to use an html parser.
If you want to live dangerously with code that can break you can use a regex to parse html:
my $html = '<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>';
if($html =~ /<a href="[^\?]*\?vendornum=([^"]*)[^>]*>([^<]*).*$/) {
print "vendornum: $1, link text: $2\n";
} else {
print "no match";
}
Output:
vendornum: StaplesA03, link text: View Stock
Explanation:
vendornum=([^"]*) - scan for vendornum=, and capture everything after that until just before "
[^>]*> - scan over remaining attributes, such as class="", up to closing angle bracket
([^<]*) - capture link text
.*$ - scan up to end of text
First of all you should consider using HTML::TreeBuilder for things like this. Once you get the hang of it it can be easier than coming up with regexes. However for quick and dirty tasks, a regex is fine.
$text =
'<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a x=y href="/search/product/?Vendornum=651687" foo=bar>View Stockings</A>';
$regex =
qr{<a\s[^>]*?href="(?<link>[^"]*?\?vendornum=(?<vendornum>\w+)[^"]*)"[^>]*?>(?<desc>(?:(?!</a>).)*)</a>}i;
while($text =~ m/$regex/g){ Data:Dump::pp1 %+; }
Returns
{
# tied Tie::Hash::NamedCapture
desc => "View Stock",
link => "/search/title/?vendornum=StaplesA03",
vendornum => "StaplesA03",
}
{
# tied Tie::Hash::NamedCapture
desc => "View Stockings",
link => "/search/product/?Vendornum=651687",
vendornum => 651687,
}
HTH

Perl web scraper, retrieve data from text inside a script tag

So far I was using perl to obtain data from web pages using HTML::TreeBuilder. This was OK when the data was contained inside meta or div tags; but now I stumbled upon a new structure that I don't know how to crawl, though it looks pretty trivial.
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
The example displays the relevant part of the content that I get from the web. I would like to get the values for units and horsePower.
Fragments of the code I was using so far:
use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder;
[...]
$reply = $ua->get($url, #ns_headers);
# printing the reply would get us the first code snippet.
print $reply->content;
unless ($reply->is_success) {
[...]
}
my $tree = HTML::TreeBuilder->new_from_content($reply->content);
my #unit_array = $tree -> look_down(_tag=>'meta','itemprop'=>'unit');
my $unit = $unit_array[0]->attr('content');
[...]
Any one knows how to obtain the relevant data and whether I should use something other than HTML::TreeBuilder for that matter? I haven't found any similar cases searching through stackoverflow and the web.
You are basically on the right path. But HTML::TreeBuilder doesn't understand anything about JavaScript.
The approach:
find the <script> nodes
extract the JSON content from those nodes
NOTE: this will be easy for the example given, but will require more finesse for more complicated <script> content
The escape \; in the regex isn't really required, but the SO syntax highlighter gets confused without it
use JSON to decode the string to Perl data structures
access those data structures in your script
A first rough solution without error checking. I left some debugging lines, commented out, in the code so that you can trace what each step is doing:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use HTML::TreeBuilder;
use JSON;
my $decoder = new JSON;
my $tree = HTML::TreeBuilder->new_from_file(\*DATA);
#$tree->dump;
my #scripts = $tree->look_down(_tag => 'script');
#$scripts[0]->dump;
# NOTE 1: ->as_text() *DOES NOT* return <script> content!
# NOTE 2: ->as_HTML() probably doesn't work for all cases, i.e. escaping
my $javascript = ($scripts[0]->content_list())[0];
#print "${javascript}\n";
my($json) = $javascript =~ /(\{.+\})\;/s;
#print "${json}\n";
my $object = $decoder->decode($json);
print Dumper($object);
print "FOUND: units: ", $object->{data}->{units},
" horsepower: ", $object->{data}->{horsePower}, "\n";
# IMPORTANT: $tree needs to be destroyed by hand when you're done with it!
$tree->delete;
exit 0;
__DATA__
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
Test run:
$ perl dummy.pl
$VAR1 = {
'data' => {
'horsePower' => '100',
'units' => 'kW'
}
};
FOUND: units: kW horsepower: 100

How to get a matched substring from a string with regular expression in perl [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I extract URL and link text from HTML in Perl?
I am trying to get the substring in a string .There could be more than one matched string with that name in the string.
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
From the above string i want to get the file name "65378161_12011_Q.pdf".
if($line=~ m/((.*)Q\.pdf)/i ){
my $inside=$2;
print " file name:$inside \n";
}
This is what i tried but it does not get the right sub string.
Can some one help on this.
I really appreciate if some one can answer to my question.
Use a HTML parser.
use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
HTML
$w->find('a')->attr('href');
# expression returns '65378161_12011_Q.pdf'
$w->find('a')->text;
# expression returns ' 65378161_12011_Q.pdf '
See the following script :
#!/usr/bin/env perl
use strict;
use warnings;
my $string = "65378161_12011_Q.pdf";
if($string =~ m/((.*)?Q\.pdf)/i ){
my $inside=$2;
print " file name:$inside \n";
}
Your code just lack the '?' character to tell the regex to be not greedy.
Another way is to match all of the characters that is not a 'Q' before itself :
m/(^[^Q]+)?Q\.pdf/i
Edit:
Because you had edited your post with a different spec :
If you need to parse HTML, I recommend to use a proper module :
Don't parse or modify html with regular expressions! See one of
HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple,
HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response
begins "that's overkill. i only want to..." you are wrong.
http://en.wikipedia.org/wiki/Chomsky_hierarchy and
here for why not to use regex on HTML
(This is a reminder about using regex to parse HTML from #perl channel on irc.freenode.org)
Edit 2:
Here a complete working example :
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content('
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf
</A>
');
$tree->look_down("_tag", "a")->as_text =~ m/(^[^Q]+)Q\.pdf/i && print "$1\n";
Since . will match everything, simply remove the parenthesis around it.
#!/usr/bin/perl
my $line = "65378161_12011_Q.pdf";
if ($line =~ m/(.*Q\.pdf)/i )
{
my $inside = $1;
print "filename = $inside\n";
}
Produces the correct output.
Hope it helps.
Manny

How can I modify HTML files in Perl?

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have.
I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.
An EXAMPLE: http://www.astrosociety.org/education/surf.html
I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again.
In this case there arent additional 'here's there except from the href, but lets assume that there are.
I want to replace the keyword only if its just text, any idea?
BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?
To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.
A common idiom with HTML-Tree is to use a recursive function that modifies the tree:
use strict;
use warnings;
use 5.008;
use File::Slurp 'read_file';
use HTML::TreeBuilder;
sub replace_keyword
{
my $elt = shift;
return if $elt->is_empty;
$elt->normalize_content; # Make sure text is contiguous
my $content = $elt->content_array_ref;
for (my $i = 0; $i < #$content; ++$i) {
if (ref $content->[$i]) {
# It's a child element, process it recursively:
replace_keyword($content->[$i])
unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
} else {
# It's text:
if ($content->[$i] =~ /here/) { # your keyword or regexp here
$elt->splice_content(
$i, 1, # Replace this text element with...
substr($content->[$i], 0, $-[0]), # the pre-match text
# A hyperlink with the keyword itself:
[ a => { href => 'http://example.com' },
substr($content->[$i], $-[0], $+[0] - $-[0]) ],
substr($content->[$i], $+[0]) # the post-match text
);
} # end if text contains keyword
} # end else text
} # end for $i in content index
} # end replace_keyword
my $content = read_file('foo.shtml');
# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");
my $body = $html->look_down(qw(_tag body));
replace_keyword($body);
# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;
print STDOUT $content; # Replace STDOUT with a suitable filehandle
The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.
If tags matter in your search and replace, you'll need to use HTML::Parser.
This tutorial looks a bit easier to understand than the documentation with the module.
If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:
this will not work correctly within HTML comments
this will not work where the < or > character is used within a tag
this will not work where the < or > character is used and not part of a tag
this will not work where a tag spans multiple lines (if you're processing one line at a time)
If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.
Otherwise:
my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";
1 while $html =~ s/
\A # beginning of string
( # group all non-searchfor text
( # sub group non-tag followed by tag
[^<]*? # non-tags (non-greedy)
<[^>]*> # whole tags
)*? # zero or more (non-greedy)
)
\Q$searchfor\E # search text
/$1$replacewith/sx;
Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

How can I remove unused, nested HTML span tags with a Perl regex?

I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.
I'm having a problem with my regex not picking the correct pair of start and end tags to remove.
my $a = 'a <span>b <span style="color:red;">c</span> d</span>e';
$a =~ s/<span\s*>(.*?)<\/span>/$1/g;
print "$a\
returns
a b <span style="color:red;">c d</span>e
but I want it to return
a b <span style="color:red;">c</span> de
Help appreciated.
Try HTML::Parser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my #print_span;
my $p = HTML::Parser->new(
start_h => [ sub {
my ($text, $name, $attr) = #_;
if ( $name eq 'span' ) {
my $print_tag = %$attr;
push #print_span, $print_tag;
return if !$print_tag;
}
print $text;
}, 'text,tagname,attr'],
end_h => [ sub {
my ($text, $name) = #_;
if ( $name eq 'span' ) {
return if !pop #print_span;
}
print $text;
}, 'text,tagname'],
default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;
__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).
This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
(<span[^>]*>.*+(?1)?.*+<\/span>)
See perlfaq 6.11.
Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.
You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.
Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:
a <span>b <span style="color:red;" title="being closed with </span>">c</span> de
Would you have thought of that?
Use an XML processor instead.
Also see the Related Questions (to the right) for your question.
With all your help I've published a script that does everything I need.
http://github.com/timabell/decrufter/