perl - matching greater than charater in regex - html

$string1="<a href='/channels/folder1'>Alpha-Seeking";
$string2="<a href='/channels/folder2'>No Underlying Index ,";
I need to extract "Alpha-Seeking" and "No Underlying Index ," from the above 2 strings.
Basically, need everything from ('>) to the last character of the string.
Tried two ways,
1) The standard intuitive
($string1=~ /\'>(.*?)/) {print "got $1";}
but this does not seem to work on '>' symbol.
2) Also tried
if ($string1=~ /(?=>)(.*?)/) {print "got $1";}
based on inputs from Greater than and less than symbol in regular expressions, but it is not working.
Any inputs will be useful.
PS: Also, if the answer can include matching the "less than" symbo ("<"), that will be great!
Thanks

Do not parse HTML with a regex. Regexes are very bad at parsing complex, balanced text like HTML.
For example:
<tag>
outer
<tag>
middle
<tag>inner</tag>
middle
</tag>
outer
</tag>
Instead, use an HTML parser and search tools such as XPath.
Here is a demonstration using XML::LibXML.
use strict;
use warnings;
use v5.10;
use XML::LibXML;
my $html = q{
<html>
<body>
<a href='/channels/folder1'>Alpha-Seeking</a>
<a href='/channels/folder2'>No Underlying Index</a>
</body>
</html>
};
# Parse the HTML
my $dom = XML::LibXML->load_html(string => $html);
# Find all links.
for my $node ($dom->findnodes('//a')) {
# Print their text.
say $node->textContent;
}

I must start by reiterating that it's incredibly unwise to parse HTML or XML with regexes. Please consider using a proper HTML parser.
Having said that, your problem here is pretty simple to fix. What you call the "standard intuitive approach" works fine with a simple tweak.
Here's what you have:
if ($string1=~ /\'>(.*?)/) {print "got $1";}
And your regex is \'>(.*?). That means "find a literal quote mark, followed by a greater than sign and then capture the minimum amount of anything following that". It's "the minimum amount" that's the problem. The simplest thing that .*? can capture is nothing - the empty string.
Regexes are greedy by default; they match as much as possible. You add the ? to remove that greediness and make them match as little as possible. But you don't want that here. Here, you want their greediness. So just remove that ?.
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings) {
if ($string =~ /'>(.*)/) { # Note: No "?" here
print "got $1\n";
}
}
This displays:
got Alpha-Seeking
got No Underlying Index ,

This works for me
use warnings;
use strict;
my #strings = (
"<a href='/channels/folder1'>Alpha-Seeking",
"<a href='/channels/folder2'>No Underlying Index ,"
);
for my $string (#strings)
{
if ($string =~ /'>(.*?)$/)
{
print "got $1\n";
}
}
running it gives
$ perl /tmp/abc.pl
got Alpha-Seeking
got No Underlying Index ,

While exploring various options, I managed to get this working with the following:
Replace the greater than sign with some other generic symbol (like a pipe)
$string=~ s/>/\|/g; #Interestingly, '>' matches here without any issues
After that, split on the pipe char, and print/parse the second part:
($o1,$o2) = split(/\|/, $string);
print "$o2|";
Works perfectly as a work-around.

Related

Perl Regular Expression to extract value from nested html tags

$match = q(<h1><b>Google</b></h1>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";
OUTPUT: Google</b></h1>
It Should be : Google
Unable to extract value from link using Regex in Perl, it could have one more or less nesting:
<h1><b><i>Google</i></b></h1>
Please Try this:
1) <td>Unix shell
2) <h1><b>HP</b></h1>
3) generic</td>);
4) <span>[</span>1<span>]</span>
OUTPUT:
Unix shell
HP
generic
[1]
Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:
use Mojo;
my $dom = Mojo::DOM->new(q(<h1><b>Google</b></h1>));
print $dom->at('a[href="#google"]')->all_text, "\n";
Or with HTML::TreeBuilder::XPath:
use HTML::TreeBuilder::XPath;
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<h1><b>Google</b></h1>));
print $dom->findvalue('//a[#href="#google"]'), "\n";
Try this:
if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)
That should take "everything after the href and between the <b>...</b> tags
Instead, to get "everything after the last > and before the first </, you can use
<a.*?href.*?>([^>]*?)<\/
For this simple case you could use: The requirements are no longer simple, look at #amon's answer for how to use an HTML parser.
/<a.*?>([^<]+)</
Match an opening a tag, followed by anything until you find something between > and <.
Though as others have mentioned, you should generally use a HTML parser.
echo '<td>Unix shell
<h1><b>HP</b></h1>
generic</td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic
I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*
(?<=>)((?:\w+)(?:\s*))(?1)*
Just take the first element of the returned array, ie array[0]

perl: strip html tags, manipulate text, and then return html tags to their original positions

I'm using the Html::Strip module to remove all html tags from a file. I want to then manipulate the resulting text (stripped of html) and finally return the html tags to their original positions.
The text manipulation I'm doing requires breaking the text into arrays using split(/ /, $text). I then do some natural language processing of the resulting arrays (including adding new html tags to some key words). Once I'm finished processing the text, I'd like to return the original tags to their places while keeping the text manipulations I've done in the meantime intact.
I would be satisfied if I could simply remove all whitespace from within the original tags (since whitespace within tags is ignored by browsers). That way my NLProcessing could simply ignore words that are tags (contain '<' or '>').
I've tried diving into the guts of Html::Strip (in an effort to modify it to my needs), but I can't understand what the following piece of code does:
my $stripped = $self->strip_html( $text );
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Seems like strip_html is a sub, but I can't find that sub anywhere.
Anyway thanks for any and all advice.
... the next day...
After a bit of back and forth with #amon, I have come up with a solution that I believe is sufficient for my purposes. amon pushed me in the right direction even though he recommended I not do what I've done anyway, haha.
It is a brutish method, but gets the job done satisfactorily. Gonna leave it here in the off chance that someone else has the same wishes as me and doesn't mind a quick and dirty solution:
my $input = text.html;
my $stripped = $hs->parse($input);
$hs->eof;
so now I have two string variables. One is the html file I want to manipulate, and the other is the same file stripped of html.
my #marks = split(/\s/, $stripped);
#marks = uniq(#marks);
Now I have a list of all non-HTMLtag-associated words that appear in my file.
$input = HTML::Entities::decode($input);
$input =~ s/\</ \</g;
$input =~ s/\>/\> /g;
$input =~ s/\n/ \n /g;
$input =~ s/\r/ \r /g;
$input =~ s/\t/ \t /g;
Now I've decoded my HTML containing var and have ensured that no word is up against a "<", or ">" or non-space whitespace character.
foreach my $mark(#marks) { $input =~ s/ \Q$mark\E / TAQ\+$mark\TAQ /g; }
$input =~ s/TAQ\+TAQ//g;
Now I've "tagged" each word with a "+" and have separated words from non-words with the TAQ delimiter. I can now split on TAQ and ignore any item that does not contain a "+" when performing my NLP and text manipulation. Once I'm done, I rejoin and strip all of the "+". Follow that with some clever encoding, remove all the extra spaces I inserted, and BAM! I've now got my NLProcessing completed, have manipulated the text, and still have all of my HTML in the right places.
There are a lot of caveats here, and I'm not going to go into all of them. Most problematic is the need to decode and then encode, coupled with the fact that HTML::Strip doesn't always strip all the javascript or invalid HTML. There are ways to work around that, but again I don't have room or time to discuss that here.
Thanks amon for your help, and I welcome any criticism or suggestions. I'm new to this.
The module HTML::Strip uses the XS glue language to connect the Perl code with C code. You can find the XS file e.g. on (meta-)cpan. It includes a file strip_html.c that implements the actual algorithm. Due to the definitions in the XS file, a strip_html sub is available in the Perl code as part of the HTML::Strip package. Therefore, it can be invoked as a method on an appropriate object.
Explanation of that piece of code
my $stripped = $self->strip_html( $text );
This will invoke the C function on the contents of $text to strip all the HTML tags. The stripped data will then be assigned to $stripped.
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Suffixing variable names with -p is a lispish tradition to indicate boolean variables (or predicates, in mathematics). Here, it indicates if HTML::Entities could be loaded: my $_html_entities_p = eval 'require HTML::Entities';. If the configuration option decode_entities was set to a true value, and HTML::Entities could be loaded, then entities will be decoded in the stripped data.
Example: given the input
<code> $x < $y </code>
then stripping would produce
$x < $y
and decoding the entities would make it
$x < $y

Perl - split html code by "table" tag and its contents

I'm trying to split a chunck of html code by the "table" tag and its contents.
So, I tried
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my #values = split(/<table*.*\/table>/, $html);
After this, I want the #values array to look like this:
array('aaa', 'bbb', 'ccc').
But it returns this array:
array('aaa', 'ccc').
Can anyone tell me how I can specify to the split function that each table should be parsed separately?
Thank you!
Your regex is greedy, change it to /<table.*?\/table>/ and it will do what you want. But you should really look into a proper HTML parser if you are going to be doing any serious work. A search of CPAN should find one that is suited to your needs.
Your regex .* is greedy, therefore chewing its way to the last part of the string. Change it to .*? and it should work better.
Use a ? to specify non-greedy wild-card char slurping, i.e.
my #values = split(/<table*.*?\/table>/, $html);
Maybe using HTML parser is a bit overkill for your example, but it will pay off later when your example grows. Solution using HTML::TreeBuilder:
use HTML::TreeBuilder;
use Data::Dump qw(dd);
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my $tree = HTML::TreeBuilder->new_from_content($html);
# remove all <table>....</table>
$_->delete for $tree->find('table');
dd($tree->guts); # ("aaa", "bbb", "ccc")

Parse HTML Page For Links With Regex Using Perl [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.