perl extract text between html tags using regex - html

I'm new to Perl and im trying to extract the text between all <li> </li> tags in a string and assign them into an array using regex or split/join.
e.g.
my $string = "<ul>
<li>hello</li>
<li>there</li>
<li>everyone</li>
</ul>";
So that this code...
foreach $value(#array){
print "$value\n";
}
...results in this output:
hello
there
everyone

Note: Do not use regular expressions to parse HTML.
This first option is done using HTML::TreeBuilder, one of many HTML Parsers that is available to use. You can visit the link provided above and read the documentation and see the example's that are given.
use strict;
use warnings;
use HTML::TreeBuilder;
my $str
= "<ul>"
. "<li>hello</li>"
. "<li>there</li>"
. "<li>everyone</li>"
. "</ul>"
;
# Now create a new tree to parse the HTML from String $str
my $tr = HTML::TreeBuilder->new_from_content($str);
# And now find all <li> tags and create an array with the values.
my #lists =
map { $_->content_list }
$tr->find_by_tag_name('li');
# And loop through the array returning our values.
foreach my $val (#lists) {
print $val, "\n";
}
If you decide you want to use a regular expression here (I don't recommend). You could do something like..
my $str
= "<ul>"
. "<li>hello</li>"
. "<li>there</li>"
. "<li>everyone</li>"
. "</ul>"
;
my #matches;
while ($str =~/(?<=<li>)(.*?)(?=<\/li>)/g) {
push #matches, $1;
}
foreach my $m (#matches) {
print $m, "\n";
}
Output:
hello
there
everyone

Note: Do not use regular expressions to parse HTML.
hwnd has already provided one HTML Parser solution.
However, for a more modern HTML Parser based off css selectors, you can check out Mojo::DOM. There is a very informative 8 minute intro video at Mojocast episode 5.
use strict;
use warnings;
use Mojo::DOM;
my $html = do {local $/; <DATA>};
my $dom = Mojo::DOM->new($html);
for my $li ($dom->find('li')->text->each) {
print "$li\n";
}
__DATA__
<ul>
<li>hello</li>
<li>there</li>
<li>everyone</li>
</ul>
Outputs:
hello
there
everyone

Related

Regex to parse html for sentences?

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";
Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this
A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.
I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";

perl test content to see if contains HTML

I would like to test content that is submitted by users to see if contains HTML or not, and I'd prefer to do it without having to write my own regex. Does anyone know of a module that does this or a good way to do this in perl?
You can check the HTML::Restrict module what allows restrict the content only to allowed tags.
Example:
use 5.012;
use strict;
use warnings;
use HTML::Restrict;
use Data::Dumper;
my #texts = map { { "has_html", 0, "text", $_ } }
split(/==cut-here==/, do{ local $/; <DATA> });
my $res = HTML::Restrict->new();
foreach my $text (#texts) {
my $tmp = $text->{text};
my $plain = $res->process($tmp);
$plain =~ s/\s//gs;
$tmp =~ s/\s//gs;
$text->{has_html} = $tmp cmp $plain ? "YES" : "NO";
}
say Dumper(\#texts);
__DATA__
<img src="image.jpg" alt="tricky>text" />
text with html
==cut-here==
plain
text here
==cut-here==again <!-- a > b --> with html==cut-here==
plain
will check 4 chunks of text and detect for html. If you configure the HTML::Restrict you can check with "allowed" and "not allowed" HTML tags too.

HTML parser using perl

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.
<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>
Here is the following code.
use HTML::TagParser();
use URI::Fetch;
//my #list = $html->getElementsByTagName( "p" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
foreach $_ (#array) {
# print "$_\n";
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
}
}
}
$end = $#array1+1;
print "Elements in the array: $end\n";
close $html_fh;
The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?
The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text"); # this array is printed below
foreach $_ (#array) { # This is inside the other loop
# print "$_\n";
print $html_fh "$_\n"; # here comes the print
chomp ($_);
push (#array1, "$_");
}
}
So for example, if you have an array "foo", "bar", "baz", it would print:
foo # first iteration
foo # second
bar
foo # third
bar
baz
So, to fix your duplication errors, move the second loop outside the first one.
Some other notes:
You should always use these two pragmas:
use strict;
use warnings;
They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.
//my #list = $html->getElementsByTagName( "p" );
Comments in perl start with #. Not sure if this is a typo, because you use this array below.
foreach my $elem ( #list ) {
You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):
for my $elem ($html->getElementsByTagName("p")) {
These variables are also intermediate, and two of them unused.
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
Also note that you never have to quote a variable this way. You can simply do this:
push #array, $elem->innerText;
foreach $_ (#array) {
The $_ variable is used by default, no need to specify it explicitly.
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.
$end = $#array1+1;
This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:
$end = #array1; # size = last index + 1
But you can do this in one go:
print "Elements in the array: " . #array1 . "\n";
Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.
close $html_fh;
Explicitly closing a file handle is not required as it will automatically closed when the script ends.
If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):
#!/usr/bin/env perl
use strict;
use warnings qw(all);
use URI;
use Web::Scraper;
my $result = scraper {
process 'p',
'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));
for my $test (#{$result->{paragraph}}) {
print "$test\n";
}
print "Elements in the array: " . (scalar #{$result->{paragraph}});
Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10; # say
use Mojo::DOM;
my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END
my $dom = Mojo::DOM->new($html);
my #paragraphs = $dom->find('p')->pluck('text')->each;
say for #paragraphs;

Perl regular expression for html

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. I have looked at the page source for this page and come up with the following regex
use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";
if (is_success( $content = LWP::Simple::get($url) ) ) {
print "$url is alive!\n";
} else {
print "No movies found";
}
$code = "";
if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
$code = $1;
}
I am getting an internal server error at this line
$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
I am very new to perl, and would be grateful if anyone could point out my mistake(s).
Use an HTML parser. Regular expressions cannot parse HTML.
Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. It should look like this:
/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
A very nice interface for this type of work is provided by some tools of the Mojolicious distribution.
Long version
The combination of its UserAgent, DOM and URL classes can work in a very robust way:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;
# preparations
my $ua = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";
# try to load the page
my $tx = $ua->get($url);
# error handling
die join ', ' => $tx->error unless $tx->success;
# extract the url
my $movie_link = $tx->res->dom('a[href^=/title]')->first;
my $movie_url = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];
Output:
tt0381061
Short version
The funny one liner helper module ojo helps to build a very short version:
$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'
Output:
tt0381061
I agree XML is anti-line-editing thus anti-unix but, there is AWK.
If awk can do, perl can surely do. I can produce a list:
curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '
$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq
Pass search string to "ID".
Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. Should be easier in perl.

Extract text from HTML - Perl using HTML::TreeBuilder

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?
#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;
use FileHandle;
my #task = ('ar','cn','en','id','vn');
foreach my $lang (#task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my #files = grep (/\.html/,readdir ($dir));
closedir ($dir);
foreach my $file (#files) {
open (my $fh, '<', "./$lang/$file") or die "$!";
my $root = HTML::TreeBuilder->new;
$root->parse_file("./$lang/$file");
my #all_p = $root->find_by_attribute("p");
foreach my $p (#all_p) {
my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
my $filewrite = substr($file, 0, -5);
open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
print $outwrite $ptag->as_text . "\n";
my $pcontents = $ptag->as_text;
print $pcontents . "\n";
close (outwrite);
}
close (FH);
}
}
My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp
My .html files are saved in:
./ar/*
./cn/*
./en/*
./id/*
./vn/*
You are confusing element with attribute. The program can be written much more concisely:
#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);
foreach my $lang (qw(ar cn en id vn)) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
foreach my $file (bsd_glob "./$lang/*.html") {
my $basename = file($file)->basename;
$basename =~ s/[.]html$/.txt/;
open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
$out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
close $out;
}
}
Use find_by_tag_name to search for tag names, not find_by_attribute.
You want find_by_tag_name, not find_by_attribute:
my #all_p = $root->find_by_tag_name("p");
From the docs:
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or under $h that have
any of the specified tag names. In scalar context, returns the first
(in pre-order traversal of the tree) such element found, or undef if
none.
You might want to take a look at Mojo::DOM which lets you use CSS selectors.