Regex to parse html for sentences?

Regex to parse html for sentences? - html

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";

Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";

Related

Delete one character at End of File in PERL

So I have encountered a problem while programming with PERL. I use a foreach loop to get some data out of the hash, so it has to loop through it.
The Code:
foreach $title (keys %FilterSPRINTHASH) {
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
print $outfile "$line\n";
}
The thing is, that I am creating a seperate File with the PERL's writting to a file expression, which will be a JSONP text (later used for HTML).
Back to the problem:
As JSONP requires comma's "," after every line that is not the last one, i had to put a comma at the end of line, however when the last line comes in, I have to remove the comma.
I have tried with CHOP function, but not sure where to put it, since if I put it at the end of foreach, it will just chop the comma in $line, but this wont chop it in the new file I created.
I have also tried with while (<>) statement, with no success.
Any ideas appreaciated.
BR

Using JSON module is far less error prone; no need to reinvent the wheel
use JSON;
print $outfile encode_json(\%FilterSPRINTHASH), "\n";

You can check if it is the last iteration of the loop, then remove the comma from line.
So something like
my $count = keys %FilterSPRINTHASH; #Get number of keys (scalar context)
my $loop_count = 1; #Use a variable to count number of iteration
foreach $title (keys %FilterSPRINTHASH){
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
if($loop_count == $count){
#this is the last iteration, so remove the comma from line
$line =~ s/,+$//;
}
print $outfile "$line\n";
$loop_count++;
}

i would approach this by storing your output in an array and then joining that with the line separators you wish:
my #output; # storage for output
foreach $title (keys %FilterSPRINTHASH) {
# create each line
my $line = sprintf "'%s':{'openSP' : %s, 'estSP' : %s}", $title, $FilterSPRINTHASH{$title}{openSP}, $FilterSPRINTHASH{$title}{estSP};
# and put it in the output container
push #output, $line;
}
# join all outputlines with comma and newline and then output
print $outfile (join ",\n", #output);

HTML parser using perl

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.
<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>
Here is the following code.
use HTML::TagParser();
use URI::Fetch;
//my #list = $html->getElementsByTagName( "p" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
foreach $_ (#array) {
# print "$_\n";
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
}
}
}
$end = $#array1+1;
print "Elements in the array: $end\n";
close $html_fh;
The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?

The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text"); # this array is printed below
foreach $_ (#array) { # This is inside the other loop
# print "$_\n";
print $html_fh "$_\n"; # here comes the print
chomp ($_);
push (#array1, "$_");
}
}
So for example, if you have an array "foo", "bar", "baz", it would print:
foo # first iteration
foo # second
bar
foo # third
bar
baz
So, to fix your duplication errors, move the second loop outside the first one.
Some other notes:
You should always use these two pragmas:
use strict;
use warnings;
They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.
//my #list = $html->getElementsByTagName( "p" );
Comments in perl start with #. Not sure if this is a typo, because you use this array below.
foreach my $elem ( #list ) {
You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):
for my $elem ($html->getElementsByTagName("p")) {
These variables are also intermediate, and two of them unused.
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
Also note that you never have to quote a variable this way. You can simply do this:
push #array, $elem->innerText;
foreach $_ (#array) {
The $_ variable is used by default, no need to specify it explicitly.
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.
$end = $#array1+1;
This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:
$end = #array1; # size = last index + 1
But you can do this in one go:
print "Elements in the array: " . #array1 . "\n";
Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.
close $html_fh;
Explicitly closing a file handle is not required as it will automatically closed when the script ends.

If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):
#!/usr/bin/env perl
use strict;
use warnings qw(all);
use URI;
use Web::Scraper;
my $result = scraper {
process 'p',
'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));
for my $test (#{$result->{paragraph}}) {
print "$test\n";
}
print "Elements in the array: " . (scalar #{$result->{paragraph}});

Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10; # say
use Mojo::DOM;
my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END
my $dom = Mojo::DOM->new($html);
my #paragraphs = $dom->find('p')->pluck('text')->each;
say for #paragraphs;

HTML sorting with Perl regex

I have an HTML file consisting of an HTML table with links to Scientific Papers and Authors and with their year of publishing. The html is sorted from oldest to newest. I need to resort the table by parsing the file and getting a new file with the sorted source code from newest to oldest.
Here is a small perl script that should be doing the job but it produces semi-sorted results
local $/=undef;
open(FILE, "pubTable.html") or die "Couldn't open file: $!";
binmode FILE;
my $html = <FILE>;
open (OUTFILE, ">>sorted.html") || die "Can't oupen output file.\n";
map{print OUTFILE "<tr>$_->[0]</tr>"}
sort{$b->[1] <=> $a->[1]}
map{[$_, m|, +(\d{4}).*</a>|]}
$html =~ m|<tr>(.*?)</tr>|gs;
close (FILE);
close (OUTFILE);
And here is my input file:
link
and what I get as an output:
link
From the output you can see the order is going well but then I get the year 1993 after the year 1992 and not in the beginning of the list.

There was a problem with the regex in the map because of the following lines in the html.
,{UCLA}-Report 982051,Los Angeles,,1989,</td> </tr>
and
Phys.Rev.Lett., <b> 60</b>, 1514, 1988</td> </tr>
Phys. Rev. B, <b> 45</b>, 7115, 1992</td> </tr>
J.Chem.Phys., <b> 96</b>, 2269, 1992</td> </tr>
In the 1989 line the year includes a comma at the end and there's no whitespace in front. Because of that, the script threw a lot of warnings and always put that line in the bottom.
The other three lines have a four-digit number (\d{4}) with something behind it .* (the year). So the sorting used the other numbers (7115, 2269, 1514) to sort and those were mixed up with the years.
You need to adjust the regex accordingly to fix those issues.
Before:
map{[$_, m|, +(\d{4}).*</a>|]}
After:
map{[$_, m|, *(\d{4}),?</a>|]}

And a solution with XML::Twig, which can also be used to process HTML. It's fairly robust: it won't process other tables in the file, it will accommodate typos like the one in the year in the UCLA report...
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $IN = 'sort_input.html';
my $OUT = 'sort_output.html';
my $t= XML::Twig->new( twig_handlers => { 'table[#class="pubtable"]' => \&sort_table,
},
pretty_print => 'indented',
)
->parsefile_html( $IN)
->print_to_file( $OUT);
sub sort_table
{ my( $t, $table)= #_;
$table->sort_children( sub { if($_[0]->last_child( 'td')->text=~ m{(\d+)\D*$}s) { $1; } },
type => 'numeric', order => 'reverse'
);
}

Solution with a robust HTML parsing/manipulation library:
use strictures;
use autodie qw(:all);
use Web::Query qw();
my $w = Web::Query->new_from_file('pubTable.html');
$w->find('table')->html(
join q(),
map { $_->[0]->html }
sort { $a->[1] <=> $b->[1] }
#{
$w->find('tr')->map(sub {
my (undef, $row) = #_;
my ($year) = $row->find('.pubname')->text =~ /(\d\d\d\d) ,? \s* \z/msx;
return [$row => $year];
})
}
);
open my $out, '>:encoding(UTF-8)', 'sorted.html';
print {$out} $w->html;
close $out;

Perl: HTML::PrettyPrinter - Handling self-closing tags

I am a newcomer to Perl (Strawberry Perl v5.12.3 on Windows 7), trying to write a script to aid me with a repetitive HTML formatting task. The files need to be hand-edited in future and I want them to be human-friendly, so after processing using the HTML package (HTML::TreeBuilder etc.), I am writing the result to a file using HTML::PrettyPrinter. All of this works well and the output from PrettyPrinter is very nice and human-readable. However, PrettyPrinter is not handling self-closing tags well; basically, it seems to be treat the slash as an HTML attribute. With input like:
<img />
PrettyPrinter returns:
<img /="/" >
Is there anything I can do to avoid this other than preprocessing with a regex to remove the backslash?
Not sure it will be helpful, but here is my setup for the pretty printing:
my $hpp = HTML::PrettyPrinter->new('linelength' => 120, 'quote_attr' => 1);
$hpp->allow_forced_nl(1);
my $output = new FileHandle ">output.html";
if (defined $output) {
$hpp->select($output);
my $linearray_ref = $hpp->format($internal);
undef $output;
$hpp->select(undef),
}

You can print formatted human readable html with TreeBuilder method:
$h = HTML::TreeBuilder->new_from_content($html);
print $h->as_HTML('',"\t");
but if you still prefer this bugged prettyprinter try to remove problem tags, no idea why someone need ...
$h = HTML::TreeBuilder->new_from_content($html);
while(my $n = $h->look_down(_tag=>img,'src'=>undef)) { $n->delete }
UPD:
well... then we can fix the PrettyPrinter. It's pure perl module so lets see...
No idea where on windows perl modules are for me it's /usr/local/share/perl/5.10.1/HTML/PrettyPrinter.pm
maybe not an elegant solution, but will work i hope.
this sub parse attribute/value pairs, a little fix and it will add single '/' at the end
~line 756 in PrettyPrinter.pm
I've marked the stings that i added with ###<<<<<< at the end
#
# format the attributes
#
sub _attributes {
my ($self, $e) = #_;
my #result = (); # list of ATTR="value" strings to return
my $self_closing = 0; ###<<<<<<
my #attrs = $e->all_external_attr(); # list (name0, val0, name1, val1, ...)
while (#attrs) {
my ($a,$v) = (shift #attrs,shift #attrs); # get current name, value pair
if($a eq '/') { ###<<<<<<
$self_closing=1; ###<<<<<<
next; ###<<<<<<
} ###<<<<<<
# string for output: 1. attribute name
my $s = $self->uppercase? "\U$a" : $a;.
# value part, skip for boolean attributes if desired
unless ($a eq lc($v) &&
$self->min_bool_attr &&.
exists($HTML::Tagset::boolean_attr{$e->tag}) &&
(ref($HTML::Tagset::boolean_attr{$e->tag}).
? $HTML::Tagset::boolean_attr{$e->tag}{$a}.
: $HTML::Tagset::boolean_attr{$e->tag} eq $a)) {
my $q = '';
# quote value?
if ($self->quote_attr || $v =~ tr/a-zA-Z0-9.-//c) {
# use single quote if value contains double quotes but no single quotes
$q = ($v =~ tr/"// && $v !~ tr/'//) ? "'" : '"'; # catch emacs ");
}
# add value part
$s .= '='.$q.(encode_entities($v,$q.$self->entities)).$q;
}
# add string to resulting list
push #result, $s;
}
push #result,'/' if $self_closing; ###<<<<<<
return #result; # return list ('attr="val"','attr="val"',...);
}

Convert CSS Style Attributes to HTML Attributes using Perl

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;

You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.

Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.

I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regex to parse html for sentences? - html

Your regex should be />[^<]?./gis The ? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period. There may be other problems. Now read this

Related

Delete one character at End of File in PERL

HTML parser using perl

HTML sorting with Perl regex

Perl: HTML::PrettyPrinter - Handling self-closing tags

Convert CSS Style Attributes to HTML Attributes using Perl

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regex to parse html for sentences? - html

Your regex should be />[^<]*?./gis The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period. There may be other problems. Now read this

Related

Delete one character at End of File in PERL

HTML parser using perl

HTML sorting with Perl regex

Perl: HTML::PrettyPrinter - Handling self-closing tags

Convert CSS Style Attributes to HTML Attributes using Perl

Categories

Resources

Your regex should be />[^<]?./gis The ? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period. There may be other problems. Now read this