HTML parser using perl

HTML parser using perl - html

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.
<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>
Here is the following code.
use HTML::TagParser();
use URI::Fetch;
//my #list = $html->getElementsByTagName( "p" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
foreach $_ (#array) {
# print "$_\n";
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
}
}
}
$end = $#array1+1;
print "Elements in the array: $end\n";
close $html_fh;
The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?

The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text"); # this array is printed below
foreach $_ (#array) { # This is inside the other loop
# print "$_\n";
print $html_fh "$_\n"; # here comes the print
chomp ($_);
push (#array1, "$_");
}
}
So for example, if you have an array "foo", "bar", "baz", it would print:
foo # first iteration
foo # second
bar
foo # third
bar
baz
So, to fix your duplication errors, move the second loop outside the first one.
Some other notes:
You should always use these two pragmas:
use strict;
use warnings;
They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.
//my #list = $html->getElementsByTagName( "p" );
Comments in perl start with #. Not sure if this is a typo, because you use this array below.
foreach my $elem ( #list ) {
You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):
for my $elem ($html->getElementsByTagName("p")) {
These variables are also intermediate, and two of them unused.
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
Also note that you never have to quote a variable this way. You can simply do this:
push #array, $elem->innerText;
foreach $_ (#array) {
The $_ variable is used by default, no need to specify it explicitly.
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.
$end = $#array1+1;
This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:
$end = #array1; # size = last index + 1
But you can do this in one go:
print "Elements in the array: " . #array1 . "\n";
Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.
close $html_fh;
Explicitly closing a file handle is not required as it will automatically closed when the script ends.

If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):
#!/usr/bin/env perl
use strict;
use warnings qw(all);
use URI;
use Web::Scraper;
my $result = scraper {
process 'p',
'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));
for my $test (#{$result->{paragraph}}) {
print "$test\n";
}
print "Elements in the array: " . (scalar #{$result->{paragraph}});

Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10; # say
use Mojo::DOM;
my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END
my $dom = Mojo::DOM->new($html);
my #paragraphs = $dom->find('p')->pluck('text')->each;
say for #paragraphs;

Related

How do I add link to cells using HTML::TagTree?

I am using HTML::TagTree as it seems to create the html file for table easily. I want to add html links to the text within some cells.
From the documentation provided here, I am not able to get a clear answer about how to add a new tag for the text inside a cell. Here's my code.
Main line to focus on: $new_row->td($1,'style=text-align:center','a:href="second_page.html"')
I don't think I clearly understand how to add more tags and attributes. Can someone please help?
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TagTree;
my $filename = 'list.txt';
my $html = HTML::TagTree->new('html'); # Define the top of the tree of objects.
my $head = $html->head(); # Put a 'head' branch on the tree.
my $body = $html->body(); # Put a 'body' branch on the tree
$head->title("Report");
$head->meta('', 'name=author CONTENT="xxx"');
$body->div->h1('Main page name'); # Example of method chaining to create
# a long branch.
my $table = $body->table('', 'width=100% border=1');
my $row1 = $table->tr();
$row1->td('Feature Code','style=background-color:khaki;text-align:center');
$row1->td('Feature Name','style=background-color:khaki;text-align:center');
$row1->td('% completed','style=background-color:khaki;text-align:center');
open(my $fh, '<', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>){
if($row =~ m/([.\d]+): (.+)/){
my $new_row = $table->tr();
$new_row->td($1,'style=text-align:center','a:href="page_for_each_item.html"');
$new_row->td($2);
}
}
# Print to STDOUT the actual HTML representation of the tree
$html->print_html();

After some tries I think I found one way of doing it:
my $new_row = $table->tr();
my $text = $html->object();
$text->a($original_text,"href=second_page.html");
$new_row->td($text,'style=text-align:center');

How can I delete parts of a JSON web response?

I have a simple Perl script and I want to remove everything up to the word "city". Or remove everything up to the nth occurrence (the 2nd in my particular case) of the comma's " , ". Here's what is looks like below.
#!/usr/bin/perl
use warnings;
use strict;
my $CMD = `curl http://ip-api.com/json/8.8.8.8`;
chomp($CMD);
my $find = "^[^city]*city";
$CMD =~ s/$find//;
print $CMD;
The output is this:
{"as":"AS15169 Google Inc.","city":"Mountain View","country":"United States","countryCode":"US","isp":"Google","lat" :37.386,"lon":-122.0838,"org":"Google","query":"8.8.8.8","region":"CA","regionName":"California","status":"success","timezone":"America/Los_Angeles","zip":"94035"}
So i want do drop
" {"as":"AS15169 Google Inc.","
or drop up to
{"as":"AS15169 Google Inc.","city":"Mountain View",
EDIT:
I see I was doing far too much when matching the string. I simplified the fix for my problem with removing all before "city". My $find has been changed to
my $find = ".*city";
While I also changed the replace function like so,
$CMD =~ s/$find/city/;
Still haven't figured out how to remove all before the nth occurrence of a comma or any character / string for that matter.

The content you get back is JSON, so you can easily turn it into a Perl data structure, play with it, and even turn it back into JSON if you like. That's the point! And, it's so easy:
use Mojo::UserAgent;
use Mojo::JSON qw(decode_json encode_json);
my $ua = Mojo::UserAgent->new;
my $tx = $ua->get( 'http://ip-api.com/json/8.8.8.8' );
my $json = $tx->res->body;
my $perl = decode_json( $json );
delete $perl->{'as'};
my $new_json = encode_json( $perl );
print $new_json;
Mojolicious is wonderful for this. It's my preferred way for dealing with JSON even without the user-agent stuff. If you play with the JSON string directly, you're likely to have problems when the order of elements change or it contains wide characters.

You don't have to manually decode_json() with Mojolicious. Simply do this:
my $tx = $ua->get('http://ip-api.com/json/8.8.8.8');
my $json = $tx->res->json;
my $as = $json->{as}
You can even go fancy with JSON pointers:
my $as = $tx->res->json("/as");

Something like
#!/usr/bin/perl -w
my $results = `curl http://ip-api.com/json/8.8.8.8`;
chomp $results;
$results =~ s/^.*city":"\w+\s?\w+",//g;
print $results . "\n";
should do the trick.. unless there's a misunderstanding of what you want to keep v.s. remove.
FYI, http://regexr.com/ is totally my go to for regex happiness.

Delete one character at End of File in PERL

So I have encountered a problem while programming with PERL. I use a foreach loop to get some data out of the hash, so it has to loop through it.
The Code:
foreach $title (keys %FilterSPRINTHASH) {
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
print $outfile "$line\n";
}
The thing is, that I am creating a seperate File with the PERL's writting to a file expression, which will be a JSONP text (later used for HTML).
Back to the problem:
As JSONP requires comma's "," after every line that is not the last one, i had to put a comma at the end of line, however when the last line comes in, I have to remove the comma.
I have tried with CHOP function, but not sure where to put it, since if I put it at the end of foreach, it will just chop the comma in $line, but this wont chop it in the new file I created.
I have also tried with while (<>) statement, with no success.
Any ideas appreaciated.
BR

Using JSON module is far less error prone; no need to reinvent the wheel
use JSON;
print $outfile encode_json(\%FilterSPRINTHASH), "\n";

You can check if it is the last iteration of the loop, then remove the comma from line.
So something like
my $count = keys %FilterSPRINTHASH; #Get number of keys (scalar context)
my $loop_count = 1; #Use a variable to count number of iteration
foreach $title (keys %FilterSPRINTHASH){
$openSP = $FilterSPRINTHASH{$title}{openSP};
$estSP = $FilterSPRINTHASH{$title}{estSP};
$line = "'$title':{'openSP' : $openSP, 'estSP' : $estSP}\n";
if($loop_count == $count){
#this is the last iteration, so remove the comma from line
$line =~ s/,+$//;
}
print $outfile "$line\n";
$loop_count++;
}

i would approach this by storing your output in an array and then joining that with the line separators you wish:
my #output; # storage for output
foreach $title (keys %FilterSPRINTHASH) {
# create each line
my $line = sprintf "'%s':{'openSP' : %s, 'estSP' : %s}", $title, $FilterSPRINTHASH{$title}{openSP}, $FilterSPRINTHASH{$title}{estSP};
# and put it in the output container
push #output, $line;
}
# join all outputlines with comma and newline and then output
print $outfile (join ",\n", #output);

Regex to parse html for sentences?

I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";

Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this

A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.

I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";

Convert CSS Style Attributes to HTML Attributes using Perl

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;

You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.

Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.

I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

HTML parser using perl - html

Related

How do I add link to cells using HTML::TagTree?

How can I delete parts of a JSON web response?

Delete one character at End of File in PERL

Regex to parse html for sentences?

Convert CSS Style Attributes to HTML Attributes using Perl

Categories

Resources