Perl : Extract an HTML element with a particular class using HTML::TokeParser

Perl : Extract an HTML element with a particular class using HTML::TokeParser - html

I'm trying to extract the HTML content present in < td > tags corresponding to the class "tablehead1".
< td class="tablehead1"> Market < /td >
While parsing, i'm getting all the text contents of < td > tags present in the whole html file.
But I need only the content in < td > tags with the particular class "tablehead1" .
Where am i going wrong in the below code ?
use HTML::TokeParser;
open(DATA,"<KeyStats.html") or die "Can't open data";
my $p = HTML::TokeParser->new(*DATA);
while (my $token = $p->get_tag('td')) {
my $url = $token->[1]{class} || "tablehead1";
my $text = $p->get_trimmed_text("/td");
if (length($text)<30&&length($text)>0) { print "$text\n"; }
}

You don't really perform the check whether the class is really tablehead1.
Replace
my $url = $token->[1]{class} || "tablehead1";
by
next unless $token->[1]{class} eq "tablehead1";
should give you the expected results. Also, you should add a check whether the actual <td> really has a key class, e.g. by
next unless grep( /^class$/, #{$token->[2]} ) && $token->[1]{class} eq "tablehead1";

Related

Compare a current user against a list of employees logged into computers Perl, CGI

Not sure I am doing this correctly as it doesn't appear to be working.
Is this the correct way to declare my users, and is the If statement correctly formated?
At the top I have declared:
my $las = 'jpietrza hpietrza oszones';
These are employees we are checking against current users.
Further down in the code, I want to change the text color that is printed if the user is in the list vs. someone else.
while ( $sth->fetch() ) {
next unless defined $currentuser;
$lastlogin =~ s/ .*$//;
$host_name =~ s/1408//;
foreach ( #las ) {
if ( $currentuser eq "$_" ) {
$lacolor = "black";
last;
}
else {
$lacolor = "red";
}
}
print "<tr>";
print "<td>$host_name</td>";
print "<td><font color=\"$lacolor\">System In-Use (User Undisclosed)</font></td><td> </td>";
}

maybe there was nothing wrong with the if statement, but the whole 9 lines of code can be condensed to 1 very readable line:
$lacolor = any { /^$currentuser$/ } #las ? "black" : "red";
please
use List::Util qw/any/;
while($sth->fetch()) {
# $currentuser is assigned between the 'while' and this 'next' statement ?
# if not, then outside the loop and do not loop at all unless defined
next unless defined $currentuser;
$lastlogin=~s/ .*$//;
$host_name=~s/1408//;
$lacolor = any { /^$currentuser$/ } #las ? "black" : "red";
print "<tr>";
print "<td>$host_name</td>";
print "<td><font color=\"$lacolor\">System In-Use (User Undisclosed)</font></td><td> </td>";
}
please, also use strict; and use warnings;

I figured out how to put the array together correctly:
my #las = qw(
jpietrza
hpietrza
oszones
);
instead of:
my $las='
jpietrza
hpietrza
oszones
';

Firstly, as I think you have worked out now, the scalar variable $las and the array #las are completely different. As you've seen, you should declare and initialise your array like this:
my #las = qw(
jpietrza
hpietrza
oszones
);
Actually, I suspect this all gets easier if you store this in a hash, not an array;
my %las = map { $_ => 1 } qw(jpietrza hpietrza oszones);
Then your check just becomes:
my $lacolour = $las{$currentuser} ? 'black' : 'red';
A few more points:
Please add use strict and use warnings. And understand and fix the problems they reveal.
The quotes are unnecessary in if ($currentuser eq "$_").
Using a templating system to create the output will make your life a lot easier.
Update: Oh, and one I forgot earlier. It's 2017. No-one has used the font element in HTML for fifteen years. Take a look at CSS.

creating table from 2-Dimensional array in perl has different outputs

Hi I am generating table from a 2-Dimensional array in perl.
But the output of my program is different if viewed in browser and viewing page source using developers tool in chrome:
Let me explain-
I have a subroutine to print the table from #RESULT array, the code is below
sub printTableFormattedEmpty {
my #array= #_ ;
print "<table border='0' cellspacing='0' bgcolor='#cfcfcf' cellpadding='0'>\n";
for(my $row_i = 0; $row_i < #array; $row_i++) {
print "<tr style='background-color:#B39DB3;'>\n";
for(my $column_i = 0; $column_i < #{ $array[$row_i] }; $column_i++) {
my $th = ($row_i == 0) ? "th" : "td";
print "</$th>";
print "$array[$row_i][$column_i]";
my $close = ($row_i == 0) ? 'th' : 'td';
print "</$close> \n";
}
print "</tr> \n";
}
print "</table> \n";
}
and i am calling the subroutine as
{
print "Table starts here!\n";
#$RESULT[0]- is array of many elements. u can see in output image
$RESULT[1][0]= 'No Active bookings available for you !';
$RESULT[2][0]= 'Click here to create new Booking !';
&printTableFormattedEmpty(#RESULT);
}
Now here i am not getting the expected output in a table , i am getting different output as shown in 2 figure:
when i inspect element and inspect the table i get:
But when i view page source of the page iam getting output formatted as table as shown in the fig:
I am really confused with this two types of Output, the both images are of the same page without refreshing.
How is this possible!
Did i do any mistake in my program or its something else.
Please Help me with This.

This is a typo!
There is a slash / in your opening HTML tag output.
for(my $column_i = 0; $column_i < #{ $array[$row_i] }; $column_i++) {
my $th = ($row_i == 0) ? "th" : "td";
# V HERE
print "</$th>";
print "$array[$row_i][$column_i]";
my $close = ($row_i == 0) ? 'th' : 'td';
print "</$close> \n";
}
Remove that slash and it will be fine.
As to why your two outputs are different: The HTML inspector shows the DOM structure after it has been parsed by the browser. It does not include invalid elements. Since stray closing elements are not valid, it's likely the parser just omitted them, so they are gone.
Viewing the source code on the other hand shows the real, unparsed code, which contains the wrong markup with the faulty HTML tags included. That is also where I saw the extra slashes. (read: your variable names are badly chosen. You would have seen it yourself had it been something like $open_tag and $closing_tag).

as_html in HTML::TagParser

I'm working in perl
I would like to ask if there is something like
$value->as_html()
from HTML::TreeBuilder in HTML::TagParser;
I extracted tag which I needed in HTML::TagParser, but now the only option is:
$value->innerText();
which give me only text without HTML tags
Or maybe can I somehow connect result from HTML::TagParser with HTML::TreeBuilder, and take my HTML tags like this?

The HTML::TagParser does not only read the element content. It also keeps the element name and the attribute key/value pairs for each selected element. Therefore you can easily reproduce the complete HTML code of the element.
Actually, the HTML::TagParser CPAN page contains an example for this: The following code extracts all <a>nchor tags from a web page and reproduces them into an HTML fragment listing precisely these tags.
my $url = 'http://www.kawa.net/xp/index-e.html';
my $html = HTML::TagParser->new( $url );
my #list = $html->getElementsByTagName( "a" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$tagname";
foreach my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
if ( $text eq "" ) {
print " />\n";
} else {
print ">$text</$tagname>\n";
}
}
This works pretty well for simple element scanning. For more complex tasks (e.g. mixed inner HTML content) I would prefer to work with HTML::Parser.

how to find all <p> tags under heading

I have to extract data from this link: http://bit.ly/l1rF5x
What I want to do is that I want to extract all p tags which comes under the <a> tag having attribute rel="bookmark". My only requirement is that only <p> tags which comes under this heading should be parsed, and remaining should be left as it is. Like for example in this page which I have given you, all <p> tags which comes under heading "IIFT question paper 2006", should be parsed.
help please.

You can try using the following :
$(function(){
var results= '';
$('a[rel="bookmark"] p').each(function(i,e){
results += $(e).html() + "\n";
});
alert(results);
});
Variable results will be alerted with the required content.
Example : http://jsfiddle.net/eGmWw/1/

Since you haven't provided any information about the language / environment you want to use to extract this information, I've gone ahead and hacked something together with jQuery.
(Updated) You can see it in action here: JS Fiddle.
If you wanted to use PHP, I recommend simplehtmldom
Here is an example using simplehtmldom:
$url = 'http://school-listing.mba4india.com/page/7/';
$html = file_get_html($url);
$data = array();
// Find all anchors with the desired rel attribute
foreach ($html->find('a[rel="bookmark"]') as $a) {
$h4 = $a->parent(); // Get the anchors parent (in this case an h4)
// We're assuming the next sibling is a p tag here - should test for this here
$p = $h4->next_sibling();
$content = '';
// Iterate over all following p tags, until we run out of siblings or find one
// that isn't a p tag
while ($p) {
$content .= (string) $p;
if ($p->next_sibling() && $p->next_sibling()->tag == 'p') {
$p = $p->next_sibling();
} else {
break;
}
}
$data[] = array('h4' => $h4, 'content' => $content);
}
$br = '<br/>';
foreach ($data as $datum) {
echo $datum['h4'] . $br . $datum['content'];
echo $br.$br;
}
Refer to Simplehtmldom Documentation for more!

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!

i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>

You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.

This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.

You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.

There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Perl : Extract an HTML element with a particular class using HTML::TokeParser - html

Related

Compare a current user against a list of employees logged into computers Perl, CGI

creating table from 2-Dimensional array in perl has different outputs

as_html in HTML::TagParser

how to find all <p> tags under heading

Ignoring unclosed tags from another <div>?

Categories

Resources