as_html in HTML::TagParser - html

I'm working in perl
I would like to ask if there is something like
$value->as_html()
from HTML::TreeBuilder in HTML::TagParser;
I extracted tag which I needed in HTML::TagParser, but now the only option is:
$value->innerText();
which give me only text without HTML tags
Or maybe can I somehow connect result from HTML::TagParser with HTML::TreeBuilder, and take my HTML tags like this?

The HTML::TagParser does not only read the element content. It also keeps the element name and the attribute key/value pairs for each selected element. Therefore you can easily reproduce the complete HTML code of the element.
Actually, the HTML::TagParser CPAN page contains an example for this: The following code extracts all <a>nchor tags from a web page and reproduces them into an HTML fragment listing precisely these tags.
my $url = 'http://www.kawa.net/xp/index-e.html';
my $html = HTML::TagParser->new( $url );
my #list = $html->getElementsByTagName( "a" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$tagname";
foreach my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
if ( $text eq "" ) {
print " />\n";
} else {
print ">$text</$tagname>\n";
}
}
This works pretty well for simple element scanning. For more complex tasks (e.g. mixed inner HTML content) I would prefer to work with HTML::Parser.

Related

creating table from 2-Dimensional array in perl has different outputs

Hi I am generating table from a 2-Dimensional array in perl.
But the output of my program is different if viewed in browser and viewing page source using developers tool in chrome:
Let me explain-
I have a subroutine to print the table from #RESULT array, the code is below
sub printTableFormattedEmpty {
my #array= #_ ;
print "<table border='0' cellspacing='0' bgcolor='#cfcfcf' cellpadding='0'>\n";
for(my $row_i = 0; $row_i < #array; $row_i++) {
print "<tr style='background-color:#B39DB3;'>\n";
for(my $column_i = 0; $column_i < #{ $array[$row_i] }; $column_i++) {
my $th = ($row_i == 0) ? "th" : "td";
print "</$th>";
print "$array[$row_i][$column_i]";
my $close = ($row_i == 0) ? 'th' : 'td';
print "</$close> \n";
}
print "</tr> \n";
}
print "</table> \n";
}
and i am calling the subroutine as
{
print "Table starts here!\n";
#$RESULT[0]- is array of many elements. u can see in output image
$RESULT[1][0]= 'No Active bookings available for you !';
$RESULT[2][0]= 'Click here to create new Booking !';
&printTableFormattedEmpty(#RESULT);
}
Now here i am not getting the expected output in a table , i am getting different output as shown in 2 figure:
when i inspect element and inspect the table i get:
But when i view page source of the page iam getting output formatted as table as shown in the fig:
I am really confused with this two types of Output, the both images are of the same page without refreshing.
How is this possible!
Did i do any mistake in my program or its something else.
Please Help me with This.
This is a typo!
There is a slash / in your opening HTML tag output.
for(my $column_i = 0; $column_i < #{ $array[$row_i] }; $column_i++) {
my $th = ($row_i == 0) ? "th" : "td";
# V HERE
print "</$th>";
print "$array[$row_i][$column_i]";
my $close = ($row_i == 0) ? 'th' : 'td';
print "</$close> \n";
}
Remove that slash and it will be fine.
As to why your two outputs are different: The HTML inspector shows the DOM structure after it has been parsed by the browser. It does not include invalid elements. Since stray closing elements are not valid, it's likely the parser just omitted them, so they are gone.
Viewing the source code on the other hand shows the real, unparsed code, which contains the wrong markup with the faulty HTML tags included. That is also where I saw the extra slashes. (read: your variable names are badly chosen. You would have seen it yourself had it been something like $open_tag and $closing_tag).

how to find all <p> tags under heading

I have to extract data from this link: http://bit.ly/l1rF5x
What I want to do is that I want to extract all p tags which comes under the <a> tag having attribute rel="bookmark". My only requirement is that only <p> tags which comes under this heading should be parsed, and remaining should be left as it is. Like for example in this page which I have given you, all <p> tags which comes under heading "IIFT question paper 2006", should be parsed.
help please.
You can try using the following :
$(function(){
var results= '';
$('a[rel="bookmark"] p').each(function(i,e){
results += $(e).html() + "\n";
});
alert(results);
});
Variable results will be alerted with the required content.
Example : http://jsfiddle.net/eGmWw/1/
Since you haven't provided any information about the language / environment you want to use to extract this information, I've gone ahead and hacked something together with jQuery.
(Updated) You can see it in action here: JS Fiddle.
If you wanted to use PHP, I recommend simplehtmldom
Here is an example using simplehtmldom:
$url = 'http://school-listing.mba4india.com/page/7/';
$html = file_get_html($url);
$data = array();
// Find all anchors with the desired rel attribute
foreach ($html->find('a[rel="bookmark"]') as $a) {
$h4 = $a->parent(); // Get the anchors parent (in this case an h4)
// We're assuming the next sibling is a p tag here - should test for this here
$p = $h4->next_sibling();
$content = '';
// Iterate over all following p tags, until we run out of siblings or find one
// that isn't a p tag
while ($p) {
$content .= (string) $p;
if ($p->next_sibling() && $p->next_sibling()->tag == 'p') {
$p = $p->next_sibling();
} else {
break;
}
}
$data[] = array('h4' => $h4, 'content' => $content);
}
$br = '<br/>';
foreach ($data as $datum) {
echo $datum['h4'] . $br . $datum['content'];
echo $br.$br;
}
Refer to Simplehtmldom Documentation for more!

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!
i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>
You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.
This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.
You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.
There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.

PHP: Inject iframe right after body tag

I would like to place an iframe right below the start of the body tag. This has some issues since the body tag can have various attributes and odd whitespace. My guess is this will will require regular expressions to do correctly.
EDIT: This solution has to work with php 4 & performance is a concern of mine. It's for this http://drupal.org/node/586210#comment-2567398
You can use DOMDocument and friends. Assuming you have a variable html containing the existing HTML document as a string, the basic code is:
$doc = new DOMDocument();
$doc->loadHTML(html);
$body = $doc->getElementsByTagName('body')->item(0);
$iframe = $doc->createElement('iframe');
$body->insertBefore($iframe, $body->firstChild);
To retrieve the modified HTML text, use
$html = $doc->saveHTML();
EDIT: For PHP4, you can try DOM XML.
Both PHP 4 and PHP 5 should be happy with preg_split():
/* split the string contained in $html in three parts:
* everything before the <body> tag
* the body tag with any attributes in it
* everything following the body tag
*/
$matches = preg_split('/(<body.*?>)/i', $html, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
/* assemble the HTML output back with the iframe code in it */
$injectedHTML = $matches[0] . $matches[1] . $iframeCode . $matches[2];
Using regular expressions brings up performance concerns... This is what I'm going for
<?php
$html = file_get_contents('http://www.yahoo.com/');
$start = stripos($html, '<body');
$end = stripos($html, '>', $start);
$body = substr_replace($html, '<IFRAME INSERT>', $end+1, 0);
echo htmlentities($body);
?>
Thoughts?

How can I extract HTML img tags wrapped in anchors in Perl?

I am working on parsing HTML obtain all the hrefs that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,
You should post some examples that you tried with "LinkExtractor, TokenParser, Mechanize & TreeBuilder" so that we can help you.
Here is something which works for me in pQuery:
use pQuery;
my $data = '
<html>
Not yahoo anchor text
<img src="somepath/nw.gif" alt="Open this result in new window"></img>
just text for yahoo
anchor text only<img src="blah" alt="alt text"/>
</html>
';
pQuery( $data )->find( 'a' )->each(
sub {
say $_->innerHTML
if $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
}
);
# produces:
#
# => <img alt="Open this result in new window" src="somepath/nw.gif"></img>
# => just text for yahoo
# => anchor text only<img /="/" alt="alt text" src="blah"></img>
#
And if you just want the text:
pQuery( $data )->find( 'a' )->each(
sub {
return unless $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
if ( my $text = pQuery($_)->text ) { say $text }
}
);
# produces:
#
# => just text for yahoo
# => anchor text only
#
/I3az/
Use a proper parser (like HTML::Parser or HTML::TreeBuilder). Using regular expressions to parse SGML (HTML/XML included) isn't really all that effective because of funny multiline tags and attributes like the one you've run into.
If the HTML you are working with is fairly close to well formed you can usually load it into an XML module that supports HTML and use it to find and extract data from the parts of the document you are interested in.
My method of choice is XML::LibXML and XPath.
use XML::LibXML;
my $parser = XML::LibXML->new();
my $html = ...;
my $doc = $parser->parse_html_string($html);
my #links = $doc->findnodes('//a[#href = "http://example.com"]');
for my $node (#links) {
say $node->textContent();
}
The string passed to findnodes is an XPath expression that looks for all 'a' element descendants of $doc that have an href attribute equal to "http://example.com".