how to find all <p> tags under heading - html

I have to extract data from this link: http://bit.ly/l1rF5x
What I want to do is that I want to extract all p tags which comes under the <a> tag having attribute rel="bookmark". My only requirement is that only <p> tags which comes under this heading should be parsed, and remaining should be left as it is. Like for example in this page which I have given you, all <p> tags which comes under heading "IIFT question paper 2006", should be parsed.
help please.

You can try using the following :
$(function(){
var results= '';
$('a[rel="bookmark"] p').each(function(i,e){
results += $(e).html() + "\n";
});
alert(results);
});
Variable results will be alerted with the required content.
Example : http://jsfiddle.net/eGmWw/1/

Since you haven't provided any information about the language / environment you want to use to extract this information, I've gone ahead and hacked something together with jQuery.
(Updated) You can see it in action here: JS Fiddle.
If you wanted to use PHP, I recommend simplehtmldom
Here is an example using simplehtmldom:
$url = 'http://school-listing.mba4india.com/page/7/';
$html = file_get_html($url);
$data = array();
// Find all anchors with the desired rel attribute
foreach ($html->find('a[rel="bookmark"]') as $a) {
$h4 = $a->parent(); // Get the anchors parent (in this case an h4)
// We're assuming the next sibling is a p tag here - should test for this here
$p = $h4->next_sibling();
$content = '';
// Iterate over all following p tags, until we run out of siblings or find one
// that isn't a p tag
while ($p) {
$content .= (string) $p;
if ($p->next_sibling() && $p->next_sibling()->tag == 'p') {
$p = $p->next_sibling();
} else {
break;
}
}
$data[] = array('h4' => $h4, 'content' => $content);
}
$br = '<br/>';
foreach ($data as $datum) {
echo $datum['h4'] . $br . $datum['content'];
echo $br.$br;
}
Refer to Simplehtmldom Documentation for more!

Related

echo a variable from a multidimentional array outside a function

The code below works for a string value but not when I try to access the variable directly.
The data being accessed is a table at http://webrates.truefx.com/rates/connect.html?f=html
My code strips it of tags and put it in an array $row0
And puts it in a function. But I can't get it out. The function is simplified for this question. I intend to concatenate some of the variables inside the function once I find out what I'm doing wrong.
$row0 = array();
include "scrape/simple_html_dom.php";
$url = "http://webrates.truefx.com/rates/connect.html?f=html";
$html = new simple_html_dom();
$html->load_file($url);
foreach ($html->find('tr') as $i => $row) {
foreach ($row->find('td') as $j => $col) {
$row0[$i][$j]= strip_tags($col);
}
}
myArray($row0); //table stripped of tags
function myArray($arr) {
$a = 'hello'; //$arr[0][0]; HELLO will come out but not the variable
$b = $arr[1][0];
$r[0] = $a;
$r[1] = $b;
//echo $r[1]; If the //'s are removed one can see the proper value here but not outside the function.
return $r;
}
$arrayToEcho = myArray($arr);
echo $arrayToEcho[0]; // will echo "first"
I have tried all the suggestions from here:
http://stackoverflow.com/questions/3451906/multiple-returns-from-function
http://stackoverflow.com/questions/5692568/php-function-return-array
Suggestion appreciated please and more info available if required. Thank you very much for viewing.
You need to get the innertext of $col in your loop. Like this:
$row0[$i][$j]= $col->innertext;
The next thing is:
myArray($row0);
This call will correctly return the parsed array; try echoing it and you'll see. But when you do this:
$arrayToEcho = myArray($arr);
...you're referencing to $arr which is a local variable (a parameter, actually) inside your function myArr. So what you probably meant was this:
$arrayToEcho = myArray($row0);
Hope this helps!
UPDATE
Look, I show you what happens when you call a function:

as_html in HTML::TagParser

I'm working in perl
I would like to ask if there is something like
$value->as_html()
from HTML::TreeBuilder in HTML::TagParser;
I extracted tag which I needed in HTML::TagParser, but now the only option is:
$value->innerText();
which give me only text without HTML tags
Or maybe can I somehow connect result from HTML::TagParser with HTML::TreeBuilder, and take my HTML tags like this?
The HTML::TagParser does not only read the element content. It also keeps the element name and the attribute key/value pairs for each selected element. Therefore you can easily reproduce the complete HTML code of the element.
Actually, the HTML::TagParser CPAN page contains an example for this: The following code extracts all <a>nchor tags from a web page and reproduces them into an HTML fragment listing precisely these tags.
my $url = 'http://www.kawa.net/xp/index-e.html';
my $html = HTML::TagParser->new( $url );
my #list = $html->getElementsByTagName( "a" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$tagname";
foreach my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
if ( $text eq "" ) {
print " />\n";
} else {
print ">$text</$tagname>\n";
}
}
This works pretty well for simple element scanning. For more complex tasks (e.g. mixed inner HTML content) I would prefer to work with HTML::Parser.

Basic information extraction from html?

I have a project where users submit many links to external sites and I need to parse the HTML of these submitted links and extract basic information from the page in the same way that Digg and Facebook do when a link is submitted.
I want to retrieve:
main title or heading (could be in title, h1, h2, p etc...)
intro or description text (could be in div, p etc...)
main image
My main problem is that there seem to be too many options to explore here and im getting a little confused to sat the least. Many solutions I have looked so far seem to be inadequate or huge overkill.
You would pick a server side language to do this.
For example, with PHP, you could use get_meta_tags() for the meta tags...
$meta = get_meta_tags('http://google.com');
And you could use DOMDocument to get the title element (some may argue if needing the title element, you may as well use DOMDocument to get the meta tags as well).
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$title = $dom
->getElementsByTagName('head')
->item(0)
->getElementsByTagName('title')
->item(0)
->nodeValue;
As for getting main image, that would require some sort of extraction of what may be considered the main image. You could get all img elements and look for the largest one on the page.
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$imgs = $dom
->getElementsByTagName('body')
->item(0)
->getElementsByTagName('img');
$imageSizes = array();
foreach($imgs as $img) {
if ( ! $img->hasAttribute('src')) {
continue;
}
$src = $img->getAttribute('src');
// May need to prepend relative path
// Assuming Apache, http and port 80
$relativePath = rtrim($_SERVER['SERVER_NAME'] . $_SERVER['REQUEST_URI'], '/') . '/';
if (substr($src, 0, strlen($relativePath) !== $relativePath) {
$src = $relativePath . $src;
}
$imageInfo = getimageinfo($src);
if ( ! $imageInfo) {
continue;
}
list($width, $height) = $imageInfo;
$imageSizes[$width * $height] = $img;
}
$mainImage = end($imageSizes);

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!
i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>
You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.
This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.
You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.
There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.

PHP: Inject iframe right after body tag

I would like to place an iframe right below the start of the body tag. This has some issues since the body tag can have various attributes and odd whitespace. My guess is this will will require regular expressions to do correctly.
EDIT: This solution has to work with php 4 & performance is a concern of mine. It's for this http://drupal.org/node/586210#comment-2567398
You can use DOMDocument and friends. Assuming you have a variable html containing the existing HTML document as a string, the basic code is:
$doc = new DOMDocument();
$doc->loadHTML(html);
$body = $doc->getElementsByTagName('body')->item(0);
$iframe = $doc->createElement('iframe');
$body->insertBefore($iframe, $body->firstChild);
To retrieve the modified HTML text, use
$html = $doc->saveHTML();
EDIT: For PHP4, you can try DOM XML.
Both PHP 4 and PHP 5 should be happy with preg_split():
/* split the string contained in $html in three parts:
* everything before the <body> tag
* the body tag with any attributes in it
* everything following the body tag
*/
$matches = preg_split('/(<body.*?>)/i', $html, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
/* assemble the HTML output back with the iframe code in it */
$injectedHTML = $matches[0] . $matches[1] . $iframeCode . $matches[2];
Using regular expressions brings up performance concerns... This is what I'm going for
<?php
$html = file_get_contents('http://www.yahoo.com/');
$start = stripos($html, '<body');
$end = stripos($html, '>', $start);
$body = substr_replace($html, '<IFRAME INSERT>', $end+1, 0);
echo htmlentities($body);
?>
Thoughts?