Basic information extraction from html? - html

I have a project where users submit many links to external sites and I need to parse the HTML of these submitted links and extract basic information from the page in the same way that Digg and Facebook do when a link is submitted.
I want to retrieve:
main title or heading (could be in title, h1, h2, p etc...)
intro or description text (could be in div, p etc...)
main image
My main problem is that there seem to be too many options to explore here and im getting a little confused to sat the least. Many solutions I have looked so far seem to be inadequate or huge overkill.

You would pick a server side language to do this.
For example, with PHP, you could use get_meta_tags() for the meta tags...
$meta = get_meta_tags('http://google.com');
And you could use DOMDocument to get the title element (some may argue if needing the title element, you may as well use DOMDocument to get the meta tags as well).
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$title = $dom
->getElementsByTagName('head')
->item(0)
->getElementsByTagName('title')
->item(0)
->nodeValue;
As for getting main image, that would require some sort of extraction of what may be considered the main image. You could get all img elements and look for the largest one on the page.
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$imgs = $dom
->getElementsByTagName('body')
->item(0)
->getElementsByTagName('img');
$imageSizes = array();
foreach($imgs as $img) {
if ( ! $img->hasAttribute('src')) {
continue;
}
$src = $img->getAttribute('src');
// May need to prepend relative path
// Assuming Apache, http and port 80
$relativePath = rtrim($_SERVER['SERVER_NAME'] . $_SERVER['REQUEST_URI'], '/') . '/';
if (substr($src, 0, strlen($relativePath) !== $relativePath) {
$src = $relativePath . $src;
}
$imageInfo = getimageinfo($src);
if ( ! $imageInfo) {
continue;
}
list($width, $height) = $imageInfo;
$imageSizes[$width * $height] = $img;
}
$mainImage = end($imageSizes);

Related

as_html in HTML::TagParser

I'm working in perl
I would like to ask if there is something like
$value->as_html()
from HTML::TreeBuilder in HTML::TagParser;
I extracted tag which I needed in HTML::TagParser, but now the only option is:
$value->innerText();
which give me only text without HTML tags
Or maybe can I somehow connect result from HTML::TagParser with HTML::TreeBuilder, and take my HTML tags like this?
The HTML::TagParser does not only read the element content. It also keeps the element name and the attribute key/value pairs for each selected element. Therefore you can easily reproduce the complete HTML code of the element.
Actually, the HTML::TagParser CPAN page contains an example for this: The following code extracts all <a>nchor tags from a web page and reproduces them into an HTML fragment listing precisely these tags.
my $url = 'http://www.kawa.net/xp/index-e.html';
my $html = HTML::TagParser->new( $url );
my #list = $html->getElementsByTagName( "a" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$tagname";
foreach my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
if ( $text eq "" ) {
print " />\n";
} else {
print ">$text</$tagname>\n";
}
}
This works pretty well for simple element scanning. For more complex tasks (e.g. mixed inner HTML content) I would prefer to work with HTML::Parser.

MediaWiki: changing the label of a category at the bottom of the page

In mediawiki, is it possible to change the label of a 'Category' at the bottom of an article.
For example for the following article:
=Paris=
blablablablablabla
[[Category:place_id]]
I'd like to see something more verbose like (the example below doesn't work):
=Paris=
blablablablablabla
[[Category:place_id|France]]
Note: I don't want to use a 'redirect' and I want to keep my strange ids because they are linked to an external database.
I do not think mediawiki is supporting this feature.
However, how about using:
[[Category:France]]
in your page, and set it into the category named with your id? France would just be a subcategory of "place_id", and you could use more terms all linked to the parent category. For this, you just need to edit the category page for "France", inserting:
[[Category:place_id]]
An alternative would be to put your page in both categories, but in this case, the id would still be displayed:
[[Category:place_id]]
[[Category:France]]
You could do this with an OutputPageMakeCategoryLinks hook. Alas, the interface for that hook seems to be a bit inconvenient — as far as I can tell, it's pretty much only good for replacing the standard category link generation code entirely. Still, you could do that is you want:
function myOutputPageMakeCategoryLinks( &$out, $categories, &$links ) {
foreach ( $categories as $category => $type ) {
$title = Title::makeTitleSafe( NS_CATEGORY, $category );
$text = $title->getText();
if ( $text == 'Place id' ) {
// set $text to something else
}
$links[$type][] = Linker::link( $title, htmlspecialchars( $text ) );
}
return false; // skip default link generation
}
$wgHooks['OutputPageMakeCategoryLinks'][] = 'myOutputPageMakeCategoryLinks';
(The code above is based on the default category link generation code in OutputPage.php, somewhat simplified; I assume you're not using language variant conversion on your wiki, so I removed the parts that deal with that. Note that this code is untested! Use at your own risk.)

How to stop tags within a div from affecting elements outside a div?

So, on my website, I have user generated content. I have a wysiwyg editor, but there is a view source part. I have a few approved tags.
But it occurred to me, what if a user just puts in without closing it? Then the rest of the rest of the page.
How can I get around this.
I actaully wanted to tell you look how wordpress does this untill i tested it and found out that wordpress does not care ^^ i can break my page easy by open divs and not colosing them
anyway i found this.
/** * close all open xhtml tags at the end of the string
* * #param string $html
* #return string
* #author Milian <mail#mili.de>
*/function closetags($html) {
#put all opened tags into an array
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1]; #put all closed tags into an array
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
# all tags are closed
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
# close tags
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)){
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]); }
} return $html;}
this should be what you are looking for.
This is a very complex topic in general and there's no easy shortcut. Before you start reinventing the wheel, use a library that has been designed to do exactly that. Supposedly HTML Purifier is one of the few, if not the only, libraries that gets it right.

how to find all <p> tags under heading

I have to extract data from this link: http://bit.ly/l1rF5x
What I want to do is that I want to extract all p tags which comes under the <a> tag having attribute rel="bookmark". My only requirement is that only <p> tags which comes under this heading should be parsed, and remaining should be left as it is. Like for example in this page which I have given you, all <p> tags which comes under heading "IIFT question paper 2006", should be parsed.
help please.
You can try using the following :
$(function(){
var results= '';
$('a[rel="bookmark"] p').each(function(i,e){
results += $(e).html() + "\n";
});
alert(results);
});
Variable results will be alerted with the required content.
Example : http://jsfiddle.net/eGmWw/1/
Since you haven't provided any information about the language / environment you want to use to extract this information, I've gone ahead and hacked something together with jQuery.
(Updated) You can see it in action here: JS Fiddle.
If you wanted to use PHP, I recommend simplehtmldom
Here is an example using simplehtmldom:
$url = 'http://school-listing.mba4india.com/page/7/';
$html = file_get_html($url);
$data = array();
// Find all anchors with the desired rel attribute
foreach ($html->find('a[rel="bookmark"]') as $a) {
$h4 = $a->parent(); // Get the anchors parent (in this case an h4)
// We're assuming the next sibling is a p tag here - should test for this here
$p = $h4->next_sibling();
$content = '';
// Iterate over all following p tags, until we run out of siblings or find one
// that isn't a p tag
while ($p) {
$content .= (string) $p;
if ($p->next_sibling() && $p->next_sibling()->tag == 'p') {
$p = $p->next_sibling();
} else {
break;
}
}
$data[] = array('h4' => $h4, 'content' => $content);
}
$br = '<br/>';
foreach ($data as $datum) {
echo $datum['h4'] . $br . $datum['content'];
echo $br.$br;
}
Refer to Simplehtmldom Documentation for more!

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!
i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>
You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.
This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.
You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.
There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.