How to stop tags within a div from affecting elements outside a div? - html

So, on my website, I have user generated content. I have a wysiwyg editor, but there is a view source part. I have a few approved tags.
But it occurred to me, what if a user just puts in without closing it? Then the rest of the rest of the page.
How can I get around this.

I actaully wanted to tell you look how wordpress does this untill i tested it and found out that wordpress does not care ^^ i can break my page easy by open divs and not colosing them
anyway i found this.
/** * close all open xhtml tags at the end of the string
* * #param string $html
* #return string
* #author Milian <mail#mili.de>
*/function closetags($html) {
#put all opened tags into an array
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1]; #put all closed tags into an array
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
# all tags are closed
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
# close tags
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)){
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]); }
} return $html;}
this should be what you are looking for.

This is a very complex topic in general and there's no easy shortcut. Before you start reinventing the wheel, use a library that has been designed to do exactly that. Supposedly HTML Purifier is one of the few, if not the only, libraries that gets it right.

Related

What's the best way to scrape specific content from multiple HTML files?

I have quite a few HTML files of webpages with many pieces of information. I am trying to extract some of the content and place it into an xml file or possible an excel spreadsheet. All webpages are quite similar by design and the information is placed in the same locations across all pages. Does anybody know of any way to do this?
there are many scraper library which can help you to extract data from html pages
Web scraping and crawling is not always so straightforward, so it depends on what you’re trying to achieve. Different products, SDK, libraries, etc., focus on different aspects of scraping or crawling. Here are a few you can check out:
Apify - (formerly Apifier) is a cloud-based web scraper that extracts structured data from any website using a few simple lines of JavaScript.
Diffbot - which extracts data from web pages automatically and returns structured JSON.
`
Espion
- a headless browser that enables you to inject JavaScript code directly into your target web pages.
Also if you have knowledge of Node Js then node-osmosis is realy cool and easy to use library
I strong recommend you this library:
http://sourceforge.net/projects/simplehtmldom/
/**
* Website: http://sourceforge.net/projects/simplehtmldom/
* Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
* Contributions by:
* Yousuke Kumakura (Attribute filters)
* Vadim Voituk (Negative indexes supports of "find" method)
* Antcs (Constructor with automatically load contents either text or file/url)
*
* all affected sections have comments starting with "PaperG"
*
* Paperg - Added case insensitive testing of the value of the selector.
* Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
* This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
* it will almost always be smaller by some amount.
* We use this to determine how far into the file the tag in question is. This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
* but for most purposes, it's a really good estimation.
* Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
* Allow the user to tell us how much they trust the html.
* Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node.
* This allows for us to find tags based on the text they contain.
* Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
* Paperg: added parse_charset so that we know about the character set of the source document.
* NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
* last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
*
* Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that.
* PaperG (John Schlick) Added get_display_size for "IMG" tags.
*
* Licensed under The MIT License
* Redistributions of files must retain the above copyright notice.
*
* #author S.C. Chen <me578022#gmail.com>
* #author John Schlick
* #author Rus Carroll
* #version 1.5 ($Rev: 196 $)
* #package PlaceLocalInclude
* #subpackage simple_html_dom
*/
/**
* All of the Defines for the classes below.
* #author S.C. Chen <me578022#gmail.com>
*/
here's an example
$html = file_get_html($ad_bachecubano_url);
//Proceder a capturar el texto
$anuncio['header'] = $html->find('.headingText', 0)->plaintext;
$anuncio['body'] = $html->find('.showAdText', 0)->plaintext;
$precio = $html->find('#lineBlock');
foreach ($precio as $possibleprice) {
$item = $possibleprice->find('.headingText2', 0)->plaintext;
$precio = 0;
if ($item == "Precio: ") {
$precio = $possibleprice->find('.normalText', 0)->plaintext;
$anuncio['price'] = $this->getFinalPrice($precio);
} else {
continue;
}
}
$contactbox = $html->find('#contact');
foreach ($contactbox as $contact) {
$boxes = $contact->find('#lineBlock');
foreach ($boxes as $box) {
$key = $box->find('.headingText2', 0)->plaintext;
$value = $box->find('.normalText', 0)->plaintext;
if ($key == "Nombre: ") {
$anuncio['nombre'] = $value;
}
if ($key == "Teléfono: ") {
$anuncio['phone'] = $value;
}
}
}
$anuncio['email'] = scrapeemail($anuncio['body'])[0][0];
if (!isset($anuncio['email']) || $anuncio['email'] == '') {
$anuncio['email'] = "";
}

Trim long html for preview without breaking html tags

I was put in front of this problem when working on a blog post preview list.
They need to shorten the content but not break any html tags by leaving them open.
I have heard that reg ex is not a good option. I am looking for something simple and working.
I appreciate your help in advance as always (SO ended up being a very nice place to come over with problems like that :-)
Wordpress has a function for generating excerpts built-in to the blogging platform which generates an excerpt from the actual blog post.
You didn't specify which language you were looking to use for the trim function so here is the Wordpress version. It can be easily modified and re-purposed to use outside of Wordpress if need be.
wp_trim_words() function reference
/**
* Generates an excerpt from the content, if needed.
*
* The excerpt word amount will be 55 words and if the amount is greater than
* that, then the string ' […]' will be appended to the excerpt. If the string
* is less than 55 words, then the content will be returned as is.
*
* The 55 word limit can be modified by plugins/themes using the excerpt_length filter
* The ' […]' string can be modified by plugins/themes using the excerpt_more filter
*
* #since 1.5.0
*
* #param string $text Optional. The excerpt. If set to empty, an excerpt is generated.
* #return string The excerpt.
*/
function wp_trim_excerpt($text = '') {
$raw_excerpt = $text;
if ( '' == $text ) {
$text = get_the_content('');
$text = strip_shortcodes( $text );
$text = apply_filters('the_content', $text);
$text = str_replace(']]>', ']]>', $text);
$excerpt_length = apply_filters('excerpt_length', 55);
$excerpt_more = apply_filters('excerpt_more', ' ' . '[…]');
$text = wp_trim_words( $text, $excerpt_length, $excerpt_more );
}
return apply_filters('wp_trim_excerpt', $text, $raw_excerpt);
}

Basic information extraction from html?

I have a project where users submit many links to external sites and I need to parse the HTML of these submitted links and extract basic information from the page in the same way that Digg and Facebook do when a link is submitted.
I want to retrieve:
main title or heading (could be in title, h1, h2, p etc...)
intro or description text (could be in div, p etc...)
main image
My main problem is that there seem to be too many options to explore here and im getting a little confused to sat the least. Many solutions I have looked so far seem to be inadequate or huge overkill.
You would pick a server side language to do this.
For example, with PHP, you could use get_meta_tags() for the meta tags...
$meta = get_meta_tags('http://google.com');
And you could use DOMDocument to get the title element (some may argue if needing the title element, you may as well use DOMDocument to get the meta tags as well).
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$title = $dom
->getElementsByTagName('head')
->item(0)
->getElementsByTagName('title')
->item(0)
->nodeValue;
As for getting main image, that would require some sort of extraction of what may be considered the main image. You could get all img elements and look for the largest one on the page.
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$imgs = $dom
->getElementsByTagName('body')
->item(0)
->getElementsByTagName('img');
$imageSizes = array();
foreach($imgs as $img) {
if ( ! $img->hasAttribute('src')) {
continue;
}
$src = $img->getAttribute('src');
// May need to prepend relative path
// Assuming Apache, http and port 80
$relativePath = rtrim($_SERVER['SERVER_NAME'] . $_SERVER['REQUEST_URI'], '/') . '/';
if (substr($src, 0, strlen($relativePath) !== $relativePath) {
$src = $relativePath . $src;
}
$imageInfo = getimageinfo($src);
if ( ! $imageInfo) {
continue;
}
list($width, $height) = $imageInfo;
$imageSizes[$width * $height] = $img;
}
$mainImage = end($imageSizes);

Ignoring unclosed tags from another <div>?

I have a website where members can input text using a limited subset of HTML. When a page is displayed that contains a user's text, if they have any unclosed tags, the formatting "bleeds" across into the next area. For example, if the user entered:
Hi, my name is <b>John
Then, the rest of the page will be bold.
Ideally, there'd be someting I could do that would be this simple:
<div contained>Hi, my name is <b>John</div>
And no tags could bleed out of that div. Assuming there isn't anything this simple, how would I accomplish a similar effect? Or, is there something this easy?
Importantly, I do not want to validate the user's input and return an error if they have unclosed tags, since I want to provide the "easiest" user interface possible for my users.
Thanks!
i have solution for php
<?php
// close opened html tags
function closetags ( $html )
{
#put all opened tags into an array
preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
$openedtags = $result[1];
#put all closed tags into an array
preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
$closedtags = $result[1];
$len_opened = count ( $openedtags );
# all tags are closed
if( count ( $closedtags ) == $len_opened )
{
return $html;
}
$openedtags = array_reverse ( $openedtags );
# close tags
for( $i = 0; $i < $len_opened; $i++ )
{
if ( !in_array ( $openedtags[$i], $closedtags ) )
{
$html .= "</" . $openedtags[$i] . ">";
}
else
{
unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
}
}
return $html;
}
// close opened html tags
?>
you can use this function like
<?php echo closetags("your content <p>test test"); ?>
You can put the HTML snippet through Tidy, which will do its best to fix it. Many languages include it in some fashion or another, here for example PHP.
This can't be done.
Don't let users invalidate your HTML.
If you don't want to let users fix their errors, then try to clean it up automatically for them.
You can parse the data entered by the user. Thats what an XML does. You may need to parse or replace the standard html or xml symbols like '<', '>', '/', '&', etc... with '&lt', '&gt', etc...
In this way you can achieve whatever you want.
There is a way to do this using HTML and javascript. I wouldn't recommend this method for public-facing websites; you should clean your data before it reaches the browser. But it might be useful in other situations.
The idea is to put the potentially invalid content into a noscript tag, like this:
<noscript class="contained">
<div>Hi, my name is <b>John</div>
</noscript>
... and then add javascript that will load it into the DOM. Using jQuery (but probably not necessary):
$("noscript.contained").each(function () {
$(this).replaceWith(this.innerText);
});
Note that users without javascript will still experience the "bleeding" that you are trying to avoid.

PHP: Inject iframe right after body tag

I would like to place an iframe right below the start of the body tag. This has some issues since the body tag can have various attributes and odd whitespace. My guess is this will will require regular expressions to do correctly.
EDIT: This solution has to work with php 4 & performance is a concern of mine. It's for this http://drupal.org/node/586210#comment-2567398
You can use DOMDocument and friends. Assuming you have a variable html containing the existing HTML document as a string, the basic code is:
$doc = new DOMDocument();
$doc->loadHTML(html);
$body = $doc->getElementsByTagName('body')->item(0);
$iframe = $doc->createElement('iframe');
$body->insertBefore($iframe, $body->firstChild);
To retrieve the modified HTML text, use
$html = $doc->saveHTML();
EDIT: For PHP4, you can try DOM XML.
Both PHP 4 and PHP 5 should be happy with preg_split():
/* split the string contained in $html in three parts:
* everything before the <body> tag
* the body tag with any attributes in it
* everything following the body tag
*/
$matches = preg_split('/(<body.*?>)/i', $html, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
/* assemble the HTML output back with the iframe code in it */
$injectedHTML = $matches[0] . $matches[1] . $iframeCode . $matches[2];
Using regular expressions brings up performance concerns... This is what I'm going for
<?php
$html = file_get_contents('http://www.yahoo.com/');
$start = stripos($html, '<body');
$end = stripos($html, '>', $start);
$body = substr_replace($html, '<IFRAME INSERT>', $end+1, 0);
echo htmlentities($body);
?>
Thoughts?