What's the best way to scrape specific content from multiple HTML files? - html

I have quite a few HTML files of webpages with many pieces of information. I am trying to extract some of the content and place it into an xml file or possible an excel spreadsheet. All webpages are quite similar by design and the information is placed in the same locations across all pages. Does anybody know of any way to do this?

there are many scraper library which can help you to extract data from html pages
Web scraping and crawling is not always so straightforward, so it depends on what you’re trying to achieve. Different products, SDK, libraries, etc., focus on different aspects of scraping or crawling. Here are a few you can check out:
Apify - (formerly Apifier) is a cloud-based web scraper that extracts structured data from any website using a few simple lines of JavaScript.
Diffbot - which extracts data from web pages automatically and returns structured JSON.
`
Espion
- a headless browser that enables you to inject JavaScript code directly into your target web pages.
Also if you have knowledge of Node Js then node-osmosis is realy cool and easy to use library

I strong recommend you this library:
http://sourceforge.net/projects/simplehtmldom/
/**
* Website: http://sourceforge.net/projects/simplehtmldom/
* Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
* Contributions by:
* Yousuke Kumakura (Attribute filters)
* Vadim Voituk (Negative indexes supports of "find" method)
* Antcs (Constructor with automatically load contents either text or file/url)
*
* all affected sections have comments starting with "PaperG"
*
* Paperg - Added case insensitive testing of the value of the selector.
* Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
* This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
* it will almost always be smaller by some amount.
* We use this to determine how far into the file the tag in question is. This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
* but for most purposes, it's a really good estimation.
* Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
* Allow the user to tell us how much they trust the html.
* Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node.
* This allows for us to find tags based on the text they contain.
* Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
* Paperg: added parse_charset so that we know about the character set of the source document.
* NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
* last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
*
* Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that.
* PaperG (John Schlick) Added get_display_size for "IMG" tags.
*
* Licensed under The MIT License
* Redistributions of files must retain the above copyright notice.
*
* #author S.C. Chen <me578022#gmail.com>
* #author John Schlick
* #author Rus Carroll
* #version 1.5 ($Rev: 196 $)
* #package PlaceLocalInclude
* #subpackage simple_html_dom
*/
/**
* All of the Defines for the classes below.
* #author S.C. Chen <me578022#gmail.com>
*/
here's an example
$html = file_get_html($ad_bachecubano_url);
//Proceder a capturar el texto
$anuncio['header'] = $html->find('.headingText', 0)->plaintext;
$anuncio['body'] = $html->find('.showAdText', 0)->plaintext;
$precio = $html->find('#lineBlock');
foreach ($precio as $possibleprice) {
$item = $possibleprice->find('.headingText2', 0)->plaintext;
$precio = 0;
if ($item == "Precio: ") {
$precio = $possibleprice->find('.normalText', 0)->plaintext;
$anuncio['price'] = $this->getFinalPrice($precio);
} else {
continue;
}
}
$contactbox = $html->find('#contact');
foreach ($contactbox as $contact) {
$boxes = $contact->find('#lineBlock');
foreach ($boxes as $box) {
$key = $box->find('.headingText2', 0)->plaintext;
$value = $box->find('.normalText', 0)->plaintext;
if ($key == "Nombre: ") {
$anuncio['nombre'] = $value;
}
if ($key == "Teléfono: ") {
$anuncio['phone'] = $value;
}
}
}
$anuncio['email'] = scrapeemail($anuncio['body'])[0][0];
if (!isset($anuncio['email']) || $anuncio['email'] == '') {
$anuncio['email'] = "";
}

Related

 (OBJ) symbol in WordPress URL?

I have a question about a WordPress URL in Google Chrome 94.0.4606.81:
I was reading a WordPress article recently and noticed that there is an  (OBJ) symbol in the URL. The symbol is also in the webpage title.
Take Ownership and Select Owner
Question:
What is the purpose of the  (OBJ) symbol -- and how is it possible that it has been included in a URL?
It seems like you got this symbol in the title field of the article. You can remove it from there. If you don't see it select everything in the field with ctrl + a and write the title new.
Honestly, I don't know what nature is this copy/paste issue in WP, and the "Object Replacement Character"
To avoid appearing this character it's enough to use Ctrl+Shift+V shortcut while pasting into WP post title field, means: Paste Text Without Formatting.
If you want to be sure in protecting the post slug (means: post URL) you can use the snippet in your functions.php:
/**
* Remove the strange [OBJ] character in the post slug
* See: https://github.com/WordPress/gutenberg/issues/38637
*/
add_filter("wp_unique_post_slug", function($slug, $post_ID, $post_status, $post_type, $post_parent, $original_slug) {
return preg_replace('/(%ef%bf%bc)|(efbfbc)|[^\w-]/', '', $slug);
}, 10, 6);
preg_replace function searches here for string "%ef%bf%bc" or "efbfbc" (UTF-8 - hex encoded OBJ character) OR any character that IS NOT base alphanumeric character or dash character – to delete.
Since you've mentioned it also made into the title: I use this to filter the title on save to remove these special characters.
function sbnc_filter_title($title) {
// Concatenate separate diacritics into one character if we can
if ( function_exists('normalizer_normalize') && strlen( normalizer_normalize( $title ) ) < strlen( $title ) ) {
$title = normalizer_normalize( $title );
}
// Replace no-break-space with regular space
$title = preg_replace( '/\x{00A0}/u', ' ', $title );
// Remove whitespaces from the ends
$title = trim($title);
// Remove any invisible and control characters
$title = preg_replace('/[^\x{0020}-\x{007e}\x{00a1}-\x{FFEF}]/u', '', $title);
return $title;
}
add_filter('title_save_pre', 'sbnc_filter_title');
Please note that you may need to extend set of allowed UTF range in the preg_replace call based on the languages you support. The range in the example should suit most languages actively used in the word, but if you may write article titles that include archaic scripts like Linear-B, gothic etc. you may need to extend the ranges.
If you copy-pasted it from somewhere, like I did, remember to paste as text using Ctrl + Shift + V to avoid this.
Also, it is the case that this [OBJ] only appears in Chromium-based browsers like Chrome, Edge etc, unlike in Firefox which I believe discards it by default.

Trim long html for preview without breaking html tags

I was put in front of this problem when working on a blog post preview list.
They need to shorten the content but not break any html tags by leaving them open.
I have heard that reg ex is not a good option. I am looking for something simple and working.
I appreciate your help in advance as always (SO ended up being a very nice place to come over with problems like that :-)
Wordpress has a function for generating excerpts built-in to the blogging platform which generates an excerpt from the actual blog post.
You didn't specify which language you were looking to use for the trim function so here is the Wordpress version. It can be easily modified and re-purposed to use outside of Wordpress if need be.
wp_trim_words() function reference
/**
* Generates an excerpt from the content, if needed.
*
* The excerpt word amount will be 55 words and if the amount is greater than
* that, then the string ' […]' will be appended to the excerpt. If the string
* is less than 55 words, then the content will be returned as is.
*
* The 55 word limit can be modified by plugins/themes using the excerpt_length filter
* The ' […]' string can be modified by plugins/themes using the excerpt_more filter
*
* #since 1.5.0
*
* #param string $text Optional. The excerpt. If set to empty, an excerpt is generated.
* #return string The excerpt.
*/
function wp_trim_excerpt($text = '') {
$raw_excerpt = $text;
if ( '' == $text ) {
$text = get_the_content('');
$text = strip_shortcodes( $text );
$text = apply_filters('the_content', $text);
$text = str_replace(']]>', ']]>', $text);
$excerpt_length = apply_filters('excerpt_length', 55);
$excerpt_more = apply_filters('excerpt_more', ' ' . '[…]');
$text = wp_trim_words( $text, $excerpt_length, $excerpt_more );
}
return apply_filters('wp_trim_excerpt', $text, $raw_excerpt);
}

How to stop tags within a div from affecting elements outside a div?

So, on my website, I have user generated content. I have a wysiwyg editor, but there is a view source part. I have a few approved tags.
But it occurred to me, what if a user just puts in without closing it? Then the rest of the rest of the page.
How can I get around this.
I actaully wanted to tell you look how wordpress does this untill i tested it and found out that wordpress does not care ^^ i can break my page easy by open divs and not colosing them
anyway i found this.
/** * close all open xhtml tags at the end of the string
* * #param string $html
* #return string
* #author Milian <mail#mili.de>
*/function closetags($html) {
#put all opened tags into an array
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1]; #put all closed tags into an array
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
# all tags are closed
if (count($closedtags) == $len_opened) {
return $html;
}
$openedtags = array_reverse($openedtags);
# close tags
for ($i=0; $i < $len_opened; $i++) {
if (!in_array($openedtags[$i], $closedtags)){
$html .= '</'.$openedtags[$i].'>';
} else {
unset($closedtags[array_search($openedtags[$i], $closedtags)]); }
} return $html;}
this should be what you are looking for.
This is a very complex topic in general and there's no easy shortcut. Before you start reinventing the wheel, use a library that has been designed to do exactly that. Supposedly HTML Purifier is one of the few, if not the only, libraries that gets it right.

Basic information extraction from html?

I have a project where users submit many links to external sites and I need to parse the HTML of these submitted links and extract basic information from the page in the same way that Digg and Facebook do when a link is submitted.
I want to retrieve:
main title or heading (could be in title, h1, h2, p etc...)
intro or description text (could be in div, p etc...)
main image
My main problem is that there seem to be too many options to explore here and im getting a little confused to sat the least. Many solutions I have looked so far seem to be inadequate or huge overkill.
You would pick a server side language to do this.
For example, with PHP, you could use get_meta_tags() for the meta tags...
$meta = get_meta_tags('http://google.com');
And you could use DOMDocument to get the title element (some may argue if needing the title element, you may as well use DOMDocument to get the meta tags as well).
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$title = $dom
->getElementsByTagName('head')
->item(0)
->getElementsByTagName('title')
->item(0)
->nodeValue;
As for getting main image, that would require some sort of extraction of what may be considered the main image. You could get all img elements and look for the largest one on the page.
$dom = new DOMDocument;
$dom->loadHTML('http://google.com');
$imgs = $dom
->getElementsByTagName('body')
->item(0)
->getElementsByTagName('img');
$imageSizes = array();
foreach($imgs as $img) {
if ( ! $img->hasAttribute('src')) {
continue;
}
$src = $img->getAttribute('src');
// May need to prepend relative path
// Assuming Apache, http and port 80
$relativePath = rtrim($_SERVER['SERVER_NAME'] . $_SERVER['REQUEST_URI'], '/') . '/';
if (substr($src, 0, strlen($relativePath) !== $relativePath) {
$src = $relativePath . $src;
}
$imageInfo = getimageinfo($src);
if ( ! $imageInfo) {
continue;
}
list($width, $height) = $imageInfo;
$imageSizes[$width * $height] = $img;
}
$mainImage = end($imageSizes);

Is there functions in coldfusion to get just 2 lines of text from a string?

I know this works in other languages, but wanted to see if there is existing code/functions.
This string can be populated from numerous different queries, but they need to be all displayed the same way, same length etc.
I have a function, to control string length by word count, but I would prefer to make sure that I have at least 2 sentences or 2 lines of text at most.
Thanks
I had a similar task at my job and you have to pick an arbitrary number, and it looks like you've chosen 190. That being said, you can't just hope that the characters/words returned are relevant. You have to ensure that they are if its something you care about, which is seems like you do looking at your comments.
Try to find the keyword in the string and use the mid() function to get a certain number of characters on either side of the keyword:
<cfscript>
max_chars = 190;
full_article = #the full article#;
keyword_position = find(keyword, full_article);
if( keyword_position != 0 ) {
excerpt = mid(full_article,
keyword_position - max_chars / 2 - len(keyword_position) / 2,
max_chars);
}
</cfscript>
...or something like that. I'll leave it to you to make sure that you're not trying to get characters before the start of the full_article, or after the end of it, and adding ellipses and stuff.
Try something like fullLeft or dig through the other string manipulation UDFs at CFLib. If you're looking for something more specific could you show us a comparable function in another language and we'd be better able to point you to something similar.
_TestString = "I know this works in other languages, but wanted to see if there is existing code/functions. This string can be populated from numerous different queries, but they need to be";
if ( len(_TestString) GT 190)
{
_TestString = Left(_TestString,190) & "...";
}
That will output:
I know this works in other languages, but wanted to see if there is existing code/functions. This string can be populated from numerous different queries, but they need to be all displayed t...
You probably don't want to do anything more than that, string manipulation can get expensive for no reason, you shouldn't waste processing on the display layer unless you have to.
CFLIB has plenty of string manipulation functions on offer. You may find abbreviate() is useful, especially for search results: http://cflib.org/udf/abbreviate
<cfscript>
/**
* Abbreviates a given string to roughly the given length, stripping any tags, making sure the ending doesn't chop a word in two, and adding an ellipsis character at the end.
* Fix by Patrick McElhaney
* v3 by Ken Fricklas kenf#accessnet.net, takes care of too many spaces in text.
*
* #param string String to use. (Required)
* #param len Length to use. (Required)
* #return Returns a string.
* #author Gyrus (kenf#accessnet.netgyrus#norlonto.net)
* #version 3, September 6, 2005
*/
function abbreviate(string,len) {
var newString = REReplace(string, "<[^>]*>", " ", "ALL");
var lastSpace = 0;
newString = REReplace(newString, " \s*", " ", "ALL");
if lenn(newString) gt len) {
newString = left(newString, len-2);
lastSpace = find(" ", reverse(newString));
lastSpace = len(newString) - lastSpace;
newString = left(newString, lastSpace) & " &##8230;";
}
return newString;
}
</cfscript>