Extract content from a page - html

I need to recognize content in a page - to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API)
What logic can I use to accomplish this? (Coding language is not matter)
Here what I did (with a good result) - needs a lot more fixes...
Find the most text in page so don't have a breaking tags - ignore inline tags (span, b, etc...)
Go up one level and count breaking tags (br, p, div, etc...)
Go up another level and count tags
Compare tags count from step 2 with step 3
If there is a lot of different we stop here - if not we go to step 3

Look for the Boilerpipe library. It is a comprehensive solution.
Using the Boilerpipe library, you can specify the output as HTML. So you get the main content(the article) while still preserving its HTML tags.

Another good alternative would be to use Goose.
It allows more fields(published date, author, main image in article and a few more) than Boilerpipe (title, content)

You need a parser for navigate the DOM, in the NuGet packages you can find some helpful parser tools like this

Related

Filtering (hide or show) schema in HTML5 only

Using html5 only, without css or anything else, is it possible to have a sentence like this,
She loves him.
be schema'd with HTML5 so that the word "She" is tagged with metadata such as "Subject", "loves" is tagged with "verb", "him" is tagged with "object", and all of it together including the period, "She loves him." is tagged with "complete sentence".
Then, a user could filter what they want to display - elements tagged as Subjects, verbs, objects, or complete sentences... or mixtures of these.
For example, maybe you want to see all complete sentences and all objects, regardless if they are in a complete sentence or not.
Another example, if you had a bunch of sentences on a webpage, you could quickly filter to see only the verbs.
I'm looking for a way to accomplish this - a game plan for how to structure the tags. If I use divs, will that be robust enough to let me tag complete sentences and the individual words inside them, or is there a cleaner way, a more concise way, etc?
You're not going to be able to do this with just HTML. You're going to need to use Javascript to show and hide those elements. You'll need CSS to make it look good.
A way you could get started is to put spans around different parts of the sentence.
<span class="full-sentence><span class="subject">She</span><span class="verb">loves</span><span class="object">him</span>.
Then you need to look into using jQuery to show and hide the different parts of the sentence.
If you don't want to use Javascript, you could alternatively use CSS on the span classes to color the subject, verb and object.

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Stripping HTML but retaining block/inline structure

I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.

Equivalent of LaTeX's \label and \ref in HTML

I have an FAQ in HTML (example) in which the questions refer to each other a lot. That means whenever we insert/delete/rearrange the questions, the numbering changes. LaTeX solves this very elegantly with \label and \ref -- you give items simple tags and LaTeX worries about converting to numbers in the final document.
How do people deal with that in HTML?
ADDED: Note that this is no problem if you don't have to actually refer to items by number, in which case you can set a tag with
<a name="foo">
and then link to it with
some non-numerical way to refer to foo.
But I'm assuming "foo" has some auto-generated number, say from an <ol> list, and I want to use that number to refer to and link to it.
There is nothing like this in HTML.
The way you would normally solve this, is by having the HTML for the links generated, by either parsing the HTML itself and inserting the TOC (you can do that on the server, before you send the HTML out to the browser, or on the client, by traversing the DOM with a little piece of ECMAScript and simply collecting and inspecting all <a> elements) or generating the entire HTML document from a higher level source like a database, an XML document, markdown or – why not? – even LaΤΕΧ.
I know it's not widely supported by browsers, but you can do this using CSS counter.
Also, consider using ids instead of names for your anchors.
Instead of \label{key} use <a name="key" />. Then link using Link.
PrinceXML can do that, but that's about it. I suppose it'd be best to use server-side scripting.
Here's how I ended up solving this with a php script:
http://yootles.com/genfaq
It's roughly as convenient as \label and \ref in LaTeX and even auto-generates the index of questions.
And I put it on an etherpad instance which is handy when multiple people are contributing questions to the FAQ.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.