Any ideas on how to identify the main content of the page? - html

if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?
Get the page content with cURL
Maybe use a DOM parser to identify the elements of the page

That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.
You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.

There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics.
Some features that can detect html block with main content:
p, div tags
amount of text inside/outside
amount of links inside/outside (i.e remove munus)
some css class names and id (frequntly those block have classes or ids with main, main_block, content e.t.c)
relation between title and text inside content

You might consider:
Boilerpipe: "The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings."
Ruby Readability: "Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project."
The Readability API: "If you'd like access to the Readability parser directly, the Content API is available upon request. Contact us if you're interested."

It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.
If the author uses "common" tags, you could look for a container
element ID'd as "content" or "main."
If the author is using HTML5, you should in theory be able to query for the <article> element, if it's a page with only one "story" to tell.

Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p> tags. But this did't work because there were news sites that used the <p> for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.
After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:
<html>
<!-- depth: 1 -->
<nav>
<!-- depth: 2 -->
<ul>
<!-- depth: 3 -->
<li>Site<!-- depth: 5 --></li>
<li>Site<!--- depth: 5 ---></li>
</ul>
</nav>
<div id='text'>
<!--- depth: 2 --->
<p>Thats the main content...<!-- depth: 3 --></p>
<p>main content, bla, bla bla ... <!-- depth: 3 --></p>
<p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
<p>whatever, bla... <!-- depth: 3 --></p>
</div>
</html>
As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.
Calculate the depth of every html element.
Find the depth with the highest amount of textual content.
Select all textual content with this depth
To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.
Greetings,
David

It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>, <section> etc. (which carry more semantic power than pre-HTML5 tags).

I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe

Related

What is the difference between html <var> and <p 'font-style: italic'></p>?

I tried to search the web about what is the purpose of the HTML <var> Tag and didn't find any good explanation or let say I'm not satisfied yet. I can read what they say about it but I don't understand the purpose. I tried two different lines of code and both gives me the same thing now I need to know what exactly is <var> and why we should use it rather than a single style.
<var>y</var> = <var>m</var><var>x</var> + <var>b</var>
<p style='font-style:italic'>y = mx + b</p>
Reference to name only one: https://html.com/tags/var/
Funny because I read the explanation but I still don't see what is the use of <var> other than just making the text italic!
Here is how W3Schools defines HTML:
HTML stands for Hyper Text Markup Language
HTML is the standard markup language for creating Web pages
HTML describes the structure of a Web page
HTML consists of a series of elements
HTML elements tell the browser how to display the content
HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.
The way I see it is that, even though <var> and <i> have the same output printed to the browser, they mean different things, specially if you are "reading" pages without opening a browser like search engines do.
Check it is not particular to the example you mentioned. Look at the example on <b> and <strong> (https://www.w3schools.com/html/html_formatting.asp). They also have the same output but mean different things.
Semantics.
<p> tags are generic paragraph elements, typically used for text.
<var> elements represent the name of a variable in a mathematical expression or a programming context.
If you italicize a paragraph it may resemble the default styling of the <var> element, but that's where the similarities end. Also, they're different to screen readers.
Here's an example using both elements and you can see that semantically, it's a paragraph of text that contains references to variables in a mathematical sense:
<p>The volume of a box is <var>l</var> × <var>w</var> × <var>h</var>, where <var>l</var> represents the length, <var>w</var> the width and <var>h</var> the height of the box.</p>

Ruby Nokogiri Ordered HTML tags

Background:
I am working on a simple web scraper for learning purposes. I am trying to scrape the main-headings<h2> and the sub-headings <h3> elements from the Wikipedia page about the Ruby programming language. I can access each of these individually, but I would like to write my code in a way that any Wikipedia article could be substituted in.
Main question:
I am looking for a way to list all the <h3> elements that lie between the <h2> elements on the page. Is there a way to do that directly via Nokogiri, or will it involve using some Ruby as a work around?
Basically, I want to be able to list the main heading and the accompanying sub-headings, but I can not see a way to group them as Wikipedia does not have them grouped in their html.
Thank you for your time.
-M
I would use Nokogiri's CSS selectors. The Bastard's Book of Ruby has a good primer on that. http://ruby.bastardsbook.com/chapters/html-parsing/
In your case, you'd want to use the following:
page.css('h2:not([id]) > span.mw-headline, h3:not([id]) > span.mw-headline')
Based on what I see in the dev tools console for Wikipedia pages, the main headings and subheadings do not have ID attributes, which is why I use the :not([id]) pseudo-selector. It'll look for all h2 and h3 elements that do not have IDs. Each nested span with the heading title has the .mw-headline class.
If you only want the h3 elements (each section's sub-heading), you can just have:
page.css('h3:not([id]) > span.mw-headline')

<s> vs <del> in HTML

So I'm writing a list of todos in HTML. Some of these todos are, well, done.
<h1>TODO</h1>
<ul>
<li>I'm still to be done</li>
<li>I'm done</li>
</ul>
Now I'm wondering what the best way to mark up and style these items. When it's done, I could mark up each item with <s>, which seems much more acceptable these days, as it's 'no longer relevant':
<li><s>I'm done</s></li>
I could go for <del> as, in some sense, the user has edited the list and set this item for removal (kinda):
<li><del>I'm done</del></li>
I could add a class to say what this item means:
<li class="todo done">I'm done</li>
Or some combination of the three. Or something else entirely.
My concerns are accessibility and semantics - I want the markup to convey the meaning of a 'done' item.
What's the best way of doing this?
Both answers are great. s and del are indeed semantic tags so it's good for accessibility. Unfortunately, no browsers surface those tags in the accessibility tree so screen readers cannot convey any information regarding the tags. But you can work around it with CSS. There's a simple blog that talks about the <mark> element, which is also semantic but does not convey info to screen readers but the blog gives a workaround. You can do something similar with s or del. That is, use either s or del but also use CSS to augment the tag for screen reader users.
A class has no semantic meaning, so if accessibility and semantic are important for you, then you have to use del or s.
If you should use s or del is not easy to say. For s the specs have this example:
<p>Buy our Iced Tea and Lemonade!</p>
<p><s>Recommended retail price: $3.99 per bottle</s></p>
<p><strong>Now selling for just $2.99 a bottle!</strong></p>
So you want to show the reader the old information, but also tell the the information is not relevant anymore.
Your TODO example is covered in the specs in the del section 4.6.2 The del element
<h1>To Do</h1>
<ul>
<li>Empty the dishwasher</li>
<li><del datetime="2009-10-11T01:25-07:00">Watch Walter Lewin's lectures</del></li>
<li><del datetime="2009-10-10T23:38-07:00">Download more tracks</del></li>
<li>Buy a printer</li>
</ul>
I think the main difference is that you have more possibilities to add semantic information to del then to s. So del is more about if the information that something was deleted is really important, e.g. the tracking of changes (diff tool), a TODO list, that a part od a specification was removed, ... . And s is some kind of informally additional information.
In plain HTML, class values don’t convey any meaning. You can make use of classes in addition to making use of semantic elements (classes can be useful for CSS, JavaScript, documentation purposes etc.), but you should not use classes instead of semantic elements.
With del and s, you found the two relevant elements that can make sense in this context. Which one to use? It, most likely, doesn’t make a practical difference.
The semantic differences are subtle:
With del, you convey that the content was removed from the document (semantically, it doesn’t matter if the content is still visible or if you visually hide it with CSS). It represents the actual edit to the document.
With s, you convey that the content is no longer relevant.
I guess the purpose why you show the done items can help in making the choice which element to use:
If the todo list could work as well without showing the done items (so showing them has the purpose of tracking changes, or detecting errors), go for del. In theory, a user agent could offer viewing the list at a specific point in time (making use of the datetime attribute), and a default view could only show the actual current content (i.e., without any content in del).
If it’s relevant for the meaning of the todo list to show what has already been done, go for s.
If there is a relevant difference in your case between removing an item (e.g., because it was added by accident, or didn’t make sense etc.) and marking an item as done, then you might want to use del for the former and s for the latter. You can ignore this if there is no relevant difference (e.g., if you would not keep showing items from the former case anyway).
(Side note: If using del, it would make sense to also use ins for adding new todo items.)

When is it appropriate to use semantic elements?

I do not really understand how (often) I should use semantic elements like time, header, footer.
Do article, nav, figure, time, etc just replace div id="post", div id="navbar", div id="illustration", span id="time" and, therefore, I should use them only when I need wrap some content for styling purposes or they are something more than that?
General: You should use them so often you can.
If you developing an intranet application, most time no one will care about it.
The Good thing of semantic use is in the public area (Internet)
A Search engine wants to know about semantic, so it could better understand your page
A Screenreader can say "This is a Blockquote", "This is a Navigation", "This is a Footer"
Semantic is not for styling a page, semantik is for understanding a page. Blind people don't see your css for example, but a good structured website is better for text to speach and help blind people.
Also take a look at https://schema.org/
And what about the following example : there is a story with some
dates on a webpage. Do I have to put these dates inside time tags?
Yes. Take look at the Mozilla Documentation here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time
According to it, you can do this:
<time datetime="2001-05-15 19:00">May 15</time>
You can even do it without time (which is optional according to documentation)
<time datetime="2001-05-15">May 15</time>
A <footer> typically contains the author of the document, copyright information, links to terms of use, contact information, etc. The same story aplies for a <header>. You can use multiple <footer> and <header> elements in your html file.
The <time> element can be used as a way to encode dates and times in a machine-readable way so that, for example, user agents can offer to add birthday reminders or scheduled events to the user's calendar, and search engines can produce smarter search results.
An example:
<p>We open at <time>10:00</time> every morning.</p>
<p>I have a date on <time datetime="2008-02-14">Valentines day</time>.</p>
Here is a link to the W3Schools page:
Footer page
Time page
I hope this answers your question :)

Correct use of the <small> tag, or how to markup "less important" text

Yet another tag that was given new meaning in HTML5, <small> apparently lives on:
http://www.w3.org/TR/html-markup/small.html#small
The small element represents so-called “fine print” or “small print”,
such as legal disclaimers and caveats.
This unofficial reference seems to take it a little further:
http://html5doctor.com/small-hr-element/
<small> is now for side comments, which are the inline equivalent of
<aside> — content which is not the main focus of the page. A common
example is inline legalese, such as a copyright statement in a page
footer, a disclaimer, or licensing information. It can also be used
for attribution.
I have a list of people I want to display, which includes their real name and nickname. The nickname is sort of an "aside", and I want to style it with lighter text:
<li>Laurence Tureaud <small>(Mr.T)</small></li>
I'll need to do something like this for several sections of the site (people, products, locations), so I'm trying to develop a sensible standard. I know I can use <span class="quiet"> or something like that, but I'm trying to avoid arbitrary class names and use the correct HTML element (if there is one).
Is <small> appropriate for this, or is there another element or markup structure that would be appropriate?
The spec you're looking at is old, you should look at the HTML5 spec:
https://html.spec.whatwg.org/multipage/
I suggest <em> here instead of small:
<p>Laurence Tureaud also called <em>Mr. T</em> is famous for his role
in the tv series A-TEAM.</p>
<small> is not used commonly in an article sentence, but like this:
<footer>
<p>
Search articles about Laurence Tureaud,
<small>or try articles about A-TEAM.</small>
</p>
</footer>
<footer>
<p>
Call the Laurence Tureaud's "life trainer chat line" at
555-1122334455 <small>($1.99 for 1 minute)</small>
</p>
</footer>
Article sentence:
<p>
My job is very interesting and I love it: I work in an office
<small>(123 St. Rome, Italy)</small> with a lot of funny guys that share
my exact interests.
</p>
Personally I would think <small> would not be the correct tag for this as it suggests the text will be physically smaller which doesn't seem to be the case with your example. I think using a <span> would be more appropriate or possible the HTML <aside>. http://dev.w3.org/html5//spec-author-view/the-aside-element.html
You should ask yourself how you would prefer the document to be displayed when style sheets are not applied. Select the markup according to this, instead of scholarly or scholastic theories about “semantic markup” (see my pragmatic guide to HTML).
If smaller size is what you want, then use <small> or <font size=2>. The former is more concise and easier to style, and it is more “resistant” (on some browsers, settings that tell the browser to ignore font sizes specified on web pages do not remove the effect of small). So it’s a rather simple choice.
On the other hand, font size variation inside a line of text is typographically questionable. In printed matter, it is much more often accidental, an error, rather than intentional. Putting something in parentheses is normally a sufficient indication of being somehow secondary