What does "semantically correct" mean? - html

I have seen it a lot in css talk. What does semantically correct mean?

Labeling correctly
It means that you're calling something what it actually is. The classic example is that if something is a table, it should contain rows and columns of data. To use that for layout is semantically incorrect - you're saying "this is a table" when it's not.
Another example: a list (<ul> or <ol>) should generally be used to group similar items (<li>). You could use a div for the group and a <span> for each item, and style each span to be on a separate line with a bullet point, and it might look the way you want. But "this is a list" conveys more information.
Fits the ideal behind HTML
HTML stands for "HyperText Markup Language"; its purpose is to mark up, or label, your content. The more accurately you mark it up, the better. New elements are being introduced in HTML5 to more accurately label common web page parts, such as headers and footers.
Makes it more useful
All of this semantic labeling helps machines parse your content, which helps users. For instance:
Knowing what your elements are lets browsers use sensible defaults for how they should look and behave. This means you have less customization work to do and are more likely to get consistent results in different browsers.
Browsers can correctly apply your CSS (Cascading Style Sheets), describing how each type of content should look. You can offer alternative styles, or users can use their own; as long as you've labeled your elements semantically, rules like "I want headlines to be huge" will be usable.
Screen readers for the blind can help them fill out a form more easily if the logical sections are broken into fieldsets with one legend for each one. A blind user can hear the legend text and decide, "oh, I can skip this section," just as a sighted user might do by reading it.
Mobile phones can switch to a numeric keyboard when they see a form input of type="tel" (for telephone numbers).

Semantics basically means "The study of meaning".
Usually when people are talking about code being semantically correct, they're referring to the code that accurately describes something.
In (x)HTML, there are certain tags that give meaning to the content they contain. For example:
An H1 tag describes the data it contains as a level-1 heading. An H2 tag describes the data it contains as a level-2 heading. The implied meaning behind this is that each H2 under an H1 is in some way related (i.e. heading and subheading).
When you code in a semantic way, you basically give meaning to the data you're describing.
Consider the following 2 samples of semantic VS non-semantic:
<h1>Heading</h1>
<h2>Subheading</h2>
VS a non-semantic equivalent:
<p><strong>Heading</strong></p>
<p><em>Subheading</em></p>
Sometimes you might hear people in a debate saying "You're just talking semantics now" and this usually refers to the act of saying the same meaning as the other person but using different words.

"Semantically correct usage of elements means that you use them for what they are meant to be used for. It means that you use tables for tabular data but not for layout, it means that you use lists for listing things, strong and em for giving text an emphasis, and the like."
From: http://www.codingforums.com/archive/index.php/t-53165.html

HTML elements have meaning. "Semantically correct" means that your elements mean what they are supposed to.
For instance, you definition lists are represented by <dl> lists in code, your abbreviations are <abbr>s etc.

It means that HTML elements are used in the right context (not like tables are used for design purposes), CSS classes are named in a human-understandable way and the document itself has a structure that can be processed by non-browser clients like screen-readers, automatic parsers trying to extract the information and its structure from the document etc.
For example, you use lists to build up menus. This way a screen reader for disabled people will know these list items are parts of the same menu level, so it will read them in sequence for a person to make choice.

I've never heard it in a purely CSS context, but when talking about CSS and HTML, it means using the proper tags (for example, avoiding the use of the table tag for non-tabular data), providing proper values for the class and id that identify what the contained data is (and using microformats as appropriate), and so on.
It's all about making sure that your data can be understood by humans (everything is displayed properly) and computers (everything is properly identified and marked up).

Related

div tag and nav tag uses in htlm [duplicate]

This question already has answers here:
Why should I use 'li' instead of 'div'?
(15 answers)
Are new HTML5 elements like <section> and <article> pointless? [closed]
(8 answers)
Why use HTML5 tags? [duplicate]
(1 answer)
Closed 9 years ago.
Why use HTML5 semantic tags like headers, section, nav, and article instead of simply div with the preferred css to it?
I created a webpage and used those tags, but they do not make a difference from div. What is their main purpose?
Is it only for the appropriate names for the tags while using it or more than that?
Please explain. I have gone through many sites, but I could not find these basics.
The Oxford Dictionary states:
semantics: the branch of linguistics and logic concerned with meaning.
As their name says, these tags are meant to improve the meaning of your web page. Good semantics plays an important role the automated processing of documents. This automated processing happens more often than you realize - each website ranking from search engines is derived from automated processing of all the website out there.
If you visit a (well designed) web page, you as the human reader can immediately (visually) distinguish all the page elements and more importantly understand the content. In the top left you see the company logo, next to it is the site navigation, there is a search bar and some text about the company, a link to a product you can buy and a legal disclaimer at the bottom.
However, machines are dumb and cannot do this:
Looking at the same page as you, all the web crawler would see is an image, a list of anchors tags, a text node, an input field and an image with a link on it. At the bottom there is another text node.
Now, how should they know, what part of the document you intended to be the navigation or the main article, or some not-so-important footnote? They can guess by analyzing your document structure using some common criteria which are a hint for a specific element.
E.g. an ul list of internal links is most likely some kind of page navigation and the text at the end of the document is something necessary but not so important to the everyday viewer (the legal disclaimer).
Now imagine instead of a plain div, a nav element would be used – the machine immediately knows what the purpose of this element is:
// machine: okay, this structure looks like it might be a navigation element?
<div><ul><li><a href="internal_link">...</div>
// machine: ah, a navigation element!
<nav><ul><li><a>...</nav>
Now the text inside a main tag – this is clearly the most important information of the page! Over there to the left, that text node, the image and the anchor node all belong together, because they are grouped inside a section tag, and down at the bottom there is some text inside a footer element (they still don't know the meaning of that text, but now they can deduce it's some sort of fine print).
Example:
You, as the user (reading a page without seeing the actual markup), don't care if an element is enclosed in an <i> or <em> tag. In most browsers both of these tags will be rendered identically – as italic text – and as long as it stands out between the surrounding text it serves its purpose.
However, there is a big difference in terms of semantics:
<i> means italic - it's simply a presentational hint for the browser on how to render it (italic) and does not necessarily contain deeper semantic information.
<em> means emphasize - it indicates an important piece of information. Now the browser is not bound to the italic instruction any more, but could render it in italic or bold or underlined or in a different color... For visually impaired persons, the screen readers can raise the voice - whatever method seems most suited in a specific situation to emphasise this important information.
Final thought:
Semantic tags are not the end. There are things like metadata, ontologies, resource description languages which go a step further and help connect data between different web pages and can even help create new knowledge!
E.g. wikipedia is doing a really bad job at semantically presenting data.
https://en.wikipedia.org/wiki/Barack_Obama
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org/wiki/Joe_Biden
All three are persons who at some point in time where president of the USA.
All three articles contain a sidebar that displays these information, and you can compare them (by opening both pages and then switching back and forth), but they are not semantically described.
Instead, if wikipedia used an ontology to describe a person: http://dbpedia.org/ontology/Person
<!-- President is a subclass of Politician which is a subclass of Person -->
<President>
<birthname>Barrack Hussein Obama II</birthname>
<birthdate>1961-08-04</birthdate>
<headOf>country::USA</headOf>
<tenure>2009-01-20 – 2017-01-20</tenure>
</President>
Not only could you (and machines) now directly compare those three directly (on a dynamically generated page!), but you could even create new knowledge, e.g. show a list of all presidents of the United States - quite boring but also cool stuff like who are all the current world leaders, how many female world leaders do we have, who is the youngest leader, how many types of leaders are there (presidents/emperors/queens/dictators), who served the longest, how many of them are taller than 175cm and have brown eyes, etc. etc.
In conclusion, good semantics is super cool (but also – on a technical level – hard to achieve and maintain).
There's a nice little article on HTML5 semantics on HTML5Doctor.
Semantics have been a part of HTML in some form or another. It helps you understand what's happening where on the page.
Earlier when <div> was used for pretty much everything, we still implemented semantics by giving it a "semantic" class name or an id name.
These tags help in proper structuring and understanding of the layout.
If you do,
<div class="nav"></div>
as opposed to,
<nav></nav>
OR
<div class="sidebar"></div>
as opposed to,
<aside></aside>
there's nothing wrong, but the latter helps in providing better readability for you as well as crawlers, readers, etc..
In the div tag you have to give an id which tells about what kind of content it is holding, either body, header, footer, etc.
While in case of semantic elements of HTML5, the name clearly defines what kind of code it is holding, and it is for which part of the website.
Semantic elements are <header>, <footer>, <section>, <aside>, etc.

HTML semantics and styling

My question is this – is it still best practice to use certain HTML tags even if you would then need to style them differently to how a browser interprets and displays those tags?
For example – the HTML5 <blockquote> tag will start on its own line, with default margin and padding, and with an indent.
However – if you do not wish there to be an indent, should you still use the <blockquote> tags in order to convey meaning to the browser and search engine, and then apply CSS to reduce/rid the indent, or should you just use a <p> tag for example?
It’s not much effort to restyle the blockquote element, and I assume that it is important to use the tags that most accurately convey the meaning of their contents, but at the same time I do not want to get into a habit of writing extra code if it is considered best practice not to do so.
I assume that it is important to use the tags that most accurately
convey the meaning of their contents
You assume correctly.
I would say yes, absolutely always try to use the appropriate element for the type of content/purpose you intend. Elements have names/designations for a reason, so your code can be structured in a way that makes semantic sense. Why is this important? Well ignoring SEO for which it plays an important part, or ease of access regarding your code, this is the intended design of HTML.
Not using a specific element because it has default styling applied is not a sensible or really logical course of action in this context. This is especially in light of the fact that when you compare browser-specific styling, most elements may have default styles applied.
I do not want to get into a habit of writing extra code if it is
considered best practice not to do so
Bad practice to some degree is subjective, otherwise it is simply right or wrong- in this case it wont be game breaking to not use the correct tags, however you will be going against their intended use according to specification, so it would most certainly be bad practice.
See:
Semantic HTML (Wikipedia)
Semantic Web (Wikipedia)
Yes, use the tags which are most aligned with the semantics of your content. Don't concern yourself with styling when constructing your HTML doc, as you could have different styles for different resolutions (smartphone, tablet, desktop) or even different mediums (web browser, screen reader, braille display, whatever device come out in the future...)
The following talks about class names specifically, but it does drive home the importance of semantic correctness:
ref: http://www.w3.org/QA/Tips/goodclassnames
HTML5 gives us many new elements to describe parts of a Web page, such as header, footer, nav, section, article, aside and so on. These exist because we Web developers actually wanted such semantics. How did the authors of the HTML5 specification know this? Because in 2005 Google analyzed 1 billion pages to see what authors were using as class names on divs and other elements. More recently, in 2008, Opera MAMA analyzed 3 million URLs to see the top class names and top IDs used in the wild. These analyses revealed that authors wanted to mark up these areas of the page but had no elements to do so, other than the humble and generic div, to which they then added descriptive classes and IDs.
(HTML5 Doctor has many articles about HTML5 semantics.)
Read
You use the blockquote tag to denote that the content is a quote.
The default style provided by the mark-up communicates information to the one interpreting or decoding the sign.
http://html5doctor.com/blockquote-q-cite/
It is not only a matter of the default style of the mark-up element.
By using CSS you can create a variation of that basic style. And this is certainly valid for the blockquote. The default style does not provide quotes just padding and margin. It is provided by the CSS quotes property. And this is just one way of styling a quote. You can have a left border on the quote, italic, ...
blockquote {
quotes: "\201C""\201D""\2018""\2019";
}
http://css-tricks.com/snippets/css/simple-and-nice-blockquote-styling/
This is largely opinion-based, and opinions on matters like this are often expressed almost as religious convictions, with little or no factual evidence or arguments presented. However, I will try to deal with the issue on a technical basis, focusing on the example given. (It’s a good example and does not take us too easily too deep into semantic jungles.)
The blockquote element is defined as structural rather than semantic. It does not say anything about the meaning of its content (it might be about apples, God, or screwdrivers), only about its structural relation with the enclosing document: the content is a copy of some content taken from an external source, i.e. from outside the document – that is, a quotation – and it is a “block quotation” as opposite to “inline quotation”, which means little (if anything) else than its default rendering being a block.
In practice, blockquote has widely been just simply to indent text, especially in the old days, before CSS was generally available. This is one reason why there is no sign of search engines (or browsers) making any use of the structural relationship (or “semantics”, if you want to put it that way) nominally expressed by the element. There are many things that search engines could do with such information. They just don’t. It is questionable if they ever will: too much content would be incorrectly treated as quoted, and there is too much content that is actually quoted but not marked with blockquote.
Thus, the effect of blockquote is in practice just the default rendering, with certain margins on all sides. When CSS is disabled or overridden for some reason, this is the rendering, and you should try to use HTML markup that gives you tolerable rendering even without CSS. This can be seen as the good reason for using blockquote for block quotations. On the other hand, if your block quotations have been clearly indicated as quotations in other methods, such as introductory phrases or headings, this argument does not really apply.
On the other hand, there is no good reason not to use blockquote just because you don’t want the indent. In that case, you must have some other method of visually distinguishing the block quotation from other content, such as using a background color, a different font, or maybe large decorative quotation marks. You would naturally be using CSS for that, and compared to what you do with it, setting margin-left: 0 is a trivial thing to do.
Moreover, even though browsers and general search engines ignore the structural meaning of blockquote, you (or your organization) may decide to do otherwise. You can choose to mark block quotations uniformly in order to be able to find them easily (for statistical, checking, or other purposes). You could achieve the same by using, say, class=quote consistently, but why not use an element when you can?

What is the actual meaning of separation of content and presentation?

What is the actual meaning of separation of content and presentation?
Is it just mean to avoid inline css?
Does it mean that the design should be able to manipulated without changing the HTML?
Can we really make any change in design from CSS only?
If we want to change the size of
images then we will have to go to in
HTML code
If we wan to add one more line break in paragraph then again we will
have to go to in HTML code
If we want to add one more separator
at some place then again we will have
to go to in HTML code
Which X/HTML tag we should avoid to use to keep separation of content and presentation?
Is separation of content and presentation also helpful for accessibility/screen reader users? ... and for programmer/developer/designer?
When defining what is content and presentation, see your HTML document as a data container. Then ask yourself the following on each element and attribute:
Does the attribute/element represent a meaningful entity in my data?
For example, are the words between <b> tag are in bold simply for display purposes or did I want to add emphasis on that data?
Am I using the proper attribute/element to property represent the type of data I want to represent?
Since I want to add emphasis on that particular section, I should use <em> (it doesn't mean italic, it means emphasis and can be made bold) or <strong> depending of the level of emphasis wanted.
Am I using the attribute/element only for display purposes? If yes, can the element be removed and the parent element styled using CSS?
Sometimes an presentational tag can simply be replaced by CSS rules on the parent element. In which case, the presentational tag needs to be removed.
After asking yourself these three simple questions, you are usually able to make a pretty informed decision. An example:
Original Code:
<label for="name"><b>Name:</b></label>
Checking the <b> tag...
Does the attribute/element represent a meaningful entity in my data?
No, the tag doesn't represent a data node. It is there purely for presentation.
Am I using the proper attribute/element to property represent the type of data I want to represent?
<b> is used for presentation of bold elements.
Am I using the attribute/element only for display purposes? If yes, can the element be removed and the parent element styled using CSS?
Since <b> is presentational and I am using it for presentation, yes. And since the <b> element affects the whole of <label>, it can be removed and style be applied to the <label>.
Semantic HTML's goal is not to simplify design and redesign or to avoid inline styling, but to help a parser understand what that particular tag represent in your document. That way, applications can be created (ie.: search engine) to intelligently decide what your content signify and to classify it accordingly.
Therefore, it makes sense to use the CSS property content: to add quotes around text located in a <q> tag (it has no value to the data contained in your document other that presentation), but no sense to the use the same CSS property to add a © symbol in your footer as it does have a value in your data.
Same applies to attributes. Using the width and height attribute on an <img> tag representing an icon at size 16x16 makes semantic sense as it is important to understand the meaning of the <img> tag (an icon can have different representations depending on the size it is displayed at). Using the same attributes on an <img> tag representing a thumbnail of an larger image does not.
Sometimes you will need to add non-semantic elements to be able to achieve your wanted presentation, but usually those are avoidable.
There are no wrong elements. There are wrong uses of particular elements. <b> should not be used when adding emphasis. <small> should be used for legal sub-text, not to make text smaller (see HTML5 - Section 4.6.4 for why), etc... All elements have a particular usage scenario and they all represent data (minus presentational elements, but they do have a use in some cases). No elements should be set aside.
Attributes are a different thing. Most the attributes are presentational in nature. Attributes such as <img border> and <body fgcolor> rarely have signification in the data you are representing therefore you should not use them (except in those rare cases).
Search Engines are a good examples as to why semantic documents are so important. Microformats are a predefined set of elements and classes which you can use to represent data which search engines will understand in a certain way. The product price information in Google Searches is an example of semantics at work.
By using the predefined rules in set standards to store information in your document allows third-party programs to understand what seems to be a wall of text without using heuristics algorithms which may be prone to failures. It also helps screen readers and other accessibility applications to more easily understand the context in which the information is presented. It also greatly helps the maintainability of your markup as everything is tied to a set definition.
The best example is probably the CSS Zen Garden.
The goal of this site is to showcase what is possible with CSS-based design only, with a strict separation of content from the design. Style sheets contributed by various graphic designers are used to change the visual presentation of a single HTML file, producing hundreds of different designs. The HTML markup itself never changes between the different designs.
On each design page, you'd have a link to view the CSS file of that design.
What is the actual meaning of separation of content and presentation?
It is rather a design philosophy than somewhat concrete. In general, it means that you should preserve the semantics of the content, think of your content as of a piece of structured information. And that also means that you should keep all aesthetic details away from this structured information.
is it just mean to avoid inline css?
As noticed above, inline styles have nothing to do with semantics of your content and should be avoided at all costs. But it isn't just that.
is it just mean if after writing html according to design then if then if we want to do any change in design then it should be only with css, no need to html
Unfortunately, it is not always possible to achieve some concrete aesthetic goals without modifying the underlying markup; CSS3 tries it's best to address these issues.
Which X/HTML tag we should avoid to use to keep separation of content and presentation?
Look for deprecated tags in W3C HTML 4.01 / XHTML 1.0 Reference
Is separation of content and presentation also helpful for accessibility/screen reader users?
Surely. Better structured information generally remains readable even if certain browsers render styles incorrectly (or do not render them at all). Such content may also look more adequate on printed media (though print styles may be applied to achieve even better aestherics -- they, again, have nothing to do with content semantics).
Is separation of content and presentation also helpful for programmer/developer/designer ?
Of course. The separation of content and presentation takes its roots from more general philosophy, the separation of concerns. Everybody benefit from the separation: the content supplier does not have to be a good designer and vice versa.
Putting in line breaks at certain points is inevitable, there will usually be some overlap of presentation and content. You should always aim for perfect separation though.
Take the other extreme: A page containing loads and loads of tables that are used for layout purposes only. This is the definite anti-pattern that should be avoided at all cost. The content plays a second fiddle after the layout here; it's often not in the right order and thereby hardly machine readable. Not machine readable content is bad for accessibility and bad for the page's search engine ranking.
By marking up content without concern for presentation, you are first and foremost making it machine readable. You are then also in a position to serve the same content to different clients in different formats, say in a mobile-optimized version. You can also change the presentation easily without having to mess with the HTML files, say for a big redesign.
Another benefit that comes naturally by separating content and presentation (HTML - CSS files) is that you have less to type and less to maintain, plus your pages can have a consistent styling applied very easily. Contrast thousands of inline styles vs. one style definition in one CSS file, which is "naturally" applied to all elements with the same "meaning" (markup).
Ideally your (X)HTML consists only of meaningful, semantic markup and your CSS of styles using this markup for its selectors. In the real world you'll often mix classes and IDs into your markup that add no extra meaning, because you need these extra "hooks" to style everything the way you want to. But even here there's a difference between class="blue right-aligned" and class="contact-info secondary". Always try to add meaning to the content, not style. Balancing this is quite an art in itself. :)

Quantify the semantic value of <p> as opposed to <div>

I'm transforming some XML, which I have no control over, to XHTML. The XML schema defines a <para> tag for paragraphs and <unordered-list> and <ordered-list> for lists.
Frequently in this XML, I find lists nested within paragraphs. So, a straight-forward transformation causes <ul>s to get nested within <p>s, which is illegal in XHTML.
I've created a list of ways to deal with it and here are the most obvious:
Just don't worry about it. The browsers will do fine. Who cares. (I don't like this option, but it's an option!)
Write a fancy-pants component to my transform that makes sure all <para> tags get closed before unordered lists start, and re-opened afterward. (I like this option the most, but it's complicated due to multiple levels of nesting, and we may not have the budget for this)
Just transform <para> to <div> and set the margins on the divs so it looks like a paragraph in the browser. This is the easiest solution that emits valid XHTML, but it takes from the semantic value of the markup.
My questions are:
how much value do I lose if I go with option 3?
Does it really matter?
What is the actual effect on the user experience?
If you can cite references, please do (this is easy to speculate on). For example, I was thinking it might affect search results from a Google Search Appliance that we are using.
If search terms appear in divs, do they carry less weight?
Or is there less of an association between them and preceding header tags?
How can I find this out?
I've come up against this too.
Personally, I consider it a grave mistake on part of the standard that a p cannot contain lists. I think it's typographically legal, so it should be legal in what was originally intended to be a markup for text.
I may be flamed for this, but XHTML has crashed and burned in the real world, regardless of whether it was a good idea or not. The often horrible tag soup that is today's HTML markup will continue to survive for a goodly long time, if only because bad markup and lenient browsers will continue to perpetuate each other forever.
Thus, I tend to go with Option 1.
Option 3 is also viable, in my opinion. While I don't have proof, I'm pretty sure no search engine is crazy enough to actually put any trust in most of the formatting tags we apply to our HTML. meta and a tags are obvious exceptions, of course.
First of all, unless you set every CSS property available now plus every one possibly available in the future, then you can't guarantee your <div> will match up, WRT styles, with <p>. (Though I agree you can get close and this is probably good enough, but read on.) I don't know of any visual browsers or other tools that would seriously treat them differently, but this is just as much an artifact, IMHO, of the current widespread loose interpretation on the web, as it is of them being close in meaning.
Is <ul> the right transformation for every <unordered-list> in your source data? If they are always displayed as block-level content instead of 1) an, 2) inline, 3) list; then that's a safe bet. If so, you can break the paragraph into two (and wrap the whole thing in <div> if you like).
Example input:
<para>Yadda yadda: <unordered-list/> And so fin.</para>
Output:
<div>
<p>Yadda yadda:</p>
<ul/>
<p>And so fin.</p>
</div>
The good news is that any of these 3 options would work.
There are many, many people on SO that will tell you "if it works, forget semantics and do it." So Option 1 would probably be a site favorite if everyone here was asked.
Option 2 is my favorite and would be the best semantically. I would definetely do it if time/budget allows.
However, Option 3 is a close second and hopefully this will answer your question: The <div> element and the <p> element are near-identical. In fact, the biggest difference is semantics. They each have only one rule applied to them in most browsers' CSS specification: display: block.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.