Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
So as the title suggests, im interested / curious about how much does HTML 5 outlining help boost SEO in general.
there are alot of tutorials and explanations as to what HTML5 will do for XYZ, and alot of how-tos but for some of the pluses that HTML5 currently(and future) brings, there isn't a clear mention(that i know of) as to how much better it will/or curently make things.
I work in a...normal team i guess in where we have dedicated graphic designers, coders, programmers etc and part of this team of course are the SEO guys.
Now MANY websites show (under the document outline tool) that many designers/developers are using HTML5 but when you look at their document outline, its all over the place and even have unnamed sections in their outlines.
my guess is that, theyre just using the regular DOC type of HTML <!DOCTYPE HTML> and rolling with it instead of typing the long/massive doctype and everything therein etc...
but that said, and as oppose to the regular "old" rules and debates of H1 tags etc...
Since many of HTML5s features are already in play, and most current browsers supporting them(at diff levels), does it hurt when having a semi messed up, unnamed sections in your outline hurt?
Overall, if i convert my page from html of old to new HTML5 standards and proper outlining etc, will it make my standings better?
in a simple question, FOR EXAMPLE, if my current page rankings are say 5/100 on google,
will implementing a better, newer doc outline with HTML5 bring me higher, say 3/100 or even 1/100.
You are under the wrongful impression that formatting of the code, or use of standards has something to do with how search engines rank your page. This isn't actually the case.
What does happen with semantic tags/classes/ids, validating documents, and standards compliant markup is make it a lot easier for search engines interpret the page properly.
The content, at which levels / order and importance (say with the header tags) mark how relevant it is, and a properly formed document only helps in recognizing the different sections, but has no inherent effect on how well its ranked.
In the end though SEO is slightly guessing how smart the search engine algorithms are, one might for instance assume that Google has some sort of consideration of grading an HTML4 markup page vs HTML5. Since HTML5 is more recent, it's likely to get some leverage over HTML4 markup since HTML4 is outdated.
There is no boost from using HTML5. Semantic markup helps SEO but the version you use, or format (as in microformats) does not. It's the content that gets ranked, not your markup.
I'm not sure what outlining means. Do you intent to verify that your document doesn't have untitled sections like in this article?
Which is your question:
Does Google like HTML5 more than other HTML's?
Does Google like valid HTML5 more than other valid HTML's that don't have valid HTML5 structre (i.e. messy or untitled sections)?
This is just my gut feeling but I'm guessing it doesn't matter what HTML you use. This article indicates code validity is not an issue in page ranking. Here it says Doctype isn't even registered by Google. (Just a couple of links, don't know how realiable they are.) It makes sense thought. PageRank algorithm can't really (again just my gut feeling) be so simple that would put a lot of weight on the validity of code or the used HTML language. Unique content, locality, links to other quelity sites etc. are far more important then few missing title tags or the fact source code is written using the latest HTML specification. Where would it end? Should PHP pages rank better than "pure" HTML. Is JavaScript an boosting factor or not? What JS library would be the most important one?
Sorry if I'm splitting hairs but used technology shouldn't matter, content should. (Don't really know if only content should matter thought...)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
It's a well known fact that browsers will accept invalid HTML and do their best trying to make sense out of it. If you create a web page containing only the following code:
<html>
<head>
<title>This is bad HTML</title>
<body>
<h1>Bad HTML</h2>
<p>This is a paragraph
</body>
then you will get a webpage parsed in a way that will show an acceptable view. Whether it is what you meant or not, depends on each browser's understanding of your mistakes.
This, to me, is the same as if Javascript could be written like this:
if (some_var == 1) {
say_something("some text');
else {
do_something_else();
// END OF CODE
which, a Javascript compiler written with the same effort to make sense out of invalid code could proably parse as you meant - or make its own sense but run it after all.
I've seen several articles and questions regarding the question "Is it even worth it writting valid HTML?", which present several opinions on the pros and cons of writting valid HTML. However, what this really makes me wonder is:
Why are browsers accepting invalid HTML in the first place?
NOTE: The following questions are not more questions, but a way to give context to the only question I'm asking here:
Why aren't browsers strict?
Why don't they reject with errors invalid code, just like any other programming language? (not that I'm calling HTML a programming language, but you get the point)
Wouldn't that force all developers to write HTML code that will be interpreted exactly the same in any browser?
If browsers refused to parse invalid markup, wouldn't that effectively result in valid markup everywhere and from anyone wanting to publish content in the web?
If this comes from historical reasons and backward compatibility, isn't it time already to change when we already see sites like adsense.google.com refusing compatibility with IE < v10?
EDIT: Those voting to close this question, please reconsider. This is not a broad question neither is a opinion based one. It's a very specific question on a very specific subject, completely related to the programming world and that can definitely be answered with a real answer by those who actually know it. Thanks.
"Why are browsers accepting invalid HTML in the first place?"
For compatibility reasons, and in the case of newer browsers, because HTML5 dictates an algorithm for parsing even invalid documents.
Earlier HTML specifications were ambiguous on many situations,
such as what happens when the wrong tag is seen, or inconsistent nesting of
tags, such as <b><i></b></i>. Even so, many documents "just work" because some earlier browsers ignore unexpected tags or even "correct" incorrect nesting.
But now the HTML5 specification includes a much less ambiguous algorithm for parsing HTML documents. Note that the algorithm includes points where "parse errors" can occur. But these parse errors usually don't stop a modern browser from displaying an HTML document, although the browser is free to display parse errors in its developer tools if it chooses to:
[U]ser agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification. [Emphasis added.]
But again, no modern browser, to my knowledge, aborts parsing a document this early because of parse errors (barring extraordinary situations, such as running out of memory).
On the adsense.google.com situation: This probably has nothing to do with invalid HTML, but rather, perhaps, because IE9 and earlier's DOM support is not sufficient for adsense.google.com's needs.
I don't know why they allowed it from the start, but here is why they cant switch now: Legacy Support. If a browser forced strict html, huge parts of the internet would just break, and yes some people would update their code, but some pages would just be lost. There is no incentive for browsers to do this because it would seem to the consumer that browser just doesn't work on some pages and would switch to another that still supports less optimal html.
Basically because it was allowed from the beginning, now it has to be allowed now.
To avoid opinion-based answers, this type of question requires an answer based on an authorative reference with credible and/or official sources.
The following excerpts are quotes from W3C Validator Help & FAQ that addresses Why are browsers accepting invalid HTML in the first place? and some other demonstrated concerns related to that.
About Markup
Most pages on the World Wide Web are written in computer languages
(such as HTML) that allow Web authors to structure text, add
multimedia content, and specify what appearance, or style, the result
should have.
As for every language, these have their own grammar, vocabulary and
syntax, and every document written with these computer languages are
supposed to follow these rules. The (X)HTML languages, for all
versions up to XHTML 1.1, are using machine-readable grammars called
DTDs, a mechanism inherited from SGML.
However, Just as texts in a natural language can include spelling or
grammar errors, documents using Markup languages may (for various
reasons) not be following these rules.
[...]
Concepts
One of the important maxims of computer programming is: "Be
conservative in what you produce; be liberal in what you accept."
Browsers follow the second half of this maxim by accepting Web pages
and trying to display them even if they're not legal HTML. Usually
this means that the browser will try to make educated guesses about
what you probably meant. The problem is that different browsers (or
even different versions of the same browser) will make different
guesses about the same illegal construct; worse, if your HTML is
really pathological, the browser could get hopelessly confused and
produce a mangled mess, or even crash.
That's why you want to follow the first half of the maxim by making
sure your pages are legal HTML.
[...]
Validity might not mean quality, and invalidity might not mean poor quality
A valid Web page is not necessarily a good web page, but an invalid
Web page has little chance of being a good web page.
For that reason, the fact that the W3C Markup Validator says that one
page passes validation does not mean that W3C assesses that it is a
good page. It only means that a tool (not necessarily without flaws)
has found the page to comply with a specific set of rules. No more, no
less. This is also why the "valid ..." icons should never be
considered as a "W3C seal of quality".
Unexpected browser behavior might mean that they actually don't accept invalid markup
While contemporary Web browsers do an increasingly good job of parsing
even the worst HTML “tag soup”, some errors are not always caught
gracefully. Very often, different software on different platforms will
not handle errors in a similar fashion, making it extremely difficult
to apply style or layout consistently.
Using standard, interoperable markup and stylesheets, on the other
hand, offers a much greater chance of having one's page handled
consistently across platforms and user-agents.
[...]
Compatibility problems
Checking that a page “displays fine” in several contemporary browsers
may be a reasonable insurance that the page will “work” today, but it
does not guarantee that it will work tomorrow.
In the past, many authors who relied on the quirks of Netscape 1.1
suddenly found their pages appeared totally blank in Netscape 2.0.
Whilst Internet Explorer initially set out to be bug-compatible with
Netscape, it too has moved towards standards compliance in later
releases.
[...]
Relying too much on 3rd party tools
The answer to this one is that markup languages are no more than data
formats. So a website doesn't look like anything at all! It only takes
on a visual appearance when it is presented by your browser.
In practice, different browsers can and do display the same page very
differently. This is deliberate, and doesn't imply any kind of browser
bug. A term sometimes used for this is WYSINWOG - What You See Is Not
What Others Get (unless by coincidence). It is indeed one of the
principal strengths of the web, that (for example) a visually impaired
user can select very large print or text-to-speech without a publisher
having to go to the trouble and expense of preparing a separate
edition.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
It's a well known fact that browsers will accept invalid HTML and do their best trying to make sense out of it. If you create a web page containing only the following code:
<html>
<head>
<title>This is bad HTML</title>
<body>
<h1>Bad HTML</h2>
<p>This is a paragraph
</body>
then you will get a webpage parsed in a way that will show an acceptable view. Whether it is what you meant or not, depends on each browser's understanding of your mistakes.
This, to me, is the same as if Javascript could be written like this:
if (some_var == 1) {
say_something("some text');
else {
do_something_else();
// END OF CODE
which, a Javascript compiler written with the same effort to make sense out of invalid code could proably parse as you meant - or make its own sense but run it after all.
I've seen several articles and questions regarding the question "Is it even worth it writting valid HTML?", which present several opinions on the pros and cons of writting valid HTML. However, what this really makes me wonder is:
Why are browsers accepting invalid HTML in the first place?
NOTE: The following questions are not more questions, but a way to give context to the only question I'm asking here:
Why aren't browsers strict?
Why don't they reject with errors invalid code, just like any other programming language? (not that I'm calling HTML a programming language, but you get the point)
Wouldn't that force all developers to write HTML code that will be interpreted exactly the same in any browser?
If browsers refused to parse invalid markup, wouldn't that effectively result in valid markup everywhere and from anyone wanting to publish content in the web?
If this comes from historical reasons and backward compatibility, isn't it time already to change when we already see sites like adsense.google.com refusing compatibility with IE < v10?
EDIT: Those voting to close this question, please reconsider. This is not a broad question neither is a opinion based one. It's a very specific question on a very specific subject, completely related to the programming world and that can definitely be answered with a real answer by those who actually know it. Thanks.
"Why are browsers accepting invalid HTML in the first place?"
For compatibility reasons, and in the case of newer browsers, because HTML5 dictates an algorithm for parsing even invalid documents.
Earlier HTML specifications were ambiguous on many situations,
such as what happens when the wrong tag is seen, or inconsistent nesting of
tags, such as <b><i></b></i>. Even so, many documents "just work" because some earlier browsers ignore unexpected tags or even "correct" incorrect nesting.
But now the HTML5 specification includes a much less ambiguous algorithm for parsing HTML documents. Note that the algorithm includes points where "parse errors" can occur. But these parse errors usually don't stop a modern browser from displaying an HTML document, although the browser is free to display parse errors in its developer tools if it chooses to:
[U]ser agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification. [Emphasis added.]
But again, no modern browser, to my knowledge, aborts parsing a document this early because of parse errors (barring extraordinary situations, such as running out of memory).
On the adsense.google.com situation: This probably has nothing to do with invalid HTML, but rather, perhaps, because IE9 and earlier's DOM support is not sufficient for adsense.google.com's needs.
I don't know why they allowed it from the start, but here is why they cant switch now: Legacy Support. If a browser forced strict html, huge parts of the internet would just break, and yes some people would update their code, but some pages would just be lost. There is no incentive for browsers to do this because it would seem to the consumer that browser just doesn't work on some pages and would switch to another that still supports less optimal html.
Basically because it was allowed from the beginning, now it has to be allowed now.
To avoid opinion-based answers, this type of question requires an answer based on an authorative reference with credible and/or official sources.
The following excerpts are quotes from W3C Validator Help & FAQ that addresses Why are browsers accepting invalid HTML in the first place? and some other demonstrated concerns related to that.
About Markup
Most pages on the World Wide Web are written in computer languages
(such as HTML) that allow Web authors to structure text, add
multimedia content, and specify what appearance, or style, the result
should have.
As for every language, these have their own grammar, vocabulary and
syntax, and every document written with these computer languages are
supposed to follow these rules. The (X)HTML languages, for all
versions up to XHTML 1.1, are using machine-readable grammars called
DTDs, a mechanism inherited from SGML.
However, Just as texts in a natural language can include spelling or
grammar errors, documents using Markup languages may (for various
reasons) not be following these rules.
[...]
Concepts
One of the important maxims of computer programming is: "Be
conservative in what you produce; be liberal in what you accept."
Browsers follow the second half of this maxim by accepting Web pages
and trying to display them even if they're not legal HTML. Usually
this means that the browser will try to make educated guesses about
what you probably meant. The problem is that different browsers (or
even different versions of the same browser) will make different
guesses about the same illegal construct; worse, if your HTML is
really pathological, the browser could get hopelessly confused and
produce a mangled mess, or even crash.
That's why you want to follow the first half of the maxim by making
sure your pages are legal HTML.
[...]
Validity might not mean quality, and invalidity might not mean poor quality
A valid Web page is not necessarily a good web page, but an invalid
Web page has little chance of being a good web page.
For that reason, the fact that the W3C Markup Validator says that one
page passes validation does not mean that W3C assesses that it is a
good page. It only means that a tool (not necessarily without flaws)
has found the page to comply with a specific set of rules. No more, no
less. This is also why the "valid ..." icons should never be
considered as a "W3C seal of quality".
Unexpected browser behavior might mean that they actually don't accept invalid markup
While contemporary Web browsers do an increasingly good job of parsing
even the worst HTML “tag soup”, some errors are not always caught
gracefully. Very often, different software on different platforms will
not handle errors in a similar fashion, making it extremely difficult
to apply style or layout consistently.
Using standard, interoperable markup and stylesheets, on the other
hand, offers a much greater chance of having one's page handled
consistently across platforms and user-agents.
[...]
Compatibility problems
Checking that a page “displays fine” in several contemporary browsers
may be a reasonable insurance that the page will “work” today, but it
does not guarantee that it will work tomorrow.
In the past, many authors who relied on the quirks of Netscape 1.1
suddenly found their pages appeared totally blank in Netscape 2.0.
Whilst Internet Explorer initially set out to be bug-compatible with
Netscape, it too has moved towards standards compliance in later
releases.
[...]
Relying too much on 3rd party tools
The answer to this one is that markup languages are no more than data
formats. So a website doesn't look like anything at all! It only takes
on a visual appearance when it is presented by your browser.
In practice, different browsers can and do display the same page very
differently. This is deliberate, and doesn't imply any kind of browser
bug. A term sometimes used for this is WYSINWOG - What You See Is Not
What Others Get (unless by coincidence). It is indeed one of the
principal strengths of the web, that (for example) a visually impaired
user can select very large print or text-to-speech without a publisher
having to go to the trouble and expense of preparing a separate
edition.
As far as I understand, the only real advantage of HTML5 semantic markup is for search engines and web crawlers to interpret the document better.
Since intranet applications have nothing to do with search engines or web crawlers, what are the advantages of using semantic markup in HTML5?
There is no straightforward example to point out, but the website (even intranet) can be consumed by different user agents (on different devices).
You are probably familiar with Skype (and the iOS Safari) making phone-number-like words clickable. In the future I can easily imagine mobile browsers being smarter to assist the user in completing tasks on the page, like importing a clearly indicated contact to the address book.
Screenreaders for blind people?
While there is not a whole lot of immediate benefit for non-disabled people, it is still good practice. Does your company not have any externally facing sites? If it does, do those people not look at internal page code? Good practices spread just like bad ones.
see also: http://en.wikipedia.org/wiki/Semantic_HTML
Simply said there the only rule you have to follow if you create your html documents is that it is valid html, otherwise you will have the problem that the browser would try to correct your broken syntax which may result in defects of the visual representation of your content.
In modern browsers you can use display to given any element - with some limitations e.g. with input element - the same visual look and behavior then any other element.
So if you ask what are the advantages of using semantic markup in HTML5 you should ask, why to use any of the semantic markup if it is possible to have the same result using css.
The short answer is, no one will stop you if it is your own project where you are responsible to - except the client that probably gives you requirements.
It is similar to asking: Language xyz provides comments and there is a syntax for doc-comments, but why should I use them?.
Using the semantics wisely increases the readability and thus maintainability. You are not required to use every possibility of semantics at all costs.
Using them will help you to get into the code again if you haven't looked at it for a longer time, e.g. to distinguish between the elements that encapsulate logical parts and elements that are used for styling. Especially if you use a template engine to create your code or to search for certain elements in multiple files.
Even if you are now the only one who works on the code it may happen that if the project grows that you need other people to work on the code. Or for the situation, you for some reason are not available, someone else needs to maintain your code, a good markup is essential.
Using the correct markup and additions like WAI-ARIA is not only essential for handicapped people, but also allows the browser to recognize the meaning of elements, allowing to e.g. improve the keyboard navigation. Especially in a productive environment where you need to type much, it is often faster to navigate with keyboard then using a trackpad or a mouse.
I keep making attempts at properly using HTML5 but I feel like it's still not even close to anything semantically valuable.
My attempts:
HTML5 Article node Architecture
HTML5 Blog Page Architecture
But there's such subtleties in every single tag!
My question is, what specific software out there on the web is actually doing things like processing our HTML DOM, calculating and comparing elements to say "oh, this is a <header>, and it's just after <section>, and it has <time> in it, so the <time> tag must be "metadata" in relation to the <header>...", and saying "The content within the <time> tag not only is the "published time", but also relates to the author's birthday, so it must be a special post (say because there was also a <cite> or <address class='vcard'> tag in there too)".
I mean, what benefit am I ever going to get in using HTML5 if I don't know the algorithms that are interpreting it? If I just stuck with the basic div, ol, ul, li, p, a, h[1-6] tags, I could do everything with half the number of DOM elements.
Looking forward to some specific algorithms that I can use to shape how I structure the DOM from here on out.
I'm at the point where I don't even think we should be using HTML5 tags at all. For example, on the iPhone especially, the goal should be to minimize dom elements to decrease load time. Plus, if the iPhone site is a mirror of the traditional browser version, the search engines won't even see the iPhone site (ideally). So there's no real point in making the DOM semantic. So if I can use 1/2 the amount of <div> tags to achieve the same layout as if I used a somewhat "semantic HTML5" rendition, and that's a good thing for the iPhone, why don't I do that for the regular browser too? That's where I'm coming from.
Articles like this are basically saying it's pointless to worry about semantic HTML.
What algorithms are reading your semantic HTML? Google, that's who. Their algorithm tries to extract every bit of meaning from pages that it can, because that helps Google construct smart, relevant search results. For one example, Google tries to determine the dates of things by reading the HTML and gives headers extra consideration in determining the overall topic of a page.
Also, your assertion that we shouldn't use HTML5 tags on the iPhone "to minimize dom elements" isn't founded in any technical basis. HTML5 doesn't dictate that we use more DOM elements, and in fact it can let us leave out tags that would be required by XHTML. You should use HTML5 on the iPhone more than anywhere else. For example, the new input types like number and email don't do much on the desktop, but that extra information can really make things nicer on the iPhone by allowing it to present an appropriate interface.
Whenever a "machine" tries to make sense of your content.
In addition to search engines (→ SEO), screen readers (→ Accessibility) interpret the markup. They get better from version to version.
Also, think of all the tools that might come one day. The great thing about the Web is, that all the web pages could still exist in 5, 10, 100 … years from now. Imagine the user-agents and algorithms and search tools that might exist then, and how they could extract the meaning of your old documents.
Search engines can/will better interpret your pages which combined with other factors will result in better rankings for your pages.
Moreover if you use the tags consistently and semantically, you could build your own reusable widgets and libraries that derive knowledge from the HTML structure independent of how the data is stored in the backend.
Consider this sample Google search where you can filter results by date. By using semantic HTML, for let's say, <article> and <time>, you can write a simple crawler that recreates this functionality or allows users to specify a timespan within which to search articles in your own site(s).
Off the top of my head, I don’t know of any algorithms making use of the new semantic tags in HTML5. (Obviously, that doesn’t mean there aren’t any.)
But the idea that you should tailor your HTML to specific algorithms is, I think, a bit contrary to how the web works. The web is worldwide, and will hopefully be around for a long time. We can’t know what uses our HTML will be put to, and useful algorithms can’t be written until there’s a good amount of actual content out there.
The <a> tag wasn’t designed with Google’s PageRank algorithm in mind. Some people thought links would be useless if they weren’t inherently two-way, because you’d get too many broken links when one end went away.
Of course, if the vague possibility of undefined future benefits makes it not worth using some or all HTML5 tags for whatever project you’re working on, don’t use them.
For me, the benefit of using them is that there’s a well-known, public, non-proprietary specification that tells you, and anyone else working on the code, what we’ve agreed the tags mean. Future developers don’t just get a <div> with a class name that I made up in a coffee-fuelled 7 p.m. code print, they get a tag designed and documented by people smarter and more experienced than me. There’s also the chance that the code will become more useful in future if people use the meaning contained in HTML5 tags in algorithms, whereas there’s less chance of that if it’s all just a bunch of <div>s.
I don’t think the size increase of our pages from HTML5 tags is particularly worth worrying about though. After gzipping, the size increases aren’t enough to worry about, especially as mobile performance is as much hampered by the latency (which you can’t do much about) as the bandwidth. Plus mobile bandwidth is likely to trend up, rather than down.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am taking a peek at Dive Into HTML5. It seems nice and interesting, but I am puzzled.
In the 1990s, at the time when Netscape was the browser and HTML was HTML2 or HTML3, there were a lot of tags: address, cite, code... Most of them are unused as of today, probably even obsolete.
HTML5 introduces tags to express "semantic meaning" to the tag itself. This is all fun and games, but I see something very strange in this approach. Technically, the semantics can be very open ended. HTML5 has tags for article, time, navigation bars, footer. Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident <rant> and <nsfw> would be very important tags): ? I thought XML was the strategy to assign semantics to stuff. Nothing forbids you to put an XML chunk under a XHTML div element, and assign a stylesheet to it so to style it properly, or to delegate to the proper viewer the handling of that namespace (for example, when handling RSS or SVG).
In conclusion, I don't understand the reason behind this extensions focused towards semantics, when it's clear that semantic is a very broad topic, which is guaranteed to require a potentially infinite amount of semantic tags. Since I am pretty sure there are clever people at W3C, I think I'm wrong, but I'd like to know why.
Why are tags for article, time, navigation bars, footer useful?
Because they facilitate parsing for text processing tools like Google.
It's nothing about semantics (at least in 'broad' meaning). Instead they just say: here is the body of page (most important text part) and there is the navigation bar full of links. With such an approach you can easily extract just what you need.
I too hate the way that W3C is going with their specs. There are many things that I don't like, and this "semantics" fad is one of them. (Others include taking forever to complete their specs and leaving too many important details for the browsers to implement as they choose)
Most of all I don't like it because it makes my work as a web developer more difficult. I often have to make a choice whether to make the webpage "semantically correct" or "visually/aesthetically pleasing". The latter wins of course, because that is what the users want, but as a result validations start failing and the whole thing gets quite non-semantic (tables for layout and other things).
Another issue at which I frown is that they have officialy declared that the "class" attribute is for semantics, but then they used it for visual presentation selectors in CSS.
Bottom line - DON'T MIX SEMANTICS AND VISUAL REPRESENTATION. If you use some mechanism for describing semantics (like tag names, attribute values, or what not else), then don't use it for funcional/visual purposes and vice versa.
If I would design HTML, I would simply add an attribute "semantic" which could (like the "class" attribute) be added to any tag. Then there would be a number of predefined values like all those headers/footers/articles/quotes/etc.
Tags would define functionality. Basically you could reduce HTML tags to just a handful, like "div", "table/tr/td", "a", "img", "form", "input" and "select". I probably missed a few but this is the bulk. Visual styling would be accomplished through CSS.
This way the three areas - semantics, visual representation, and functionality - would be completely independent and wouldn't clash in real life solutions.
Of course, I don't think W3C is interested in practical solutions...
There is already a lot of semantics in HTML markup in the forms of classes and IDs, of which there is a (near) infinite amount of possibilities of, And everyone has their own way of handling these semantics. One of the goals of HTML5 is to try to bring some structure to this. you will still be able to extend the semantics of tags with classes and ids. It will also most likely make things easier for search engines.
Look at it from the angle of trying to make statements either about the page, or about objects referenced from the page. If you see a <footer> tag, all you can say is "stuff in here is a footer" and pass it by. As such, adding custom tags is not as generic a solution as adding attributes and allowing people to use their own choice of URIs to specify predicates and optionally values - RDFa wins hands-down because you can express any triple-statement you like from RDF in a page, one way or another.
I just want to address one part of your question. You say:
In the nineties, at the time when
Netscape was the browser and html was
HTML2 or HTML3, there were a lot of
tags: address, cite, code... Most of
them are unused as of today, probably
even obsolete.
There are a great deal of tags to choose from in html, but the lack of usage does not imply that they are obsolete. In particular the header tags <h1>, etc, and <ul>, <ol> are used to join items into lists in a way I consider semantic. Many people may not use tags semantically, but the effort to create microformats is an ongoing continuation of the idea you consider an artifact of the 1990s. Efforts to make the semantic web be a winner keeps going, despite full-text search and link analysis (in the form of Google) being the winner as far as how to find and understand the web.
It would be great to see an updated version of Google's Web Stats which show "html as she is spoke." But you are right that many tags are underused.
Whether html5 will be successful is an open and interesting question, but the tags you describe as obsolete didn't go anywhere, they were there in HTML 4.01 and xhtml. HTML5 seems to be an effort to solidify what is useful in tags. In the end if html5 gets support in browsers and makes the job of web developers easier, it will succeed. xhtml2 failed because it roundly failed to gain adoption in browsers and did nothing to make the job of web page makers easier. The forces working on html5 seem keenly aware of the failure of xhtml2, and I think are avoiding having html5 suffer a similar fate.
"Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident and would be very important tags): ?"
You use <dialog> to describe conversations or comments. Rant and NSFW are subjective terms therefore it makes sense not to use them.
From what I understand a bunch of experienced web developers did research and looked for what most websites have in common in html. They noticed that most websitse have id="header", id="footer", id="section" and id="nav" tags so they decided that we need HTML tags to replace those id's. So in other words, don't expect them to give you a HUGE amount of HTML vocabulary. Just keep it simple as possible as you can while addressing the MOST common needed HTML tags.
NAV tag is VERY important for providing accessibility as well. You want them to know where the navigation is rather than to force them to find whether links are for navigation or not.
I disagree with adding extra tags. If detailed vocabulary were actually import then there could be a different tag name for every word in the dictionary. Additional tags names are not helpful as they may communicate additional meaning to humans, but do nothing to facilitate machine parsing of the language. This is why I don't like the "semantic" tags for HTML5 as I believe this to be slippery slope to providing a vocabulary too complex while only providing a weak solution to a problem not fully addressed.
In my opinion markup language structure data as much as describe it in a tree diagram form. Through parsing of the structure and proper use of semantic conventions, such as RDFa, context can be leveraged to provide specific meaning to otherwise generic tag names. In such as case excessive vocabulary need not exist and structurally redundant tag names, such as footer and aside, could be eliminated. The final objective is to make content faster and more accurate to interpret by both humans and machines simultaneously while using as little code as possible to achieve that result. How that solution is lesser important, except to HTML5.
I thought XML was the strategy to assign semantics to stuff.
As far as I know, no it wasn’t. XML allows new languages to be defined which are all parsed in the same way, because they all use the XML syntax.
It doesn’t, of itself, provide any way to add meaning (“semantic” just means “meaningful”) to those languages. And until computers get artificial intelligence, they don’t actually understand meaning, so meaning is just what is agreed between human beings. HTML is the most commonly-used language with agreed meaning of its tags.
As HTML is so common, it’s helpful to add a few meaningful tags to it that are quite general in their application. The new HTML5 tags are aimed at that. The HTML5 spec’s authors could indeed carry on down this route, creating tags for every specific bit of meaning possible, but as they’re not robots, they probably won’t.
<section> is useful, and general enough to be meaningfully applicable in lots of documents. <author-last-name> isn’t. Distinguishing between the two is a judgment call, which is why humans, and not computers, write the spec.
For custom semantics that are too specific to be added to HTML as tags, HTML5 defines microdata.
I've been reading Andy Clark's book Transcending CSS (page 33).
...,it is now widely accepted that presentational names such as header, left, or red that describe an element's look or position are poor choices.
After reading these lines I asked myself: hey, aren't there elements in HTML5 spec such as header, footer?? Why is footer more semantic ? Andy in his book advocates to use site-info for the ID of the footer div and this makes more sense IMHO. Footer is a presentational name (describes the element's position).
In a word, AJAX. The new tags are meant to support what real-world developers are doing by replacing some of the <div class="sidebar-wrap"><div class="styling-hook"><div><ul class="nav"> type of divitis many websites suffer from. The only <div> left in the HTML5 is the styling hook.
The semantics that get promoted to tags from classes are those that developers have freely adopted en-masse as best practices, given an extended xhtml/css adoption period. Check out the WHATWG developer's edition of the spec's sections pagehere. The document itself is a pleasure, but I won't spoil it if you haven't seen it yet.
One of the less obvious reasons for some decisions made by the W3C is the importance of Webkit. If you look, you can see that they were better than some at taking the current work of the HTML5 Working Group and implementing ideas. They have historically been way out ahead in compliance (see here). The W3C placed a high priority on their (i.e. Android, iPhone, the Googlebot, Chrome, Safari, Dreamweaver, etc.,). Google, framework users, Wordpress/Moveable Type/Joomla! type users and others wanted self contained building blocks, so this is the style we get.
Facebook is modular. Responsive design's grids are modular. Wordpress is modular. Ajax works best with modular page structures. Widgets are modules. Plug-ins are modules. It would seem that we should be trying to figure out stuff like how to apply these tags to make it easier to hook the appropriate elements and activate them in our document/application/info-network hybrid Web 2.0.
In closing, HTML5 is meant to be written as xml (again, see the spec) in order to ensure that tools and machines making ajax requests for a portion of a document will get a well-formed useful response. How awesome in combination with things like media queries for devices like feed readers, braille printers, annotators, etc.,. I see a (near)future where anything with good semantic content is it's own newsfeed automagically! This only happens if developers adopt and write compliant documents.