Practically speaking, why semantic markup? - html

Does Google really care if I use an <h5> as a <b> tag?
What are some real-world, practical reasons I should care about semantic markup?

A few examples
Many visually impaired people rely on speech browsers to read pages back to them. These programs cannot interpret pages very well unless they are clearly explained. In other words semantic code aids accessibility
Search engines need to understand what your content is about in order to rank you properly on search engines.
Semantic code tends to improve your placement on search engines, as it is easier for the "search engine spiders" to understand.
However, semantic code has other benefits too:
As you can see from the example above, semantic code is shorter and so downloads faster.
Semantic code makes site updates easier because you can apply design style to headings across an entire site instead of on a per page basis.
Semantic code is easier for people to understand too so if a new web designer picks up the code they can learn it much faster.
Because semantic code does not contain design elements it is possible to change the look and feel of your site without recoding all of the HTML.
Once again, because design is held separately from your content, semantic code allows anybody to add or edit pages without having to have a good eye for design.
You simply describe the content and the cascading style sheet defines what that content looks like.
Source: boagworld
Semantics and the Web
Semantics are the implied meaning of a subject, like a word or sentence. It aids how humans (and these days, machines) interpret subject matter. On the web, HTML serves both humans and machines, suggesting the purpose of the content enclosed within an HTML tag. Since the dawn of HTML, elements have been revised and adapted based on actual usage on the web, ideally so that authors can navigate markup with ease and create carefully structured documents, and so that machines can infer the context of the wonderful collection of data we humans can read.
Until — and perhaps even after — machines can understand language and all its nuances at the same level as a human, we need HTML to help machines understand what we mean. A computer doesn’t care if you had pizza for dinner. It likely just wants to know what on earth it should do with that information.
HTML semantics are a nuanced subject, widely debated and easily open to interpretation. Not everyone agrees on the same thing right away, and this is where problems arise.
Allow me to paint a picture:
You are busy creating a website.
You have a thought, “Oh, now I have to add an element.”
Then another thought, “I feel so guilty adding a div. Div-itis is terrible, I hear.”
Then, “I should use something else. The aside element might be appropriate.”
Three searches and five articles later, you’re fairly confident that aside is not semantically correct.
You decide on article, because at least it’s not a div.
You’ve wasted 40 minutes, with no tangible benefit to show for it.
— Divya Manian
This generated a storm of responses, both positive and negative. In Pursuing Semantic Value By Jeremy Keith argued that being semantically correct is not fruitless, and he even gave an example of how <section> can be used to adjust a document’s outline. He concludes:
But if you can get past the blustery tone and get to the kernel of the article, it’s a fairly straightforward message: don’t get too hung up on semantics to the detriment of other important facets of web development.
— Jeremy Keith
Naming Things
Of all the possible new element names in HTML5, the spec is pretty set on things like <nav> and <footer>. If you’ve used either of those as a class or id in your own markup, it’s no coincidence. Studies of the web from the likes of Google and Opera (amongst others) looked at which names people were using to hint at the purpose of a part of their HTML documents. The authors of the HTML5 spec recognised that developers needed more semantic elements and looked at what classes and IDs were already being used to convey such meaning.
Of course, it isn’t possible to use all of the names researched, and of the millions of words in the English language that could have been used, it’s better to focus on a small subset that meets the demands of the web. Yet some people feel that the spec isn’t yet doing so.
Source: html5doctor (This goes on for quite a while so I've only put a few examples here.)
Hope this helps!

Related

What is the use of HTML5 semantic markup for intranet applications?

As far as I understand, the only real advantage of HTML5 semantic markup is for search engines and web crawlers to interpret the document better.
Since intranet applications have nothing to do with search engines or web crawlers, what are the advantages of using semantic markup in HTML5?
There is no straightforward example to point out, but the website (even intranet) can be consumed by different user agents (on different devices).
You are probably familiar with Skype (and the iOS Safari) making phone-number-like words clickable. In the future I can easily imagine mobile browsers being smarter to assist the user in completing tasks on the page, like importing a clearly indicated contact to the address book.
Screenreaders for blind people?
While there is not a whole lot of immediate benefit for non-disabled people, it is still good practice. Does your company not have any externally facing sites? If it does, do those people not look at internal page code? Good practices spread just like bad ones.
see also: http://en.wikipedia.org/wiki/Semantic_HTML
Simply said there the only rule you have to follow if you create your html documents is that it is valid html, otherwise you will have the problem that the browser would try to correct your broken syntax which may result in defects of the visual representation of your content.
In modern browsers you can use display to given any element - with some limitations e.g. with input element - the same visual look and behavior then any other element.
So if you ask what are the advantages of using semantic markup in HTML5 you should ask, why to use any of the semantic markup if it is possible to have the same result using css.
The short answer is, no one will stop you if it is your own project where you are responsible to - except the client that probably gives you requirements.
It is similar to asking: Language xyz provides comments and there is a syntax for doc-comments, but why should I use them?.
Using the semantics wisely increases the readability and thus maintainability. You are not required to use every possibility of semantics at all costs.
Using them will help you to get into the code again if you haven't looked at it for a longer time, e.g. to distinguish between the elements that encapsulate logical parts and elements that are used for styling. Especially if you use a template engine to create your code or to search for certain elements in multiple files.
Even if you are now the only one who works on the code it may happen that if the project grows that you need other people to work on the code. Or for the situation, you for some reason are not available, someone else needs to maintain your code, a good markup is essential.
Using the correct markup and additions like WAI-ARIA is not only essential for handicapped people, but also allows the browser to recognize the meaning of elements, allowing to e.g. improve the keyboard navigation. Especially in a productive environment where you need to type much, it is often faster to navigate with keyboard then using a trackpad or a mouse.

We hear so much about "semantic html". Where/what are the algorithms reading our semantic html?

I keep making attempts at properly using HTML5 but I feel like it's still not even close to anything semantically valuable.
My attempts:
HTML5 Article node Architecture
HTML5 Blog Page Architecture
But there's such subtleties in every single tag!
My question is, what specific software out there on the web is actually doing things like processing our HTML DOM, calculating and comparing elements to say "oh, this is a <header>, and it's just after <section>, and it has <time> in it, so the <time> tag must be "metadata" in relation to the <header>...", and saying "The content within the <time> tag not only is the "published time", but also relates to the author's birthday, so it must be a special post (say because there was also a <cite> or <address class='vcard'> tag in there too)".
I mean, what benefit am I ever going to get in using HTML5 if I don't know the algorithms that are interpreting it? If I just stuck with the basic div, ol, ul, li, p, a, h[1-6] tags, I could do everything with half the number of DOM elements.
Looking forward to some specific algorithms that I can use to shape how I structure the DOM from here on out.
I'm at the point where I don't even think we should be using HTML5 tags at all. For example, on the iPhone especially, the goal should be to minimize dom elements to decrease load time. Plus, if the iPhone site is a mirror of the traditional browser version, the search engines won't even see the iPhone site (ideally). So there's no real point in making the DOM semantic. So if I can use 1/2 the amount of <div> tags to achieve the same layout as if I used a somewhat "semantic HTML5" rendition, and that's a good thing for the iPhone, why don't I do that for the regular browser too? That's where I'm coming from.
Articles like this are basically saying it's pointless to worry about semantic HTML.
What algorithms are reading your semantic HTML? Google, that's who. Their algorithm tries to extract every bit of meaning from pages that it can, because that helps Google construct smart, relevant search results. For one example, Google tries to determine the dates of things by reading the HTML and gives headers extra consideration in determining the overall topic of a page.
Also, your assertion that we shouldn't use HTML5 tags on the iPhone "to minimize dom elements" isn't founded in any technical basis. HTML5 doesn't dictate that we use more DOM elements, and in fact it can let us leave out tags that would be required by XHTML. You should use HTML5 on the iPhone more than anywhere else. For example, the new input types like number and email don't do much on the desktop, but that extra information can really make things nicer on the iPhone by allowing it to present an appropriate interface.
Whenever a "machine" tries to make sense of your content.
In addition to search engines (→ SEO), screen readers (→ Accessibility) interpret the markup. They get better from version to version.
Also, think of all the tools that might come one day. The great thing about the Web is, that all the web pages could still exist in 5, 10, 100 … years from now. Imagine the user-agents and algorithms and search tools that might exist then, and how they could extract the meaning of your old documents.
Search engines can/will better interpret your pages which combined with other factors will result in better rankings for your pages.
Moreover if you use the tags consistently and semantically, you could build your own reusable widgets and libraries that derive knowledge from the HTML structure independent of how the data is stored in the backend.
Consider this sample Google search where you can filter results by date. By using semantic HTML, for let's say, <article> and <time>, you can write a simple crawler that recreates this functionality or allows users to specify a timespan within which to search articles in your own site(s).
Off the top of my head, I don’t know of any algorithms making use of the new semantic tags in HTML5. (Obviously, that doesn’t mean there aren’t any.)
But the idea that you should tailor your HTML to specific algorithms is, I think, a bit contrary to how the web works. The web is worldwide, and will hopefully be around for a long time. We can’t know what uses our HTML will be put to, and useful algorithms can’t be written until there’s a good amount of actual content out there.
The <a> tag wasn’t designed with Google’s PageRank algorithm in mind. Some people thought links would be useless if they weren’t inherently two-way, because you’d get too many broken links when one end went away.
Of course, if the vague possibility of undefined future benefits makes it not worth using some or all HTML5 tags for whatever project you’re working on, don’t use them.
For me, the benefit of using them is that there’s a well-known, public, non-proprietary specification that tells you, and anyone else working on the code, what we’ve agreed the tags mean. Future developers don’t just get a <div> with a class name that I made up in a coffee-fuelled 7 p.m. code print, they get a tag designed and documented by people smarter and more experienced than me. There’s also the chance that the code will become more useful in future if people use the meaning contained in HTML5 tags in algorithms, whereas there’s less chance of that if it’s all just a bunch of <div>s.
I don’t think the size increase of our pages from HTML5 tags is particularly worth worrying about though. After gzipping, the size increases aren’t enough to worry about, especially as mobile performance is as much hampered by the latency (which you can’t do much about) as the bandwidth. Plus mobile bandwidth is likely to trend up, rather than down.

How does Google use HTML tags to enhance the search engine?

I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1, H2, title and other HTML tags to enhance the search results.
What is the name of this technique "using the document structure to enhance the search results"?
And are there any academic papers to help me study this area?
The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.
I think it's called "Semantic Markup"
[...] semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information. http://www.digital-web.com/articles/writing_semantic_markup/
A more practical article here
http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/
SEO has become almost a religion to some people where they obsess about minutiae. Frankly, I'm not convinced that all this effort is justified.
My advice? Ignore what so-called pundits say and just follow Google's guidelines.
You might be looking for an academic answer but honestly, this isn't an academic question beyond the very basics of how Web indexing works. The reality of a modern page indexing and ranking algorithm is far more complex.
You may want to look at one of the earlier works on search engines. Note the authors' names. You may also want to read Google Patent application 20050071741.
These general principles aside, Google's search algorithm is constantly tweaked based on actual and desired results. The exact workings are a closely guarded secret just to make it harder for people to game the system. Much of the "advice" or descriptions on how Google's search algorithm works is pure supposition.
So, apart from having a title and having well-formed and valid HTML, I don't think you're going to find what you're looking for.
Google very deliberately doesn't give away too much information about its search algorithm, so it's unlikely you will find a definitve answer or academic paper that confirms this. If you're interested from an SEO point of view, just write your pages so they are good for humans and the robots will like them too.
To make a page good for humans, you SHOULD use tags such as h1, h2 and so on to create a hierarchical page outlay... a bit like this...
h1 "Contact Us"
...h2 "Contact Details"
......h3 "Telephone Numbers"
......h3 "Email Addresses"
...h2 "How To Find Us"
......h3 "By Car"
......h3 "By Train"
The difficulty with your question is that if you put something in your h1 tag hoping that it would increase your position in Google, but it didn't match up with other content on your page, you could look like you are spamming. Similarly, if your page is made up of too many headings and not enough actual content, you could look like you are spamming. It's not as simple as add a h1 and h2 tag and you'll go up! That's why you need to write websites for humans, not robots.
I have found this paper:
A New Study on Using HTML Structures to Improve Retrieval
however it is an old paper 1999,
still looking for more recent papers.
Check out
http://jcmc.indiana.edu/vol12/issue3/pan.html
http://www.springerlink.com/content/l22811484243r261/
Some time spent on scholar.google.com might help you find what you are looking for
You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.
It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)
Like cletus said follow the google guidelines.
I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
In short; very carefully. In long:
Quote from anatomy of a large-scale hypertextual erb search engine:
[...] This gives us some limited
phrase searching as long as there are
not that many anchors for a particular
word. We expect to update the way that
anchor hits are stored to allow for
greater resolution in the position and
docIDhash fields. We use font size
relative to the rest of the document
because when searching, you do not
want to rank otherwise identical
documents differently just because one
of the documents is in a larger
font. [...]
It goes on:
[...] Another big difference between
the web and traditional well controlled collections is that there
is virtually no control over what
people can put on the web. Couple
this flexibility to publish anything
with the enormous influence of search
engines to route traffic and companies
which deliberately manipulating search
engines for profit become a serious
problem. This problem that has not
been addressed in traditional closed
information retrieval systems. Also,
it is interesting to note that
metadata efforts have largely failed
with web search engines, because any
text on the page which is not directly
represented to the user is abused to
manipulate search engines. [...]
The Challenges in a web search engine addresses these issues in a more modern fashion:
[...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:
And adds:
[...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.
There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.
Resources
Hypertext and the web:
Extracting knowledge from the World Wide Web
Rich media and web 2.0
Thresher: automating the unwrapping of semantic content from the World Wide Web
Information retrieval
Webspam papers
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.
The Basic Process is:
Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.
Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.
If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.
I really enjoyed the book Transcending CSS
for a clean explanation of properly structured HTML.
I suggest trying Google scholar as one of your avenues when looking for academic articles
semantic search
I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:
<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>
Always the "text for section one" is shown on the search result page.
New tag to use called CANONICAL can now also be used, from Google, click HERE

Is semantic markup too open-ended? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am taking a peek at Dive Into HTML5. It seems nice and interesting, but I am puzzled.
In the 1990s, at the time when Netscape was the browser and HTML was HTML2 or HTML3, there were a lot of tags: address, cite, code... Most of them are unused as of today, probably even obsolete.
HTML5 introduces tags to express "semantic meaning" to the tag itself. This is all fun and games, but I see something very strange in this approach. Technically, the semantics can be very open ended. HTML5 has tags for article, time, navigation bars, footer. Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident <rant> and <nsfw> would be very important tags): ? I thought XML was the strategy to assign semantics to stuff. Nothing forbids you to put an XML chunk under a XHTML div element, and assign a stylesheet to it so to style it properly, or to delegate to the proper viewer the handling of that namespace (for example, when handling RSS or SVG).
In conclusion, I don't understand the reason behind this extensions focused towards semantics, when it's clear that semantic is a very broad topic, which is guaranteed to require a potentially infinite amount of semantic tags. Since I am pretty sure there are clever people at W3C, I think I'm wrong, but I'd like to know why.
Why are tags for article, time, navigation bars, footer useful?
Because they facilitate parsing for text processing tools like Google.
It's nothing about semantics (at least in 'broad' meaning). Instead they just say: here is the body of page (most important text part) and there is the navigation bar full of links. With such an approach you can easily extract just what you need.
I too hate the way that W3C is going with their specs. There are many things that I don't like, and this "semantics" fad is one of them. (Others include taking forever to complete their specs and leaving too many important details for the browsers to implement as they choose)
Most of all I don't like it because it makes my work as a web developer more difficult. I often have to make a choice whether to make the webpage "semantically correct" or "visually/aesthetically pleasing". The latter wins of course, because that is what the users want, but as a result validations start failing and the whole thing gets quite non-semantic (tables for layout and other things).
Another issue at which I frown is that they have officialy declared that the "class" attribute is for semantics, but then they used it for visual presentation selectors in CSS.
Bottom line - DON'T MIX SEMANTICS AND VISUAL REPRESENTATION. If you use some mechanism for describing semantics (like tag names, attribute values, or what not else), then don't use it for funcional/visual purposes and vice versa.
If I would design HTML, I would simply add an attribute "semantic" which could (like the "class" attribute) be added to any tag. Then there would be a number of predefined values like all those headers/footers/articles/quotes/etc.
Tags would define functionality. Basically you could reduce HTML tags to just a handful, like "div", "table/tr/td", "a", "img", "form", "input" and "select". I probably missed a few but this is the bulk. Visual styling would be accomplished through CSS.
This way the three areas - semantics, visual representation, and functionality - would be completely independent and wouldn't clash in real life solutions.
Of course, I don't think W3C is interested in practical solutions...
There is already a lot of semantics in HTML markup in the forms of classes and IDs, of which there is a (near) infinite amount of possibilities of, And everyone has their own way of handling these semantics. One of the goals of HTML5 is to try to bring some structure to this. you will still be able to extend the semantics of tags with classes and ids. It will also most likely make things easier for search engines.
Look at it from the angle of trying to make statements either about the page, or about objects referenced from the page. If you see a <footer> tag, all you can say is "stuff in here is a footer" and pass it by. As such, adding custom tags is not as generic a solution as adding attributes and allowing people to use their own choice of URIs to specify predicates and optionally values - RDFa wins hands-down because you can express any triple-statement you like from RDF in a page, one way or another.
I just want to address one part of your question. You say:
In the nineties, at the time when
Netscape was the browser and html was
HTML2 or HTML3, there were a lot of
tags: address, cite, code... Most of
them are unused as of today, probably
even obsolete.
There are a great deal of tags to choose from in html, but the lack of usage does not imply that they are obsolete. In particular the header tags <h1>, etc, and <ul>, <ol> are used to join items into lists in a way I consider semantic. Many people may not use tags semantically, but the effort to create microformats is an ongoing continuation of the idea you consider an artifact of the 1990s. Efforts to make the semantic web be a winner keeps going, despite full-text search and link analysis (in the form of Google) being the winner as far as how to find and understand the web.
It would be great to see an updated version of Google's Web Stats which show "html as she is spoke." But you are right that many tags are underused.
Whether html5 will be successful is an open and interesting question, but the tags you describe as obsolete didn't go anywhere, they were there in HTML 4.01 and xhtml. HTML5 seems to be an effort to solidify what is useful in tags. In the end if html5 gets support in browsers and makes the job of web developers easier, it will succeed. xhtml2 failed because it roundly failed to gain adoption in browsers and did nothing to make the job of web page makers easier. The forces working on html5 seem keenly aware of the failure of xhtml2, and I think are avoiding having html5 suffer a similar fate.
"Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident and would be very important tags): ?"
You use <dialog> to describe conversations or comments. Rant and NSFW are subjective terms therefore it makes sense not to use them.
From what I understand a bunch of experienced web developers did research and looked for what most websites have in common in html. They noticed that most websitse have id="header", id="footer", id="section" and id="nav" tags so they decided that we need HTML tags to replace those id's. So in other words, don't expect them to give you a HUGE amount of HTML vocabulary. Just keep it simple as possible as you can while addressing the MOST common needed HTML tags.
NAV tag is VERY important for providing accessibility as well. You want them to know where the navigation is rather than to force them to find whether links are for navigation or not.
I disagree with adding extra tags. If detailed vocabulary were actually import then there could be a different tag name for every word in the dictionary. Additional tags names are not helpful as they may communicate additional meaning to humans, but do nothing to facilitate machine parsing of the language. This is why I don't like the "semantic" tags for HTML5 as I believe this to be slippery slope to providing a vocabulary too complex while only providing a weak solution to a problem not fully addressed.
In my opinion markup language structure data as much as describe it in a tree diagram form. Through parsing of the structure and proper use of semantic conventions, such as RDFa, context can be leveraged to provide specific meaning to otherwise generic tag names. In such as case excessive vocabulary need not exist and structurally redundant tag names, such as footer and aside, could be eliminated. The final objective is to make content faster and more accurate to interpret by both humans and machines simultaneously while using as little code as possible to achieve that result. How that solution is lesser important, except to HTML5.
I thought XML was the strategy to assign semantics to stuff.
As far as I know, no it wasn’t. XML allows new languages to be defined which are all parsed in the same way, because they all use the XML syntax.
It doesn’t, of itself, provide any way to add meaning (“semantic” just means “meaningful”) to those languages. And until computers get artificial intelligence, they don’t actually understand meaning, so meaning is just what is agreed between human beings. HTML is the most commonly-used language with agreed meaning of its tags.
As HTML is so common, it’s helpful to add a few meaningful tags to it that are quite general in their application. The new HTML5 tags are aimed at that. The HTML5 spec’s authors could indeed carry on down this route, creating tags for every specific bit of meaning possible, but as they’re not robots, they probably won’t.
<section> is useful, and general enough to be meaningfully applicable in lots of documents. <author-last-name> isn’t. Distinguishing between the two is a judgment call, which is why humans, and not computers, write the spec.
For custom semantics that are too specific to be added to HTML as tags, HTML5 defines microdata.
I've been reading Andy Clark's book Transcending CSS (page 33).
...,it is now widely accepted that presentational names such as header, left, or red that describe an element's look or position are poor choices.
After reading these lines I asked myself: hey, aren't there elements in HTML5 spec such as header, footer?? Why is footer more semantic ? Andy in his book advocates to use site-info for the ID of the footer div and this makes more sense IMHO. Footer is a presentational name (describes the element's position).
In a word, AJAX. The new tags are meant to support what real-world developers are doing by replacing some of the <div class="sidebar-wrap"><div class="styling-hook"><div><ul class="nav"> type of divitis many websites suffer from. The only <div> left in the HTML5 is the styling hook.
The semantics that get promoted to tags from classes are those that developers have freely adopted en-masse as best practices, given an extended xhtml/css adoption period. Check out the WHATWG developer's edition of the spec's sections pagehere. The document itself is a pleasure, but I won't spoil it if you haven't seen it yet.
One of the less obvious reasons for some decisions made by the W3C is the importance of Webkit. If you look, you can see that they were better than some at taking the current work of the HTML5 Working Group and implementing ideas. They have historically been way out ahead in compliance (see here). The W3C placed a high priority on their (i.e. Android, iPhone, the Googlebot, Chrome, Safari, Dreamweaver, etc.,). Google, framework users, Wordpress/Moveable Type/Joomla! type users and others wanted self contained building blocks, so this is the style we get.
Facebook is modular. Responsive design's grids are modular. Wordpress is modular. Ajax works best with modular page structures. Widgets are modules. Plug-ins are modules. It would seem that we should be trying to figure out stuff like how to apply these tags to make it easier to hook the appropriate elements and activate them in our document/application/info-network hybrid Web 2.0.
In closing, HTML5 is meant to be written as xml (again, see the spec) in order to ensure that tools and machines making ajax requests for a portion of a document will get a well-formed useful response. How awesome in combination with things like media queries for devices like feed readers, braille printers, annotators, etc.,. I see a (near)future where anything with good semantic content is it's own newsfeed automagically! This only happens if developers adopt and write compliant documents.

Did HTML's loose standards hurt or help the internet

I was reading O'Reilly's Learning XML Book and read the following
HTML was in some ways a step backward.
To achieve the simplicity necessary to
be truly useful, some principles of
generic coding had to be sacrificed.
... To return to the ideals of
generic coding, some people tried to
adapt SGML for the web ... This proved
too difficult.
This reminded me of a StackOverflow Podcast where they discussed the poorly formed HTML that works on browsers.
My question is, would the Internet still be as successful if the standards were as strict as developers would want them to be now?
Lack of standard enforcement didn't hurt the adoption of the web in the slightest. If anything, it helped it. The web was originally designed for scientists (who generally have little patience for programming) to post research results. So liberal parsers allowed them to not care about the markup - good enough was good enough.
If it hadn't been successful with scientists, it never would have migrated to the rest of academia, nor from there to the wider world, and it would still today be an academic exercise.
But now that it's out in the wider world, should we clamp down? I see no incentive for anyone to do so. Browser makers want market share, and they don't get it by being pissy about which pages they display properly. Content sites want to reach people, and they don't do that by only appearing correctly in Opera. The developer lobby, such as it is, is not enough.
Besides, one of the reasons front-end developers can charge a lot of money (vs. visual designers) is because they know the ins and outs of the various browsers. If there's only one right way, then it can be done automatically, and there's no longer a need for those folks - well, not at programmer salaries, anyway.
Most of the ambiguity and inconsistency on the web today isn't from things like unclosed tags - it's from CSS semantics being inconsistent from one browser to the next. Even if all web pages were miraculously well-formed XML, it wouldn't help much.
The fact that html simply "marks up" text and is not a language with operators, loops, functions and other common programming language elements is what allows it to be loosely interpreted.
One could correlate this loose interpretation as making the markup language more accessible and easily used thus allowing more "uneducated" people access to the language.
My personal opinion is that this has little to do with the success of the Internet. Instead, it's the ability to communicate and share information that make the internet "successful."
It hurt the Internet big time.
I recall listening to a podcast interview with someone who worked on the HTML 2.0 spec and IIRC there was a big debate at the time surrounding the strictness of parsers adhering to the standard.
The winners of the argument used the "a well implemented system should be liberal in what it accepts and strict in what it outputs" approach which was popular at the time.
AFAICT many people now regard this approach as overly simplistic - it sounds good in principle, but actually rarely works in practice.
IMO, even if HTML was super strict from the outset, it would still have been simple enough for most people to grasp. Uptake might have been marginally slower at the outset, but a huge amount of time/money (billions of dollars) would have been saved in the medium-long term.
There is a principle that describes how HTML and web browsers are able to work and interoperate with any success at all:
Be liberal in what you accept, and conservative in what you output.
There needs to be some latitude between what is "correct" and "acceptable" HTML. Because HTML was designed to be "human +rw", we shouldn't be surprised that there are so many flavours of tag soup. Flexibility is HTML's strength wherever humans need to be involved.
However, that flexibility adds processing overhead which can be hard to justify when you need to create something for machine consumption. This is the reason for XHTML and XML: it takes away some of that flexibility in exchange for predictable input.
If HTML had been more strict, something easier would have generated the needed network effect for the internet to become mainstream.