HTML5 and well-formedness - html

I'm looking into HTML5 and I'm puzzled why it goes so easy on well-formedness.
<div id="main">
<DIV ID="main">
<DIV id=main>
are all valid and produce the same result. I thought with XHTML we moved to XML compliant code at no cost (I don't count closing tags as a cost!). Now the HTML5 spec looks to be written by lazy coders and/or anarchists. The result is that from the start of HTML5 we have two versions: HTML5 and the XML compliant XHTML5. Would you consider it an asset if C would suddenly allow you to write a for construct in the following ways?
for(i = 0; i < 10; i++) {
for(i = o; i < 1o; i++) { // you can use "o" instead of "0"
for(i = 0, i < 10, i++) { // commas instead of semicolons are alright!
Frankly, as an XHTML coder since many moons I feel a bit insulted by the HTML5 spec.
Wadya think?
Steven
edit:
Mind the "wadya": would you as a customer accept a letter with "wadya" written instead of "What do you"? :-)

HTML 5 is not an XML dialect like XHTML is.
What made HTML so popular was the fact that it tolerated mistakes, so just about anyone could write an HTML page.
XHTML made it much more difficult and it didn't get widely adopted. At the same time, further development of HTML/XHTML stagnated, so an industry group formed up, the WHATWG who started work on the next generation of HTML and decided to revert to a non XML standard for HTML 5.
Since XML is stricter than HTML, you can always write your HTML to be XML compliant. Make sure attributes are in lower case, use value delimiters, elements have closing tags and use correct XML escaping where needed.

HTML was never intended to convey media, and therefore never intended
for any kind of marketing or merchandising. HTML was only intended to
convey text and to provide some sort of descriptive structure upon the
text it was describing. The people originally using HTML professors and
scientists who needed the ability to describe their communications in
more depth and line breaks and quotes would allow. In other words HTML
was only intended to be a document storage mechanism. Keep in mind
there were no web browsers at this time.
HTML was first made popular with the release of the web browsers.
Initially web browsers were just text parsers that provided a handy GUI
for navigating the hyperlinking between documents, but this changed
almost immediately. At this time there was still no actual standard for
HTML. There was the list of tags and a description of mechanisms
initially created for HTML, which along with an understand of SGML, was
all that was required to create an HTML parser.
With web browsers came the immediate demand to extend HTML in ways HTML
never intended. It was at this point that the inventors and original
users completely lost control of the web. Tags were added, such as
center and font, and tables became the primary mechanism for laying
things out on a page instead of describing data. Web browsers supplied
a media demand completely orthogonal to the intentions of HTML.
Marketing people, being what they are, care very much for the appearance
and expressive nature of communications and don't give a crap for the
technology which makes such communication possible. As a result parsers
became more lax to accommodate the incompetent. You have to understand
that HTML was already lax because there were no standard parsing rules
and SGML, due to being so very obtuse, encourages a lax nature outside
of parsing instruction tags.
Its not that these early technology pioneers were stupid, although its
easy to argue the contrary, they simply had other priorities. When the
web went mainstream there was an immediate obsession to conquer specific
business niches in this new medium. All costs were driven towards
marketing, market share, traffic acquisition, and brand awareness. Many
web businesses operate today with similar agendas, but today's web is
not a fair comparison. In the 90s marketing was all that mattered and
technology costs were absolutely ignored. The problem was so widespread
and the surge of investment so grand that it completely defied all
rational rules of economics. This is why there was an implosion. The
only web businesses that survived this crash were those that confronted
their technology costs up front or those who channeled investment monies
into technology expenses opposed to additional marketing expense.
http://en.wikipedia.org/wiki/Dot-com_bubble
After the crash things changed. Consider the crash good timing, because
although it was entirely driven by bad business decisions, foolish
investments, and irrational economics there was positive technology
developments going on behind the scenes. The founders of the web were
completely aware that they had lost all control of their technology.
They sought to solve this problem and set things straight by creating
the World Wide Web Consortium (W3C). They invited experts and software
companies to participate. Although solving many of the technology
problems introduced to the web by marketing drivin motivations was a
lost cause many future problems could be avoided if the language were
implemented in accordance with an agreed upon standard. It was during
this time that HTML 2 (the first standard form of HTML), HTML 3, and
HTML 4 were written.
At the same time the W3C also began work on XML, which never intended to
be a HTML replacement. XML was created because SGML was too complex. A
simple syntax based upon similar rules was needed. XML was immediately
written off by marketing people and was immediately praised by data
evangalists at Microsoft and IBM. Because the holy wars around XML were
trivial, insignificant, and short lived compared to such problems
plaguing HTML XML's developement occurred at rocket speed. Almost
immediately after XML was formed the first version of XML Schema was
formed.
XML Schema was an extradinary work that most people either choose to
ignore or take for granted. An abstration model for accessing the
structure of HTML was also standardized based upon XML Schema, know as
the Document Object Model (DOM). It is important to note that the DOM
was initially developed by browser vendors to provide an API for
JavaScript to access HTML, but the standard DOM released by the W3C had
nothing to do with JavaScript directly.It quickly became obvious that
many of technology problems plaguing HTML could be solved by creating an
XML compliant form of HTML. This is called XHTML. Unfortunately, the
path of adoption from HTML to XHTML was introduced in a confused manner
that is still not widely understood years after clarification finally
occurred.
So, there was a crash and leading up to this period of economic collapse
there were some fantastic technology developments. The ultimate source
of technology corruption, the web browsers, were finally just starting
to innovate around adoption of the many fantastic technology solutions
dreamed up at the W3C, but with the crash came an almost complete loss
of development motivation from the browser vendors. At this time there
was only really Netscape, IE, and Opera. Opera was not free software,
so it was never widely adopted, and Netscape went under. This
essentially left only IE and Microsoft pulled all their developers off
IE. Years later development on IE would be revived when competition
arose from Firefox and when Opera adopted free licensing.
About the same time that browsers were coming back to life the W3C was
moving forward with development of XHTML2. XHTML2 was an ambitious
project and was not related to XHTML1, which created much confusion.
The W3C was attempting to solve technology problems associated with HTML
that had been allowed to fester for long and their intentions were valid
and solid. Unfortunately, there was some contention in the XHTML2
working group. The combination of failed communication on how and why
to transition from HTML to XHTML in combination with the unrelated
nature of XHTML2 and its infighting made people worry.
The marketing interference that allowed the web to crash regressed with
the web crash, but it did not die. It was reviving during this period
as well. Let's not forget that marketing motivations give dick about
technology concerns. Marketing motivations are about instant
gratification. All flavors of XHTML, especially XHTML2, were an
abomination to instant gratification. XHTML2 would eventually be killed
for a single draft was published. This fear and disgust lead to the
establishment of separate standards body whose interests were aligned
with moving HTML forward in the nature of instant gratification
silliness. This new group would call itself WHATWG and would carry the
marketing torch forward.
The WHATWG was united, because their motivations were simple even if
their visions of the technology were ambitious, essentially to make it
easier for developers to make things pretty, interactive, and reduce
complexity around media integration. The WHATWG was also successful,
because the web began to contract since the crash. There were fewer
major players around and each had a specific set of priorities that were
more and more inalignment.
The web is a media channel and its primary business is advertising.
Web businesses that make money from advertising tend to be significantly
larger than web businesses that make money from goods or services. As a
result the priorties of the web would eventually become the priorities
of media and advertising distribution. For instance why did JavaScript
become much faster in the browser? The answer is because Google, an
advertising company, made it a priority to release a web browser that
was significantly faster at processing JavaScript. To compete other
browsers would need to become 20 to 30 times faster to keep up. This is
important because JavaScript is the primary means by which advertisement
metrics are measured, which is the basis of Google's revenue.
Since HTML5 is a marketing friendly specification it allows a lax
syntax. Browser vendors are economically justified to spend more money
writing more complex parsing mechanisms against sloppy markup, because
it allows more rapid developement by which media is published so as to
allow deeper penetration of advertising. This is economically qualified
because all of the 5 major web browsers available now are primarily
funded from advertising revenue. Unfortunately, this is nothing but
cost for anybody else that want's to write a parser and is limiting or
harmful to any later interpretation of structured data. The result is
a lack of regard for the technology and the rise of hidden costs with
limits upon technology innovation within the given medium.
This is why HTML syntax continues to be shit. The only solution is to
propose an alternate and technologically superior communication medium
that technologically emphasizes a decentralization of contracting market
concerns.

For natural parsing the quotes aren't necessary in the first place.
Regarding case, HTML elements are reserved regardless of case; for example, you can't define your own DiV or Div.
HTML is a markup language where speed and simplicity is a greater priority than consistency.
While arguable, this matters greatly to search engines; documents with quoted attributes and any kind of error are very expensive to process. It's funny -- the quoted example in HTML docs has 'be evil' in quotes; as to say that, not using quotes is not being evil.

Better that the spec allows it then it forbids it, everyone does it anyway, and browsers have to error correct.
XHTML never really took off, not least because MSIE never supported it (pretending it is HTML by sending a text/html content type not withstanding).

Honestly, your question answers itself. "We have two different specs." Each spec addresses a different level of conformance, and they do so for a reason. As much as we might loathe the notion of "backwards compatibility," it's a burden we have to bear, and HTML5 is far better at maintaining it than XHTML5 will ever be.

Related

Practically speaking, why semantic markup?

Does Google really care if I use an <h5> as a <b> tag?
What are some real-world, practical reasons I should care about semantic markup?
A few examples
Many visually impaired people rely on speech browsers to read pages back to them. These programs cannot interpret pages very well unless they are clearly explained. In other words semantic code aids accessibility
Search engines need to understand what your content is about in order to rank you properly on search engines.
Semantic code tends to improve your placement on search engines, as it is easier for the "search engine spiders" to understand.
However, semantic code has other benefits too:
As you can see from the example above, semantic code is shorter and so downloads faster.
Semantic code makes site updates easier because you can apply design style to headings across an entire site instead of on a per page basis.
Semantic code is easier for people to understand too so if a new web designer picks up the code they can learn it much faster.
Because semantic code does not contain design elements it is possible to change the look and feel of your site without recoding all of the HTML.
Once again, because design is held separately from your content, semantic code allows anybody to add or edit pages without having to have a good eye for design.
You simply describe the content and the cascading style sheet defines what that content looks like.
Source: boagworld
Semantics and the Web
Semantics are the implied meaning of a subject, like a word or sentence. It aids how humans (and these days, machines) interpret subject matter. On the web, HTML serves both humans and machines, suggesting the purpose of the content enclosed within an HTML tag. Since the dawn of HTML, elements have been revised and adapted based on actual usage on the web, ideally so that authors can navigate markup with ease and create carefully structured documents, and so that machines can infer the context of the wonderful collection of data we humans can read.
Until — and perhaps even after — machines can understand language and all its nuances at the same level as a human, we need HTML to help machines understand what we mean. A computer doesn’t care if you had pizza for dinner. It likely just wants to know what on earth it should do with that information.
HTML semantics are a nuanced subject, widely debated and easily open to interpretation. Not everyone agrees on the same thing right away, and this is where problems arise.
Allow me to paint a picture:
You are busy creating a website.
You have a thought, “Oh, now I have to add an element.”
Then another thought, “I feel so guilty adding a div. Div-itis is terrible, I hear.”
Then, “I should use something else. The aside element might be appropriate.”
Three searches and five articles later, you’re fairly confident that aside is not semantically correct.
You decide on article, because at least it’s not a div.
You’ve wasted 40 minutes, with no tangible benefit to show for it.
— Divya Manian
This generated a storm of responses, both positive and negative. In Pursuing Semantic Value By Jeremy Keith argued that being semantically correct is not fruitless, and he even gave an example of how <section> can be used to adjust a document’s outline. He concludes:
But if you can get past the blustery tone and get to the kernel of the article, it’s a fairly straightforward message: don’t get too hung up on semantics to the detriment of other important facets of web development.
— Jeremy Keith
Naming Things
Of all the possible new element names in HTML5, the spec is pretty set on things like <nav> and <footer>. If you’ve used either of those as a class or id in your own markup, it’s no coincidence. Studies of the web from the likes of Google and Opera (amongst others) looked at which names people were using to hint at the purpose of a part of their HTML documents. The authors of the HTML5 spec recognised that developers needed more semantic elements and looked at what classes and IDs were already being used to convey such meaning.
Of course, it isn’t possible to use all of the names researched, and of the millions of words in the English language that could have been used, it’s better to focus on a small subset that meets the demands of the web. Yet some people feel that the spec isn’t yet doing so.
Source: html5doctor (This goes on for quite a while so I've only put a few examples here.)
Hope this helps!

Is HTML5 a programming language?

Nowadays, we can use HTML5 to make apps, as in android, in firefox os, iPhone, Blackberry and others. But, I heard that HTML is a Markup language, not for programming.
Even with App features, HTML continues to being only a markup language?
Programming languages have certain features, like branching, looping, that sort of thing, that HTML5 lacks. HTML5 defines markup for some interactive features, but the markup is almost entirely static (there's some interaction implied in the definition of select elements and such). A lot of "HTML5" features you hear about aren't HTML5 at all, but rather things you can do with JavaScript (a programming language) in a modestly-capable browser.
HTML5 is increasingly taking over (or has taken over) the role of defining both the structure of web pages and the API to interacting with them from a programming language. That used to be quite separate, in the DOM specs, but a lot of that is now being folded into the HTML5 specification. But again, that's just defining APIs. The actual coding using those APIs requires (in almost all cases) an actual programming language.
Short Answer: No.
Long Answer: No, it isn't. HTML as defined by the standard is just a markup language, exactly as it was in its previous versions.
But what does that mean? It means that it is supposed to structure your data allowing you also to define semantics with the use of markers, but it cannot process or modify your data as you would do using a programming language. Also it has no concept of input or output as is the case in programming languages​​, where you get an input to analyze and produce an output.
By the way HTML5 is coming out alongside a wider interest for the web and also stronger technologies (such as newer versions of javascript and css) which make new web applications even more powerful and limitless.
Please, read this great resource to learn more about HTML5.
HTML5 is considered a technology.
Yes, there is 5th release of HTML markup language but probably you didn't mean that.
HTML5 is more considered to be a technology including HTML,CSS3 and javascript and most of all their support in tools like browsers. So as a matter of fact it can be considered as something that requires programming.
Programming do not means Turing Complete Language. It's a linguistic problem, programing means to plan something, and this Html does very well.
program (n.)
1630s, "public notice," from Late Latin programma "proclamation, edict," from Greek programma "a written public notice," from stem of prographein "to write publicly," from pro "forth" (see pro-) + graphein "to write" (see -graphy).
The meaning "written or printed list of pieces at a concert, playbill" is recorded by 1805 and retains the original sense. The sense of "broadcasting presentation" is from 1923.
The general sense of "a definite plan or scheme, method of operation or line of procedure prepared or announced beforehand" is recorded from 1837. The computer sense of "series of coded instructions which directs a computer in carrying out a specific task: is from 1945.
The sense of "objects or events suggested by music" is from 1854 (program music is attested by 1877). Spelling programme, established in Britain, is from French in modern use and began to be used early 19c., originally especially in the "playbill" sense.
source

HTML 5 "how to recover from errors"

I read the following in HTML 5 tag reference at W3School
HTML5 improves interoperability and reduces development costs by
making precise rules on how to handle all HTML elements, and how to
recover from errors.
While I understand that there are some attributes like "pattern" and "required", are they talking about the same? Do they mean form validation when they mention "recovery from error"?
If not, what HTML 5 elements/tags are they referring to which helps "recovering from error"?
Thanks
From the source: http://dev.w3.org/html5/spec/Overview.html#an-introduction-to-error-handling-and-strange-cases-in-the-parser
HTML 5 defines a standard for the the handling of specific exceptional situations.
Why it's important
I have written a few HTML parsers for commercial use and--while no means an expert on the subject--I know firsthand how painful it can be to deal with malformed content. As hard as developers try (or fail to try), many major sites have poor, non-standard markup. Content management systems driven by non-technical users only increase the problem, as most WYSIWYG editors don't produce perfect markup.
So what do you do? you make assumptions and you relax the rules, rather than failing the whole process or rendering radically incorrect content when you know that was probably not the intention of the developer.
The HTML spec (versions 5 and previous) define rules for how user agents should handle the rendering of content. To my knowledge, the HTML 5 spec has the richest definition for how exceptional cases should be handled.
If all user agents (browsers) treat exceptional cases the same, you achieve consistency while still allowing for the inevitable human error. That said, I wish more people would take the warnings on validator.w3.org seriously (or at least read them!)
FWIW, most people on this site (myself included) don't trust w3schools as a reference.

HTML 5 - Early Adoption Where Possible - Good or Bad?

This question was inspired a bit by this question, in which the most upvoted answer recommended using a feature from HTML 5. It certainly seemed to be a good method to me, but it made me curious about using features from a future spec in general.
HTML 5 offers a lot of nice improvements, many of which can be used without causing problems in current browsers.
Some examples:
// new, simple HTML5 doctype (puts browsers in standards mode)
<!doctype HTML>
// new input types, for easy, generic client side validation
<input type="email" name="emailAddress"/>
<input type="number" name="userid"/>
<input type="date" name="dateOfBirth"/>
// new "required" attribute indicates that a field is required
<input type="text" name="userName" required="true"/>
// new 'data-' prefixed attributes
// for easy insertion of js-accessible metadata in dynamic pages
<div data-price="33.23">
<!-- -->
</div>
<button data-item-id="93024">Add Item</button>
Many of these new features are designed to make it possible for browsers to automatically validate forms, as well as give them better inputs (for example a date picker). Some are just convenient and seem like a good way to get ready for the future.
They currently don't break anything (as far as I can tell) in current browsers and they allow for clean, generic clientside code.
However, even though they are all valid in HTML 5, they are NOT valid for HTML 4, and HTML 5 is still a draft at this point.
Is it a good idea to go ahead and use these features early?
Are there browser implementation issues with them that I haven't realized?
Should we be developing web pages now that make use of HTML 5 draft features?
There are several things to consider:
First, validation doesn't mean that much, because an HTML page can very well be valid but badly authored, inaccessible, etc. See Say no to "Valid HTML" icons and Sending XHTML as text/html Considered Harmful (in reference to the hobo-web tests mentioned in another response)
Given this, I'd highly recommend using the new DOCTYPE: the only reason for having it in HTML5 is that it's the smallest thing that triggers standards mode in browsers, so if you want standards mode, go with it; you have little to no reason to use another, verbose, error-prone DOCTYPE
As for the forms enhancements, you can use Weston Ruter's webforms2 JS library to bring it to non-aware browsers
and finally, about the data-* attributes, it a) works in all browsers (as long as you use getAttribute()), b) is still better than abusing the title or class attributes and c) won't bother you with validation as we said earlier that validation isn't that important (of course it is, but it doesn't matter that your page is invalid if the validity errors are willful; and you can already use HTML5 validation in the W3C validator, so...); so there's no real reason not to use them either.
Good question!
In short: it depends on your context, and risk tolerance :)
Slightly longer:
I think it's always good to push the envelope on early adoption of technology. It gives you an advantage over late-comers in the commercial world, and also gives you much more leverage in influencing the technology as it emerges.
If you don't want to have to re-write code, or update your source, then early adoption may not be for you. It's perfectly respectable to want to write solid, stable code that never has to change, but it's entirely up to you (and your business context)
If your page relies heavily on search engine placement, it may be worth considering that some engines give priority to validating HTML (Source: http://www.hobo-web.co.uk/seo-blog/index.php/official-google-prefers-valid-html-css/).
Also, it is worth considering that relying on the new date input elements (such as those in Opera, possibly others) allows for more convenience on the part of the developer, it typically precludes including more complex Javascript controls which would better server older browsers (typically falling back to a simple text input field).
Of course and as always, don't rely on browser side checks and validate all input server side.
Please don’t use the new features before you can test them in at least one browser. For example, if you use the now form features, be sure to test in Opera. Otherwise, you’ll likely do more harm than good by contributing to a poisoned legacy out there.
When a feature is already implemented in browsers and you are testing with those browsers, sure, please use the new features.
See also an older answer.
See Robustness principle:
In RFC 761 (Transmission Control
Protocol, 1980) American computer
scientist Jon Postel summarized
earlier communications of desired
interoperability criteria for the
Internet Protocol (cf. IEN 1111, RFC
760) as follows:
TCP implementations should follow a
general principle of robustness: be
conservative in what you do, be
liberal in what you accept from
others.
So, imho, no.
I will not implement new features from HTML until at least they have support from all major browsers.
Clients don't care if your page is valid, they care much more if it works cross browser. Even if we fight to implement the latest standards there will be still clients and companies that will never shed their IE6, and IE6 will be on their browser requirements list for still a while.
The new form types are welcomed, nevertheless the forms have to be checked in the server side.
Passing to HTML5 existing documents will require a lot of effort and adaptation and in my estimate will not happen overnight. Expect at least a 3 years until it will hit the mainstream.
I would use HTML 5 just for fun and learning, but definitively I wouldn't touch any of my production code (existing code) with this new standard, at least by now and until I have a valid reason to support this move.

Did HTML's loose standards hurt or help the internet

I was reading O'Reilly's Learning XML Book and read the following
HTML was in some ways a step backward.
To achieve the simplicity necessary to
be truly useful, some principles of
generic coding had to be sacrificed.
... To return to the ideals of
generic coding, some people tried to
adapt SGML for the web ... This proved
too difficult.
This reminded me of a StackOverflow Podcast where they discussed the poorly formed HTML that works on browsers.
My question is, would the Internet still be as successful if the standards were as strict as developers would want them to be now?
Lack of standard enforcement didn't hurt the adoption of the web in the slightest. If anything, it helped it. The web was originally designed for scientists (who generally have little patience for programming) to post research results. So liberal parsers allowed them to not care about the markup - good enough was good enough.
If it hadn't been successful with scientists, it never would have migrated to the rest of academia, nor from there to the wider world, and it would still today be an academic exercise.
But now that it's out in the wider world, should we clamp down? I see no incentive for anyone to do so. Browser makers want market share, and they don't get it by being pissy about which pages they display properly. Content sites want to reach people, and they don't do that by only appearing correctly in Opera. The developer lobby, such as it is, is not enough.
Besides, one of the reasons front-end developers can charge a lot of money (vs. visual designers) is because they know the ins and outs of the various browsers. If there's only one right way, then it can be done automatically, and there's no longer a need for those folks - well, not at programmer salaries, anyway.
Most of the ambiguity and inconsistency on the web today isn't from things like unclosed tags - it's from CSS semantics being inconsistent from one browser to the next. Even if all web pages were miraculously well-formed XML, it wouldn't help much.
The fact that html simply "marks up" text and is not a language with operators, loops, functions and other common programming language elements is what allows it to be loosely interpreted.
One could correlate this loose interpretation as making the markup language more accessible and easily used thus allowing more "uneducated" people access to the language.
My personal opinion is that this has little to do with the success of the Internet. Instead, it's the ability to communicate and share information that make the internet "successful."
It hurt the Internet big time.
I recall listening to a podcast interview with someone who worked on the HTML 2.0 spec and IIRC there was a big debate at the time surrounding the strictness of parsers adhering to the standard.
The winners of the argument used the "a well implemented system should be liberal in what it accepts and strict in what it outputs" approach which was popular at the time.
AFAICT many people now regard this approach as overly simplistic - it sounds good in principle, but actually rarely works in practice.
IMO, even if HTML was super strict from the outset, it would still have been simple enough for most people to grasp. Uptake might have been marginally slower at the outset, but a huge amount of time/money (billions of dollars) would have been saved in the medium-long term.
There is a principle that describes how HTML and web browsers are able to work and interoperate with any success at all:
Be liberal in what you accept, and conservative in what you output.
There needs to be some latitude between what is "correct" and "acceptable" HTML. Because HTML was designed to be "human +rw", we shouldn't be surprised that there are so many flavours of tag soup. Flexibility is HTML's strength wherever humans need to be involved.
However, that flexibility adds processing overhead which can be hard to justify when you need to create something for machine consumption. This is the reason for XHTML and XML: it takes away some of that flexibility in exchange for predictable input.
If HTML had been more strict, something easier would have generated the needed network effect for the internet to become mainstream.