Since HTML and XML are referred as structure of the data the only difference between HTML and XML is, an HTML tag has predefined CSS style in technical point of view. Am i correct.
Not really.
HTML, for historical reasons, does not conform to XML syntax requirements (unless you're talking about XHTML). The most obvious example is the <br> tag, which does not have a closing tag. Also, browsers are extremely forgiving in what they accept and will try to do something halfway meaningful even if the HTML is not valid. This is in stark contrast to XML parsers, which will reject any XML that is not well-formed.
You are correct that browsers implement a default CSS stylesheet, but there are subtle differences between browsers and between different versions of browsers, so many frameworks clear all defaults and re-specify the CSS for every element.
Related
I really like to use "short closing" for tags using ordinary <tag/> format but unfortunately using such method in Browser (i.e. chrome) cause quite unexpected behavior.
When in document I have:
<div/><div/>
it's interpreted as
<div>
<div></div>
<div>
no matter what DOCTYPE i use (XHTML) or HTML5 I just get this in a wrong way.
I'm also using this "notation" for custom tags in namespace <widget:aSampleWidgetA/> <widget:aSampleWidgetB/> which also introduce this problem.
I don't want to use a full closing notation as its making a lot of visual mess in code.
Is there some way to force Browser to parse those tags as proper XML?
Apologies, I can't find great documentation on this but I suspect it is because a div is not a valid self closing tag. Looking at the XHTML DTD, empty tags are specifically marked as EMPTY, div is not, so Chrome instead behaves as if it is html5 where the closing tags are can be left off and takes a best guess as to where to close them.
Alternatively, if you don't like the look of html, perhaps you might prefer something like haml or jade templates.
There is a way to make browsers (except of IE8 and below) parse the markup as XML. You need to serve it with the proper XHTML content type application/xhtml+xml.
Doctype is irrelevant for parsing, it affects only rendering mode (Standards or Quirks). When served as text/html, all pages will be parsed by HTML rules (HTML5 rules for modern browsers), which effectively mean that end slash in the 'self-closing' syntax is just ignored, and the ability of the element to be 'self-closed' is actually hard-coded in the parser. Divs and custom tags don't have this ability.
http://www.w3schools.com/tags/tag_doctype.asp
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
On what standard is HTML 5 based on if not on SGML?
The HTML5 standard specifies two serializations of HTML5: "html" and "xml". "xml" is a valid XML serialization (which in turn is a subset of SGML). "html" is not based on any specific serialization standard anymore, it has its own complete serialization. Herein lies the difference: HTML4 has a "sgml" serialization and "xml" serialization (called XHTML 1.0)
Of course HTML5 is for a large part based on HTML4 (based on SGML) and XHTML (based on HTML4 and XML).
Also see the history section of the HTML5 specification
What is the HTML 5 standard based on?
It is based on what browsers actually do.
In 2002-2005 Ian Hickson went through every browser, and found every parsing edge case for the DOM tree they create when presented with some HTML.
For Example
For example, what should the DOM tree of this (invalid) HTML be:
<!DOCTYPE html><em><p>XY</p></em>
Browsers seemed to agree on the tree:
DOCTYPE: html
HTML
HEAD
BODY
EM
P
#text: XY
Even though it is invalid html, browsers were happy to parse it into what you meant. The last thing your browser should do refuse to display what is perfectly understandable HTML.
Now what about this invalid html:
<!DOCTYPE html><em><p>X</em>Y</p>
IE: Y is a child of both p and body. This violates the DOM spec (a note is supposed to have only one parent), but is what the author of the HTML wanted.
Opera: Makes a valid DOM tree, but X isn't emphasised - violating CSS spec.
Mozilla and Safari: make it a valid DOM tree, but Y isn't emphasised (which is what the author wanted)
DOCTYPE: html
HTML
HEAD
BODY
EM
P
EM
#text: X
#text: Y
Which means that different browsers had different ideas on how to handle HTML (hence the need for an HTML standard).
A parser can't say:
Well, HTML is supposed to be a subset of SGML. And if your HTML isn't well-formed, then the results are undefined.
Not good enough
The web needs a standard to reflect how browsers should parse HTML. The W3C wasn't doing it. They hated HTML, and wanted everyone to move their beautiful SGML version of HTML, an xml-ified version of HTML: xhtml.
The HTML 5 standard is meant to be used in the real world. There needs to be a definition on how to handle not well-formed HTML, and define how browsers should handle it. It was based on a survey of all existing implementations, and choosing what either a consensus is, or what a consensus should be.
Which brings us to HTML5
From the HTML5 spec, and they lay it out quite plainly:
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
In other words (and they also say this):
An HTML5 parser is any parser that follows the parsing rules of HTML5
HTML5 has no grammer. There is no regex, lexer, BNF, EBNF you can use to parse HTML.
In order to correctly parse HTML to the HTML5 standard, you must implement the (very meticulously detailed) algorithm described in the HTML5 standard.
And if your parser doesn't handle invalid HTML: then that's the fault of your parser.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is it OK to use unknown HTML tags?
I've tested a custom tag <oles-tag> ... </oles-tag> in Chrome and IE9.
I use HTML5 doctype <!DOCTYPE html>.
HTML5 does NOT support custom tags. The code won't validate, but the browsers parses it anyway. I can even target it with CSS...
Why does browsers parse custom tags when it's not standardized valid code?
And why shouldn't I just use custom tags for the sake of semantic code?
Why does browsers parse custom tags, when it's not standardized and valid code?
To make it forward-compatible. Just imagine if it was impossible for you to style <article> elements in old browsers because <article> didn't exist when those old browsers were written. That'd be terrible, wouldn't it? glares at IE
And why shouldn't I just use custom tags for the sake of semantic code?
Because no one else (programs) recognizes those tags, therefore they're not semantic. The reason why elements like <article> are considered semantic is because they have an established use. When you use a custom element that doesn't have an established use, it could be interpreted in a number of ways, leading to inconsistency among programs. glares at <b> and <i>
Browsers tend to be lenient with the markup. This is partly historically rooted in difficulties to adopt the complex SGML syntax.
There has been a movement toward strictness in the late 90s, resulting in creation of XHTML, where every mistake results in a catastrophic failure. If you prefer strictness, there seems to be a version of XHTML adapted for HTML5.
XHTML has another interesting feature - you can define and use custom tags all you want, in fact this was one of the two major reasons for it's development.
I have tried to find an answer to this in the W3C HTML specifications, but haven't had any luck so far.
For example, if I have the following HTML code:
<body>
<p>
<foo>bar</foo>
</p>
</body>
Does W3C specify how a user agent should handle this? E.g should the "foo" element be completely ignored? Should the "foo" element be ignored but the content "bar" parsed?
Also, is it even "legal" to do this?
Edit: Some excellent answers from all of you! I totally agree that it would be bad practice to embed generic XML unless, possibly, if you have complete control over which browser your users will use. I was mostly curious about what actually would or should happen if such markup were to be produced :-)
The HTML spec doesn't say much about it, other than:
The HTMLUnknownElement interface must be used for HTML elements that are not defined by this specification (or other applicable specifications).
This can be verified in conforming browsers using the following JavaScript code in the console:
Object.prototype.toString.call(document.createElement("foo"));
//-> "[object HTMLUnknownElement]"
However, some browsers either don't follow the specification here yet. For instance, Chrome 13 gives [object HTMLElement], IE 8 gives [object HTMLGenericElement] (IE 9 is correct).
As far as I'm aware, all browsers will parse <foo> as an element, but default styling and behaviour is not guaranteed to be the same. Where HTMLUnknownElement is implemented and the spec is followed, it should inherit directly from HTMLElement and, therefore, have many of the default properties found on other elements.
Please note that your HTML will not validate when you have non-standard elements in your markup. It's also worth mentioning that search engine crawlers, screen readers and other software will not be able to extract semantic meaning from these elements.
Further reading:
Why generic XML on the web is a bad idea and 386: Generic Elements; Still a Bad Idea - Anne van Kesteren's blog (2005, 2010)
Some excellent advice from #Andy E. This is just some add-ons to that.
The HTML5 draft does define how to parse unknown elements, however, it is distinctly non-trivial. To see the rules, see http://dev.w3.org/html5/spec/tree-construction.html
Note that the first version of Firefox to use these rules is FireFox 4, and the first version of IE to use the rules is IE 10. Older versions have a number of different and often very strange behaviours.
HTML has no notion of "legality", only validity or conformance to a standard. You are free to decide whether you want your pages to conform to any particular standard or not. There is no W3C standard of HTML where use of arbitrarily named elements is conforming.
It is generally advisable to make your HTML conforming to avoid unpredictable errors in browsers and other HTML consumers that you haven't tested against.
"bar" should definitely be rendered. For example, in the HTML5 video element, the contents of the element contain fallback content to be displayed in older browsers for exactly this reason. It's also why people traditionally put comments around style declarations:
<style><!--
(styling goes here)
--></style>
to hide the styling information from pre-HTML 4 browsers. (I think the comments aren't considered good practice any more.)
since IE won't render XHTML as XHTML, but treat it as HTML instead, when can this actually cause problems for IE?
i know of one case, where
<div style="clear:both" />
in browsers that support XHTML, the div is closed. But IE will treat the div as still open, so the layout can have unexpected result later.
Internet Explorer will have trouble distinguishing XHTML documents from XML documents if the MIME-type is not specified as text/html. However, because it fully supports HTML 4.01 the majority of problems arise from inconsistent and non-standards implementations of positioning, layout, and CSS properties. To avoid any problems it is best to write valid XHTML and specify a DOCTYPE.
A list of all known Internet Explorer Bugs
Self-closing syntax won't work (it will appear to work only on elements that are always empty in HTML). XML serializers might generate <textarea/>, <script/> and similar, which break pages in various ways (triggering complicated error recovery, sometimes involving re-parsing of remainder of the page).
Explicitly closed HTML "empty" elements might behave oddly (</br> inserts break in IE).
<![CDATA[ outside HTML's hardcoded CDATA elements will be recognized as a tag. It won't affect escaping and might make some content disappear.
In HTML's CDATA elements (namely <script>) entities won't be recognized. XHTML requires <script> if (1 < 2) … which is going to be syntax error in IE.
Background of <body> will be applied differently in IE.
There will be no cross-browser syntax for namespace-aware selectors in CSS.
You'll get all implied HTML elements (e.g. <tbody> in all tables) and implictly closed elements (it's usually not a problem when document is valid, but other browsers won't warn you as long as markup is well-formed).
Elements and attributes with prefixes won't be namespaced and will get different tagName in IE (which is also illegal in XML). They won't get appropriate default styling and behaviour either (<xhtml:a> can't be a link).
You won't be able to use namespace-aware methods like createElementNS (they don't exist in IE), .tagName will be uppercase in IE, but not in all cases.
Elements and attributes with prefixes won't be namespaced and will get different local name in IE (which is also illegal in XML).
These are only problems concerning switching from working XML document to HTML. There are as many surprises when you're going from HTML (i.e. what everyone expects and takes as normal behavior) to real XML, e.g. document.write doesn't work rendering most of Google's scripts useless.
These all apply to any browser treating XHTML as text/html rather than specifically IE, but you should read Appendix C of the XHTML 1.0 spec here: http://www.w3.org/TR/xhtml1/#guidelines