HTML with entity (can't get rid of ]>) - html

A very simple HTML file. I deliberately placed all required attributes even though it may be an overkill. (Actually, é is recognised by practically all browsers without explicit specification, but this is just an example to highlight the problem):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY eacute "é">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>Test HTML with an entity</title>
</head>
<body lang="en">
<h1>Résumé</h1>
</body>
</html>
When I open it in a browser (I tried Firefox, Chrome, IE and Android WebView), it always comes up as
]>
Résumé
and I can't see a reason why ]> appears. Of course, it I remove ]> in DOCTYPE, everything appears all right,
but in this case my html is not a valid xml file, so it gives an error when opened in DOM.
Any suggestions?

What you are doing is correct as per XML rules, and it actually works in browsers that support XML, when served as XML; cf. How do I define HTML entity references inside a valid XML document?
The problem is that if the document is opened as legacy HTML in a browser, it will be processed by legacy HTML principles. This means, among other things, that an internal DTD subset (the thing you have in brackets in the DOCTYPE declaration) is not parsed by the book; instead, when processing a DOCTYPE string, browsers end with the first > character, and the rest will be consumed as character data.
So the problem isn’t just the ]>. The construct does not work at all, i.e. no entity is defined. In the example, the “é” character is displayed, but only because é is predefined in HTML. If you tried defining <!ENTITY foo "é"> and using &foo;, you would see &foo; literally.
If your document will be processed as legacy HTML, you cannot define entity references. Apparently it currently is, since the example document does not display at all when processed as XML (it is not well-formed, so only a syntax error message appears).

Related

DOCTYPE HTML in html file

Why is <!DOCTYPE html ... > used in html file?
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
The DOCTYPE Declaration (DTD or Document Type Declaration) does a couple of things:
When performing HTML validation testing on a web page it tells the HTML (HyperText Markup Language) validator which version of (X)HTML standard the web page coding is supposed to comply with. When you validate your web page the HTML validator checks the coding against the applicable standard then reports which portions of the coding do not pass HTML validation (are not compliant).
It tells the browser how to render the page in standards compliant mode.
For more information refer to this "<!DOCTYPE html>" What does it mean?
It tells the browser that the following code is to be treated as a particular version of html code.
The browser knows then to look for an open HTML tag <html> and treats everything like html until it reaches the close HTML tag </html>
<!DOCTYPE html> is all that's needed now.
The term DOCTYPE tells the browser which type of HTML is used on a webpage. Here is link of official page which explains your query why and what is
<!DOCTYPE html>
A doctype defines which version of HTML/XHTML your document uses. You would want to use a doctype so that when you run your code through validators, the validators know which version of HTML/XHTML to check against
The declaration is not an HTML tag; it is an instruction to the web browser about what version of HTML the page is written in.
In HTML 4.01, the declaration refers to a DTD, because HTML 4.01 was based on SGML. The DTD specifies the rules for the markup language, so that the browsers render the content correctly.
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
Tip: Always add the declaration to your HTML documents, so that the browser knows what type of document to expect.
The <!DOCTYPE html> declaration is used to inform a website visitor's browser that the document being rendered is an HTML document. While not actually an HTML element itself, every HTML document should being with a DOCTYPE declaration to be compliant with HTML standards.
For HTML5 documents (which nearly all new web documents should be), the DOCTYPE declaration should be:
<!DOCTYPE html>
Show to the browser than the file is a HTML5.
Is followed by the lenguage etiquete according to HTML5 good practiques.
<!doctype html>
<html lang="es">
In this case the second line indicates to the browsers than the file is in example, spanish in this case <html lang="es">
is important for building an HTML documents it is not just HTML but it is an instruction to the web browser about what version of HTML the page is written in.

HTML5 Doctype with strict

I want a strict but fully compatible html5 alternative to:
<!doctype html>
Basically I want to ensure the use of closing tags just to keep everything well readable, consistent and highlighted clearly in editors.
The answer to this question is to HTML 5, as XHTML-1.0-strict is to HTML 4.
Thanks in advance.
There is no doctype for "strict" XHTML5 validation. For XHTML5 the doctype is even optional, as the doctype is only for stopping the browser to switch to quirksmode. There is no such quirksmode for XHTML. It is recommended to use the HTML5 doctype (with capitalised DOCTYPE) if you are planning to use it as a polyglot document. In that case you would use the doctype:
<!DOCTYPE html>
However, if you want to validate as if the document is using XHTML style syntax, you can achieve that using the advanced options of the validator.
Go to http://validator.nu
Switch to "text field" in the select box (or point it to your online document but make sure it is served as XHTML not text/html
If using the text field paste in your document. In my case I used the following:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta charset="UTF-8" />
</head>
<body>
<p>test
</body>
</html>
Select XHTML5 + SVG 1.1 + MathML 3.0 from the Preset field. This will pre fill the scheme as http://s.validator.nu/xhtml5.rnc http://s.validator.nu/html5/assertions.sch http://c.validator.nu/all/
Click Validate
Using my document it will warn about the missing close </p>.

HTML5 is not based on SGML, and therefore does not require a reference to a DTD

From: http://www.w3schools.com/tags/tag_doctype.asp
The < !DOCTYPE > declaration is not an HTML tag; it is an instruction to
the web browser about what version of HTML the page is written in.
In HTML 4.01, the < !DOCTYPE > declaration refers to a DTD, because HTML
4.01 was based on SGML. The DTD specifies the rules for the markup language, so that the browsers render the content correctly.
HTML5 is not based on SGML, and therefore does not require a reference
to a DTD.
Tip: Always add the < !DOCTYPE > declaration to your HTML documents, so that the browser knows what type of document to expect.
Does the bold statement mean that when we are using HTML 5 we don't need to specify < !DOCTYPE html >?
What does that statement exactly mean?
I am currently using < !DOCTYPE html > in my html file with the browser Firefox 4. I removed that declaration but did not see any difference in the rendered output. Does it mean that the problem may occur in old browsers and not in new ones?
The terminology is confusing, but a DTD (document type definition) is only one part of a document type declaration (usually shortened to "doctype"). You should always include a doctype declaration (<!DOCTYPE html> if you use HTML5), but a document type definition identifier is no longer necessary.
To provide a concrete example, this is what a HTML4.01 document type declaration ("doctype") might have looked like:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
The document type definition ("DTD") identifier in the above declaration is this part:
"-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
That's the part you can leave off for HTML5. "PUBLIC" specifies the DTD's availability, so that should also not be included if there is no DTD.
Does the bold statement mean that when we are using HTML 5 we don't need to specify ?
It means that you can't specify.
The HTML 5 Doctype has no public or system identifier in it.
I am currently using <!DOCTYPE html> in my html file
That is required. Keep doing that.
with the browser Firefox 4.
The current stable version of Firefox is version 20. You should probably upgrade.
I removed that declaration but did not see any difference in the rendered output. Does it mean that the problem may occur in old browsers and not in new ones?
No, it just means that you don't have any code that is impacted by being in Quirks mode (or that you do but didn't spot the changes).
Lets take a look at the W3C HTML5 definition, they have a conveniënt page about the differences HTML5 brings:
http://www.w3.org/TR/html5-diff/#doctype
2.2 The Doctype
The HTML syntax of HTML5 requires a doctype to be specified to ensure
that the browser renders the page in standards mode. The doctype has
no other purpose. [DOCTYPE]
The doctype declaration for the HTML syntax is and is
case-insensitive. Doctypes from earlier versions of HTML were longer
because the HTML language was SGML-based and therefore required a
reference to a DTD. With HTML5 this is no longer the case and the
doctype is only needed to enable standards mode for documents written
using the HTML syntax. Browsers already do this for .
To support legacy markup generators that cannot generate the preferred
short doctype, the doctype is allowed in the HTML syntax.
The strict doctypes for HTML 4.0, HTML 4.01, XHTML 1.0 as well as
XHTML 1.1 are also allowed (but are discouraged) in the HTML syntax.
In the XML syntax, any doctype declaration may be used, or it may be
omitted altogether. Documents with an XML media type are always
handled in standards mode.
On that page, chapter 1 (Introduction) says more about HTML versus XML syntax:
The HTML5 draft (..) defines a single language called HTML which can be written in HTML syntax and in XML syntax.
So, if your HTML5 is strict XML syntax, i can conclude from the last paragraph that yes in this case you should not prefix a doctype line.
See chapter 2 for the difference in syntax:
HTML5 HTML syntax:
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>Example document</title>
</head>
<body>
<p>Example paragraph</p>
</body>
</html>
HTML5 XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Example document</title>
</head>
<body>
<p>Example paragraph</p>
</body>
</html>
There is some subtle differences in syntax.

Multiple Doctypes in a single HTML Document

If a HTML document has two doctypes, how will the doctypes affect the rendering of the page and which doctype would the browser pick? Is having two (or more) doctypes in a single document valid or confusing?
Example:
<!DOCTYPE html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
</html>
Only a single doctype declaration is permitted. This follows rather directly from the HTML specifications as well the HTML5 drafts, and it can also be checked using a validator.
Thus, there is no specification of what should happen. The natural expectation is that since browsers process the doctype declaration only in “doctype sniffing” when deciding on the browser mode (Quirks Mode vs. Standards Mode), only the first doctype declaration takes effect and the other is ignored.
This can be tested e.g. as follows (using an HTML 3.2 doctype, which triggers Quirks Mode on all doctype-sniffer browsers):
<!DOCTYPE HTML>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<title>Testing duplicate doctype effect</title>
<script>
document.write(document.compatMode);
</script>
</html>
This displays “CSS1Compat” (= Standards Mode), whereas swapping the doctype declarations causes “BackCompat” (= Quirks Mode).
I believe the very first DOCTYPE is used by the browser and it's against the specification to have more than one in a document.
I think (not sure) that the only situation when multiple DOCTYPE-s may be valid is when using IE conditional comments. Browsers other than IE won't see those, of course.
I remember reading a blog entry (can't find it now, so I may be wrong in this) but some (most?) browsers even ignore the DOCTYPE if it's not the first thing they encounter. (This may have been a bug that got fixed since.)
Here's W3School's reference page about DOCTYPE:
http://www.w3schools.com/tags/tag_doctype.asp
If you have multiple DOCTYPE-s in your HTML page then browser will consider first one, browser parse the DOM line by line. Once browser get DOCTYPE then it will stop looking for other doctypes and will jump to search for HTML tag.
In the above question HTML-5 DOCTYPE is mentioned first and then
HTML-4, according to this browser will render things as HTML-5 doctype
It is better to try once in http://www.w3schools.com/ ... Try to use 'code' or 'kbd' or 'dfn' or 'samp' or 'strong' tag by mentioning both doctypes by priority.

Why is the CDATA section in my HTML not rendering?

I am writing a report about XML injection attacks in HTML. Thus I am going to have (mangled) HTML content as the content of my HTML. As such I am trying to wrap my HTML content in CDATA blocks but it does seem to be rendering properly.
I have the (validated by W3C) document:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>report</title>
</head>
<body>
<div><![CDATA[AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">]]></div>
</body>
</html>
From my understanding of the Wikipedia article this means that the content should be "marked for the parser to interpret as only character data, not markup". So the output should be
AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">
However, in both Chrome 21.0.1180.60 m and Firefox 14.0.1 all that renders is
]]>
What is going on? Shouldn't everything from the <![CDATA[ to the first ]]> appear on screen as if every character had been escaped?
CDATA sections are recognized by browsers only in XML parsing mode. In legacy HTML mode, strange things happen, as you have seen.
So you would need to configure the server to send the document with an XHTML Content-Type. This in turn would prevent old versions of IE (up to IE 8) from not rendering the document at all.
The practical ways of displaying HTML tags as content of an HTML document are:
a) Present each <as < and each & as &. Works in legacy HTML ande in XHTML.
b) Wrap the data in an xmp element. Works in legacy HTML (only - so no XML Content-Type, but just declaring an XHTML doctype doesn't matter, it gets ignored anyway). Example:
<xmp><![CDATA[AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">]]></xmp>
The xmp markup also implies a monospace font and pre-like rendering where whitespace is honored. But these can be overridden with CSS. The xmp element was dropped from HTML specs long ago but is supported by browsers quite well.