Why is the CDATA section in my HTML not rendering? - html

I am writing a report about XML injection attacks in HTML. Thus I am going to have (mangled) HTML content as the content of my HTML. As such I am trying to wrap my HTML content in CDATA blocks but it does seem to be rendering properly.
I have the (validated by W3C) document:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>report</title>
</head>
<body>
<div><![CDATA[AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">]]></div>
</body>
</html>
From my understanding of the Wikipedia article this means that the content should be "marked for the parser to interpret as only character data, not markup". So the output should be
AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">
However, in both Chrome 21.0.1180.60 m and Firefox 14.0.1 all that renders is
]]>
What is going on? Shouldn't everything from the <![CDATA[ to the first ]]> appear on screen as if every character had been escaped?

CDATA sections are recognized by browsers only in XML parsing mode. In legacy HTML mode, strange things happen, as you have seen.
So you would need to configure the server to send the document with an XHTML Content-Type. This in turn would prevent old versions of IE (up to IE 8) from not rendering the document at all.
The practical ways of displaying HTML tags as content of an HTML document are:
a) Present each <as < and each & as &. Works in legacy HTML ande in XHTML.
b) Wrap the data in an xmp element. Works in legacy HTML (only - so no XML Content-Type, but just declaring an XHTML doctype doesn't matter, it gets ignored anyway). Example:
<xmp><![CDATA[AuthType=<META HTTP-EQUIV="Set-Cookie" Content="USERID=<SCRIPT>alert('XSS')</SCRIPT>">]]></xmp>
The xmp markup also implies a monospace font and pre-like rendering where whitespace is honored. But these can be overridden with CSS. The xmp element was dropped from HTML specs long ago but is supported by browsers quite well.

Related

DOCTYPE HTML in html file

Why is <!DOCTYPE html ... > used in html file?
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
The DOCTYPE Declaration (DTD or Document Type Declaration) does a couple of things:
When performing HTML validation testing on a web page it tells the HTML (HyperText Markup Language) validator which version of (X)HTML standard the web page coding is supposed to comply with. When you validate your web page the HTML validator checks the coding against the applicable standard then reports which portions of the coding do not pass HTML validation (are not compliant).
It tells the browser how to render the page in standards compliant mode.
For more information refer to this "<!DOCTYPE html>" What does it mean?
It tells the browser that the following code is to be treated as a particular version of html code.
The browser knows then to look for an open HTML tag <html> and treats everything like html until it reaches the close HTML tag </html>
<!DOCTYPE html> is all that's needed now.
The term DOCTYPE tells the browser which type of HTML is used on a webpage. Here is link of official page which explains your query why and what is
<!DOCTYPE html>
A doctype defines which version of HTML/XHTML your document uses. You would want to use a doctype so that when you run your code through validators, the validators know which version of HTML/XHTML to check against
The declaration is not an HTML tag; it is an instruction to the web browser about what version of HTML the page is written in.
In HTML 4.01, the declaration refers to a DTD, because HTML 4.01 was based on SGML. The DTD specifies the rules for the markup language, so that the browsers render the content correctly.
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
Tip: Always add the declaration to your HTML documents, so that the browser knows what type of document to expect.
The <!DOCTYPE html> declaration is used to inform a website visitor's browser that the document being rendered is an HTML document. While not actually an HTML element itself, every HTML document should being with a DOCTYPE declaration to be compliant with HTML standards.
For HTML5 documents (which nearly all new web documents should be), the DOCTYPE declaration should be:
<!DOCTYPE html>
Show to the browser than the file is a HTML5.
Is followed by the lenguage etiquete according to HTML5 good practiques.
<!doctype html>
<html lang="es">
In this case the second line indicates to the browsers than the file is in example, spanish in this case <html lang="es">
is important for building an HTML documents it is not just HTML but it is an instruction to the web browser about what version of HTML the page is written in.

HTML with entity (can't get rid of ]>)

A very simple HTML file. I deliberately placed all required attributes even though it may be an overkill. (Actually, é is recognised by practically all browsers without explicit specification, but this is just an example to highlight the problem):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY eacute "é">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>Test HTML with an entity</title>
</head>
<body lang="en">
<h1>Résumé</h1>
</body>
</html>
When I open it in a browser (I tried Firefox, Chrome, IE and Android WebView), it always comes up as
]>
Résumé
and I can't see a reason why ]> appears. Of course, it I remove ]> in DOCTYPE, everything appears all right,
but in this case my html is not a valid xml file, so it gives an error when opened in DOM.
Any suggestions?
What you are doing is correct as per XML rules, and it actually works in browsers that support XML, when served as XML; cf. How do I define HTML entity references inside a valid XML document?
The problem is that if the document is opened as legacy HTML in a browser, it will be processed by legacy HTML principles. This means, among other things, that an internal DTD subset (the thing you have in brackets in the DOCTYPE declaration) is not parsed by the book; instead, when processing a DOCTYPE string, browsers end with the first > character, and the rest will be consumed as character data.
So the problem isn’t just the ]>. The construct does not work at all, i.e. no entity is defined. In the example, the “é” character is displayed, but only because é is predefined in HTML. If you tried defining <!ENTITY foo "é"> and using &foo;, you would see &foo; literally.
If your document will be processed as legacy HTML, you cannot define entity references. Apparently it currently is, since the example document does not display at all when processed as XML (it is not well-formed, so only a syntax error message appears).

HTML5 is not based on SGML, and therefore does not require a reference to a DTD

From: http://www.w3schools.com/tags/tag_doctype.asp
The < !DOCTYPE > declaration is not an HTML tag; it is an instruction to
the web browser about what version of HTML the page is written in.
In HTML 4.01, the < !DOCTYPE > declaration refers to a DTD, because HTML
4.01 was based on SGML. The DTD specifies the rules for the markup language, so that the browsers render the content correctly.
HTML5 is not based on SGML, and therefore does not require a reference
to a DTD.
Tip: Always add the < !DOCTYPE > declaration to your HTML documents, so that the browser knows what type of document to expect.
Does the bold statement mean that when we are using HTML 5 we don't need to specify < !DOCTYPE html >?
What does that statement exactly mean?
I am currently using < !DOCTYPE html > in my html file with the browser Firefox 4. I removed that declaration but did not see any difference in the rendered output. Does it mean that the problem may occur in old browsers and not in new ones?
The terminology is confusing, but a DTD (document type definition) is only one part of a document type declaration (usually shortened to "doctype"). You should always include a doctype declaration (<!DOCTYPE html> if you use HTML5), but a document type definition identifier is no longer necessary.
To provide a concrete example, this is what a HTML4.01 document type declaration ("doctype") might have looked like:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
The document type definition ("DTD") identifier in the above declaration is this part:
"-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
That's the part you can leave off for HTML5. "PUBLIC" specifies the DTD's availability, so that should also not be included if there is no DTD.
Does the bold statement mean that when we are using HTML 5 we don't need to specify ?
It means that you can't specify.
The HTML 5 Doctype has no public or system identifier in it.
I am currently using <!DOCTYPE html> in my html file
That is required. Keep doing that.
with the browser Firefox 4.
The current stable version of Firefox is version 20. You should probably upgrade.
I removed that declaration but did not see any difference in the rendered output. Does it mean that the problem may occur in old browsers and not in new ones?
No, it just means that you don't have any code that is impacted by being in Quirks mode (or that you do but didn't spot the changes).
Lets take a look at the W3C HTML5 definition, they have a conveniënt page about the differences HTML5 brings:
http://www.w3.org/TR/html5-diff/#doctype
2.2 The Doctype
The HTML syntax of HTML5 requires a doctype to be specified to ensure
that the browser renders the page in standards mode. The doctype has
no other purpose. [DOCTYPE]
The doctype declaration for the HTML syntax is and is
case-insensitive. Doctypes from earlier versions of HTML were longer
because the HTML language was SGML-based and therefore required a
reference to a DTD. With HTML5 this is no longer the case and the
doctype is only needed to enable standards mode for documents written
using the HTML syntax. Browsers already do this for .
To support legacy markup generators that cannot generate the preferred
short doctype, the doctype is allowed in the HTML syntax.
The strict doctypes for HTML 4.0, HTML 4.01, XHTML 1.0 as well as
XHTML 1.1 are also allowed (but are discouraged) in the HTML syntax.
In the XML syntax, any doctype declaration may be used, or it may be
omitted altogether. Documents with an XML media type are always
handled in standards mode.
On that page, chapter 1 (Introduction) says more about HTML versus XML syntax:
The HTML5 draft (..) defines a single language called HTML which can be written in HTML syntax and in XML syntax.
So, if your HTML5 is strict XML syntax, i can conclude from the last paragraph that yes in this case you should not prefix a doctype line.
See chapter 2 for the difference in syntax:
HTML5 HTML syntax:
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>Example document</title>
</head>
<body>
<p>Example paragraph</p>
</body>
</html>
HTML5 XML syntax:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Example document</title>
</head>
<body>
<p>Example paragraph</p>
</body>
</html>
There is some subtle differences in syntax.

Multiple Doctypes in a single HTML Document

If a HTML document has two doctypes, how will the doctypes affect the rendering of the page and which doctype would the browser pick? Is having two (or more) doctypes in a single document valid or confusing?
Example:
<!DOCTYPE html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
</html>
Only a single doctype declaration is permitted. This follows rather directly from the HTML specifications as well the HTML5 drafts, and it can also be checked using a validator.
Thus, there is no specification of what should happen. The natural expectation is that since browsers process the doctype declaration only in “doctype sniffing” when deciding on the browser mode (Quirks Mode vs. Standards Mode), only the first doctype declaration takes effect and the other is ignored.
This can be tested e.g. as follows (using an HTML 3.2 doctype, which triggers Quirks Mode on all doctype-sniffer browsers):
<!DOCTYPE HTML>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<title>Testing duplicate doctype effect</title>
<script>
document.write(document.compatMode);
</script>
</html>
This displays “CSS1Compat” (= Standards Mode), whereas swapping the doctype declarations causes “BackCompat” (= Quirks Mode).
I believe the very first DOCTYPE is used by the browser and it's against the specification to have more than one in a document.
I think (not sure) that the only situation when multiple DOCTYPE-s may be valid is when using IE conditional comments. Browsers other than IE won't see those, of course.
I remember reading a blog entry (can't find it now, so I may be wrong in this) but some (most?) browsers even ignore the DOCTYPE if it's not the first thing they encounter. (This may have been a bug that got fixed since.)
Here's W3School's reference page about DOCTYPE:
http://www.w3schools.com/tags/tag_doctype.asp
If you have multiple DOCTYPE-s in your HTML page then browser will consider first one, browser parse the DOM line by line. Once browser get DOCTYPE then it will stop looking for other doctypes and will jump to search for HTML tag.
In the above question HTML-5 DOCTYPE is mentioned first and then
HTML-4, according to this browser will render things as HTML-5 doctype
It is better to try once in http://www.w3schools.com/ ... Try to use 'code' or 'kbd' or 'dfn' or 'samp' or 'strong' tag by mentioning both doctypes by priority.

Do modern browsers care about the DOCTYPE?

If you use deprecated attributes or tags <center>, <font color="red">, or <td valign="top"> etc. in XHTML 1.0 Strict (no depr. attributes), modern browsers (I will use Chrome as an example) still take notice of and use them.
If you use HTML5 <video> on an XHTML 1.0 Strict DOCTYPE Chrome will still recognize it - it's not as if they'd program it to not. I tested the worst deprecated, capitalized, and unquoted attribute code I could write, along with HTML5 audio, with the XHTML 1.0 Strict DOCTYPE in Chrome and it rendered flawlessly.
Here's the code I tested, working flawlessly in Chrome (red bg, centered table, audio playing):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Do browsers care about the DOCTYPE?</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
</head>
<body bgcolor=#ff0000>
<CENTER>
<table cellpadding="0" cellspacing=0>
<tr><td valign=top>test</td></tr>
</table>
</CENTER>
and some HTML5 audio..
<audio autoplay>
<source src="http://www.duncannz.com/resources/sound/alarm.mp3" type="audio/mp3">fallback text</audio>
</body>
</html>
So my question: Do modern browsers (translation: browsers other than IE) pay any attention at all, or do anything differently, because of the DOCTYPE? Do they even bother to read and interpret it?
Browsers do care about the DOCTYPE - otherwise there wouldn't be any point in having it!
You are right in saying that many browsers interpret old/deprecated commands in the correct way, but this is largely a matter of backwards compatibility. There is such a huge amount of content on the web that it is next to impossible to keep everything up-to-date and standards-complient. The web browsers continue to support these outdated pages because if they didn't, much of the content on the web would look slightly off. Most users don't know the difference between HTML4 and 5, so the blame could fall on the browser, which could be devastating - especially if a page looks bad on Firefox and nice on IE!
The DOCTYPE is mainly used in validation and to determine whether a browser runs in this "quirks mode" (where many of these older rules still work) or "standards mode" . Many professional web designers use the W3C validation tools to make sure their web pages are valid HTML, and the tools provided by their website look at the DOCTYPE to choose the correct set of rules with which to validate. Furthermore, XHTML strict does not allow empty tags or other blatant syntactic errors.
Hope this helps!
Try this in Chrome:
<!DOCTYPE html>
<title>Test case</title>
<p hidden>My text
<table><tr><td>Hello World</table>
against this
<title>Test case</title>
<p hidden>My text
<table><tr><td>Hello World</table>
Only in the former case will the text "Hello World" be visible.
In most Modern Browsers, you're not going to notice much difference (depending on the page) when using different Doctypes. The biggest difference you'll notice is not in your markup, but in your use of CSS, and the layout/positioning of elements. The Doctype is used when validating your pages, and in determining the mode, the browser renders the page in. So, depending on the Doctype you use, it will determine if the page is rendered in Standards mode, Quirks mode, etc. In IE, and older browsers, you'll notice much more of a difference.
For a more in-depth information on the subject, check out this link: http://hsivonen.iki.fi/doctype/
Yes, they do. It means the difference between Quirks or Standard mode, and can affect how IE handles box containers.
Have a look here:
http://www.quirksmode.org/css/quirksmode.html
And also here:
http://www.webmasterworld.com/forum21/7975.htm They discuss this topic in detail.
maybe the paragraph called "How DOCTYPES Affect Rendering" could help you?
http://www.upsdell.com/BrowserNews/res_doctype.htm
At the current date it is still possible to use DTD entities as variables in chrome/firefox/opera/ie in .xml and .xhtml and .svg and other xml-based files(breaks in .html files as I imagine it uses an html rendered instead of an xml renderer) without having to resort to javascript or php/other server-side preprocessor magic:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY theword "bird">
<!ENTITY thexmlns "http://www.w3.org/1999/xhtml">
]>
<html xmlns="&thexmlns;">
<head>
<title>The word is &theword;</title>
</head>
<body>
<p>This document uses the word &theword; multiple times.</p>
<p>This document's word can be changed from &theword; by altering the entity.</p>
</body>
</html>
This seems like a useful test to see if doctypes still work(save the example above as example.xml or example.xhtml and see if it works).
So far I only found a realistic use for it in android projects xml files to use
entities inside attributes to prevent lines from having too much text one one line,
or from having the repeated long text in multiple attributes that could have a short entity encode them instead.