Is valid XML also valid HTML - html

I'm trying to convert some XML to HTML. The XML contains only a few known elements that map to HTML tags. Do I need to html encode text nodes?
Is valid XML also valid HTML assuming we are only using HTML tags?

Is valid XML also valid HTML assuming we are only using HTML tags?
No. Here's a simple example.
<div>
<span/>
</div>
This is well-formed and valid XML. It is not valid HTML (except when processed as XHTML) in any version of HTML.
That's not to say that a HTML parser won't process it, but that's not a good test. An HTML parser will process any byte sequence, valid or not.

Is valid* XML also valid HTML assuming we are only using HTML tags?
*Note that "valid" is not the same as "well-formed". Validity is a property that requires well-formedness and succesful comparison against a DTD or schema. Well-formedness only means syntactical correctness, which is what you mean here.
Yes. HTML uses a few conventions that are not present in XML (prominently unclosed tags, unencoded tag bodies like <script>, namespaces are unsupported, incorrect tag nesting is glossed over) but all things considered well-formed, vanilla (!) XML that only uses HTML tag names will be understood by an HTML parser.
Vanilla means in this case: No custom DTDs, no custom named character entities.
Do I need to html encode text nodes?
No. All characters valid in a certain encoding (say, UTF-8) will be acceptable in both XML and HTML, as long as the encoding is correctly declared. Character escaping schemes are compatible, so e.g.   (or &xA0;) will represent a non-breaking space in both XML and HTML. Writing that non-breaking space verbatim (i.e as single byte xA0) into the text will work as well. Named character entities besides <, >, &, " and &apos; are unsupported in XML, whereas all numbered character entities XML could use will work in HTML. That means you will not encounter a problem there.
XML that does not declare an encoding will default to UTF-8. You should not have a problem with leaving all text nodes and attribute values as they are as long as you use the same encoding for your HTML.

Related

What is the difference between HTML and XHTML?

Note: this is supposed to be the canonical post for this question. A number of answers exist already, but descriptions of the various differences are scattered all over the place, and more often than not, they also offer opinions to "which one should I use", which I will refrain from in here.
If you have more questions to ask, or you know of more differences, feel free to edit.
What is the difference between XHTML and HTML? Isn't XHTML merely a more strict version of HTML? And why are there different versions of XHTML if they all act the same?
What is the difference between HTML and XHTML?
There are many differences. The main one is that XHTML is HTML in an XML document, and XML has different syntax rules:
XML has a different namespace by default, so you'll have to use the HTML namespace, xmlns="http://www.w3.org/1999/xhtml" explicitly in an XHTML document
XML is case sensitive and you'll have to use lowercase for tag names and attributes and even the x in hexadecimal character references
XML doesn't have optional start and end tags, so you'll have to write out all of them in full
Likewise, XML doesn't have void tags, so you'll have to close every void element yourself with a slash.
Non-void elements that have no content can be written as a single empty element tag in XML.
XML can contain CDATA sections, sections of plain text delimited with <![CDATA[ .. ]]>; HTML cannot
On the other hand, there are no CDATA or PCDATA elements or attributes in XML, so you'll have to escape your < signs everywhere (except in CDATA sections)
Quotes around attribute values are not optional in XML, and there is no attribute minimization (name-only attributes)
And the XML parser is not as forgiving of errors as the HTML parser.
Then there are a couple of not XML-related differences:
XHTML documents are always rendered in standards mode, never in quirks mode
XHTML does not look at meta commands in the head to determine the encoding. In fact, the W3C validator flags <meta http-equiv="content-type" ... as an error in XHTML5 files, but not in HTML5 files.
Earlier on, mismatches between the dtds for XHTML 1.0 strict and HTML 4.01 strict lead to validation issues. The definition for XTHML 1.0 was missing the name attribute on <img> and <form>. This was an error though, fixed in XHTML 1.1.
Note that XHTML documents should be served up with the correct file type, i.e. a .xhtml file extension or an application/xhtml+xml MIME type. You can't really have XHTML in an HTML document, because browsers don't differentiate between the two syntaxes by looking at the content, only by file type.
In other words, if you have an HTML file, its contents are HTML, no matter if it has valid XML in it or not.
One point about the syntax rules worth mentioning is the casing of tag names. Although HTML documents are case-insensitive, the tag names are actually exposed as uppercase by the DOM. That means that under HTML, a JavaScript command like console.log(document.body.tagName); would output "BODY", whereas the same command under XHTML would output "body".
Isn't XHTML merely a stricter version of HTML?
No; XML has different rules than HTML, but it's not necessarily stricter. If anything, XML has fewer rules!
In HTML, many features are optional. You can choose to put quotes around attribute values or not; in XML you don't have that choice. And in HTML, you have to remember when you have the choice and when you don't: are quotes optional in <a href=http://my-website.com/?login=true>? In XML, you don't have to think about that. XML is easier.
In HTML, some elements are defined as raw text elements, that is, elements that contain plain text rather than markup.
And some other elements are escapable raw text elements, in which references like é will be parsed, but things like <b>bold</b> and <!-- comment --> will be treated as plain text. If you can remember which elements those are, you don't have to escape < signs (you optionally can though). XML doesn't have that, so there's nothing to remember and all elements have the same content type.
XML has processor instructions, the most well known of which is the xml declaration in the prolog, <?xml version="1.0" encoding="windows-1252"?>. This tells the browser which version of XML is used (1.0 is the only version that works, by the way) and which character set.
And XML parses comments in a different way. For example, HTML comments can't start with <!--> (with a > as the first character inside); XHTML comments can.
Speaking of comments, with XHTML you can comment out blocks of code inside <script> and <style> elements using <!-- comment -->. Don't try that in HTML. (It's not recommended in XHTML either, because of compatibility issues, but you can.)
Why are there different versions of XHTML if they all act the same?
They don't! For instance, in XHTML 1.1 you can refer to character entities like é and , because those entities are defined in the DTD. The current version of XHTML (formerly known as XHTML5) does not have a DTD, so you will have to use numerical references, in this case é and   (or, define those entities yourself in the DOCTYPE declaration. The X means eXtensible after all).

Does HTML conform to the XML specification?

HTML and XML are syntactically very similar, so what I want to know is if valid HTML code will always conform to the XML specification.
No, it won't.
HTML 2 through 4.x were SGML applications, not XML applications. (HTML+ might also have been an SGML application, it isn't clear from a brief skim of the specification)
HTML 5 has its own parse rules.
(XHTML and the XML serialisation of HTML 5 will be XML though)
Does HTML conform to the XML specification?
No, it does not. HTML supports:
unclosed tags (e.g. <img> instead of <img />)
wrongly nested tags (e.g. <b><i>bla</b></i> instead of <b><i>bla</i></b>)
unquoted attributes (e.g. <a name=foo>...</a>)
contents that is not propery encoded (e.g. <em>this & that</em> instead of <em>this & that</em>)
tags that explicitly must contain unencoded content (i.e. <script>)
named entities (e.g. © instead of ©)
The standard does not explicitly allow all of these notions, but all HTML parsers understand and support them.
None of them is legal in XML.
HTML is more lenient. For example,
<!DOCTYPE html>
<title>foo</title>
bar
is a valid HTML5 document, but it's obviously not valid XML, since XML requires a top-level element that encompasses the whole document.
However, you can use one of the XHTML languages, which are applications of XML with the same semantics as the corresponding HTML standards.

Necessary to encode characters in HTML links?

Should I be encoding characters contained within a url?
Example:
Some link using &
or
Some link using &
Yes.
In HTML (including XHTML and HTML5, as far as I know), all attribute values and tag content should be encoded:
Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
There are two different kinds of encoding which are needed for different purposes in web programming, and it is easy to get confused.
Special characters in text which is to be displayed as HTML need to be encoded as HTML entities. This is particularly characters such as '<' which are part of HTML markup, but it may also be useful for other special characters if there is any doubt about the character encoding to be used.
Special characters in a URL need to be URL-encoded (replaced by %nn codes).
There is no harm in putting an HTML entity into a URL if it is going to be treated as HTML text by whatever receives it; but if it is part of an instruction to a program (such as the & used to separate arguments in a CGI query string) you should not encode it.
Depends how your files are being served up and identified.
For XHTML, yes and it's required.
For HTML, no and it's incorrect to do it.

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

Why are HTML character entities necessary?

Why are HTML character entities necessary? What good are they? I don't see the point.
Two main things.
They let you use characters that are not defined in a current charset. E.g., you can legally use ASCII as the charset, and still include arbitrary Unicode characters thorugh entities.
They let you quote characters that HTML gives special meaning to, as Simon noted.
"1 < 2" lets you put "1 < 2" in your page.
Long answer:
Since HTML uses '<' to open tags, you can't just type '<' if you want that as text. Therefore, you have to have a way to say "I want the text < in my page". Whoever designed HTML (or, actually SGML, HTML's predecessor) decided to use '&something;', so you can also put things like non-breaking space: ' ' (spaces that are not collapsed or allow a line break). Of course, now you need to have a way to say '&', so you get '&'...
They aren't, apart from &, <, >, " and probably . For all other characters, just use UTF-8.
In SGML and XML they aren't just for characters. They are generic inclusion mechanism, and their use for special characters is just one of many cases.
<!ENTITY signature "<hr/><p>Regards, <i>&myname;</i></p>">
<!ENTITY myname "John Doe">
This kind of entities is not useful for web sites, because they work only in XML mode, and you can't use external DTD file without enabling "validating" parsing mode in browser configuration.
Entities can be expanded recursively. This allows use of XML for Denial of Serice attack called "Billion Laughs Attack".
Firefox uses entities internally (in XUL and such) for internationalization and brand-independent messages (to make life easier for Flock and IceWeasel):
<!ENTITY hidemac.label "Hide &brandShortName;">
<!ENTITY hidewin.label "Hide - &brandShortName;">
In HTML you just need <, & and " to avoid ambiguities between text and markup.
All other entities are basically obsoleted by Unicode encodings and remain only as covenience (but a good text editor should have macros/snippets that can replace them).
In XHTML all entities except the basic few are problematic, because won't work with stand-alone XML parsers (e.g. won't work).
To parse all XHTML entities you need validating XML parser (option's usually called "resolve externals") which is slower and needs DTD Catalog set up. If you ignore or screw up your DTD Catalog, you'll be participating in DDoS of W3C servers.
Character entities are used to represent character which are reserved to write HTML for.ex.
<, >, /, & etc, if you want to represent these characters in your content you should use character entities, this will help the parser to distinguish between the content and markup
You use entities to help the parser distinguish when a character should be represented as HTML, and what you really want to show the user, as HTML will reserve a special set of characters for itself.
Typing this literally in HTML
I don't mean it like that </sarcasm>
will cause the "</sarcasm>" tag to disappear,
e.g.
I don't mean it like that
as HTML does not have a tag defined as such. In this case, using entities will allow the text to display properly.
e.g.
No, really! </sarcasm>
gives
No, really! </sarcasm>
as desired.