HTML : Encoding special characters : name vs code [duplicate] - html

So, I know that I can represent an ampersand as & or &.
I have found that at least one method of parsing XML does not allow for the abbreviation-based style - only numeric. Is there a best-practice? I want to instruct my team to use the numeric versions because of my experience, but one instance hardly seems like enough reason to convince them.
Which method should we favor?

XML only has a small set of these symbolic entities, for amp, quot, gt and lt.
The symbolic names we're familiar with for ©, etc. for entities exist because of their appearance in the HTML DTD, here http://www.w3.org/TR/html4/sgml/entities.html (although I think most browsers have this baked in).
Therefore, if you are using (X)HTML, get your doctype right, and then follow the links on w3.org to XHTML to see the entities available.
As far as best practices, most people find the symbolic names easier to understand and will use them when available. I would recommend that.
The only reason not to is that there used to be cases in very old browsers when entities wouldn't work-- but I don't believe this is the case any more.

If you mean other HTML entities, with pure XML, only the entities amp, lt, gt, quot, and apos are pre-defined (apos is not available in HTML, but amp indeed should be).
However, all other HTML entities (such as nbsp) will not be available unless defined in the DOCTYPE, so in such a case, using numeric entities may indeed be preferable.

Related

Entity codes and the lang attribute: should I use both?

I am writing a markup document in Finnish.
I'm using the lang="fi-fi" attribute. Am I supposed to use the markup entities (ä for ä etc.) in conjunction with the language attribute, or is using the language attribute alone sufficient? How do the entities and language attribute affect each other?
The "problem" comes from the fact that the markup is written without entities and I have a script that's supposed to replace the scandic letters with entities by using regular expressions -- after defining the lang attribute the script doesn't appear to work anymore (which it supposedly did before adding the lang attribute).
My main concern is that the markup renders correctly regardless of the browser, although a "modern" browser can be assumed.
The lang attribute and entities do completely different jobs.
The lang attribute tells the parser what human language the document is written in. This allows, for example, search engines to tell if it is a good document to present to Finish speakers and screen reader software to select the correct pronunciation library.
Entities just let you represent characters that you couldn't otherwise represent. e.g.
Because you can't type the character of your keyboard
Because the character encoding the document is saved in (e.g. ASCII) doesn't include the character. This century you should be using UTF-8 just about everywhere and shouldn't need to worry about that.
Because the character would otherwise have special meaning in HTML (e.g. <).
Always use a lang attribute if you know what language the text of the document will be written in
Always use entities for characters with special meaning in HTML
Use literal characters if you can be reasonably certain the character encoding won't be mangled (which you can be most of the time) as they use fewer bytes and are easier to read in source code.
The root of my problem was actually character encoding. Although all of the documents were defined with UTF-8, the script somehow didn't recognize it. By telling the script that the input files (that were supposed be fixed with entities) are UTF-8 encoded the script functions correctly again.
As an answer to the question in the heading: to be absolutely sure that the documents are compatible with the server -- yes, I am supposed to use entity encoding (though I understand that assuming that the server allows UTF-8 is pretty safe assumption in general as implied by Quentin). Due to other reasons (related to automatic content generating), I'm also supposed to use the lang attribute.

HTML5: which is better - using a character entity vs using a character directly?

I've recently noticed a lot of high profile sites using characters directly in their source, eg:
<q>“Hi there”</q>
Rather than:
<q>“Hi there”</q>
Which of these is preferred? I've always used entities in the past, but using the character directly seems more readable, and would seem to be OK in a Unicode document.
If the encoding is UTF-8, the normal characters will work fine, and there is no reason not to use them. Browsers that don't support UTF-8 will have lots of other issues while displaying a modern webpage, so don't worry about that.
So it is easier and more readable to use the characters and I would prefer to do so.
It also saves a couple of bytes which is good, although there is much more to gain by using compression and minification.
The main advantage I can see with encoding characters is that they'll look right, even if the page is interpreted as ASCII.
For example, if your page is just a raw HTML file, the default settings on some servers would be to serve it as text/html; charset=ISO-8859-1 (the default in HTTP 1.1). Even if you set the meta tag for content-type, the HTTP header has higher priority.
Whether this matters depends on how likely the page is to be served by a misconfigured server.
It is better to use characters directly. They make for: easier to read code.
Google's HTML style guide advocates for the same. The guide itself can be found here:
Google HTML/CSS Style guide.
Using characters directly. They are easier to read in the source (which is important as people do have to edit them!) and require less bandwidth.
The example given is definitely wrong, in theory as well as in practice, in HTML5 and in HTML 4. For example, the HTML5 discussions of q markup says:
“Quotation punctuation (such as quotation marks) that is quoting the contents of the element must not appear immediately before, after, or inside q elements; they will be inserted into the rendering by the user agent.”
That is, use either ´q’ markup or punctuation marks, not both. The latter is better on all practical accounts.
Regarding the issue of characters vs. entity references, the former are preferable for readability, but then you need to know how to save the data as UTF-8 and declare the encoding properly. It’s not rocket science, and usually better. But if your authoring environment is UTF-8 hostile, you need not be ashamed of using entity references.

Is HTML an application of XML?

So, yeah, is HTML a particular application of XML? Like, instead of user-customizable tags, "hard coded" fixed tags decided by the W3C and interpreted by navigators? Or are them totally different things?
Also, in which case is XML better than a database to transfer information inside a Web application? (I was thinking, saving users information or things like that may do better with XML documents than with a database).
Here's a history of HTML
...The HTML that Tim invented was strongly based on SGML (Standard Generalized Mark-up Language), an internationally agreed upon method for marking up text into structural units such as paragraphs, headings, list items and so on. SGML could be implemented on any machine. The idea was that the language was independent of the formatter (the browser or other viewing software) which actually displayed the text on the screen. The use of pairs of tags such as and is taken directly from SGML, which does exactly the same. The SGML elements used in Tim's HTML included P (paragraph); H1 through H6 (heading level 1 through heading level 6); OL (ordered lists); UL (unordered lists); LI (list items) and various others. What SGML does not include, of course, are hypertext links: the idea of using the anchor element with the HREF attribute was purely Tim's invention, as was the now-famous `www.name.name' format for addressing machines on the Web....
And in no case is XML "better" than a database (are cakes better than ovens?). XML isn't for storing data, it's for transfering it. Unless the data is absolutely minimal, you have to find some other way to store it. Opening static XML files on the file system over and over as you save and read data is a terrible way to go about it.
So, yeah, is HTML a particular application of XML?
No.
HTML 4 is an application of SGML, but most parsers for it do not treat it as such.
XHTML is an application of XML, but it is usually served as text/html instead of application/xhtml+xml and so is treated like HTML.
HTML 5 is not an application of either SGML or XML (except in its XML serialisation) and has its own parsing rules.
Also, in which case is XML better than a database to transfer information inside a Web application?
XML is a good basis for a data exchange format. It is not a good basis for storing data in order to search it (which is what happens "inside" most web applications)
HTML and XML both come from SGML, hence their similarities. But XML is a strict grammar (no predefined tag names), while HTML is both a not very strict grammar and a vocabulary (tag names). There is an HTML variant which strictly complies with XML rules : XHTML.
As for using XML as a database, it is possible under certain circumstances. But it really depends on your architecture, language, volumetry and lots of other considerations. I suggest you open a new question with more details for this.
XHTML is a reformulation of HTML as XML app.
You can invent your own tags. I don't think HTML5 has a doctype for that though. You can create them with JavaScript and initalize/style then with CSS like any other element.
instead of using XML, spit out JSON, seriously, do this.
if you are worried about using your db, think about switching to couchdb or nosql. they're ripe for JSON.
don't get me wrong, your thought process isn't wrong, you can do that. i've seen it done rather well. but most people don't get it right. and seriously, JSON is your friend.
For the differences between HTML & XML see:
http://www.w3schools.com/xml/xml_whatis.asp
XML is primarily used for transfering data, not storing it. A database will generally give you much more flexibility in querying the data.
HTML allows things that XML doesn't allow, like omitting end tags, omitting the quotes around attribute values, and using upper-case and lower-case interchangeably. So HTML is not just another XML vocabulary.
XHTML, however, was an attempt to reformulate HTML as an XML vocabulary.

Is HTML a context-free language?

Reading some related questions made me think about the theoretical nature of HTML.
I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html<head>
<title//
<p ltr<span id=p></span</p>
</>
So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar?
What about HTML5?
I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)
Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow:
X→δ
That simple restriction allows X to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at:
αXλ
one is sure that
αδλ
is also valid. Examples of non-context-free rules would be:
XY→δ
Xa→δ
aX→δ
Those would require knowing what could be derive arround X to determine if a rule applies, and that leads to non-determinism (what's around X would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined.
The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free.
For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF.
But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML.
HTML is defined at several levels.
At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
Finally, you have the semantic rules about what a given HTML document means.
The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen.
In the end:
Parsing correct HTML is extremely easy (it's context-free and LL/LR).
Parsing the HTML that actually exists over the Web is difficult.
Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.
Valid HTML is not a context-free language.
First of all, HTML being an application of SGML is fiction for all practical purposes, so analyzing SGML to answer the question is useless. (However, the SGML fiction probably isn't context-free, either.)
It's more useful to look at the actually defined HTML parsing algorithm. It works on two levels: tokenization and tree building. What HTML calls tokenization is a higher-level operation than what is usually called tokenization when talking about parsers. In the case of HTML, tokenization splits a stream of characters into units like start tags, end tags, comments and text. The tokenizer expands character references. Usually, when talking about parsers, you'd probably treat stuff like the less-than sign as "tokens" and would consider character references to consist of tokens instead of being resolved by the tokenizer.
If you consider the process of splitting the input stream into tokens, that level of the HTML language is regular (except for feedback from the tree builder).
However, there are three complications: The first one is that splitting the input stream into tokens is just the first and then there's the tree builder's side that actually cares about the identifiers in the tokens. The second one is that the tree builder feeds back into the tokenizer so that some state transitions made by the tokenizer depend on the state of the tree builder! The third one is that valid documents in the language are defined by rules that apply to the output of the tree builder stage and those rules are complex enough that they can't be fully defined using tree automata (as evidenced by RELAX NG not being expressive enough to describe all the validity constraints).
This isn't an actual proof, but you can probably develop real proofs by working from complications #2 and #3.
Note that the case of invalid documents is not particularly interesting as a question of whether the language is context-free in the sense of there being a context-free grammar that generates all the possible strings with no regard to the parse tree having some intelligible interpretation in terms of the tree that an HTML parser generates. The HTML parser will successfully consume all possible strings, so in that sense, all possible strings are in the "invalid HTML" language.
Edit: Interesting questions left as exercise to the reader:
Is HTML without parse errors but ignoring validity a context-free language?
Is HTML without parse errors and ignoring general validity but with only valid element names allowed a context-free language?
(Complication #2 applies in both cases.)
NO
See Edit Below
It depends.
If you are talking about the subset consisting of only theoretical HTML, then yes.
If you also include real life, working HTML that is accessed and used successfully by millions of people daily on many of the top sites on the internet then NO.
That is what gives HTML flexibility. The parsing engine adds tags, closes tags, and takes care of stuff that a theoretical CFG can't do. If you took automata you might remember that a production rule in a formal grammar cannot be empty (aka epsilon/lambda) on the lhs (left-hand side). Since the parsing engine is basically using knowledge that a formal grammar and automata couldn't have, it isn't restricted by that and the 'grammar' would have epsilon/lambda -> result where the specific epsilon/lambda rule is chosen based on information not available in the grammar.
Since I don't think empty lhs are allowed in any formal grammars, HTML cannot be defined by a formal grammar and is not a formal language at all.
Sure, HTML5 might try to move towards a 'more formal' language description but the likelihood that it becomes a context free language in reality (i.e. strings not matched by the grammar are rejected) is about the likelihood XHTML 2.0 takes the world by storm and replaces HTML altogether (XHTML is the attempt they made to make HTML a formal language...it was rejected en masse due to its fragility).
Noteworthy is the fact that HTML 5 is the FIRST HTML standard to be defined before being implemented! That's right, HTML 1-4 consist of random ideas someone just implemented in a browser, and were collected into standards after the fact based on which features were popularly used and widely implemented. Then they tried XHTML, which totally failed to be adopted. Even 'xhtml' on the web is automatically parsed as HTML under almost every circumstance to prevent stuff from just breaking with a cryptic syntax error. Now you can see how we got here and why it is unlikely to be formalized any time soon.
Lesson: "In theory, there is no difference between theory and practice. In practice, there is." - Yogi Berra
EDIT:
Actually, after reading through the documents it turns out that HTML, even according to the HTML 4.01 specification, doesn't actually conform to SGML. To see for yourself, view the HTML 4.01 Strict document type definition (doctype) at http://www.w3.org/TR/html4/strict.dtd and note the following lines:
The HTML 4.01 specification includes additional
syntactic constraints that cannot be expressed within
the DTDs.
So I would say that it is probably not a CFL due to those features (although it technically it doesn't disprove the hypothesis that there is some possible PDA that accepts HTML 4.01, it does prevent the argument that SGML is a CFL therefore HTML is a CFL).
HTML5 flip-flops, abandoning any implied conformance to SGML, but is presumably describable by a CFG. However it will still provide best-effort parsing not based on a cfg, so IMO the current situation (i.e. language specification is defined formally, with invalid strings still being accepted, parsed and rendered in a best effort fashion) in this regard is unlikely to change drastically for a long, long, long time.
HTML5 is different from previous HTML versions in that it strictly defines the parsing behaviour of code that isn't completely correct. Pre-HTML5 parsers vary and each do their best to 'guess' the intention of the code author.

Should I write Polyglot HTML5 documents?

I've been considering converting my current HTML5 documents to polyglot HTML5 ones. I figure that even if they only ever get served as text/html, the extra checks of writing it XML would help to keep my coding habits tidy and valid.
Is there anything particularly thrilling in the HTML5-only space that would make this an unwise choice?
Secondly, the specs are a bit hazy on how to validate a polyglot document. I assume the basics are:
No errors when run through the W3C Validator as HTML5
No errors when run through an XML parser
But are there any other rules I'm missing?
Thirdly, seeing as it is a polyglot, does anyone know any caveats to serving it as application/xhtml+xml to supporting browsers and text/html to non-supporting ones?
Edit: After a small bit of experimenting I found that entities like break in XHTML5 (no DTD). That XML parser is a bit of a double-edged sword, I guess I've answered my third question already.
Work on defining how to create HTML5 polyglot documents is currently on-going, but see http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html for an early draft. It's certainly possible to do, but it does require a good deal of coding discipline, and you will need to decide whether it's worth the effort. Although I create HTML4.01/XHTML1.0 polyglot documents, I create them using an XML tool chain which guarantees XML well-formedness and have specialized code to ensure compatibility with HTML non-void elements and valid XML characters. Direct hand coding would be very difficult.
One known current issue in HTML5 is the srcdoc attribute on the iframe element. Because the value of the attribute contains markup, certain characters need to be escaped. The HTML5 draft spec describes how to do this for the HTML serialization, but not (the last time I looked) how to do it in the XHTML serialization.
I'm late to the party but after 5 years the question is still relevant.
On one hand closing all my tags strongly appeals to me. For people reading it, for easier editing, for Great Justice. OTOH, looking at the gory details of the polyglot spec — http://www.sitepoint.com/have-you-considered-polyglot-markup/ has a convenient summary at the end — it's clear to me I can't get it all right by hand.
https://developer.mozilla.org/en/docs/Writing_JavaScript_for_XHTML also sheds interesting light on why XHTML failed: the very choice to use XML mime type has various side effects at run time. By now it should be routine for good JS code to handle these (e.g. always lowercase tag names before comparing) but I don't want all that. There are enough cross-browser issues to test for as-is, thank you.
So I think there is a useful middle way:
For now serve only as text/html. Stop worrying that it will actually parse as exactly the same DOM with same runtime behavior in both HTML and XML modes.
Only strive that it parses as some well-formed XML. It helps readers, it helps editors, it lets me use XML parser on my own documents.
Unfortunately, polyglot tools are rare to non-existant — it's hard to even serialize back XML in a way that also passes the HTML requirements...
No brainer: always self close void tags (<hr/>) and separately close non-void tags (<script ...></script>).
No brainers: use lowercase tags and attr (except some SVG but foreign content uses XML rules anyway), always quote attribute values, always provide attribute values (selected="selected" is more verbose than stanalone selected but I can live with that).
Inline <script> and <style> are most annoying. I can't use & or < inside without breaking XML parsing. I need:
<script>/*<![CDATA[*/
foo < bar && bar < baz;
/*]]>*/</script>
...and that's about it! Not caring about XML namespaces or matching HTML's implied DOM for tables drops about half the rules :-)
Await some future when I can directly go to authoring XHTML, skipping polyglotness. The benefits are I'll be able to forget the tag-closing limitations, will be able to directly consume and produce it with XML tools. Sure, neglecting xml namespaces and other things now will make the switch harder, but I think I'll create more new documents in this future than convert existing ones.
Actually I'm not entirely sure what's stopping me from living in that future right now. Is it only IE 8? I'm also a tiny bit concerned about the all-or-nothing error handling. I'm slighly hoping a future HTML spec will find a way to shrink the HTML vs XML gaps, e.g. make browsers accept <hr></hr> and <script .../> in HTML— while still retaining HTML error handling.
Also, tools. Having libraries in many languages that can serialize to polyglot markup would make it feasible for programs to generate it. Having tools to validate and convert HTML5 <-> polyglot <-> XHTML5 would help. Otherwise, it's pretty much doomed.
Given that the W3C's documentation on the differences between HTML and XHTML isn't even finished, it's probably not worth your time to try to do polyglot. Not yet anyways.... give it another couple of years.
In any event, only in the extremely narrow circumstances where you are actively planning on parsing your HTML as XML for some specific purpose, should you invest the extra time in XML-compliance. There are no benefits of doing it purely for consumption by web browsers -- only drawbacks.
Should you? Yes. But first some clarification on a couple points.
Sending the Content-Type: application/xhtml+xml header only means it should go through an XML parser, it still has all the benefits of HTML5 as far as I can tell.
About , that isn't defined in XML, the only character entity references XML defines are lt, gt, apos, quot, and amp, you will need to use numeric character references for anything else. The code for nbsp is   or  , I personally prefer hex because unicode code points are represented that way (U+00A0).
Sending the header is useful for testing because you can quickly find problems with your markup such as unclosed tags, stray end tags, text that could be interpreted as a tag, etc, basically stuff that can break the look or even functionality of your site.
Most significantly in my opinion, is if you are allowing user input and it fails to parse, that generally means you didn't escape their data and are leaving yourself open to a vulnerability. Parsed as HTML, you might not ever notice a problem until someone starts injecting scripts to harass your users or steal data.
This page is pretty good about explaining what polyglot markup is: https://blog.whatwg.org/xhtml5-in-a-nutshell
This sounds like a very difficult thing to do. One of the downfalls of XHTML was that it wasn't possible to steer successfully between the competing demands of XML and vintage HTML.
I think if you write HTML5 and validate it successfully, you will have as tidy and valid a document as anyone would need.
This wiki has some information not present in the W3C document: http://wiki.whatwg.org/wiki/HTML_vs._XHTML