Why are HTML character entities necessary? - html

Why are HTML character entities necessary? What good are they? I don't see the point.

Two main things.
They let you use characters that are not defined in a current charset. E.g., you can legally use ASCII as the charset, and still include arbitrary Unicode characters thorugh entities.
They let you quote characters that HTML gives special meaning to, as Simon noted.

"1 < 2" lets you put "1 < 2" in your page.
Long answer:
Since HTML uses '<' to open tags, you can't just type '<' if you want that as text. Therefore, you have to have a way to say "I want the text < in my page". Whoever designed HTML (or, actually SGML, HTML's predecessor) decided to use '&something;', so you can also put things like non-breaking space: ' ' (spaces that are not collapsed or allow a line break). Of course, now you need to have a way to say '&', so you get '&'...

They aren't, apart from &, <, >, " and probably . For all other characters, just use UTF-8.

In SGML and XML they aren't just for characters. They are generic inclusion mechanism, and their use for special characters is just one of many cases.
<!ENTITY signature "<hr/><p>Regards, <i>&myname;</i></p>">
<!ENTITY myname "John Doe">
This kind of entities is not useful for web sites, because they work only in XML mode, and you can't use external DTD file without enabling "validating" parsing mode in browser configuration.
Entities can be expanded recursively. This allows use of XML for Denial of Serice attack called "Billion Laughs Attack".
Firefox uses entities internally (in XUL and such) for internationalization and brand-independent messages (to make life easier for Flock and IceWeasel):
<!ENTITY hidemac.label "Hide &brandShortName;">
<!ENTITY hidewin.label "Hide - &brandShortName;">
In HTML you just need <, & and " to avoid ambiguities between text and markup.
All other entities are basically obsoleted by Unicode encodings and remain only as covenience (but a good text editor should have macros/snippets that can replace them).
In XHTML all entities except the basic few are problematic, because won't work with stand-alone XML parsers (e.g. won't work).
To parse all XHTML entities you need validating XML parser (option's usually called "resolve externals") which is slower and needs DTD Catalog set up. If you ignore or screw up your DTD Catalog, you'll be participating in DDoS of W3C servers.

Character entities are used to represent character which are reserved to write HTML for.ex.
<, >, /, & etc, if you want to represent these characters in your content you should use character entities, this will help the parser to distinguish between the content and markup

You use entities to help the parser distinguish when a character should be represented as HTML, and what you really want to show the user, as HTML will reserve a special set of characters for itself.
Typing this literally in HTML
I don't mean it like that </sarcasm>
will cause the "</sarcasm>" tag to disappear,
e.g.
I don't mean it like that
as HTML does not have a tag defined as such. In this case, using entities will allow the text to display properly.
e.g.
No, really! </sarcasm>
gives
No, really! </sarcasm>
as desired.

Related

Entity codes and the lang attribute: should I use both?

I am writing a markup document in Finnish.
I'm using the lang="fi-fi" attribute. Am I supposed to use the markup entities (ä for ä etc.) in conjunction with the language attribute, or is using the language attribute alone sufficient? How do the entities and language attribute affect each other?
The "problem" comes from the fact that the markup is written without entities and I have a script that's supposed to replace the scandic letters with entities by using regular expressions -- after defining the lang attribute the script doesn't appear to work anymore (which it supposedly did before adding the lang attribute).
My main concern is that the markup renders correctly regardless of the browser, although a "modern" browser can be assumed.
The lang attribute and entities do completely different jobs.
The lang attribute tells the parser what human language the document is written in. This allows, for example, search engines to tell if it is a good document to present to Finish speakers and screen reader software to select the correct pronunciation library.
Entities just let you represent characters that you couldn't otherwise represent. e.g.
Because you can't type the character of your keyboard
Because the character encoding the document is saved in (e.g. ASCII) doesn't include the character. This century you should be using UTF-8 just about everywhere and shouldn't need to worry about that.
Because the character would otherwise have special meaning in HTML (e.g. <).
Always use a lang attribute if you know what language the text of the document will be written in
Always use entities for characters with special meaning in HTML
Use literal characters if you can be reasonably certain the character encoding won't be mangled (which you can be most of the time) as they use fewer bytes and are easier to read in source code.
The root of my problem was actually character encoding. Although all of the documents were defined with UTF-8, the script somehow didn't recognize it. By telling the script that the input files (that were supposed be fixed with entities) are UTF-8 encoded the script functions correctly again.
As an answer to the question in the heading: to be absolutely sure that the documents are compatible with the server -- yes, I am supposed to use entity encoding (though I understand that assuming that the server allows UTF-8 is pretty safe assumption in general as implied by Quentin). Due to other reasons (related to automatic content generating), I'm also supposed to use the lang attribute.

Replacing HTML markup characters with the coreesponding character entity reference

Recently when I was developing a simple web site i came across a problem with replacing HTML mark up(eg- & , '', "", % etc).
Most of the time we have to replace these mark up with the corresponding character entity reference. but the thing I can't figure out is at some instances I didn't have to replace the mark up with it's corresponding character entity reference (e.g- $ -> &)
Can anyone please explain this?
You don't need to escape any of the characters, providing you can type it on your keyboard, and the tools used to edit and display the HTML files don't destroy them (because they aren't Unicode compatible for example). However with many characters it's easier to, say, type &emdash; than try and work out how to type it.
The special characters in reference SGML (hence HTML) are (as far as you need to know) >, < and &. If you start with real-world text that you want to include in your mark-up, you have to replace precisely all of those by their entity or character references (< etc.) and you will be fine. (Exceptions are if you are inside an CDATA marked section or an element with content type CDATA, but let us just assume that that's never the case.)

Necessary to encode characters in HTML links?

Should I be encoding characters contained within a url?
Example:
Some link using &
or
Some link using &
Yes.
In HTML (including XHTML and HTML5, as far as I know), all attribute values and tag content should be encoded:
Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
There are two different kinds of encoding which are needed for different purposes in web programming, and it is easy to get confused.
Special characters in text which is to be displayed as HTML need to be encoded as HTML entities. This is particularly characters such as '<' which are part of HTML markup, but it may also be useful for other special characters if there is any doubt about the character encoding to be used.
Special characters in a URL need to be URL-encoded (replaced by %nn codes).
There is no harm in putting an HTML entity into a URL if it is going to be treated as HTML text by whatever receives it; but if it is part of an instruction to a program (such as the & used to separate arguments in a CGI query string) you should not encode it.
Depends how your files are being served up and identified.
For XHTML, yes and it's required.
For HTML, no and it's incorrect to do it.

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

When should one use HTML entities?

This has been confusing me for some time. With the advent of UTF-8 as the de-facto standard in web development I'm not sure in which situations I'm supposed to use the HTML entities and for which ones should I just use the UTF-8 character. For example,
em dash (–, &emdash;)
ampersand (&, &)
3/4 fraction (¾, ¾)
Please do shed light on this issue. It will be appreciated.
Based on the comments I have received, I looked into this a little further. It seems that currently the best practice is to forgo using HTML entities and use the actual UTF-8 character instead. The reasons listed are as follows:
UTF-8 encodings are easier to read and edit for those who understand what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings for those who don't understand them, but they have the advantage of rendering as special characters rather than hard to understand decimal or hex encodings.
As long as your page's encoding is properly set to UTF-8, you should use the actual character instead of an HTML entity. I read several documents about this topic, but the most helpful were:
UTF-8: The Secret of Character Encoding
Wikipedia Special Characters Help
From the UTF-8: The Secret of Character Encoding article:
Wikipedia is a great case study for an
application that originally used
ISO-8859-1 but switched to UTF-8 when
it became far too cumbersome to support
foreign languages. Bots will now
actually go through articles and
convert character entities to their
corresponding real characters for the
sake of user-friendliness and
searchability.
That article also gives a nice example involving Chinese encoding. Here is the abbreviated example for the sake of laziness:
UTF-8:
這兩個字是甚麼意思
HTML Entities:
這兩個字是甚麼意思
The UTF-8 and HTML entity encodings are both meaningless to me, but at least the UTF-8 encoding is recognizable as a foreign language, and it will render properly in an edit box. The article goes on to say the following about the HTML entity-encoded version:
Extremely inconvenient for those of us
who actually know what character
entities are, totally unintelligible
to poor users who don't! Even the
slightly more user-friendly,
"intelligible" character entities like
θ will leave users who are
uninterested in learning HTML
scratching their heads. On the other
hand, if they see θ in an edit box,
they'll know that it's a special
character, and treat it accordingly,
even if they don't know how to write
that character themselves.
As others have noted, you still have to use HTML entities for reserved XML characters (ampersand, less-than, greater-than).
You don't generally need to use HTML character entities if your editor supports Unicode. Entities can be useful when:
Your keyboard does not support the character you need to type. For example, many keyboards do not have em-dash or the copyright symbol.
Your editor does not support Unicode (very common some years ago, but probably not today).
You want to make it explicit in the source what is happening. For example, the code is clearer than the corresponding white space character.
You need to escape HTML special characters like <, &, or ".
Entities may buy you some compatibility with brain-dead clients that don't understand encodings correctly. I don't believe that includes any current browsers, but you never know what other kinds of programs might be hitting you up.
More useful, though, is that HTML entities protect you from your own errors: if you misconfigure something on the server and you end up serving a page with an HTTP header that says it's ISO-8859-1 and a META tag that says it's UTF-8, at least your —es will always work.
I would not use UTF-8 for characters that are easily confused visually. For example, it is difficult to distinguish an emdash from a minus, or especially a non-breaking space from a space. For these characters, definitely use entities.
For characters that are easily understood visually (such as the chinese examples above), go ahead and use UTF-8 if you like.
Personally I do everything in utf-8 since a long time, however, in an html page, you always need to convert ampersands (&), greater than (>) and lesser then (<) characters to their equivalent entities, &, > and <
Also, if you intend on doing some programming using utf-8 text, there are a few thing to watch for.
XML needs some extra lines to validate when using entities.
Some libraries do not play along nice with utf-8. For instance, PHP in some Linux distributions dropped full support for utf-8 in their regular expression libraries.
It is harder to limit the number of characters in a text that uses html entities, because a single entity uses many characters. Also there's always the risk of cutting the entity in half.
HTML entities are useful when you want to generate content that is going to be included (dynamically) into pages with (several) different encodings. For example, we have white label content that is included both into ISO-8859-1 and UTF-8 encoded web pages...
If character set conversion from/to UTF-8 wasn't such a big unreliable mess (you always stumble over some characters and some tools that don't convert properly), standardizing on UTF-8 would be the way to go.
If your pages are correctly encoded in utf-8 you should have no need for html entities, just use the characters you want directly.
All of the previous answers make sense to me.
In addition: It mostly depends on the editor you intent to use and the document language. As a minimum requirement for the editor is that it supports the document language. That means, that if your text is in japanese, beware of using an editor which does not show them (i.e. no entities for the document itself). If its english, you can even use an old vim-like editor and use entities only for the relative seldom © and friends.
Of course: > for > and other HTML-specials still need escapes.
But even with the other latin-1 languages (german, french etc.) writing ä is a pain in you know where...
In addition, I personally write entities for invisible characters and those which are looking similar to standard-ascii and are therefore easily confused. For example, there is u1173 (looking like a dash in some charsets) or u1175, which looks like the vertical bar. I'd use entities for those in any case.