What are all the HTML escaping contexts? - html

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.

<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)

The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)

If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html

If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

Related

Uses for the '"' entity in HTML

I am revising some XHTML files authored by another party. As part of this effort, I am doing some bulk editing via Linq to XML.
I've just noticed that some of the original source XHTML files contain the " HTML entity in text nodes within those files. For instance:
<p>Greeting: "Hello, World!"</p>
And that when recovering the XHTML text via XElement.ToString(), the " entities are being replaced by plain double-quotes:
<p>Greeting: "Hello, World!"</p>
Question: Can anyone tell me what the motivation might have been for the original author to use the " entities instead of plain double-quotes? Did those entities serve a purpose which I don't fully appreciate? Or, were they truly unnecessary as I suspect?
I do understand that " would be necessary in certain contexts, such as when there is a need to place a double-quote within an HTML attribute. For instance:
<a href="/images/hello_world.jpg" alt="Greeting: "Hello, World!"">
Greeting</a>
It is impossible, and unnecessary, to know the motivation for using " in element content, but possible motives include: misunderstanding of HTML rules; use of software that generates such code (probably because its author thought it was “safer”); and misunderstanding of the meaning of ": many people seem to think it produces “smart quotes” (they apparently never looked at the actual results).
Anyway, there is never any need to use " in element content in HTML (XHTML or any other HTML version). There is nothing in any HTML specification that would assign any special meaning to the plain character " there.
As the question says, it has its role in attribute values, but even in them, it is mostly simpler to just use single quotes as delimiters if the value contains a double quote, e.g. alt='Greeting: "Hello, World!"' or, if you are allowed to correct errors in natural language texts, to use proper quotation marks, e.g. alt="Greeting: “Hello, World!”"
Reason #1
There was a point where buggy/lazy implementations of HTML/XHTML renderers were more common than those that got it right. Many years ago, I regularly encountered rendering problems in mainstream browsers resulting from the use of unencoded quote chars in regular text content of HTML/XHTML documents. Though the HTML spec has never disallowed use of these chars in text content, it became fairly standard practice to encode them anyway, so that non-spec-compliant browsers and other processors would handle them more gracefully. As a result, many "old-timers" may still do this reflexively. It is not incorrect, though it is now probably unnecessary, unless you're targeting some very archaic platforms.
Reason #2
When HTML content is generated dynamically, for example, by populating an HTML template with simple string values from a database, it's necessary to encode each value before embedding it in the generated content. Some common server-side languages provided a single function for this purpose, which simply encoded all chars that might be invalid in some context within an HTML document. Notably, PHP's htmlspecialchars() function is one such example. Though there are optional arguments to htmlspecialchars() that will cause it to ignore quotes, those arguments were (and are) rarely used by authors of basic template-driven systems. The result is that all "special chars" are encoded everywhere they occur in the generated HTML, without regard for the context in which they occur. Again, this is not incorrect, it's simply unnecessary.
In my experience it may be the result of auto-generation by a string-based tools, where the author did not understand the rules of HTML.
When some developers generate HTML without the use of special XML-oriented tools, they may try to be sure the resulting HTML is valid by taking the approach that everything must be escaped.
Referring to your example, the reason why every occurrence of " is represented by " could be because using that approach, you can safely use such "special" characters in both attributes and values.
Another motivation I've seen is where people believe, "We must explicitly show that our symbols are not part of the syntax." Whereas, valid HTML can be created by using the proper string-manipulation tools, see the previous paragraph again.
Here is some pseudo-code loosely based on C#, although it is preferred to use valid methods and tools:
public class HtmlAndXmlWriter
{
private string Escape(string badString)
{
return badString.Replace("&", "&").Replace("\"", """).Replace("'", "&apos;").Replace(">", ">").Replace("<", "<");
}
public string GetHtmlFromOutObject(Object obj)
{
return "<div class='type_" + Escape(obj.Type) + "'>" + Escape(obj.Value) + "</div>";
}
}
It's really very common to see such approaches taken to generate HTML.
As other answers pointed out, it is most likely generated by some tool.
But if I were the original author of the file, my answer would be: Consistency.
If I am not allowed to put double quotes in my attributes, why put them in the element's content ? Why do these specs always have these exceptional cases ..
If I had to write the HTML spec, I would say All double quotes need to be encoded. Done.
Today it is like In attribute values we need to encode double quotes, except when the attribute value itself is defined by single quotes. In the content of elements, double quotes can be, but are not required to be, encoded. (And I am surely forgetting some cases here).
Double quotes are a keyword of the spec, encode them.
Lesser/greater than are a keyword of the spec, encode them. etc..
It is likely because they used a single function for escaping attributes and text nodes. & doesn't do any harm so why complicate your code and make it more error-prone by having two escaping functions and having to pick between them?

HTML character codes in alt tag [duplicate]

I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as &.)
Do I really need to do &?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.
HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.
Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.
Compare the following: which is easier? Which is easier to bugger up?
Methodology 1
Write some content which includes ampersand characters.
Encode them all.
Methodology 2
(with a grain of salt, please ;) )
Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:
It is isolated, and as such unambiguously an ampersand. eg. volt & amp > In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt >. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt&amp > Encode it.
??
HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "&copy=2" is still a problem, for example, since © is the copyright symbol.
However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.
I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:
Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.
Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.
Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?
If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.
Could you show us what your title actually is? When I submit
<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>
to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...
In HTML, a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on, the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.
But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain, & can be omitted:
<p title="&">foo & bar</p>
<p title="&amp">foo &amp bar</p>
<p title="&">foo & bar</p>
Only in these cases can the ending ; or even the reference itself be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.
But the specification recommends to always use a reference like the character reference & or the entity reference & to avoid confusion:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
Update (March 2020): The W3C validator no longer complains about escaping URLs.
I was checking why image URLs need escaping and hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URLs need to be escaped. [PS: I guess it will be unescaped when it's consumed since URLs need &. Can anyone clarify?]
<img alt="" src="foo?bar=qut&qux=fop" />
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs". Entity references start with an ampersand (&) and end with a
semicolon (;). If you want to use a literal ampersand in your document
you must encode it as "&" (even inside URLs!). Be careful to end
entity references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters. If this error appears in some markup
generated by PHP's session handling code, this article has
explanations and solutions to your problem.
It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.
For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.
For your own static HTML content, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.
If the user passes it to you, or it will wind up in a URL, you need to escape it.
If it appears in static text on a page? All browsers will get this one right either way, and you don't worry much about it, since it will work.
Yes, you should try to serve valid code if possible.
Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.
Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.
For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.
So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.
A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like
<div style="..." ... style="...">
When faced with a repeated style attribute, Internet Explorer combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to
<div style="...; ..." ...>
and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)
If & is used in HTML then you should escape it.
If & is used in JavaScript strings, e.g., an alert('This & that'); or document.href, you don't need to use it.
If you're using document.write then you should use it, e.g. document.write(<p>this & that</p>).
If you're really talking about the static text
<title>Foo & Bar</title>
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple &, then chances are you also don't escape a & or a or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.
In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.
The link has a fairly good example of when and why you may need to escape & to &
https://jsfiddle.net/vh2h7usk/1/
Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in & and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)

HTML - Importance of Double Quotes in HTML Tags

i have been working with HTML for some time, recently a question popped in my mind that when ever give tags
<input type = "text" name="c-id" />
<input type =text name=c-id />
with or without quotes there is no difference, when working with V-Studio, this IDE doesn't put quotes automatically, but while working with the Dreamweaver if i remember correctly it does put the quotes automatically...
what i want is to know is that at what point\s the absence of quotes makes a difference or creates a problem, and what is the best practice, always go quoted or quotes-less...
p.s. there is very good chance this question has been asked before but didn't show up in the search
It makes a difference when the attribute value contains a space, linefeed, formfeed, or tab character, because the attribute value will end at any one of those characters if the value is not wrapped in quotes.
From an IDE point of view, it's little more than a matter of preference whether to add the quotes or not when the attribute value does not contain one of those whitespace characters.
In XHTML syntax, the quotation marks are always required (but you can alternatively use single quotes, i.e. 'apostrophes'). This does not matter if the page is served with text/html content type, as it almost always is: browsers will parse it as sloppy HTML, not as real XHTML. But if the page is served with an XML content type, or if it is opened in a program that expects XHTML or other XML, then lack of quotation marks causes Draconian error processing: only an error message is shown, not the content at all.
In the example case, XHTML syntax is used otherwise: the “/” before closing “>” belongs to XHTML, not HTML. It makes little point to write XHTML in that respect but not consistently.
In HTML syntax, the formal rules depend on HTML version. HTML5 is much more permissive than HTML 4.01. For example, <a href=/foo/bar/ title=What???> is valid HTML5 but not valid HTML 4.01. This mostly matters in validation and depend on which version of HTML you wish to validate against (i.e., which one do you mostly try to comply with). In this issue, HTML5 reflects browser practices: browsers have long been more permissive than HTML 4.01.
By HTML5 rules, the quotes (though always allowed) are needed only when the attribute value contains any of the following characters: space, tab, line feed, form feed, carriage return, quotation mark ("), apostrophe ('), equals sign (=), less-than sign (<), greater-than sign (>), or grave (`), or is empty.
There are also opinions as well as coding style guides and other recommendations on the matter. Most people who think about such issues apparently favor the “safe” way of always putting quotes around attribute values. Some people think the quotes improve readability of code; others think they reduce it.
Not putting your values into quotes may cause some problems, ie you can't provide multiple css classes for your elements. Browser will only understand the first value, and the page won't pass validation.
You should always put your values into quotes. The markup is easier to read and you won't run into some unexpected browser behaviour.
For me, any practice that improves code/markup (re-)readability is worth making a habit, and having quotes on the attribute values visibly separates one property from another (readability again).
On the technical side, some attribute values have spaces and the whole tag might not work without the quotes on these. Those that come to mind: img's alt, multiple class values, title (for tooltips), an href with a url that has the equal sign (without quotes, the first equal in href= is doomed), etc.
And here is the almost same question in SO.
--
My thanks to Alohci for enlightening me. I can't up vote yet.

Necessary to encode characters in HTML links?

Should I be encoding characters contained within a url?
Example:
Some link using &
or
Some link using &
Yes.
In HTML (including XHTML and HTML5, as far as I know), all attribute values and tag content should be encoded:
Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
There are two different kinds of encoding which are needed for different purposes in web programming, and it is easy to get confused.
Special characters in text which is to be displayed as HTML need to be encoded as HTML entities. This is particularly characters such as '<' which are part of HTML markup, but it may also be useful for other special characters if there is any doubt about the character encoding to be used.
Special characters in a URL need to be URL-encoded (replaced by %nn codes).
There is no harm in putting an HTML entity into a URL if it is going to be treated as HTML text by whatever receives it; but if it is part of an instruction to a program (such as the & used to separate arguments in a CGI query string) you should not encode it.
Depends how your files are being served up and identified.
For XHTML, yes and it's required.
For HTML, no and it's incorrect to do it.

Why are HTML character entities necessary?

Why are HTML character entities necessary? What good are they? I don't see the point.
Two main things.
They let you use characters that are not defined in a current charset. E.g., you can legally use ASCII as the charset, and still include arbitrary Unicode characters thorugh entities.
They let you quote characters that HTML gives special meaning to, as Simon noted.
"1 < 2" lets you put "1 < 2" in your page.
Long answer:
Since HTML uses '<' to open tags, you can't just type '<' if you want that as text. Therefore, you have to have a way to say "I want the text < in my page". Whoever designed HTML (or, actually SGML, HTML's predecessor) decided to use '&something;', so you can also put things like non-breaking space: ' ' (spaces that are not collapsed or allow a line break). Of course, now you need to have a way to say '&', so you get '&'...
They aren't, apart from &, <, >, " and probably . For all other characters, just use UTF-8.
In SGML and XML they aren't just for characters. They are generic inclusion mechanism, and their use for special characters is just one of many cases.
<!ENTITY signature "<hr/><p>Regards, <i>&myname;</i></p>">
<!ENTITY myname "John Doe">
This kind of entities is not useful for web sites, because they work only in XML mode, and you can't use external DTD file without enabling "validating" parsing mode in browser configuration.
Entities can be expanded recursively. This allows use of XML for Denial of Serice attack called "Billion Laughs Attack".
Firefox uses entities internally (in XUL and such) for internationalization and brand-independent messages (to make life easier for Flock and IceWeasel):
<!ENTITY hidemac.label "Hide &brandShortName;">
<!ENTITY hidewin.label "Hide - &brandShortName;">
In HTML you just need <, & and " to avoid ambiguities between text and markup.
All other entities are basically obsoleted by Unicode encodings and remain only as covenience (but a good text editor should have macros/snippets that can replace them).
In XHTML all entities except the basic few are problematic, because won't work with stand-alone XML parsers (e.g. won't work).
To parse all XHTML entities you need validating XML parser (option's usually called "resolve externals") which is slower and needs DTD Catalog set up. If you ignore or screw up your DTD Catalog, you'll be participating in DDoS of W3C servers.
Character entities are used to represent character which are reserved to write HTML for.ex.
<, >, /, & etc, if you want to represent these characters in your content you should use character entities, this will help the parser to distinguish between the content and markup
You use entities to help the parser distinguish when a character should be represented as HTML, and what you really want to show the user, as HTML will reserve a special set of characters for itself.
Typing this literally in HTML
I don't mean it like that </sarcasm>
will cause the "</sarcasm>" tag to disappear,
e.g.
I don't mean it like that
as HTML does not have a tag defined as such. In this case, using entities will allow the text to display properly.
e.g.
No, really! </sarcasm>
gives
No, really! </sarcasm>
as desired.