escaping inside html tag attribute value - html

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.
I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:
It doesn't work. However:
and
does work in all browsers!
Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?

There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.
As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:
The less-than character < should be escaped. Usually < is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
The ampersand & should be escaped. Usually & is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using " whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference ' (or ').
You can escape > (or any other data character) if you like, but it is never needed.
On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.
In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or ' to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the ' reference.

&apos; is not a valid HTML reference entity. You should escape using '

Related

What characters must be escaped in HTML 5?

HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they
are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
"" represents the " mark.
Authors wishing
to put the "<" character in text should use "<" (ASCII decimal 60)
to avoid possible confusion with the beginning of a tag (start tag
open delimiter). Similarly, authors should use ">" (ASCII decimal
62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close
delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
The & character should be replaced by &
Non-breaking spaces should be escaped as (surprise!...)
Within attributes, " should be escaped as "
Outside of attributes, < should be escaped as < and > should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Adding my voice to insist that things are not that easy -- strictly speaking:
HTML5 is a language specifications
it could be serialized either as HTML or as XML
Case 1 : HTML serialization
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."
So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&&copy is not safe (even if this might work) as the last sequence &copy might be interpreted as the entity reference for ©
Case 1 : XML serialization
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &.
In that case && (with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?

Why is `'` escaped in html libs?

With HTML I notice some libraries escape '. My question is why? The first time I thought maybe they did it just because but I seen more then one do it but not all. I can't remember what I looked at from the top of my head but the others i remember were &, <, >, ".
I know & is used for escape characters (such as to make & which is &). < and > are escaped to not be confused for start/end tags and " is done so you can put " in tag attributes if you need to for some reason. But why '? Also am I missing any other characters that should be escaped?
Because under HTML, the single-quote character ' can be used to delimit element attributes instead of the double-quote, like so:
<p class='something'></p>
However the character does not need to be escaped normally, but it's best to be safe.
In HTML, " and ' are interchangeable. Both can be used for setting the attribute for an element as well as used for denoting a string in JavaScript:
<img src="bob.png" />
<img src='bob.png' />
The single-quote mark is escaped because it can only be used as-is in certain contexts. When writing an escaping function, it is easier and faster to just always escape it, so you don't have to take into account the context.
For example, if you use double quotes " to denote an attribute value, you can use a single quote ' within it safely. However, if you use single quotes to denote the attribute value, you cannot.

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Why shouldn't `&apos;` be used to escape single quotes?

As stated in, When did single quotes in HTML become so popular? and Jquery embedded quote in attribute, the Wikipedia entry on HTML says the following:
The single-quote character ('), when used to quote an attribute value, must also be escaped as ' or ' (should NOT be escaped as &apos; except in XHTML documents) when it appears within the attribute value itself.
Why shouldn't &apos; be used? Also, is " safe to be used instead of "?
" is on the official list of valid HTML 4 entities, but &apos; is not.
From C.16. The Named Character Reference &apos;:
The named character reference &apos;
(the apostrophe, U+0027) was
introduced in XML 1.0 but does not
appear in HTML. Authors should
therefore use ' instead of
&apos; to work as expected in HTML 4
user agents.
" is valid in both HTML5 and HTML4.
&apos; is valid in HTML5, but not HTML4. However, most browsers support &apos; for HTML4 anyway.
&apos; is not part of the HTML 4 standard.
" is, though, so is fine to use.
If you need to write semantically correct mark-up, even in HTML5, you must not use &apos; to escape single quotes. Although, I can imagine you actually meant apostrophe rather then single quote.
single quotes and apostrophes are not the same, semantically, although they might look the same.
Here's one apostrophe.
Use ' to insert it if you need HTML4 support. (edited)
In British English, single quotes are used like this:
"He told me to 'give it a try'", I said.
Quotes come in pairs. You can use:
<p><q>He told me to <q>give it a try</q></q>, I said.<p>
to have nested quotes in a semantically correct way, deferring the substitution of the actual characters to the rendering engine. This substitution can then be affected by CSS rules, like:
q {
quotes: '"' '"' '<' '>';
}
An old but seemingly still relevant article about semantically correct mark-up: The Trouble With EM ’n EN (and Other Shady Characters).
(edited) This used to be:
Use ’ to insert it if you need HTML4 support.
But, as #James_pic pointed out, that is not the straight single quote, but the "Single curved quote, right".
If you really need single quotes, apostrophes, you can use
html | numeric | hex
‘ | ‘ | ‘ // for the left/beginning single-quote and
’ | ’ | ’ // for the right/ending single-quote

HTML code for an apostrophe

Seemingly simple, but I cannot find anything relevant on the web.
What is the correct HTML code for an apostrophe? Is it ’?
If you are looking for straight apostrophe ' (U+00027), it is
' or &apos; (latest is HTLM 5 only)
If you are looking for the curly apostrophe ’ (U+02019), then yes, it is
’ or ’
As of to know which one to use, there are great answers in the Graphic Design community: What’s the right character for an apostrophe?.
A List Apart has a nice reference on characters and typography in HTML. According to that article, the correct HTML entity for the apostrophe is ’. Example use: ’ .
It's &apos;.
As noted by msanders, this is actually XML and XHTML but not defined in HTML4, so I guess use the ' in that case. I stand corrected.
A standard-compliant, easy-to-remember set of html quotes, starting with the right single-quote which is normally used as an apostrophe:
right single-quote — ’ — ’
left single-quote — ‘ — ‘
right double-quote — ” — ”
left double-quote — “ — “
Depends on which apostrophe you are talking about: there’s &apos;, ‘, ’ and probably numerous other ones, depending on the context and the language you’re intending to write. And with a declared character encoding of e.g. UTF-8 you can also write them directly into your HTML: ', ‘, ’.
Firstly, it would appear that &apos; should be avoided -
The curse of &apos;
Secondly, if there is ever any chance that you're going to generate markup to be returned via AJAX calls, you should avoid the entity names (As not all of the HTML entities are valid in XML) and use the &#XXXX; syntax instead.
Failure to do so may result in the markup being considered as invalid XML.
The entity that is most likely to be affected by this is , which should be replaced by  
Here is a great reference for HTML Ascii codes:
http://www.ascii.cl/htmlcodes.htm
The code you are looking for is: '
Note that &apos; IS defined in HTML5, so for modern websites, I would advise using &apos; as it is much more readable than '
Check: http://www.w3.org/TR/html5/syntax.html#named-character-references
Even though &apos; reads nicer than ' and it's a shame not to use it, as a fail-safe, use '.
&apos; is a valid HTML 5 entity, however it is not a valid HTML 4 entity.
Unless <!DOCTYPE html> is at the top of your HTML document, use '
Sorry if this offends anyone, but there is a reasonable article on Ted Clancy's blog that argues against the Unicode committee's recommendation to use ’ (RIGHT SINGLE QUOTATION MARK) and proposes using U+02BC (MODIFIER LETTER APOSTROPHE) (aka ʼ or ʼ) instead.
In a nutshell, the article argues that:
A punctuation mark (such as a quotation mark) normally separates words and phrases, while the sides of a contraction really can't be separated and still make sense.
Using a modifier allows one to select a contraction with the regular expression \w+
It's easier to parse quotes embedded in text if there aren't quotation marks also appearing in contractions
' in decimal.
%27 in hex.
Although the &apos; entity may be supported in HTML5, it looks like a typewriter apostrophe. It looks nothing like a real curly apostrophe—which looks identical to an ending quotation mark: ’.
Just look when I write them after each other:
1: right single quotation mark entity, 2: apostrophe entity: ’ &apos;.
I tried to find a proper entity or alt command specifically for a normal looking apostrophe (which again, looks ‘identical’ to a closing right single quotation mark), but I haven’t found one. I always need to insert a right single quotation mark in order to get the visually correct apostrophe.
If you use just ’ (ALT + 0146) or autoformat typewriter apostrophes and quotation marks as curly in a word processor like Word 2013, do use <meta charset="UTF-8">.
I've found FileFormat.info's Unicode Character Search to be most helpful in finding exact character codes.
Entering simply ' (the character to the left of the return key on my US Mac keyboard) into their search yields several results of various curls and languages.
I would presume the original question was asking for the typographically correct U+02BC ʼ, rather than the typewriter fascimile U+0027 '.
The W3C recommends hex codes for HTML entities (see below). For U+02BC that would be ʼ, rather than ' for U+0027.
http://www.w3.org/International/questions/qa-escapes
Using character escapes in markup and CSS
Hex vs. decimal. Typically when the Unicode Standard refers to or lists characters it does so using a hexadecimal value. … Given the prevalence of this convention, it is often useful, though not required, to use hexadecimal numeric values in escapes rather than decimal values…
http://www.w3.org/TR/html4/charset.html
5 HTML Document Representation … 5.4 Undisplayable characters
…If missing characters are presented using their numeric representation, use the hexadecimal (not decimal) form, since this is the form used in character set standards.
Just a one more link with a nicely maintained collection Html Entities (archived), and its current (2023-01-22) status Named Character References.
As far as I know it is ' but it seems yours works as well
See http://w3schools.com/tags/ref_ascii.asp
Use &apos; for a straight apostrophe. This tends to be more readable than the numeric ' (if others are ever likely to read the HTML directly).
Edit: msanders points out that &apos; isn't valid HTML4, which I didn't know, so follow most other answers and use '.
You can try ' as seen in http://unicodinator.com/#0027