How to encode quotes in HTML body? - html

Should I encode quotes (such as " and ' -> ” and ’) in my HTML body (e.g. convert <p>Matt's Stuff</p> to <p>Matt’s Stuff</p>)? I was under the impression I should, but a co-worker said that it was no big deal. I'm dubious but I can't find anything that says it is forbidden. Am I mistaken? Is it a best-practice to encode? Or is it simply useless?

Encoding quotation marks (") is in practice only needed if the're inside an attribute, however for the HTML code to be correct (passing HTML validation), you should always encode quotation marks as ".
Apostrophes (') don't need escaping in HTML. In XHTML they should be encoded as &apos;.

If you want your markup to be parsable as XML, you'll want to encode the following:
& => &
< => <
> => >
" => "
' => &apos;
Definitely do this in attributes whether you're trying to make your code XML compliant or not.

Typicaly such isn't necessary unless you're placing such values into a tag's attribute (or other places where having quote marks would throw off parsing). In regular body text un-encoded will work fine.
<img src="..." alt="A "quote mark" in an alt attribute" />

No, you only need to use character references for quotes (single or double) if you want to use them inside an attribute value declaration that uses the same quotes for the value declaration:
title="The sign says "Matt's Stuff""
title='The sign says "Matt's Stuff"'
Both title values are The sign says "Matt's Stuff".

Related

Why is `'` escaped in html libs?

With HTML I notice some libraries escape '. My question is why? The first time I thought maybe they did it just because but I seen more then one do it but not all. I can't remember what I looked at from the top of my head but the others i remember were &, <, >, ".
I know & is used for escape characters (such as to make & which is &). < and > are escaped to not be confused for start/end tags and " is done so you can put " in tag attributes if you need to for some reason. But why '? Also am I missing any other characters that should be escaped?
Because under HTML, the single-quote character ' can be used to delimit element attributes instead of the double-quote, like so:
<p class='something'></p>
However the character does not need to be escaped normally, but it's best to be safe.
In HTML, " and ' are interchangeable. Both can be used for setting the attribute for an element as well as used for denoting a string in JavaScript:
<img src="bob.png" />
<img src='bob.png' />
The single-quote mark is escaped because it can only be used as-is in certain contexts. When writing an escaping function, it is easier and faster to just always escape it, so you don't have to take into account the context.
For example, if you use double quotes " to denote an attribute value, you can use a single quote ' within it safely. However, if you use single quotes to denote the attribute value, you cannot.

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Quotation marks in HTML attribute values?

This may seem like a realy basic question but...
How do you use double speech marks in HTML code (alt tags and the such)?
For example..
I'm trying to set a tag in my webpage to Opening Credits for "It's Liverpool" but it's limiting it to Opening Credits for.
You'll want to use the corresponding HTML entity in place of the quotes:
<span alt="Opening Credits for "It's Liverpool"">A span</span>
You can normally avoid the issue by using appropriate language-dependent quotation marks, instead of Ascii quotation marks, which should be confined to use as delimiters in computer code. Example:
alt="Opening Credits for “It’s Liverpool”"
or (in British English)
alt="Opening Credits for ‘It’s Liverpool’"
Should you really need to use Ascii quotation marks inside an attribute value, use Ascii apostrophes as delimiters:
alt='The statement foo = "bar" is an assignment.'
In the extremely rare case where an attribute value really needs to contain both an Ascii quotation mark and an Ascii apostrophe, you need to escape either of them (namely the one you decide to use as attribute value delimiter):
alt="The Ascii characters " and ' should not be used in natural languages."
or
alt='The Ascii characters " and ' should not be used in natural languages.'
Note that these considerations are relevant only inside attribute values. In element content, both " and ' can be used freely:
<strong>The Ascii characters " and ' should not be used in natural languages.</strong>

escaping inside html tag attribute value

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.
I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:
It doesn't work. However:
and
does work in all browsers!
Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?
There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.
As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:
The less-than character < should be escaped. Usually < is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
The ampersand & should be escaped. Usually & is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using " whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference ' (or ').
You can escape > (or any other data character) if you like, but it is never needed.
On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.
In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or ' to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the ' reference.
&apos; is not a valid HTML reference entity. You should escape using '

What values can I put in an HTML attribute value?

Do I need to escape quotes inside of an html attribute value? What characters are allowed?
Is this valid?
<span title="This is a 'good' title.">Hi</span>
If your attribute value is quoted (starts and ends with double quotes "), then any characters except for double quotes and ampersands are allowed, which must be quoted as " and & respectively (or the equivalent numeric entity references, " and &)
You can also use single quotes around an attribute value. If you do this, you may use literal double quotes within the attribute: <span title='This is a "good" title.'>...</span>. In order to escape single quotes within such an attribute value, you must use the numeric entity reference ' since some browsers don't support the named entity, &apos; (which was not defined in HTML 4.01).
Furthermore, you can also create attributes with no quotes, but that restricts the set of characters you can have within it much further, disallowing the use of spaces, =, ', ", <, >, ` in the attribute.
See the HTML5 spec for more details.
That is valid. However, if you had to put double quotes inside, you would have to escape with " like this:
<span title="This is a "good" title.">Hi</span>
The value can be anything, but you should escape quotes (", &apos;), tag delimiters (<, >) and ampersands (&).
No, you do not need to escape single quotes inside of double quotes.
This page specifies valid attributes of a span tag:
http://www.w3.org/TR/html401/struct/global.html#edef-SPAN
This page specifies valid characters allowed in the title attribute:
http://www.w3.org/TR/html401/intro/sgmltut.html#attributes
Yes that's fine. The problem would be when you try and put a double Quote inside an attribute. like this:
<span title="This is a "bad" title.">Hi</span>
You can get around this by using HTML entities like so:
<span title="This is a "good" title">Hi</span>
Here is a validation function using a Regular expression based on Brian Campbell's answer, for worst case of an unquoted attribute.
validator: function (val) {
if (!val || val.search(/['"=<>`]+|(&\s)+/) === -1) return true;
return 'Disallowed characters in HTML attributes: \' " = < > ` &.';
},