What values can I put in an HTML attribute value? - html

Do I need to escape quotes inside of an html attribute value? What characters are allowed?
Is this valid?
<span title="This is a 'good' title.">Hi</span>

If your attribute value is quoted (starts and ends with double quotes "), then any characters except for double quotes and ampersands are allowed, which must be quoted as " and & respectively (or the equivalent numeric entity references, " and &)
You can also use single quotes around an attribute value. If you do this, you may use literal double quotes within the attribute: <span title='This is a "good" title.'>...</span>. In order to escape single quotes within such an attribute value, you must use the numeric entity reference ' since some browsers don't support the named entity, &apos; (which was not defined in HTML 4.01).
Furthermore, you can also create attributes with no quotes, but that restricts the set of characters you can have within it much further, disallowing the use of spaces, =, ', ", <, >, ` in the attribute.
See the HTML5 spec for more details.

That is valid. However, if you had to put double quotes inside, you would have to escape with " like this:
<span title="This is a "good" title.">Hi</span>

The value can be anything, but you should escape quotes (", &apos;), tag delimiters (<, >) and ampersands (&).

No, you do not need to escape single quotes inside of double quotes.
This page specifies valid attributes of a span tag:
http://www.w3.org/TR/html401/struct/global.html#edef-SPAN
This page specifies valid characters allowed in the title attribute:
http://www.w3.org/TR/html401/intro/sgmltut.html#attributes

Yes that's fine. The problem would be when you try and put a double Quote inside an attribute. like this:
<span title="This is a "bad" title.">Hi</span>
You can get around this by using HTML entities like so:
<span title="This is a "good" title">Hi</span>

Here is a validation function using a Regular expression based on Brian Campbell's answer, for worst case of an unquoted attribute.
validator: function (val) {
if (!val || val.search(/['"=<>`]+|(&\s)+/) === -1) return true;
return 'Disallowed characters in HTML attributes: \' " = < > ` &.';
},

Related

What characters must be escaped in HTML 5?

HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they
are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
"" represents the " mark.
Authors wishing
to put the "<" character in text should use "<" (ASCII decimal 60)
to avoid possible confusion with the beginning of a tag (start tag
open delimiter). Similarly, authors should use ">" (ASCII decimal
62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close
delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
The & character should be replaced by &
Non-breaking spaces should be escaped as (surprise!...)
Within attributes, " should be escaped as "
Outside of attributes, < should be escaped as < and > should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Adding my voice to insist that things are not that easy -- strictly speaking:
HTML5 is a language specifications
it could be serialized either as HTML or as XML
Case 1 : HTML serialization
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."
So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&&copy is not safe (even if this might work) as the last sequence &copy might be interpreted as the entity reference for ©
Case 1 : XML serialization
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &.
In that case && (with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?

Why is `'` escaped in html libs?

With HTML I notice some libraries escape '. My question is why? The first time I thought maybe they did it just because but I seen more then one do it but not all. I can't remember what I looked at from the top of my head but the others i remember were &, <, >, ".
I know & is used for escape characters (such as to make & which is &). < and > are escaped to not be confused for start/end tags and " is done so you can put " in tag attributes if you need to for some reason. But why '? Also am I missing any other characters that should be escaped?
Because under HTML, the single-quote character ' can be used to delimit element attributes instead of the double-quote, like so:
<p class='something'></p>
However the character does not need to be escaped normally, but it's best to be safe.
In HTML, " and ' are interchangeable. Both can be used for setting the attribute for an element as well as used for denoting a string in JavaScript:
<img src="bob.png" />
<img src='bob.png' />
The single-quote mark is escaped because it can only be used as-is in certain contexts. When writing an escaping function, it is easier and faster to just always escape it, so you don't have to take into account the context.
For example, if you use double quotes " to denote an attribute value, you can use a single quote ' within it safely. However, if you use single quotes to denote the attribute value, you cannot.

escaping inside html tag attribute value

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.
I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:
It doesn't work. However:
and
does work in all browsers!
Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?
There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.
As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:
The less-than character < should be escaped. Usually < is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
The ampersand & should be escaped. Usually & is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using " whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference ' (or ').
You can escape > (or any other data character) if you like, but it is never needed.
On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.
In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or ' to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the ' reference.
&apos; is not a valid HTML reference entity. You should escape using '

What to use " or ' when coding [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
When did single quotes in HTML become so popular?
Should I use ' or " when coding, for example width='100px' or width="100px". Is it only a matter of taste, or does it matter to the browsers?
The reason why I ask, I have always used "" for everything, so when I code with PHP, I have to escape like this:
echo "<table width=\"100px\>"";
But I've found out that I will probably save 2 minutes per day if I do this:
echo "<table width='100px'>"
Fewer key strokes. Of course I could also do this:
echo '<table width="100px">'
What should the HTML look like; 'option1' or "option2"?
Yes, it's a matter of taste. It makes no difference in HTML. Quoting the W3C on SGML and HMTL:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa.
...
In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.
However note that the width attribute is deprecated, even though it is still supported in all major browsers. Actually the width attribute is not deprecated when used in <table> in HTML 4.01. Only when used in <hr>, <pre>, <td>, <th> (Source 1, 2, 3, 4).
HTML5 supports attributes specified in any of these four different ways:
Empty attribute syntax: <input disabled>
Unquoted attribute value syntax: <input value=yes>
Single-quoted attribute value syntax: <input type='checkbox'>
Double-quoted attribute value syntax: <input name="be evil">
For me it is better to use ' than "". PHP will parse the string wrap with "" and using ' is faster because PHP will treat it just a string and no more parsing.
I agree with Daniel - it's a matter of taste.
Personally, I use double quotes when assigning values to attributes (e.g. width="100%"). Also, if you deal with JSON, strictly speaking, the names and values should be surrounded with double quotes ("").
echo '<table width="100px">'
Because other HTML tags are probably with double quotes, keep it clean, you don't need to escape quotes this way.

How to encode quotes in HTML body?

Should I encode quotes (such as " and ' -> ” and ’) in my HTML body (e.g. convert <p>Matt's Stuff</p> to <p>Matt’s Stuff</p>)? I was under the impression I should, but a co-worker said that it was no big deal. I'm dubious but I can't find anything that says it is forbidden. Am I mistaken? Is it a best-practice to encode? Or is it simply useless?
Encoding quotation marks (") is in practice only needed if the're inside an attribute, however for the HTML code to be correct (passing HTML validation), you should always encode quotation marks as ".
Apostrophes (') don't need escaping in HTML. In XHTML they should be encoded as &apos;.
If you want your markup to be parsable as XML, you'll want to encode the following:
& => &
< => <
> => >
" => "
' => &apos;
Definitely do this in attributes whether you're trying to make your code XML compliant or not.
Typicaly such isn't necessary unless you're placing such values into a tag's attribute (or other places where having quote marks would throw off parsing). In regular body text un-encoded will work fine.
<img src="..." alt="A "quote mark" in an alt attribute" />
No, you only need to use character references for quotes (single or double) if you want to use them inside an attribute value declaration that uses the same quotes for the value declaration:
title="The sign says "Matt's Stuff""
title='The sign says "Matt's Stuff"'
Both title values are The sign says "Matt's Stuff".