Difference in HTML Entity length in JavaScript - html

Why does the entity have length 6 while the entity ↓ has length 1? Is this in the spec somewhere? (Tested in Firefox, Chrome and Safari.)
JSFiddle

I agree that this is very weird behavior, but at least it's specified.
The HTML fragment serialization algorithm states that:
Escaping a string (for the purposes of the algorithm above) consists of replacing any occurrences of the "&" character by the string "&", any occurrences of the "<" character by the string "<", any occurrences of the ">" character by the string ">", any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ", and, if the algorithm was invoked in the attribute mode, any occurrences of the """ character by the string """.
Emphasis by me. If I had to guess this is to support backwards compatibility in older browsers that did this and to get consistent behavior when deserializing and serializing strings. If the browser serialized the DOM tree result of <div> </div> to <div> </div> deserializing it to the DOM tree again would result in a single space*. This is pretty much the only way the browser can achieve consistent behavior.
The replacement to ↓ on the other hand is completely safe and makes sense.
If you're actually interested in the length of the string stored inside the text using .textContent you'd get the result you were interested in.
* well, not really since it would still be a U+00A0 - but I could get why people think it might be confusing in the early DOM days

Consider the following HTML snippet:
<div>
<p>foo & bar 𝌆 baz</p>
</div>
Let’s look up innerHTML in the HTML Living Standard to see what happens when we run div.innerHTML in the context of the above HTML document. Ah, it defers to the DOM Parsing spec, which says:
On getting, if the context object’s node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on the context object; […]
The HTML fragment serialization algorithm is defined in the HTML Living Standard. Following the algorithm with the div.innerHTML example in mind, it’s clear that the first time it will descend to the “if current node is an Element” branch under step 3.2. This adds <p> to the output.
Then it calls the algorithm again on the text node within. This time we end up in the “if current node is a Text node” branch. It says:
[…] Otherwise, append the value of current node’s data IDL attribute, escaped as described below.
The data IDL attribute contains the textual contents of the element. The escaping instructions are defined as follows:
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the & character by the string &.
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string .
If the algorithm was invoked in the attribute mode, replace any occurrences of the " character by the string ".
If the algorithm was not invoked in the attribute mode, replace any occurrences of the < character by the string <, and any occurrences of the > character by the string >.
Only the abovementioned symbols are escaped as HTML entities in the result of .innerHTML – other Unicode symbols are just displayed in their raw form, regardless of how they are represented in the HTML source code.
Because of this, "↓" in the HTML source code turns into "↓" when reading it back out through innerHTML. But e.g. "&" or "&" turn into "&", and " " or   become " ".

Related

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

What characters must be escaped in HTML 5?

HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they
are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
"" represents the " mark.
Authors wishing
to put the "<" character in text should use "<" (ASCII decimal 60)
to avoid possible confusion with the beginning of a tag (start tag
open delimiter). Similarly, authors should use ">" (ASCII decimal
62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close
delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
The & character should be replaced by &
Non-breaking spaces should be escaped as (surprise!...)
Within attributes, " should be escaped as "
Outside of attributes, < should be escaped as < and > should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Adding my voice to insist that things are not that easy -- strictly speaking:
HTML5 is a language specifications
it could be serialized either as HTML or as XML
Case 1 : HTML serialization
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."
So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&&copy is not safe (even if this might work) as the last sequence &copy might be interpreted as the entity reference for ©
Case 1 : XML serialization
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &.
In that case && (with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Encoding rules for URL with the `javascript:` pseudo-protocol?

Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it's not very well considered, but anyway it's useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password#domain:port/path?query_string#anchor
but this format doesn't seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the "unofficial" format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it's clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don't want to decode '+' to spaces, for example.
My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
<a href="javascript:alert('Hi!')"> (1)
<a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2)
<a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).
Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.
From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited:
among them spaces and {}# .
The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.
This would imply that the valid characters inside the 'body' of a javascript: URI are
a-zA-Z0-9
_|. !~*'();?:#&=+$,/-
%hh : (escape sequence, with two hexadecimal digits)
with the additional restriction that it can't begin with /.
This stills leaves out some "important" ASCII characters, for example
{}#[]<>^\
Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.
In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).
But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?
I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.
javascript: URLs are currently part of the HTML spec and are specified at https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-javascript:-url-special-case

Is it safe to display user input as input values without sanitization?

Say we have a form where the user types in various info. We validate the info, and find that something is wrong. A field is missing, invalid email, et cetera.
When displaying the form to the user again I of course don't want him to have to type in everything again so I want to populate the input fields. Is it safe to do this without sanitization? If not, what is the minimum sanitization that should be done first?
And to clearify: It would of course be sanitized before being for example added to a database or displayed elsewhere on the site.
No it isn't. The user might be directed to the form from a third party site, or simply enter data (innocently) that would break the HTML.
Convert any character with special meaning to its HTML entity.
i.e. & to &, < to <, > to > and " to " (assuming you delimit your attribute values using " and not '.
In Perl use HTML::Entities, in TT use the html filter, in PHP use htmlspecialchars. Otherwise look for something similar in the language you are using.
It is not safe, because, if someone can force the user to submit specific data to your form, you will output it and it will be "executed" by the browser. For instance, if the user is forced to submit '/><meta http-equiv="refresh" content="0;http://verybadsite.org" />, as a result an unwanted redirection will occur.
You cannot insert user-provided data into an HTML document without encoding it first. Your goal is to ensure that the structure of the document cannot be changed and that the data is always treated as data-values and never as HTML markup or Javascript code. Attacks against this mechanism are commonly known as "cross-site scripting", or simply "XSS".
If inserting into an HTML attribute value, then you must ensure that the string cannot cause the attribute value to end prematurely. You must also,of course, ensure that the tag itself cannot be ended. You can acheive this by HTML-encoding any chars that are not guaranteed to be safe.
If you write HTML so that the value of the tag's attribute appears inside a pair of double-quote or single-quote characters then you only need to ensure that you html-encode the quote character you chose to use. If you are not correctly quoting your attributes as described above, then you need to worry about many more characters including whitespace, symbols, punctuation and other ascii control chars. Although, to be honest, its arguably safest to encode these non-alphanumeric chars anyway.
Remember that an HTML attribute value may appear in 3 different syntactical contexts:
Double-quoted attribute value
<input type="text" value="**insert-here**" />
You only need to encode the double quote character to a suitable HTML-safe value such as "
Single-quoted attribute value
<input type='text' value='**insert-here**' />
You only need to encode the single quote character to a suitable HTML-safe value such as ‘
Unquoted attribute value
<input type='text' value=**insert-here** />
You shouldn't ever have an html tag attribute value without quotes, but sometimes this is out of your control. In this case, we really need to worry about whitespace, punctuation and other control characters, as these will break us out of the attribute value.
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and | (and more). [para lifted from OWASP]
Please remember that the above rules only apply to control injection when inserting into an HTML attribute value. Within other areas of the page, other rules apply.
Please see the XSS prevention cheat sheet at OWASP for more information
Yes, it's safe, provided of course that you encode the value properly.
A value that is placed inside an attribute in an HTML needs to be HTML encoded. The server side platform that you are using should have methods for this. In ASP.NET for example there is a Server.HtmlEncode method, and the TextBox control will automatically HTML encode the value that you put in the Text property.