Encoding rules for URL with the `javascript:` pseudo-protocol? - html

Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it's not very well considered, but anyway it's useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password#domain:port/path?query_string#anchor
but this format doesn't seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the "unofficial" format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it's clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don't want to decode '+' to spaces, for example.

My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
<a href="javascript:alert('Hi!')"> (1)
<a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2)
<a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).
Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.
From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited:
among them spaces and {}# .
The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.
This would imply that the valid characters inside the 'body' of a javascript: URI are
a-zA-Z0-9
_|. !~*'();?:#&=+$,/-
%hh : (escape sequence, with two hexadecimal digits)
with the additional restriction that it can't begin with /.
This stills leaves out some "important" ASCII characters, for example
{}#[]<>^\
Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.
In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).
But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?
I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.

javascript: URLs are currently part of the HTML spec and are specified at https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-javascript:-url-special-case

Related

Are there some valid HTML entities without the semicolon?

Looking at this official entities.json file, some of the entities are defined without an ending semicolon.
For example:
"&Acirc": { "codepoints": [194], "characters": "\u00C2" },
"Â": { "codepoints": [194], "characters": "\u00C2" },
Where is that documented in HTML5? Or is that a browser thing¹?
¹ thing as in extension for backward compatibility.
HTML named character list is defined at https://html.spec.whatwg.org/multipage/named-characters.html and yes, some of these don't have a trailing ; e.g &not
&not
Named HTML entities without a semicolon are not valid, per the HTML spec, but browsers are required to support some of them anyway. (This spec pattern - where something is officially illegal for you to do as a HTML author, but still has a single unambiguously specified behaviour that browsers must implement - is used a lot in the HTML spec.)
There are a few pertinent sections in the spec:
§13.1.4 Character references
Pertinent quote:
Named character references
The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).
§13.2 Parsing HTML Documents, especially 13.2.5.73 Named character reference state (if you really want to pick through the horrible hard-to-read implementation details of the parsing algorithm).
The non-normative §1.11.2 Syntax errors, which contains some explanation on why the spec makes references without semicolons errors (though I don't personally find it hugely compelling):
Errors involving fragile syntax constructs
There are syntax constructs that, for historical reasons, are relatively fragile. To help reduce the number of users who accidentally run into such problems, they are made non-conforming.
Example
For example, the parsing of certain named character references in attributes happens even with the closing semicolon being omitted. It is safe to include an ampersand followed by letters that do not form a named character reference, but if the letters are changed to a string that does form a named character reference, they will be interpreted as that character instead.
In this fragment, the attribute's value is "?bill&ted":
Bill and Ted
In the following fragment, however, the attribute's value is actually "?art©", not the intended "?art&copy", because even without the final semicolon, "&copy" is handled the same as "©" and thus gets interpreted as "©":
Art and Copy
To avoid this problem, all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.
Thus, the correct way to express the above cases is as follows:
Bill and Ted <!-- &ted is ok, since it's not a named character reference -->
Art and Copy <!-- the & has to be escaped, since &copy is a named character reference -->
As a final bit of corroboration that entities like &Acirc are invalid but work anyway, we can use this test document:
<!DOCTYPE html>
<html lang="en">
<title>Test page</title>
<div>&Acirc</div>
</html>
Open it in Chrome, and it works and shows us an A with a circumflex accent:
But paste it into the Nu Html Checker (endorsed by WhatWG), and we get an error stating "Named character reference was not terminated by a semicolon.":
i.e. it works, but it's invalid.
I made a program in python to get some numbers, and I found out that:
In the 2231 total entities, there are 4.75% or 106 valid entities without a semi-colon at end
All those entities:
&AElig, &AMP, &Aacute, &Acirc, &Agrave, &Aring, &Atilde, &Auml, &COPY, &Ccedil, &ETH, &Eacute, &Ecirc, &Egrave, &Euml, &GT, &Iacute, &Icirc, &Igrave, &Iuml, &LT, &Ntilde, &Oacute, &Ocirc, &Ograve, &Oslash, &Otilde, &Ouml, &QUOT, &REG, &THORN, &Uacute, &Ucirc, &Ugrave, &Uuml, &Yacute, &aacute, &acirc, &acute, &aelig, &agrave, &amp, &aring, &atilde, &auml, &brvbar, &ccedil, &cedil, &cent, &copy, &curren, &deg, &divide, &eacute, &ecirc, &egrave, &eth, &euml, &frac12, &frac14, &frac34, &gt, &iacute, &icirc, &iexcl, &igrave, &iquest, &iuml, &laquo, &lt, &macr, &micro, &middot, &nbsp, &not, &ntilde, &oacute, &ocirc, &ograve, &ordf, &ordm, &oslash, &otilde, &ouml, &para, &plusmn, &pound, &quot, &raquo, &reg, &sect, &shy, &sup1, &sup2, &sup3, &szlig, &thorn, &times, &uacute, &ucirc, &ugrave, &uml, &uuml, &yacute, &yen, &yuml

XSS without HTML tags

It is possible to do a XSS attack if my input does not allow < and > characters?
Example: I enter <script>alert('this');</script> text
But it if I delete < and > the script is not text:
I enter script alert('this'); script text
Yes, it could still be possible.
e.g. Say your site injects user input into the following location
<img src="http://example.com/img.jpg" alt="USER-INPUT" />
If USER-INPUT is " ONLOAD="alert('xss'), this will render
<img src="http://example.com/img.jpg" alt="" ONLOAD="alert('xss')" />
No angle brackets necessary.
Also, check out OWASP XSS Experimental Minimal Encoding Rules.
For HTML body:
HTML Entity encode < &
specify charset in metatag to avoid UTF7 XSS
For XHTML body:
HTML Entity encode < & >
limit input to charset http://www.w3.org/TR/2008/REC-xml-20081126/#charsets
So within the body you can get away with only encoding (or removing) a subset of the characters usually recommended to prevent XSS. However, you cannot do this within attributes - the full XSS (Cross Site Scripting) Prevention Cheat Sheet recommends the following, and they do not have a minimal alternative:
Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH; format, including spaces. (HH = Hex Value)
The is mainly though to cover the three types of ways of specifying the attribute value:
Unquoted
Single quoted
Double quoted
Encoding in such a way will prevent XSS in attribute values in all three cases.
Also be wary that UTF-7 attacks do not need angle bracket characters. However, unless the charset is explicitly set to UTF-7, this type of attack isn't possible in modern browsers.
+ADw-script+AD4-alert(document.location)+ADw-/script+AD4-
Also beware of attributes that allow URLs like href and ensure any user input is a valid web URL. Using a reputable library to validate the URL is highly recommended using an allow-list approach (e.g. if protocol not HTTPS then reject). Attempting to block sequences like javascript: is not sufficient.
If the user-supplied input is printed inside an HTML attribute, you also need to escape quotation marks or you would be vulnerable inputs like this:
" onload="javascript-code" foobar="
You should also escape the ampersand character as it generally needs to be encoded inside HTML documents and might otherwise destroy your layout.
So you should take care of the following characters: < > & ' "
You should however not completely strip them but replace them with the correct HTML codes i.e. < > & " '

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

escaping inside html tag attribute value

I am having trouble understanding how escaping works inside html tag attribute values that are javascript.
I was lead to believe that you should always escape & ' " < > . So for javascript as an attribute value I tried:
It doesn't work. However:
and
does work in all browsers!
Now I am totally confused. If all my attribute values are enclosed in double quotes, does this mean I do not have to escape single quotes? Or is apos and ascii 39 technically different characters? Such that javascript requires ascii 39, but not apos?
There are two types of “escapes” involved here, HTML and JavaScript. When interpreting an HTML document, the HTML escapes are parsed first.
As far as HTML is considered, the rules within an attribute value are the same as elsewhere plus one additional rule:
The less-than character < should be escaped. Usually < is used for this. Technically, depending on HTML version, escaping is not always required, but it has always been good practice.
The ampersand & should be escaped. Usually & is used for this. This, too, is not always obligatory, but it is simpler to do it always than to learn and remember when it is required.
The character that is used as delimiters around the attribute value must be escaped inside it. If you use the Ascii quotation mark " as delimiter, it is customary to escape its occurrences using " whereas for the Ascii apostrophe, the entity reference &apos; is defined in some HTML versions only, so it it safest to use the numeric reference ' (or ').
You can escape > (or any other data character) if you like, but it is never needed.
On the JavaScript side, there are some escape mechanisms (with \) in string literals. But these are a different issue, and not relevant in your case.
In your example, on a browser that conforms to current specifications, the JavaScript interpreter sees exactly the same code alert('Hello');. The browser has “unescaped” &apos; or ' to '. I was somewhat surprised to hear that &apos; is not universally supported these days, but it’s not an issue: there is seldom any need to escape the Ascii apostrophe in HTML (escaping is only needed within attribute values and only if you use the Ascii apostrophe as its delimiter), and when there is, you can use the ' reference.
&apos; is not a valid HTML reference entity. You should escape using '

Is it safe to display user input as input values without sanitization?

Say we have a form where the user types in various info. We validate the info, and find that something is wrong. A field is missing, invalid email, et cetera.
When displaying the form to the user again I of course don't want him to have to type in everything again so I want to populate the input fields. Is it safe to do this without sanitization? If not, what is the minimum sanitization that should be done first?
And to clearify: It would of course be sanitized before being for example added to a database or displayed elsewhere on the site.
No it isn't. The user might be directed to the form from a third party site, or simply enter data (innocently) that would break the HTML.
Convert any character with special meaning to its HTML entity.
i.e. & to &, < to <, > to > and " to " (assuming you delimit your attribute values using " and not '.
In Perl use HTML::Entities, in TT use the html filter, in PHP use htmlspecialchars. Otherwise look for something similar in the language you are using.
It is not safe, because, if someone can force the user to submit specific data to your form, you will output it and it will be "executed" by the browser. For instance, if the user is forced to submit '/><meta http-equiv="refresh" content="0;http://verybadsite.org" />, as a result an unwanted redirection will occur.
You cannot insert user-provided data into an HTML document without encoding it first. Your goal is to ensure that the structure of the document cannot be changed and that the data is always treated as data-values and never as HTML markup or Javascript code. Attacks against this mechanism are commonly known as "cross-site scripting", or simply "XSS".
If inserting into an HTML attribute value, then you must ensure that the string cannot cause the attribute value to end prematurely. You must also,of course, ensure that the tag itself cannot be ended. You can acheive this by HTML-encoding any chars that are not guaranteed to be safe.
If you write HTML so that the value of the tag's attribute appears inside a pair of double-quote or single-quote characters then you only need to ensure that you html-encode the quote character you chose to use. If you are not correctly quoting your attributes as described above, then you need to worry about many more characters including whitespace, symbols, punctuation and other ascii control chars. Although, to be honest, its arguably safest to encode these non-alphanumeric chars anyway.
Remember that an HTML attribute value may appear in 3 different syntactical contexts:
Double-quoted attribute value
<input type="text" value="**insert-here**" />
You only need to encode the double quote character to a suitable HTML-safe value such as "
Single-quoted attribute value
<input type='text' value='**insert-here**' />
You only need to encode the single quote character to a suitable HTML-safe value such as ‘
Unquoted attribute value
<input type='text' value=**insert-here** />
You shouldn't ever have an html tag attribute value without quotes, but sometimes this is out of your control. In this case, we really need to worry about whitespace, punctuation and other control characters, as these will break us out of the attribute value.
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and | (and more). [para lifted from OWASP]
Please remember that the above rules only apply to control injection when inserting into an HTML attribute value. Within other areas of the page, other rules apply.
Please see the XSS prevention cheat sheet at OWASP for more information
Yes, it's safe, provided of course that you encode the value properly.
A value that is placed inside an attribute in an HTML needs to be HTML encoded. The server side platform that you are using should have methods for this. In ASP.NET for example there is a Server.HtmlEncode method, and the TextBox control will automatically HTML encode the value that you put in the Text property.