Why are double-quotes urlencoded as %22? - html

As far as I know, URL encoding exists because URLs only support ASCII encoding. But since " is already in the ASCII table, why should it be encoded as %22 in URL encoding?

The " character falls under section 2.2 (URL Character Encoding Issues) of RFC 1738 (Uniform Resource Locators), under the "Unsafe" section. The reason for the inclusion is:
The quote mark (""") is used to delimit URLs in some systems.
One case of this that I can think of is an HTML attribute. For example, if you have an <a> tag with an href attribute, you will likely enclose the URL between double quotes. If the " character is not quoted, then the tag becomes invalid:
...
The RFC also proceeds to say:
All unsafe characters must always be encoded within a URL.
Some examples of other unsafe characters:
The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text.
The character "%" is unsafe because it is used for encodings of other characters.
The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.

URLs only support ASCII encoding
That's not true. URL's don't support spaces or / or & or ? for example even though they are valid ASCII characters because they have special meaning in URLs.
Valid characters in URLs are:
A-Z
a-z
0-9
-
_
.
~
Other characters are not supported. Some, such as spaces and tabs are not supported because they have special meaning in protocols that usually use URLs such as HTTP. Others such as ? and & are not supported because they have special meaning in URL syntax.

Related

What kind of encoding is this html encoding?

I am doing a project that involves searching words in the Arabic script on Wiktionary, and when I do a GET request on certain word pages, I get something like this for example:
title="\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9">\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9</a></li>\n<li><a href="/wiki/%D8%B1%D8%A3%D8%B3%D9%8A"
This corresponds to the following URL: https://en.wiktionary.org/wiki/%D8%B1%D8%A3%D8%B3%D9%8A.
Does anyone know what the \xd8 or %D8 encodings are called? I want to say they are hex codes, but I have already looked up hex codes for the Arabic script and they certainly are not these.
The percentages you see in the url are used to substitute characters that are'nt allowed in URLs, such as special characters like "/", ":" and "&" and non ASCII characters. This is called percent encoding - https://en.m.wikipedia.org/wiki/Percent-encoding
The "\xd.." prefixed represent hexadecimal character codes, since arabic characters fall outside of UTF-8 thats how that have to be represented. Thats assuming that HTML you showed used UTF-8 encoding.

Proper Way to Escape the | Character Using HTML Entities

To escape the ampersand character in HTML, I use the & HTML entity, for example:
Link
If I have the following code in my HTML, how would I escape the | character?
Link
HTML Tidy is complaining, claiming an illegal character was found in my HTML.
I tried using ¦ and several other HTML entities, but Tidy says "malformed URI reference."
You wouldn't.
The problem (as the message says) is that the character is illegal in URLs. It is perfectly fine in HTML.
You need to apply encoding for URLs which would be %7C.
I don't know why tidy is complaining about it, but this character is not problematic in HTML nor in URL. | is not a reserved character and can be used in URL as is. You can percent-encode every character, but there is really no need for it.
What I would presume Tidy might be complaining is =. You have got two of them, the second being an invalid one.
There is no need to encode this character in HTML entities. It has no special meaning in HTML.

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

url encoding, Form encoding and mailto: encoding

I am a little bit confused about the whole encoding issues related to HTML. I am not refering to the charset in the headers or encoding in the XML prologue. That I get. Lets me explain.
When the "mailto:" is used along with a anchor or a submit button in a form, white space is encoded as "%20" and "line feed/carriage return/new line/end of line" is encoded as %0A. While when the enctype attribute is used on a form with a value of "application/x-www-form-urlencoded" the white space is encoded as "+" and special characters, apostrophes, percentage and other symbols are converted to their ASCII HEX equivalents. Is the value "application/x-www-form-urlencoded" an URL Encoding? So why "%20" for the first one and "+" for the second.
"mailto:someone#someplace.com?cc=carbon#copy.com&bcc=blind#carbobcopy.org&subject=This%20is%20the%20subject&body=This%20is%20the%body%0AThis%20is%20the%20second%20paragraph"
In the above example white space in the subject is encoded as %20 and new line in the body is encoded as %0A.
<form enctype="application/x-www-form-urlencoded"></form>
And in the above white space will be encoded to "+". Am I missing something?
Thanks in advance.
URIs (like your mailto example) should be encoded according to RFC 3986, which specifies that spaces are to be encoded as %20.
The format of FORM data, on the other hand, is encoded as application/x-www-form-urlencoded according to the rules defined by the HTML specification. (See, for example, section 17.13.3.3 of the HTML 4.01 specification.) This specifies that spaces are to be translated as + signs.
Thus, while percent encoding is similar between URIs and form data, the space character is treated differently.

Do I encode ampersands in <a href...>?

I'm writing code that automatically generates HTML, and I want it to encode things properly.
Say I'm generating a link to the following URL:
http://www.google.com/search?rls=en&q=stack+overflow
I'm assuming that all attribute values should be HTML-encoded. (Please correct me if I'm wrong.) So that means if I'm putting the above URL into an anchor tag, I should encode the ampersand as &, like this:
<a href="http://www.google.com/search?rls=en&q=stack+overflow">
Is that correct?
Yes, it is. HTML entities are parsed inside HTML attributes, and a stray & would create an ambiguity. That's why you should always write & instead of just & inside all HTML attributes.
That said, only & and quotes need to be encoded. If you have special characters like é in your attribute, you don't need to encode those to satisfy the HTML parser.
It used to be the case that URLs needed special treatment with non-ASCII characters, like é. You had to encode those using percent-escapes, and in this case it would give %C3%A9, because they were defined by RFC 1738. However, RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It's therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.
By current official HTML recommendations, the ampersand must be escaped e.g. as & in contexts like this. However, browsers do not require it, and the HTML5 CR proposes to make this a rule, so that special rules apply in attribute values. Current HTML5 validators are outdated in this respect (see bug report with comments).
It will remain possible to escape ampersands in attribute values, but apart from validation with current tools, there is no practical need to escape them in href values (and there is a small risk of making mistakes if you start escaping them).
You have two standards concerning URLs in links (<a href).
The first standard is RFC 1866 (HTML 2.0) where in "3.2.1. Data Characters" you can read the characters which need to be escaped when used as the value for an HTML attribute. (Attributes themselves do not allow special characters at all, e.g. <a hr&ef="http://... is not allowed, nor is <a hr&ef="http://....)
Later this has gone into the HTML 4 standard, the characters you need to escape are:
< to <
> to >
& to &
" to &quote;
' to &apos;
The other standard is RFC 3986 "Generic URI standard", where URLs are handled (this happens when the browser is about to follow a link because the user clicked on the HTML element).
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It is important to escape those characters so the client knows whether they represent data or a delimiter.
Example unescaped:
https://example.com/?user=test&password&te&st&goto=https://google.com
Example, a fully legitimate URL
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
Example fully legitimate URL in the value of an HTML attribute:
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
Also important scenarios:
JavaScript code as a value:
<img src="..." onclick="window.location.href = "https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com";">...</a> (Yes, ;; is correct.)
JSON as a value:
...
Escaped things inside escaped things, double encoding, URL inside URL inside parameter, etc,...
http://x.com/?passwordUrl=http%3A%2F%2Fy.com%2F%3Fuser%3Dtest&password=""123
I am posting a new answer because I find zneak's answer does not have enough examples, does not show HTML and URI handling as different aspects and standards and has some minor things missing.
Yes, you should convert & to &.
This HTML validator tool by W3C is helpful for questions like this. It will tell you the errors and warnings for a particular page.