URLencoding in HTTP request for space - html

Why is the space character URL encoded to %20?
I don't see a reason why space is considered to be a reserved character.

because space is used as a separator in a lot of cases (program with arguments, HTTP commands, etc), so it often has to be escaped, with a \ in unix command line, with surroundings " in a windows command line, with %20 in URLs, etc.
in HTTP protocol, when you try to reach http://www.foo.com, your browser opens a connection to the server www.foo.com on port 80, and send the commands:
GET http://www.foo.com HTTP/1.0
Accept : text/html
The syntax is "METHOD URL HTTPVERSION"
If you tried to request http://www.foo.com/my page.html instead of http://www.foo.com/my%20page.html, the server would think "page.html" is the HTTPVersion you're looking for...

See RFC 3986 Section 2.3:
2.3. Unreserved Characters
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

Because the Request-Line of an HTTP request is defined as:
Method (Space) Request-URI (Space) HTTP-Version CRLF
Naive HTTP servers that stricly adhere to the spec will do something like this:
splitInput = requestLine.Split(' ')
method = splitInput[0]
requestUri = splitInput[1]
httpVersion = splitInput[2]
That will break if you'd allow spaces in an URL.

Related

URL safe UUIDs in the smallest number of characters

Ideally I would want something like example.com/resources/äFg4вNгё5, minimum number of visible characters, never mind that they have to be percent encoded before transmitting them over HTTP.
Can you tell a scheme which encodes 128b UUIDs into the least number of visible characters efficiently, without the results having characters which break URLs?
Base-64 is good for this.
{098ef7bc-a96c-43a9-927a-912fc7471ba2}
could be encoded as
vPeOCWypqUOSepEvx0cbog
The usual equal-signs at the end could be dropped, as they always make the string-length a multiple of 4. And instead of + and /, you could use some safe characters. You can pick two from: - . _ ~
More information:
RFC 4648
Storing UUID as base64 String (Java)
guid to base64, for URL (C#)
Short GUID (C#)
I use a url-safe base64 string. The following is some Python code that does this*.
The last line removes '=' or '==' sign that base 64 encoding likes to put on the end, they make putting the characters into a URL more difficult and are only necessary for de-encoding the information, which does not need to be done here.
import base64
import uuid
# get a UUID - URL safe, Base64
def get_a_Uuid():
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes)
return r_uuid.replace('=', '')
Above does not work for Python3. This is what I'm doing instead:
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes).decode("utf-8")
return r_uuid.replace('=', '')
*
This does follow the standards: base64.urlsafe_b64encode follows RFC 3548 and 4648 see https://docs.python.org/2/library/base64.html. Stripping == from base64 encoded data with known length is allowed see RFC 4648 §3.2. UUID/GUID are specified in RFC 4122; §4.1 Format states "The UUID format is 16 octets". The base64-fucntion encodes these 16 octets.

Encoded URL does not work

I am confused about encoded URLs.
For example, when I write my browser:
stackoverflow.com/questions
I can successfully view the page.
However, when I write:
stackoverflow.com%2Fquestions
I am unable to view.
Since %2F means "/", I want to understand why this does not work properly.
The reason why I want to find out is that I am getting an encoded URL and I don't know how I can decode that URL right after I receive it in order not to have an error page.
The / is one of the percent-encoding reserved characters. URLs use percent-encoding reserved characters for defining their syntax. Only when these characters are not used in their special role inside a URL, they need to be encoded.
Percent-encoding reserved characters:
! * ' ( ) ; : # & = + $ , / ? # [ ]
%21 %2A %27 %28 %29 %3B %3A %40 %26 %3D %2B %24 %2C %2F %3F %23 %5B %5D
%2F is a URL escaped /. It means, treat / as a character, not a directory separator.
In essence, it is looking for a domain stackoverflow.com/questions, not the domain stackoverflow.com with the path questions.
%2F is what you write when you want to include a / in a parameter but don't want the browser to navigate to a different directory/route.
So if you had the file path 'root/subdirectory' passed as a querystring parameter, you would want to encode that like:
http://www.testurl.com/page.php?path=root%2Fsubdirectory
rather than
http://www.testurl.com/page.php?path=root/subdirectory
URL encoding is used e.g. for encoding a string URL parameter coming from an HTML form, which contains special characters, like '/'. Writing "stackoverflow.com%2Fquestions" is wrong, in this case the '/' is part of the URL itself, and must not be encoded.
%2F is an escaped character entity - meaning it would be included in a name, rather than the character /, which denotes directory hierarchy, as specified in RFC 1630, page 8.

Should data attribute of object tag be percent-encoded?

Suppose my web application renders the following tag:
<object type="application/x-pdf" data="http://example.com/test%2Ctest.pdf">
<param name="showTableOfContents" value="true" />
<param name="hideThumbnails" value="false" />
</object>
Should data attribute be escaped (percent-encoded path) or no? In my example it is. I haven't found any specification.
addendum
Actually, I'm interested in specification on what should browser plugins consuming data attribute expect to see there. For example, Adobe Acrobat plugin takes both escaped and unescaped uri. However, QWebPluginFactory treats data attribute as a human readable URI (unescaped), and that leads to double percent encoding. And I'm wondering whether it is a bug of QWebPluginFactory or not.
The data attribute expects the value to be a URI. So you should provide a value that is a syntactically valid URI.
The current specification of URIs is RFC 3986. To see whether the , in the URI’s path needs to be encoded, take a look at how the path production rule is defined:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
Since we have a URI with authority information, we need to take a look at path-abempty (see URI production rule):
path-abempty = *( "/" segment )
segment is zero or more pchar characters that is defined as follows (I’ve already expanded the production rules):
pchar = ALPHA / DIGIT / "-" / "." / "_" / "~" / "%" HEXDIG HEXDIG / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "#"
And as you can see, pchar expands to a literal ,. So you don’t need to encode the , in the path component. But since you are allowed to encode any non-delimiting character using the percent-encoding without changing its meaning, it is fine to use %2C instead of ,.
URLs generally can only contain specific characters. Unfortunately different specifications contain different lists of characters that are considered reserved and thus can't be used.
In your example the encoded character is a comma (,), which is a reserved character in some specifications, so it's not wrong to encode it.
Most webservers should handle unencoded and encoded commas equaly, however there can be some that don't, depending on their configuration. Due to that it is generally a good idea to avoid having special characters in filenames (as you have in your example) in the first place.
URL encoding is always needed when you have special characters in GET parameters. For example a GET parameter that is support to take C&A as a value has to be written as:
http://example.com/somescript.php?value=C%26A
EDIT:
Plugins (or even the browser) don't care either way. They don't try to (or need to) decode it or anything like that. They just request the URL as entered from the server.

What characters must be escaped in an HTTP query string?

This question concerns the characters in the query string portion of the URL, which appear after the ? mark character.
Per Wikipedia, certain characters are left as is and others are encoded (usually with a % escape sequence).
I've been trying to track this down to actual specifications, so that I understand the justification behind every bullet point in that Wikipedia page.
Contradiction Example 1:
The HTML specification says to encode space as + and defers the rest to RFC1738. However, this RFC says that ~ is unsafe and furthermore that "[a]ll unsafe characters must always be encoded within the URL". This seems to contradict Wikipedia.
In practice, IE8 encodes ~ in the query strings it generates, while FF3 leaves it as is.
Contradiction Example 2:
Wikipedia states that all characters that it does not mention must be encoded. ! is not mentioned in Wikipedia. But RFC1738 states that ! is a "special" character and "may be used unencoded". This seems to contradict Wikipedia which says that it must be encoded.
In practice, IE8 encodes ! in the query strings it generates, while FF3 leaves it as is.
I understand that the moral of this is probably going to be to encode those characters that are in doubt between Wikipedia and the specifications. Perhaps even going as far as encoding everything that is not [A-Za-z0-9]. I would just like to know the actual standards on this.
Conclusions
The algorithm described on Wikipedia encodes precisely those characters which are not RFC3986 unreserved characters. That is, it encodes all characters other than alphanumerics and -._~. As a special case, space is encoded as + instead of %20 per RFC3986.
Some applications use an older RFC. For comparison, the RFC2396 unreserved characters are alphanumerics and !'()*-._~.
For comparison, the HTML5 working draft algorithm encodes all characters other than alphanumerics and *-._. The special case encoding for space remains +. Notable differences are that * is not encoded and ~ is encoded. (Technically, this handling of * is compatible with RFC3986 even though * is in reserved because it is in the sub-delims which are allowed in the query production.)
The answer lies in the RFC 3986 document, specifically Section 3.4.
The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
...
The characters slash ("/") and question mark ("?") may represent data
within the query component.
Technically, RFC 3986-3.4 defines the query component as:
query = *( pchar / "/" / "?" )
This syntax means that query can include all characters from pchar as well as / and ?. pchar refers to another specification of path characters. Helpfully, Appendix A of RFC 3986 lists the relevant ABNF definitions, most notably:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Thus, in addition to all alphanumerics and percent encoded characters, a query can legally include the following unencoded characters:
/ ? : # - . _ ~ ! $ & ' ( ) * + , ; =
Of course, you may want to keep in mind that '=' and '&' usually have special significance within a query.

What does a ^ sign mean in a URL?

What is the meaning of a ^ sign in a URL?
I needed to crawl some link data from a webpage and I was using a simple handwritten PHP crawler for it. The crawler usually works fine; then I came to a URL like this:
http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^
This URL works fine when typed in a browser but my crawler is not able to retrieve this page. I am getting an "HTTP request failed error".
^ characters should be encoded, see RFC 1738 Uniform Resource Locators (URL):
Other characters are unsafe because
gateways and other transport agents
are known to sometimes modify such
characters. These characters are "{",
"}", "|", "\", "^", "~", "[", "]",
and "`".
All unsafe characters must always
be encoded within a URL
You could try URL encoding the ^ character.
Based on the context, I'd guess they're a homespun attempt to URL-encode quote-marks.
Caret (^) is not a reserved character in URLs, so it should be acceptable to use as-is. However, if you re having problems, just replace it with its hex encoding %5E.
And yeah, putting raw SQL in the URL is like a big flashing neon sign reading "EXPLOIT ME PLEASE!".
Caret is neither reserved nor "unreserved", which makes it an "unsafe character" in URLs. They should never appear in URLs unencoded. From RFC2396:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
The crawler may be using regular expressions to parse the URL and therefore is falling over because the caret (^) means beginning of line. I'm thinking these URLs are really bad practice since they are exposing the underlying database structure; whomever wrote this might want to consider serious refactoring!
HTH!