How do user agents distinguish domains from file extensions in relative urls? - html

Let's say a browser encounters a link like this:
<a href='stackoverflowhome.html'>home</a>
This is clearly a relative url to an html file in the current directory, but how does the browser know that the .html is a file extension, and not a TLD (top level domain)? Does it have a list of common file extensions, or a list of TLDs? And if so, is it manually updated whenever a new file format becomes commonly used, or when the list of accepted TLDs change, for example with brand tlds?

It's because that is how RFC 3986 specified that URIs should be parsed. If the URI does not have a scheme (a set of characters followed by a colon - e. g. http: or gopher:) then it must be treated as a relative URI. Quoting from the RFC:
A URI-reference is either a URI or a relative reference. If the
URI-reference's prefix does not match the syntax of a scheme followed
by its colon separator, then the URI-reference is a relative
reference.
User-agents are allowed to make their best guess about what the user meant (see section 4.5) especially in cases where the context is ambiguous (such as URL bars in browsers) but the RFC recommends against it where the URI will be around for a long time as the best guess of user-agents will change over time, thus leading to URIs that don't resolve to the same resource depending on the time they are accessed or the user-agent they are accessed with.

Relative URLs are never domain names.
A URL is only parsed as containing a domain name if it has a protocol. (or is protocol-relative).

The URL does not start with a protocol specifier - no http:// or https://, so is interpreted as a relative URL.

Related

What's the difference between the two css link below, and what are the purposes of the two links? [duplicate]

I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?
Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.
It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.

Is it standard web-browser behaviour to prepend the current URL to incomplete links?

If a website includes an incomplete link such as the following:
Link
Link
Is it standard, universal behaviour that the link would be interpreted as the current URL with the href value appended to it?
The HTML5 spec defines how [href] attributes behave
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
which links to:
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing whitespace from it, it is a valid URL.
which links to:
A URL is a valid URL if it conforms to the authoring conformance requirements in the URL standard. [URL]
which links to a sizable block of text, but I think the following is important:
Most of the URL-related terms used in the HTML specification (URL, absolute URL, relative URL, relative schemes, scheme component, scheme data, username, password, host, port, path, query, fragment, percent encode, get the base, and UTF-8 percent encode) can be straightforwardly mapped to the terminology of [RFC3986] [RFC3987].
As for the "incomplete link" examples you included in your question. They are examples of a "fragment" and "query" respectively, which have an implicit relative URL of . which represents the current URL (note that it will not merge query strings or document fragment identifiers).

How do browsers determine whether an URL in an href is relative or not when using a scheme?

Suppose I have the following link tag: Phone number.
How exactly does the browser know not to load the relative location ./tel:+15555555 from the current server and instead know that tel is supposed to be interpreted as a scheme?
Detecting host-relative URLs (/…) or protocol-relative URLs (//…) seems to be trivial. I guess HTTP-URLs (http://… or https://…) would be simple to special-case as well. But how does the browser go about parsing an URL with an arbitrary scheme? I know a valid scheme has to start with a lowercase letter and may only contain lowercase letters or the characters +, - and ., which limits the scope somewhat…
Of course I’m aware that the whole issue only pertains to scopes where relative URLs are valid (i.e. mostly the href and src attributes).
I’m looking for the links to some RFC (e.g. which forbids non-URL-encoded colons to be anything but scheme separators) as well as to the source code of various browser’s URL parsing internals.
The href value is parsed as a URI (see RFC 3986). As a result of the parsing, the browser will know that this was an absolute URI, not a relative reference.
As a matter of fact, unescaped ":" is allowed in the path component; it's just that they need to occur after the first "/"; otherwise they could be parsed as scheme delimiter if the preceding characters are all valid scheme name characters.
See http://greenbytes.de/tech/webdav/rfc3986.html#path
The RFC also has the following to say in section 4.2 (titled “Relative Reference”): “A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.” (emphasis added).
See RFC 3966 for the tel URI specification, and RFC 3986 for the more generic URL specification. It's the colon (:) that separates scheme from the "hier part".

Absolute Paths beginning with two slashes

I noticed that Wikipedia links pointing to a path on a different Wikipedia subdomain use a link with the following syntax: //<SERVER_NAME>/<REQUEST_URI>. For example, a link from a file page to the file appears (for example) as //upload.wikimedia.org/wikipedia/en/9/95/Stack_Overflow_website_logo.png. I am familiar with absolute paths (thinking twice about that now) and relative paths and how to use them. However, I have never seen this use. I assume this points to a new server name using the current protocol. Is this correct? And is there an official name (or widely accepted name) for this?
You are absolutely right. A link to //some/path is a protocol relative path.
Namely, if you are currently on http://something.example.com, a link to //google.com would point to http://google.com.
If you are currently on https://something.example.com, a link to //google.com would point to https://google.com.
The most common use of this can be seen in the html5 boilerplate.
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.5.1/jquery.js"></script>
Kindly google provides its javascript cdn over both http and https.
Thereby to avoid security warnings, we load it over https if we are on https, or http if we are on http.
note:
Unfortunately, you can't do the same thing for google analytics.
they use the domains ssl.google-analytics.com and www.google-analytics.com for https and http.
It looks like these //example.com URIs are called "Scheme Relative" or "Protocol Relative", and there is more information about it at this question:
Network-Path Reference URI / Scheme relative URLs
EDIT:
Apparently this might actually be called a "network-path reference" as seen here:
https://www.rfc-editor.org/rfc/rfc3986#section-4.2
Quote:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.

Protocol-less absolute URIs, with host, in HTML?

I have seen some pages that refer to what appear to be absolute URIs, with a host, but without a protocol. For example:
<script src="//mc.yandex.ru/metrika/watch.js" type="text/javascript"></script>
My assumption is that this means 'use the same protocol as what we are on now', so the parent page will request https://mc.yandex.ru/metrika/watch.js if its own protocol is https.
Is this syntax even correct? Part of a standard? What does it mean?
It's called a "network path reference". The documentation for this can be found in RFC 3986. Specifically, see section 4.2:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used.
And section 5.4:
Within a representation with a well defined base URI of
http://a/b/c/d;p?q
a relative reference is transformed to its target URI as follows...
"g:h" = "g:h"
...
"//g" = "http://g"
...
So a URI starting with a double slash is transformed to match the base URI. One use of this that I know of (in fact, the only use I've ever seen) is when using a CDN (for example, when including jQuery via the Google CDN). Google hosts a version on the http protocol, and another on the https protocol, and using this URI format will cause the correct version to be loaded, no matter which protocol you are using.
Update (having just found and read this article)
It appears that using this URI format throughout a page can prevent the "This Page Contains Both Secure and Non-Secure Items" error in IE. However, it's worth noting that this format causes files included via a link element, or an #import directive cause the included file to be requested twice. All other resources (such as images and anchors) should work as expected.