Absolute Paths beginning with two slashes - html

I noticed that Wikipedia links pointing to a path on a different Wikipedia subdomain use a link with the following syntax: //<SERVER_NAME>/<REQUEST_URI>. For example, a link from a file page to the file appears (for example) as //upload.wikimedia.org/wikipedia/en/9/95/Stack_Overflow_website_logo.png. I am familiar with absolute paths (thinking twice about that now) and relative paths and how to use them. However, I have never seen this use. I assume this points to a new server name using the current protocol. Is this correct? And is there an official name (or widely accepted name) for this?

You are absolutely right. A link to //some/path is a protocol relative path.
Namely, if you are currently on http://something.example.com, a link to //google.com would point to http://google.com.
If you are currently on https://something.example.com, a link to //google.com would point to https://google.com.
The most common use of this can be seen in the html5 boilerplate.
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.5.1/jquery.js"></script>
Kindly google provides its javascript cdn over both http and https.
Thereby to avoid security warnings, we load it over https if we are on https, or http if we are on http.
note:
Unfortunately, you can't do the same thing for google analytics.
they use the domains ssl.google-analytics.com and www.google-analytics.com for https and http.

It looks like these //example.com URIs are called "Scheme Relative" or "Protocol Relative", and there is more information about it at this question:
Network-Path Reference URI / Scheme relative URLs
EDIT:
Apparently this might actually be called a "network-path reference" as seen here:
https://www.rfc-editor.org/rfc/rfc3986#section-4.2
Quote:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.

Related

What's the difference between the two css link below, and what are the purposes of the two links? [duplicate]

I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?
Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.
It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.

Images with protocol-relative source URLs don't load in Firefox

I use the following syntax in my application so that images are loaded over https or http depending upon how the page was loaded.
<img src="//path_to_image.jpg">
This works fine in Chrome but firefox does not display any images.
What can I do to fix this ?
The // use to support multiple protocol (i.e. http or https), these type of URL is known as "protocol relative URLs" and use with complete domain name. Mostly CDN url are used with //.
If you are planning to use // make sure you use full domain url (i.e. //xyz.com/images/path_to_image.jpg). If you just want to use relative path from the root then use single slash (i.e. /)
Following like help you to understand // usage
Two forward slashes in a url/src/href attribute
Using // will only work with a full URL ('//yourhost.tld/directory/path_to_image.jpg'). In your case one slash /, should be enough!

URL: following rules for relative path with trailing slash

I've found lots of answers on the server side for the relative-path-with-trailing-slash question, but none on the client side. Help me out here.
I'm writing a web crawler to take statistics on a set of websites, and am running into a problem. One website I'm working with has a navbar with relative paths with trailing slashes, and intends those paths to be treated as absolute, like so:
on page http://www.example.com/foo/bar
navbar links addresses -> foo/, baz/, quox/
intended absolute urls -> http://www.example.com/foo/, http://www.example.com/baz/, http://www.example.com/quox/
The problem is, as far as I can tell, this is nonstandard behavior - and yet Firefox and Chrome both handle those paths as absolute. According to RFC 1808, and RFC 2396, these should be handled like relative paths, like this:
spec-correct absolute urls -> http://www.example.com/foo/foo/, http://www.example.com/foo/baz/, http://www.example.com/foo/quox/
In particular at section 5.1 in RFC 1808 and C.1 in RFC 2396, the 4th example shows this case specifically being treated as a relative path. In Ruby, which I'm writing the crawler in, the Addressable gem handles these according to spec.
What's worse is the server in question is happy to return 200 OK for these paths, and all of them have this navbar: so I end up crawling http://www.example.com/foo/ which is the same page as http://www.example.com/foo/foo/, http://www.example.com/foo/foo/foo/ and so on, combinatorially to bizarre URLs like http://www.example.com/foo/baz/quox/foo/
So here's the question: Am I missing something that allows Chrome and Firefox to both interpret these urls as absolute paths? Is there any way to disambiguate the case where the spec is correct and the absolute path is what is intended?
There must have been a <base> tag defined inside of <head> element which is used to specify base URL for relative paths in a page.
RFC-1808

How do user agents distinguish domains from file extensions in relative urls?

Let's say a browser encounters a link like this:
<a href='stackoverflowhome.html'>home</a>
This is clearly a relative url to an html file in the current directory, but how does the browser know that the .html is a file extension, and not a TLD (top level domain)? Does it have a list of common file extensions, or a list of TLDs? And if so, is it manually updated whenever a new file format becomes commonly used, or when the list of accepted TLDs change, for example with brand tlds?
It's because that is how RFC 3986 specified that URIs should be parsed. If the URI does not have a scheme (a set of characters followed by a colon - e. g. http: or gopher:) then it must be treated as a relative URI. Quoting from the RFC:
A URI-reference is either a URI or a relative reference. If the
URI-reference's prefix does not match the syntax of a scheme followed
by its colon separator, then the URI-reference is a relative
reference.
User-agents are allowed to make their best guess about what the user meant (see section 4.5) especially in cases where the context is ambiguous (such as URL bars in browsers) but the RFC recommends against it where the URI will be around for a long time as the best guess of user-agents will change over time, thus leading to URIs that don't resolve to the same resource depending on the time they are accessed or the user-agent they are accessed with.
Relative URLs are never domain names.
A URL is only parsed as containing a domain name if it has a protocol. (or is protocol-relative).
The URL does not start with a protocol specifier - no http:// or https://, so is interpreted as a relative URL.

Protocol-less absolute URIs, with host, in HTML?

I have seen some pages that refer to what appear to be absolute URIs, with a host, but without a protocol. For example:
<script src="//mc.yandex.ru/metrika/watch.js" type="text/javascript"></script>
My assumption is that this means 'use the same protocol as what we are on now', so the parent page will request https://mc.yandex.ru/metrika/watch.js if its own protocol is https.
Is this syntax even correct? Part of a standard? What does it mean?
It's called a "network path reference". The documentation for this can be found in RFC 3986. Specifically, see section 4.2:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used.
And section 5.4:
Within a representation with a well defined base URI of
http://a/b/c/d;p?q
a relative reference is transformed to its target URI as follows...
"g:h" = "g:h"
...
"//g" = "http://g"
...
So a URI starting with a double slash is transformed to match the base URI. One use of this that I know of (in fact, the only use I've ever seen) is when using a CDN (for example, when including jQuery via the Google CDN). Google hosts a version on the http protocol, and another on the https protocol, and using this URI format will cause the correct version to be loaded, no matter which protocol you are using.
Update (having just found and read this article)
It appears that using this URI format throughout a page can prevent the "This Page Contains Both Secure and Non-Secure Items" error in IE. However, it's worth noting that this format causes files included via a link element, or an #import directive cause the included file to be requested twice. All other resources (such as images and anchors) should work as expected.