URL: following rules for relative path with trailing slash - html

I've found lots of answers on the server side for the relative-path-with-trailing-slash question, but none on the client side. Help me out here.
I'm writing a web crawler to take statistics on a set of websites, and am running into a problem. One website I'm working with has a navbar with relative paths with trailing slashes, and intends those paths to be treated as absolute, like so:
on page http://www.example.com/foo/bar
navbar links addresses -> foo/, baz/, quox/
intended absolute urls -> http://www.example.com/foo/, http://www.example.com/baz/, http://www.example.com/quox/
The problem is, as far as I can tell, this is nonstandard behavior - and yet Firefox and Chrome both handle those paths as absolute. According to RFC 1808, and RFC 2396, these should be handled like relative paths, like this:
spec-correct absolute urls -> http://www.example.com/foo/foo/, http://www.example.com/foo/baz/, http://www.example.com/foo/quox/
In particular at section 5.1 in RFC 1808 and C.1 in RFC 2396, the 4th example shows this case specifically being treated as a relative path. In Ruby, which I'm writing the crawler in, the Addressable gem handles these according to spec.
What's worse is the server in question is happy to return 200 OK for these paths, and all of them have this navbar: so I end up crawling http://www.example.com/foo/ which is the same page as http://www.example.com/foo/foo/, http://www.example.com/foo/foo/foo/ and so on, combinatorially to bizarre URLs like http://www.example.com/foo/baz/quox/foo/
So here's the question: Am I missing something that allows Chrome and Firefox to both interpret these urls as absolute paths? Is there any way to disambiguate the case where the spec is correct and the absolute path is what is intended?

There must have been a <base> tag defined inside of <head> element which is used to specify base URL for relative paths in a page.
RFC-1808

Related

Why Does this relative URL work?

So I've been playing with writing some web crawlers and testing them on different sites. But I've come across some sites that seem like their relative urls should not work, or at least I think they should point to someplace other than where the browser resolves them to.
Given a url of a current page : "http://www.examplesite.com/a/page.htm"
And a link of: "a/page2.htm"
The browser correctly resolves this as: "http://www.examplesite.com/a/page2.htm"
My problem/feeling (obviously wrong, but I'm wondering why) is that this should resolve to "http://www.examplesite.com/a/a/page2.htm". The relative url does not begin with a /, so why does it become base relative?
Interestingly, Java's URL class appears to agree with me, as the following code will output : "http://www.examplesite.com/a/a/page2.htm"
URL baseUrl = new URL("http://www.examplesite.com/a/page.htm");
URL absoluteURL = new URL(baseURL,"a/page2.htm");
Why does this link resolve the way it does, and what is the formal rule for resolving a relative link like this?
EDIT:
I just notice that in the <head> portion of the webpage there is a field like so:
<base href="http://examplesite.com/">
I'm assuming that this overrides any relative links to use that as its base url instead of the actual url. Is this a correct assumption? Is that even a valid html markup?
You are correct in that it is the base tag, and yes it is valid.
In HTML, links and references to external images, applets,
form-processing programs, style sheets, etc. are always specified by a
URI. Relative URIs are resolved according to a base URI, which may
come from a variety of sources. The BASE element allows authors to
specify a document's base URI explicitly.
When present, the BASE element must appear in the HEAD section of an
HTML document, before any element that refers to an external source.
The path information specified by the BASE element only affects URIs
in the document where the element appears.
Sources: W3C Wiki and W3C Markup
The site is likely using a <base> tag to specify the parent as the prefix to all relative URL's on the site.
You can find out more on the base tag here. If this is not the case, then please provide the source URL as this defies normal behavior.

Is it OK to use empty href on links with <base> tag

I set the base tag as such:
<base href="http://mnapoli.github.com/PHP-DI/">
Then I would like to create a link to http://mnapoli.github.com/PHP-DI/ in relative path.
I tried:
link
and it works in Chrome, but is this method standard and supposed to work on every browser?
Although href="./" as suggested in Mike’s answer is better (easier to understand to anyone who reads the code), the answer to the question posed is that using an empty URL is standard and supposed to work on all browsers. According to STD 66, a relative URL with no characters (path-empty) is allowed, and by rules on relative URLs, it is resolved as the base URL.
This has nothing to do with folders or files; URLs are strings, and whether they get mapped to folders or files on a server is at the discretion of the server.
I would do something like this:
<base href="http://mnapoli.github.com/PHP-DI/">
Home

How do user agents distinguish domains from file extensions in relative urls?

Let's say a browser encounters a link like this:
<a href='stackoverflowhome.html'>home</a>
This is clearly a relative url to an html file in the current directory, but how does the browser know that the .html is a file extension, and not a TLD (top level domain)? Does it have a list of common file extensions, or a list of TLDs? And if so, is it manually updated whenever a new file format becomes commonly used, or when the list of accepted TLDs change, for example with brand tlds?
It's because that is how RFC 3986 specified that URIs should be parsed. If the URI does not have a scheme (a set of characters followed by a colon - e. g. http: or gopher:) then it must be treated as a relative URI. Quoting from the RFC:
A URI-reference is either a URI or a relative reference. If the
URI-reference's prefix does not match the syntax of a scheme followed
by its colon separator, then the URI-reference is a relative
reference.
User-agents are allowed to make their best guess about what the user meant (see section 4.5) especially in cases where the context is ambiguous (such as URL bars in browsers) but the RFC recommends against it where the URI will be around for a long time as the best guess of user-agents will change over time, thus leading to URIs that don't resolve to the same resource depending on the time they are accessed or the user-agent they are accessed with.
Relative URLs are never domain names.
A URL is only parsed as containing a domain name if it has a protocol. (or is protocol-relative).
The URL does not start with a protocol specifier - no http:// or https://, so is interpreted as a relative URL.

Absolute Paths beginning with two slashes

I noticed that Wikipedia links pointing to a path on a different Wikipedia subdomain use a link with the following syntax: //<SERVER_NAME>/<REQUEST_URI>. For example, a link from a file page to the file appears (for example) as //upload.wikimedia.org/wikipedia/en/9/95/Stack_Overflow_website_logo.png. I am familiar with absolute paths (thinking twice about that now) and relative paths and how to use them. However, I have never seen this use. I assume this points to a new server name using the current protocol. Is this correct? And is there an official name (or widely accepted name) for this?
You are absolutely right. A link to //some/path is a protocol relative path.
Namely, if you are currently on http://something.example.com, a link to //google.com would point to http://google.com.
If you are currently on https://something.example.com, a link to //google.com would point to https://google.com.
The most common use of this can be seen in the html5 boilerplate.
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.5.1/jquery.js"></script>
Kindly google provides its javascript cdn over both http and https.
Thereby to avoid security warnings, we load it over https if we are on https, or http if we are on http.
note:
Unfortunately, you can't do the same thing for google analytics.
they use the domains ssl.google-analytics.com and www.google-analytics.com for https and http.
It looks like these //example.com URIs are called "Scheme Relative" or "Protocol Relative", and there is more information about it at this question:
Network-Path Reference URI / Scheme relative URLs
EDIT:
Apparently this might actually be called a "network-path reference" as seen here:
https://www.rfc-editor.org/rfc/rfc3986#section-4.2
Quote:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used. A
relative reference that begins with a single slash character is
termed an absolute-path reference. A relative reference that does
not begin with a slash character is termed a relative-path reference.

Protocol-less absolute URIs, with host, in HTML?

I have seen some pages that refer to what appear to be absolute URIs, with a host, but without a protocol. For example:
<script src="//mc.yandex.ru/metrika/watch.js" type="text/javascript"></script>
My assumption is that this means 'use the same protocol as what we are on now', so the parent page will request https://mc.yandex.ru/metrika/watch.js if its own protocol is https.
Is this syntax even correct? Part of a standard? What does it mean?
It's called a "network path reference". The documentation for this can be found in RFC 3986. Specifically, see section 4.2:
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used.
And section 5.4:
Within a representation with a well defined base URI of
http://a/b/c/d;p?q
a relative reference is transformed to its target URI as follows...
"g:h" = "g:h"
...
"//g" = "http://g"
...
So a URI starting with a double slash is transformed to match the base URI. One use of this that I know of (in fact, the only use I've ever seen) is when using a CDN (for example, when including jQuery via the Google CDN). Google hosts a version on the http protocol, and another on the https protocol, and using this URI format will cause the correct version to be loaded, no matter which protocol you are using.
Update (having just found and read this article)
It appears that using this URI format throughout a page can prevent the "This Page Contains Both Secure and Non-Secure Items" error in IE. However, it's worth noting that this format causes files included via a link element, or an #import directive cause the included file to be requested twice. All other resources (such as images and anchors) should work as expected.