Why Does this relative URL work? - html

So I've been playing with writing some web crawlers and testing them on different sites. But I've come across some sites that seem like their relative urls should not work, or at least I think they should point to someplace other than where the browser resolves them to.
Given a url of a current page : "http://www.examplesite.com/a/page.htm"
And a link of: "a/page2.htm"
The browser correctly resolves this as: "http://www.examplesite.com/a/page2.htm"
My problem/feeling (obviously wrong, but I'm wondering why) is that this should resolve to "http://www.examplesite.com/a/a/page2.htm". The relative url does not begin with a /, so why does it become base relative?
Interestingly, Java's URL class appears to agree with me, as the following code will output : "http://www.examplesite.com/a/a/page2.htm"
URL baseUrl = new URL("http://www.examplesite.com/a/page.htm");
URL absoluteURL = new URL(baseURL,"a/page2.htm");
Why does this link resolve the way it does, and what is the formal rule for resolving a relative link like this?
EDIT:
I just notice that in the <head> portion of the webpage there is a field like so:
<base href="http://examplesite.com/">
I'm assuming that this overrides any relative links to use that as its base url instead of the actual url. Is this a correct assumption? Is that even a valid html markup?

You are correct in that it is the base tag, and yes it is valid.
In HTML, links and references to external images, applets,
form-processing programs, style sheets, etc. are always specified by a
URI. Relative URIs are resolved according to a base URI, which may
come from a variety of sources. The BASE element allows authors to
specify a document's base URI explicitly.
When present, the BASE element must appear in the HEAD section of an
HTML document, before any element that refers to an external source.
The path information specified by the BASE element only affects URIs
in the document where the element appears.
Sources: W3C Wiki and W3C Markup

The site is likely using a <base> tag to specify the parent as the prefix to all relative URL's on the site.
You can find out more on the base tag here. If this is not the case, then please provide the source URL as this defies normal behavior.

Related

What is a Document Base URL and Fallback Base URL?

I would like to ask the community to help me understand what is a Document Base URL and Fallback Base URL in terms of how they are defined in the HTML5's specification. Please note I would prefer to expect a more understandable definition in terms of the definitions in the specification. However, individual perceptions are also welcome.
Link for Document Base URL's definition.
Link for Fallback Base URL's definition.
To me the definition for these two in the HTML5's specification kind of looks like having a circular reference.
You obviously need to understand recursion to understand recursion... ;) – These kinds of specifications are often self-referential. In the end they're very specific steps described in excruciating detail; they're essentially pseudo-code written in English as the programming language. You just need to follow them step by step to get to the result. One part referring to another term is like calling a function in code; they may even call each other, as long as there's no infinite loop in the end that's fine.
In this case it's not really that bad. The fallback base URL describes how to resolve a document's URL assuming cases where it may be a child of another document like an iframe, in which case it falls back to the parent's URL.
The document base URL resolves a document's URL taking <base> elements into account.
In summary, the specification is:
The document base URL is either the document's URL or the appropriately resolved <base> URL.
If the document is an iframe, take the parent's document base URL (see above).
Otherwise if it's about:blank but has a parent, take the parent's document base URL (see above). (This is a real niche case, but required for completeness.)
Otherwise it's the documents's address.
If the specification is henceforth talking about the document base URL, it just means step 1., the document's own URL, possibly resolved against <base>. If the specification is talking about the fallback base URL, follow all these steps.

Is it correct to set base href to /

I'm updating a site to use HTML5 history for bookmarking/link sharing etc.
The URLs will be this shape. http://weathersupermarket.co.uk/forecasts/1,10,4,3,7
All requests were previously served from http://weathersupermarket.co.uk/ and all my references were relative, e.g. img/loading.gif, js/main.min.js etc.
Therefore I'll need to update all my CSS, JS, image references to be either
relative e.g. ../img/loading.gif
absolute e.g. /img/loading.gif
However I've found that if I set the base tag to /, i.e. the absolute root then I can leave my references as relative, e.g. img/loading.gif
Is this correct?
<base href="/" />
I've found lots of similar questions on the base tag, but nothing that covers this specific case.
The W3C spec states that href attribute is simply a URL. The RFC for URIs appears to allow /.
<base href="/"> is completely valid in HTML5. According to HTML5 specification, the base URL is determined basing on the fallback base URL (for example, on http://example.com/base/com/a, it's http://example.com/base/com).
Because of that / will always refer to the actual base of website.

URL: following rules for relative path with trailing slash

I've found lots of answers on the server side for the relative-path-with-trailing-slash question, but none on the client side. Help me out here.
I'm writing a web crawler to take statistics on a set of websites, and am running into a problem. One website I'm working with has a navbar with relative paths with trailing slashes, and intends those paths to be treated as absolute, like so:
on page http://www.example.com/foo/bar
navbar links addresses -> foo/, baz/, quox/
intended absolute urls -> http://www.example.com/foo/, http://www.example.com/baz/, http://www.example.com/quox/
The problem is, as far as I can tell, this is nonstandard behavior - and yet Firefox and Chrome both handle those paths as absolute. According to RFC 1808, and RFC 2396, these should be handled like relative paths, like this:
spec-correct absolute urls -> http://www.example.com/foo/foo/, http://www.example.com/foo/baz/, http://www.example.com/foo/quox/
In particular at section 5.1 in RFC 1808 and C.1 in RFC 2396, the 4th example shows this case specifically being treated as a relative path. In Ruby, which I'm writing the crawler in, the Addressable gem handles these according to spec.
What's worse is the server in question is happy to return 200 OK for these paths, and all of them have this navbar: so I end up crawling http://www.example.com/foo/ which is the same page as http://www.example.com/foo/foo/, http://www.example.com/foo/foo/foo/ and so on, combinatorially to bizarre URLs like http://www.example.com/foo/baz/quox/foo/
So here's the question: Am I missing something that allows Chrome and Firefox to both interpret these urls as absolute paths? Is there any way to disambiguate the case where the spec is correct and the absolute path is what is intended?
There must have been a <base> tag defined inside of <head> element which is used to specify base URL for relative paths in a page.
RFC-1808

Is it OK to use empty href on links with <base> tag

I set the base tag as such:
<base href="http://mnapoli.github.com/PHP-DI/">
Then I would like to create a link to http://mnapoli.github.com/PHP-DI/ in relative path.
I tried:
link
and it works in Chrome, but is this method standard and supposed to work on every browser?
Although href="./" as suggested in Mike’s answer is better (easier to understand to anyone who reads the code), the answer to the question posed is that using an empty URL is standard and supposed to work on all browsers. According to STD 66, a relative URL with no characters (path-empty) is allowed, and by rules on relative URLs, it is resolved as the base URL.
This has nothing to do with folders or files; URLs are strings, and whether they get mapped to folders or files on a server is at the discretion of the server.
I would do something like this:
<base href="http://mnapoli.github.com/PHP-DI/">
Home

Difference between SRC and HREF

The SRC and HREF attributes are used to include some external entities like an image, a CSS file, a HTML file, any other web page or a JavaScript file.
Is there a clear differentiation between SRC and HREF? Where or when to use SRC or HREF? I think they can't be used interchangeably.
I'm giving below few examples where these attributes are used:
To refer a CSS file: href="cssfile.css" inside the link tag.
To refer a JS file: src="myscript.js" inside the script tag.
To refer an image file: src="mypic.jpg" inside an image tag.
To refer another webpage: href="http://www.webpage.com" inside an anchor tag.
NOTE: #John-Yin's answer is more appropriate considering the changes in the specs.
Yes. There is a differentiation between src and href and they can't be used interchangeably. We use src for replaced elements while href for establishing a relationship between the referencing document and an external resource.
href (Hypertext Reference) attribute specifies the location of a Web resource thus defining a link or relationship between the current element (in case of anchor a) or current document (in case of link) and the destination anchor or resource defined by this attribute. When we write:
<link href="style.css" rel="stylesheet" />
The browser understands that this resource is a stylesheet and the processing parsing of the page is not paused (rendering might be paused since the browser needs the style rules to paint and render the page). It is not similar to dumping the contents of the css file inside the style tag. (Hence it is advisable to use link rather than #import for attaching stylesheets to your html document.)
src (Source) attribute just embeds the resource in the current document at the location of the element's definition. For eg. When the browser finds
<script src="script.js"></script>
The loading and processing of the page is paused until this the browser fetches, compiles and executes the file. It is similar to dumping the contents of the js file inside the script tag. Similar is the case with img tag. It is an empty tag and the content, that should come inside it, is defined by the src attribute. The browser pauses the loading until it fetches and loads the image. [so is the case with iframe]
This is the reason why it is advisable to load all JavaScript files at the bottom (before the </body> tag)
update : Refer #John-Yin's answer for more info on how it is implemented as per HTML 5 specs.
apnerve's answer was correct before HTML 5 came out, now it's a little more complicated.
For example, the script element, according to the HTML 5 specification, has two global attributes which change how the src attribute functions: async and defer. These change how the script (embedded inline or imported from external file) should be executed.
This means there are three possible modes that can be selected using these attributes:
When the async attribute is present, then the script will be executed asynchronously, as soon as it is available.
When the async attribute is not present but the defer attribute is present, then the script is executed when the page has finished parsing.
When neither attribute is present, then the script is fetched and executed immediately, before the user agent continues parsing the page.
For details please see HTML 5 recommendation
I just wanted to update with a new answer for whoever occasionally visits this topic. Some of the answers should be checked and archived by stackoverflow and every one of us.
I think <src> adds some resources to the page and <href> is just for providing a link to a resource(without adding the resource itself to the page).
HREF: Is a REFerence to information for the current page ie css info for the page style or link to another page. Page Parsing is not stopped.
SRC: Is a reSOURCE to be added/loaded to the page as in images or javascript. Page Parsing may stop depending on the coded attribute. That is why it's better to add script just before the ending body tag so that page rendering is not held up.
Simple Definition
SRC : (Source). To specify the origin of (a communication); document:
HREF : (Hypertext Reference). A reference or link to another page, document...
SRC(Source) -- I want to load up this resource for myself.
For example:
Absolute URL with script element: <script src="http://googleapi.com/jquery/script.js"></script>
Relative URL with img element : <img src="mypic.jpg">
HREF(Hypertext REFerence) -- I want to refer to this resource for someone else.
For example:
Absolute URL with anchor element: Click here
Relative URL with link element: <link href="mystylesheet.css" type="text/css">
Courtesy
A simple definition
SRC: If a resource can be placed inside the body tag (for image, script, iframe, frame)
HREF: If a resource cannot be placed inside the body tag and can only be linked (for html, css)
You should remember when to use everyone and that is it
the href is used with links
<link rel="stylesheet" href="style.css" />
the src is used with scripts and images
<img src="the_image_link" />
<script type="text/javascript" src="" />
the url is used generally in CSS to include something, for exemple to add a background image
selector { background-image: url('the_image_link'); }
after going through the HTML 5.1 ducumentation (1 November 2016):
part 4 (The elements of HTML)
chapter 2 (Document metadata)
section 4 (The link element) states that:
The destination of the link(s) is given by the href attribute, which must be present and must contain a valid non-empty URL potentially surrounded by spaces. If the href attribute is absent, then the element does not define a link.
does not contain the src attribute ...
witch is logical because it is a link .
chapter 12 (Scripting)
section 1 (The script element) states that:
Classic scripts may either be embedded inline or may be imported from an external file using the src attribute, which if specified gives the URL of the external script resource to use. If src is specified, it must be a valid non-empty URL potentially surrounded by spaces. The contents of inline script elements, or the external script resource, must conform with the requirements of the JavaScript specification’s Script production for classic scripts.
it doesn't even mention the href attribute ...
this indicates that while using script tags always use the src attribute !!!
chapter 7 (Embedded content)
section 5 (The img element)
The image given by the src and srcset attributes, and any previous sibling source element's srcset attributes if the parent is a picture element, is the embedded content.
also doesn't mention the href attribute ...
this indicates that when using img tags the src attribute should be used aswell ...
Reference link to the W3C Recommendation
If you're talking HTML4, its list of attributes might help you with the subtleties. They're not interchangeable.
They are not interchangeable - each is defined on different elements, as can be seen here.
They do indeed have similar meanings, so this is an inconsistency. I would assume mostly due to the different tags being implemented by different vendors to begin with, then subsumed into the spec as is to avoid breaking backwards compatibility.
They don't have similar meanings. 'src' indicates a resource the browser should fetch as part of the current page. HREF indicatea a resource to be fetched if the user requests it.
From W3:
When the A element's href attribute is
set, the element defines a source
anchor for a link that may be
activated by the user to retrieve a
Web resource. The source anchor is the
location of the A instance and the
destination anchor is the Web
resource.
Source: http://www.w3.org/TR/html401/struct/links.html
This attribute specifies the location
of the image resource. Examples of
widely recognized image formats
include GIF, JPEG, and PNG.
Source: http://www.w3.org/TR/REC-html40/struct/objects.html
I agree what apnerve says on the distinction. But in case of css it looks odd. As css also gets downloaded to client by browser. It is not like anchor tag which points to any specific resource. So using href there seems odd to me. Even if its not loaded with the page still without that page cannot look complete and so its not just relationship but like resource which in turn refers to many other resource like images.
src is to used to add that resource to the page, whereas href is used to link to a particular resource from that page.
When you use in your webpage, the browser sees that its a style sheet and hence continues with the page rendering as the style sheet is downloaded in parellel.
When you use in your webpage, it tells the browser to insert the resource at the location. So now the browser has to fetch the js file and then loads it. Until the browser finishes the loading process, the page rendering process is halted. That is the reason why YUI recommends to load your JS files at the very bottom of your web page.