Having links relative to path (i.e. http://domain/path/) - html

Are there commonly accepted ways to have all links and references to images, scripts, stylesheets be relative to some path regardless of current document's URL?
Let's start from the very beginning. I am developing a custom content managing system in PHP. I am using mod_rewrite to redirect all requests like http://domain.com/path/artist/edit/25 to http://domain.com/path/index.php?url=/artist/edit/25. So the part of the URL following http://domain.com/path/ is actually virtual.
I would like all links to be in the format like ... and references to images, scripts, etc. in the format like <link href="ui/css/style.css"...>.
Well, it seems to be possible with:
...
<base href="http://domain.com/path/" />
...
This way I can link to scripts and stylesheets in a way like below:
...
<!-- Custom page style CSS -->
<link href="ui/css/style.css" rel="stylesheet" type='text/css'>
<!-- Support for CSS3 media query in IE8 -->
<script type="text/javascript" src="ui/js/respond.js"></script>
<!-- MooTools 1.6.0 -->
<script type="text/javascript" src="ui/js/MooTools-Core-1.6.0.js"></script>
...
However, AFAIK the <base href=...> should match the current page request (which is http://domain.com/path/artist/edit/25). And it ruins the whole concept.
That's why I need you to clarify:
Is it a commonly accepted practice to have <base href=...> pointing to a directory and not to the current document URL?
Does this practice comply with the requirements for the usage of HTML <base> element?
Will it in any way affect crawlers like Googlebot? Do they require the <base href=...> to match the every particular document URL?
I also would like to know how do you solve the problem of relative links and references to resources when some part of URL is virtual. I have discovered that projects like WordPress tend to completely avoid relative links and go the "absolute links way".

The whole point of the base element is to specify an arbitrary base URL to be used to resolve relative links instead of the current-document URL. Otherwise the element would not make sense since current-document URL is used as the base url by default anyway.
Major crawlers support both absolute and relative URLs as well as the base element. Some shake-and-bake crawlers don’t understand relative URLs and/or don’t support the base element (thus resulting in multiple 404 lines in your server logs, though this is a minor thing).
I would recommend not to use the base element. Relative links tend to be error-prone resulting in wrong resolved URLs while not providing any serious benefits. It’s generally more reasonable and easy to always use absolute URLs.

Is it a commonly accepted practice to have pointing to
a directory and not to the current document URL?
No, it's not common. In fact I'd say it's very uncommon because there are better ways create a logical information architecture of your site without it.
Will it in any way affect crawlers like Googlebot? Do they require the to match the every particular document URL?
It's hard to get the base tag correct and there are ways to do what you want using better methods that are transparent to googlebot etc.
Note, absolute links are what you're seeing in the source but it that does not means that the links physically map to directories and files etc. Using tools like mod_rewrite on apache you can structure your site as many ways as you please with practically any physical filesystem, doing this is also what I'd recommend because as things changes you're not tied to a particular solution. This is also why most php apps send everything through an index.php script, the application then controls the information architecture, not the filesystem.

"base href" can be used without problems, but it is not always the best solution. It is fine if your server will answer requests with diferent server name and paths (e.g. "http://www.example.com/companysection/especificservice" and "http://service.internalnetwork.dev/")
IMHO it's not the best solution for your case.
In the url "http://example.com/path/index.php?url=/artist/edit/25" you want to transform part of the query in a path ( base example.com/path/index.php ?url= )... and this can be a big problem. How are you going to handle querys that also have a query? (receiving a search term or a form GET, for example)
Apache mod_rewrite would be a better option, as Harry answer suggest (or nginx rewrite rules). With it you can easily "transform" a request like http://example.com/path/artist/edit/25?search=something&order=ASC in http://example.com/path/index.php?url=artist/edit/25&search=something&order=ASC
This will give you less problems in the long term.
Check the last example in https://wiki.apache.org/httpd/RewriteQueryString , it's really close to fulfill all your rewriting needs
(you will just need to ensure you handle the rest of query properly)
Take a URL of the form http://example.com/path/var/val and transform
it into a var=val query http://example.com/path?var=val. Essentially
the reverse of the above recipe. This example will work for any valid
three level URL. http://example.com/path/var/val will be transformed
into http://example.com/path?var=val.
RewriteRule ^/path/([^/]+)/([^/]+) /path?$1=$2

Related

Link: Response Header VS HTML

I am currently working on a function to assist in preparing Link: HTTP header or a set of <link> tags and while reading different materials on this, I still am not able to find an answer to simple question: when to use Link: header and when to use <link>.
So far I can only say, that if you want to use HTTP20 server push, it is recommended to utilize the header. On the other hand, even if I push a stylesheet, it will not be applied unless there is a respective tag in HTML output.
Since I am preparing the library in order to help with some standardization and sanitization, I would like to catch, at least, some "weird" cases like this, if it's possible, but for that I need some set of recommendations or best practices in that regard. Sadly I am unable to find any thus far, so am turning to more knowledgeable people: what best practices or weird cases should I consider catching or should I just allow whatever to be sent regardless of whether it's a header or a tag?
If anyone is interested, the code is present in https://github.com/Simbiat/HTTP20/blob/main/src/Headers.php (links function).
They are supposed to be equivalent as #Evert states so in theory you can use either. However there are some considerations:
Headers are usually set in web server config (at least for static pages) which may not be as easy to update for developers.
However it has the added advantage that you can set these for multiple pages all at once (e.g. preload your core fonts on every .html file, rather than having to remember to set this on all pages, or all page templates if using a CMS).
On the other side with the HTML version it’s often easier to configure it per page (or page template), if you have different needs (e.g. different fonts are used in different pages).
There’s also some which say there are slight performance considerations to doing it in the header but honestly, as long as it’s high enough in the <HEAD> element I really think you’d struggle to notice this.
Of perhaps of more importance is whether it’s passed on hop to hop if your web server is hidden behind other infrastructure (e.g. a CDN or other proxy). In theory it should be, for simple headers, but for things like HTTP/2 push that’s not so easy. If it’s in the HTML you don’t need to worry about this (assuming intermediaries are not changing the markup of course!).
You mentioned the HTTP/2 push use case and that definitely needs the header (though this is not a defined standard method of setting push and some servers or CDNs use other methods, but many use this). However given HTTP/2 push’s complexities and concerns it can cause more problems than it solves, this is maybe a reason to recommend the HTML method to ensure it’s never pushed.
All in all I recommend setting this in the HTML. It’s just easier.
This is not the case however with other, similar things, which can be set in HTML and HTTP headers. CSP for example is limited in the HTML version, lacking some features of the HTTP Header version, and is also not recommended as it could be altered with JavaScript whereas the HTTP header cannot. But for simple Link headers these are less of a concern.

Is it safe to allow to embed an arbitrary external stylesheet into my web-page?

I have a dynamic web-page which I want other people to embed into their web-pages, with an iframe (not necessarily with any kind of more advanced techniques like JavaScript).
Instead of providing all sorts of designs and styles myself, I'm thinking about allowing them to provide their own stylesheet for my page through an HTTP GET parameter, and embed such external stylesheet through a URL w/ <link type="text/css" rel="stylesheet" href… on my page.
Is this safe? Will it violate the security paradigm of my web-site? I'm aware that extra text could be inserted with CSS alone, and indeed elements could be removed (which is the whole point of me providing such functionality for my users), but anything else I should be aware of?
Could malicious people insert links onto my site through such a CSS, to benefit from my http referer and potentially violate some checks, or is CSS insertion limited to text?
In the general case, no, allowing third-party CSS is not safe. Some implementations allow JavaScript in CSS, which means that allowing users to modify your CSS allows them to execute arbitrary JavaScript in the context of your page.
However, if this is meant to be sort of a "white-label" page, where it appears to be part of the site it's embedded in and the fact that it's really your page is just an implementation detail, this doesn't seem like a major concern. The person specifying the "third-party" CSS is the site owner, so it's not really third-party at that point — they're not going to XSS themselves!
But nobody else should ever be putting CSS on a page that's meant to be under your control, because it's really under the control of whoever is controlling the CSS.
CSS cannot insert linkable content. It can only style, position and hide what's already there. Sure, people can mess up your page with :before and :after text an perhaps make things look a little confusing or change labels on existing links, but not the URLs themselves.

Is it a bad practice to use relative urls with an explicit base?

Is it a bad practice to use relative urls with explicit base for dynamic site?
For example, like this one:
<base href="http://my-site.com/mount-point-of-site">
...
<img src='/my-page/my-image.jpg'></img>
I need it because mount point of site can be changed over time, and I need to preserve referential integrity of wiki-like content produced by users (links to relative pages, relative image paths, ...).
But I never saw such technique in use for dynamic web applications, usually it's handled on the server-side.
Is there any specific disadvantages of such technique, that may bite me later? SEO, cross-browser / mobile compatibility, some other aspects?
I get what you're saying about applications not using absolute urls. You'll typically set the base url in a config file, not as a meta tag in that instance.
Best practice? Always use absolute urls incase anyone links to your stuff, or scrapes your links, things will still point to your site instead of their site.
SEO folks will agree with the absolute url rule.

Do canonical links require a full domain?

I want to add canonical links to my pages, but do I need to specify the domain, or will a relative URL do?
In other words, is:
<link rel="canonical" href="/item/1">
good enough, or do I need to use:
<link rel="canonical" href="http://mydomain.com/item/1">
Directly from Google:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394
Can the link be relative or absolute?
rel="canonical" can be used with relative or absolute links, but we recommend using absolute links to minimize potential confusion or difficulties. If your document specifies a base link, any relative links will be relative to that base link.
Again, Google says this:
https://support.google.com/webmasters/answer/139066?hl=en
Avoid errors: use absolute paths rather than relative paths with the rel="canonical" link element.
Use this structure: https://www.example.com/dresses/green/greendresss.html
Not this structure: /dresses/green/greendress.html).
For example’s sake, these are their URLs:
http://example.com/wordpress/seo-plugin/
http://example.com/wordpress/seo/seo-plugin/
This is what rel=canonical was invented for. Especially in a lot of e-commerce systems, this (unfortunately) happens fairly often, where a product has several different URLs depending on how you got there. You would apply rel=canonical in the following method:
You pick one of your two pages as the canonical version. It should be the version you think is the most important one. If you don’t care, pick the one with the most links or visitors. If all of that’s equal: flip a coin. You need to choose.
Add a rel=canonical link from the non-canonical page to the canonical one. So if we picked the shortest URL as our canonical URL, the other URL would link to the shortest URL like so in the <head> section of the page:
<link rel="canonical" href="http://example.com/wordpress/seo-plugin/">
That’s it. Nothing more, nothing less.
All href attributes are hypertext references - that's what it stands for. As such, they are always URI-References, not URIs, and can be relative.
In this case though, there's a benefit in putting in the full URI if you can, because it will survive anything that migrates it onto another domain in the future (assuming you will still want the domain listed to be the canonical one), and can even survive some of the cruder automated plagiarisms :)
That benefit is pretty slight if you aren't actively using non-canonical versions on other domains though, so I wouldn't expend much effort on it.
There is nothing special about canonical. It’s a standard link type, for use with standard ways to provide links (e.g., the link element), so you can specify any kind of URL reference (absolute, relative, protocol-relative, in combination with the base element, empty, …).
RFC 6596 (The Canonical Link Relation) explicitly says:
The target (canonical) IRI MAY:
Specify a relative IRI (see [RFC3986], Section 4.2).
One of the examples:
[…] or as a relative IRI:
<link rel="canonical" href="page.php?item=purse">
Update on canonical best practices: rel="canonical" has cross-domain support. Google's source: https://webmasters.googleblog.com/2009/12/handling-legitimate-cross-domain.html
Moreover, the introduction of structured data makes the use of canonicals even more strict, as Google will not pick-up the JSON markup from not canonical sources (a mistake I happen to have made!).
Relative canonical paths are accepted. This one works best:
<link rel="canonical" href="#"/>
It points to the current document's URL – including queries – sans the hash part.
If you only have one domain for that website, is ok to use the absolute path:
<link rel="canonical" href="/item/1">

Where to place the humans.txt file if I cannot put it on the site root?

Background
I'm building a web application for a client.
This app will be accessible to the world and will be placed in a directory (e.g., /my-app) in web-root. A subdomain isn't an option as they don't want to cough up the dough for another SSL cert.
/my-app is the only directory that I'm allowed to touch (unreasonable IT guys).
I'm using an icon set which requires attribution.
I've contacted the original author of the icon set and have gotten permission to link back to his work in the THANKS section of a humans.txt file.
I also feel like I should mention some other people's work. This information combined with the above will probably take up a good 20 lines, so a separate file like humans.txt seems like an ideal place to put this considering that I'll be serving minified markup, CSS, and script files.
Questions
Since I'm not allowed to place a humans.txt file in web-root, (and even if I was, it wouldn't really make much sense to put it there as it only applies to the /my-app portion of the site) is it acceptable to do the following:
Create: /my-app/humans.txt
Place: <link rel="author" href="//example.com/my-app/humans.txt"> in my markup
I'll be serving strict HTML 4.01 and the author value for the rel attribute doesn't seem to be a recognized link type in that specification. Do I need to do anything extra to define the author link type, or is the act of using it enough?
I don't even know if there are any non-spider tools that actually use this file at the moment, but I'd like to minimize the chance of this not working in the future when something does come along.
I think it is ok to put the file in the applications own directory, since it clarifies that it is specific to the content inside the directory and not all the other stuff you might find in the root directory.
Of course it would be nice if there are 0 errors in HTML strict mode. However this is one situation where you have to decide if you want to
keep up with the standard and not insert the meta tag (maybe put it in a comment or as a real link in a credits page)
ignore the standard, because the standard is nice but not the holy grail (there are quite worse errors you can make than that)
chose another Doctype, which allows you to use the meta tag you want, but to test again if all browsers render the new Doctype correctly
However I can not make this decision for you ;)