Distinguish HTML documents by URL suffix - html

What a browser receives as HTML file can have many different filename extensions on the path: .html, .htm, /, .php, .asp, .stm, .cgi, etc.
Is there a way to distinguish, from only the request URL, whether it points to a HTML document or some additional data (f.ex. .png, .css, .js, ...)? This should be determined at the time of the request, so waiting for Content-Type is not an option.
HTML URLs
google.com/, stackoverflow.com, https://en.wikipedia.org/wiki/Uniform_Resource_Locator, https://www.google.de/search?q=content-length, http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html
non-HTML URLs
http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon#2.png?v=73d79a89bded, http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js, http://cdn.sstatic.net/stackoverflow/all.css?v=aaf07438bdbd
Maybe filtering the non-html parts (for example, by js, css, png, jpg, ...) would work. An alternative would be to filter by What are common file extensions for web programming languages? and include directories and domains.
It must not be perfect, close enough would be good.

Is there a way to distinguish, from only the request URL, whether it
points to a HTML document or some additional data (f.ex. .png, .css,
.js, ...)? This should be determined at the time of the request, so
waiting for Content-Type is not an option.
No, this is not possible.
The webserver can do anything it wants in response to a request.
Some responses can be static, i.e. files on disk (but even then, the extension is no guarantee for the real contents of the file) - others can be totally dynamic, and only the server decides what kind of data to return (it could even return a .jpg file in response to a .html request -- or the opposite could happen a lot in the real world: a .jpg url that returns a html page with a download link for that jpg).
A lot of url's don't even have an extension, so checking the extension in general is no solution.
The best (soonest) way is to look at the Content-Type header field (assuming it corresponds with the data).
If the client doesn't want to download the full response, only to check the Content-Type, a HEAD request can be made, which will only return the HTTP headers.

No.
URLs are, once you hit the path segment, entirely arbitrary.
Sometimes the URL will include something which happens to match a filename on the HTTP server's hard disk. Sometimes that filename will give a clue about what kind of data is in it. Often it will give a clue about how the server will execute a program which will generate content of any kind.
The authoritative description of what an HTTP resource is is the Content-Type response header (and sometimes servers give wrong information there anyway).

No, that's not possible (assuming you're looking for something reliable).
In general, the format of a URI is independent of the media type of the resource it identifies. That's how the web works.

The below answer is deprecated. In Python, there is mimetypes.py in the standard library, which does exactly that.
Old answer
As a bit of reasoning: URLs containing file extensions like .html are implementation specifics. When you change from cgi to, whatever, you would be forced to either abandon the URL, breaking links, or keep an incorrect version around. See also
Semantic URL Wiki Page and
Cool URIs don't change.

Related

Could we pass GET data to css?

I just came across a website pagesource and saw this in the header:
<link href="../css/style.css?V1" rel="stylesheet" type="text/css" />
Could we actually pass GET data to css? I tried searching but found no results apart from using PHP. Could anyone help make meaning of the ?V1 after the .css
I know this forum is for asking programming problems, however I decided to ask this since I have found no results in my searches
First of all, no you can't pass GET parameters to CSS. Sorry. That would have been great though.
As for the example url. It can either be a CSS page generated by any web server (doesn't have to be PHP). In this case the server can serve different pages or versions of the same page which might explain the meaning of V1, Version 1. The server can also dynamically generate the page with a server-side template. This is an example from the Jade documentaion:
http://cssdeck.com/labs/learning-the-jade-templating-engine-syntax
It can also just be used as cache buster, for versioning purposes. Whenever you enter a url the browser will try to fetch it only if it doesn't already have a cached copy which is specific to that URL. If you have made a change in your content (in this instance the css file) and you want the browser to use it and not the cached version you can change the url and trick the browser to think it's a new resource that is not cached, so it'll fetch the new content from the server. V1 can then have a symantic meaning to the developer serving as a note (ie I've changed this file once...twice..etc) but not actually do anything but break the cache. This question addresses cache busting.
There are different concepts.
At first, it only is a link - it has a name, it might have an extension, but this is just a convention for humans, and nothing more than a resource identifier for the server. Once the browser requests it, it becomes a server request for a resource. The server then decides how to handle this request. It might be a simple file it just has to return, it might be a server side script, which has to be executed by a server side scripting interpreter, or basically anything else you can imagine.
Again, do not trick yourself in thinking "this is a CSS file", just because it has a css extension, or is called style.
Whatever runs at the server, and actually answers the request, will return something. And this something then is given a meaning. It might be CSS, it might be HTML, it might be JavaScript, or an image or just a binary download. To help the browser to understand what it is, the server returns a Content-Type header.
If no content type is given, the browser has to guess what it is. Or the nice web author gave a hint on what to expect as response - in this case he gave the hint of text/css. Again, this is how the returned content should be interpreted by the client/browser, not how that content is supposed to created on the server side.
And about the ?V1? This could mean different things. Maybe the user can configure a style (theme) for the website and this method is used to dispatch different styles. Or it can be used for something called "cache busting" (look it up).
You can pass whatever you want; the server decides what to do with the data.
After all, PHP isn't your only option for creating a server. If i wrote a server in Node.js, set up a route for /css/style.css and made it return different things depending on what query was given, neither the server nor browser will bat an eyelid.

Treat no extension files as html?

So I'm recreating a website from web.archive.org. I've downloaded it and it has many pages. The problem is that the past site was a forum php script and now I obviously can't recreate it again. Nevertheless I will be satisfied with only being an html until I build something else.
So the problem now is that there are a lot of files generated from the query urls like this:
index.php#lang=fr
index.php#lang=fr&section=4
index.php#lang=fr&section=5
index.php#section=15&fonc=imp&lang=fr
etc...
And when I upload these files to my server the browser threats these no-extension files as text instead of an html, despite the html content inside.
Can anyone tell me why is this happening and is there an easy way to solve it?
EDIT: So apparently is the download software that I used which replaced the original urls ? with #. But if I just bulk rename all files from # to ? they still won't open. So how about the ultimate solution below, how to do that painless and fast?
Ultimately I would like to place each of the old files in one folder and rename them to html and then create htaccess rules from the original URLs to each file respectively in that folder. However doing this manually would take infinite time. So can anyone suggest a simpler solution to this?
This happens because your default content type is likely configured to be text/plain (which is the default in Apache). With HTTP, a resource type is not indicated by a file name extension, it is indicated by the Content-Type response header.
I think that you will have to set the default Content-Type header with this directive in your configuration:
DefaultType text/html
See also: http://httpd.apache.org/docs/2.2/mod/core.html#defaulttype

HTML5 read files from path

Well, using HTML5 file handlining api we can read files with the collaboration of inpty type file. What about ready files with pat like
/images/myimage.png
etc??
Any kind of help is appreciated
Yes, if it is chrome! Play with the filesytem you will be able to do that.
The simple answer is; no. When your HTML/CSS/images/JavaScript is downloaded to the client's end you are breaking loose of the server.
Simplistic Flowchart
User requests URL in Browser (for example; www.mydomain.com/index.html)
Server reads and fetches the required file (www.mydomain.com/index.html)
index.html and it's linked resources will be downloaded to the user's browser
The user's Browser will render the HTML page
The user's Browser will only fetch the files that came with the request (images/someimages.png and stuff like scripts/jquery.js)
Explanation
The problem you are facing here is that when HTML is being rendered locally it has no link with the server anymore, thus requesting what /images/ contains file-wise is not logically comparable as it resides on the server.
Work-around
What you can do, but this will neglect the reason of the question, is to make a server-side script in JSP/PHP/ASP/etc. This script will then traverse through the directory you want. In PHP you can do this by using opendir() (http://php.net/opendir).
With a XHR/AJAX call you could request the PHP page to return the directory listing. Easiest way to do this is by using jQuery's $.post() function in combination with JSON.
Caution!
You need to keep in mind that if you use the work-around you will store a link to be visible for everyone to see what's in your online directory you request (for example http://www.mydomain.com/my_image_dirlist.php would then return a stringified list of everything (or less based on certain rules in the server-side script) inside http://www.mydomain.com/images/.
Notes
http://www.html5rocks.com/en/tutorials/file/filesystem/ (seems to work only in Chrome, but would still not be exactly what you want)
If you don't need all files from a folder, but only those files that have been downloaded to your browser's cache in the URL request; you could try to search online for accessing browser cache (downloaded files) of the currently loaded page. Or make something like a DOM-walker and CSS reader (regex?) to see where all file-relations are.

HTML - Images with wrong extension

If an image file name does not reflect its correct file type(e.g stored with .pdf extension), is it safe to use it in HTML? Will the browser decide the correct type of the image? Will mobile browsers be able to deduce correct file type?
I have tested it with google chrome, it is working, but Is it guaranteed to run on all reasonable browsers?
UPDATE: I can't rename them to correct extensions, since they will be uploaded by users and then shown again.
Will mobile browsers be able to deduce correct file type?
Browsers don't usually deduce file types (there are exceptions, notably in IE—resulting in text files discussing HTML being treated as HTML and IIS servers sending text/plain content-types for HTML documents without their owners noticing—but they shouldn't be the primary concern).
Instead, browsers determine the type of data by examining the HTTP Content-Type Response header. By default, most servers will set this based on the file extension of the file they are reading from the filesystem to serve to the client.
You can override this, but doing so is fiddly and could cause problems if people save a file before opening it from their local file system (because it will have the wrong extension and their OS will associate it with the wrong application).

Is it better practice to add the file extension to an "href" value?

If I have a very simple http directory:
default.html
info.html
contact.html
etc...
Should default.html link to the other files in the directory, I've always been able to simply use an anchor tag thus:
Contact me!
Will this always work, assuming that there is only one file in the directory with a name matching this extension-less href value?
It depends on the server and how it is set-up.
Remember that there's no innate mapping between URIs and files on a webserver, the webserver is always following some sort of rule as to what file to send. The simplest takes the path part of the URI and does a direct mapping to a filepath local to the webserver, but it could be doing just about anything else. A common case is using the file extension to do con-neg, so if you have contact.html and contact.atom and so on in the same local directory corresponding to the path, it picks that closest to the Accept header from the user-agent.
Putting file extensions (whether of "static" files or handlers like .php, .aspx, etc.) in URIs is rather pointless since there is no such thing as a file on the web (there are files on the server, and the client can save the stream to a file, but on the web itself there are octet streams that may or may not correspond to a file). And less than ideal; presumably contact.html has something to do with contact details, while "contact" expresses this idea well, ".html" has nothing to do with contact details and doesn't belong there.
Hence the more sensible URI would not have ".html" in it, unless this was in some way expressing something useful (such as explicitly asking for a HTML version and bypassing content-negotiation, or if the page was actually about HTML).
On the other hand, just mapping directly to file names is a quick and easy way to do things, so while I certainly frown on such arbitrary cruft in URIs I won't jump through too many hoops not to use it, especially in secondary URIs used for stylesheets, images etc. rather than those which are expected to regularly appear in the address bar of a browser.
On the third hand, once you remove such cruft, adding more sophisticated handling later if required, becomes a much easier transition.
There is a content negotiation feature in Apache2 which does that, but personally, I do not like to rely on that.
If I need nice URLs, I'd better use mod_rewrite and implement completely custom url scheme which would be easy to modify & customize without limits.
http://httpd.apache.org/docs/2.0/content-negotiation.html
No it does not automatically appends .html as he could not know which file extension to use. Let's say you have a contact.html and a contact.php. Which one should he use.
However you can do all this using rewrite rules (e.g. in a .htaccess file). Just search for some examples here on SO or in the web.