Syntax of HTTP-status headers - language-agnostic

There're many ways to write an HTTP-status header:
HTTP/1.1 404 Not Found
Status: 404
Status: 404 Not Found
but which is the semantically-correct and spec-compliant way?
Edit: By status headers I mean this, using a function such as PHP's header().

Adding some information some time later, since I came across this question whilst researching something related.
I believe the Status header field was originally invented as part of the CGI specification, RFC 3875:
https://www.rfc-editor.org/rfc/rfc3875#section-6.3.3
To quote:
The Status header field contains a 3-digit integer result code that
indicates the level of success of the script's attempt to handle the
request.
Status = "Status:" status-code SP reason-phrase NL
status-code = "200" | "302" | "400" | "501" | extension-code
extension-code = 3digit
reason-phrase = *TEXT
It allows a CGI script to return a status code to the web server that overrides the default seen in the HTTP status line. Usually the server buffers the result from the script and emits a new header for the client. This one is a valid HTTP header which starts with an amended HTTP status line and omits the scripts "Status:" header field (plus some other transformations mandated by the RFC).
So all of your examples are valid from a CGI script, but only the first is really valid in a HTTP header. The latter two are only valid coming from a CGI script (or perhaps a FastCGI application).
A CGI script can also operate in "non-parsed header" (NPH) mode, when it generates a complete and valid HTTP header which the web server passes to the client verbatim. As such this shouldn't include a Status: header field.
Note, what I am interested in is what which status should win if an NPH script gets it a bit wrong and emits the Status: header field, possibly in addition to the HTTP status line. I can't find any clear indication so and I suspect it is left to the implementation of whatever is parsing the output, either the client or the server.

Since https://www.rfc-editor.org/rfc/rfc2616#section-6 and more specifically https://www.rfc-editor.org/rfc/rfc2616#section-6.1 does not mention use of "Status:" when indicating a status code, and since the official list of headers at http://www.iana.org/assignments/message-headers/message-headers.xml does not mention "Status", I'd be inclined to believe it should not be served with it as a header.

The closest thing I've found to an answer is the Fast CGI spec, which states to set status codes through Status and Location headers.

A lot of them are pretty much arbitrary strings, but there here is the w3c's spec for the commonly used ones
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Related

Error in open.connection(x,"rb") : HTTP error 406

I am trying to read the contents of a website using read_htmlin R. However, for some websites like http://benchmarkrealestate.com/, I get this error. Error in open.connection(x,"rb") : HTTP error 406
What does this error mean? This only happens in some websites. I tried to look it up online, but wasn't able to find the exact reason why I get this error.
How do I fix this?
406 Not Acceptable
The requested resource is capable of generating only content not
acceptable according to the Accept headers sent in the request.
The sentence above is lifted right off of Wikipedia.
Basically, whenever a Web crawler makes a request to a website, it often identifies itself, its application type and other information by submitting a characteristic identification string to its operating peer, i.e. the web server. In this case, this identification is transmitted in a header field called User-Agent.
One way to have the content of the web page returned to your console is to set your user-agent information to something identifiable with the help of the curl package:
library(xml2)
library(rvest)
library(curl)
web_content <- read_html(curl('http://benchmarkrealestate.com/', handle = new_handle("useragent" = "Mozilla/5.0")))
You may also want to read up on header fields.

HTTP Status code for content unavailable without a redirect

I'm trying to find the correct HTTP status code for a page where the content is temporary unavailable however there is no redirect, instead a message is displayed on the page informing the user the content is temporarily unavailable.
307 Temporary Redirect isn't applicable as there is no redirect.
404 Not Found might possibly be applicable, however I'm not sure if this is the correct response to give as the content is found, just not available.
410 Gone isn't applicable as the content will be available again some time in the future.
None of the other codes seemed even remotely applicable. Does anyone know the correct code to use and can explain why?
It sounds like the 4XX series of responses are appropriate here. From the RFC:
The 4xx class of status code is intended for cases in which the
client seems to have erred. Except when responding to a HEAD request,
the server SHOULD include an entity containing an explanation of the
error situation, and whether it is a temporary or permanent
condition.
With this in mind, I think 403 forbidden is the most appropriate:
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated.
If the request method was not HEAD and the server wishes to make
public why the request has not been fulfilled, it SHOULD describe the
reason for the refusal in the entity. If the server does not wish to
make this information available to the client, the status code 404
(Not Found) can be used instead.
I suggest this for three reasons:
It's not an exotic code, so it will work fine in the browser. This, to me, is the most important reason - you will be able to serve a page that explains why the content isn't available, and you can be fairly certain it will be displayed correctly.
It's appropriate for the server to say "I understand your request, but I won't serve you that content at this time", and that's exactly what the first two lines of the description say.
It doesn't explicity say "forget you ever knew about this content" to any robots (or for that matter, people).
For completeness, here's why I ruled out the other response code categories:
2XX Success: This class of status code indicates that the client's request was
successfully received, understood, and accepted.
But, we're not accepting the request in this case. I don't think 2XX is right.
3XX Redirection: This class of status code indicates that further action needs to be
taken by the user agent in order to fulfill the request.
I suppose that you could argue "further action" to mean "please wait until the content is available before trying again", but reading the other 3XX codes, "further action" usually means "immediate redirect", which as you've already pointed out, isn't appropriate.
5XX Server error: Response status codes beginning with the digit "5" indicate cases in
which the server is aware that it has erred or is incapable of
performing the request.
Nothing has gone wrong on the server, you just don't want to serve the content right now.
HTTP STATUS CODE 204
i.e. NO CONTENT
Read More about it here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
section 10.2.5
i.e.
The server successfully processed the request, but is not returning any content.

Is there any action I could do with POST, but not with GET?

I know differences and advantages of each command, my question is could I replace POST requests with GET everywhere? And which these commands calls by default while sending request from html form?
could I replace POST requests with GET everywhere
No (and it would be a terrible idea to try).
Things that a form can do with POST that you can't do with GET include:
Sending lots of data
Sending files
There are other things that would simply be stupid to do with GET.
From http://www.w3.org/TR/html5/forms.html#attr-fs-method :
The method and formmethod content attributes are enumerated attributes
with the following keywords and states:
The keyword get, mapping to the state GET, indicating the HTTP GET
method. The keyword post, mapping to the state POST, indicating the
HTTP POST method. The invalid value default for these attributes is
the GET state. (There is no missing value default.)
When using GET to transfer data from the client to the server, the data is added to the URL, there is not BODY of the request. There is usually a limit on how long a URL can be, in the old days this was 1024 characters but that really depends on the server software and server middleware and even the browser.
This means if you want to transfer loads or data or upload a file to the server, you can not do it with GET.

Why should JSON have a status property

I stumbled over a practice that I found to be quite widespread. I even found a web page that gave this a name, but I forgot the name and am not able to find that page on google anymore.
The practice is that every JSON response from a REST service should have the following structure:
{
"status": "ok",
"data": { ... }
}
or in an error case:
{
"status": "error",
"message": "Something went wrong"
}
My question: What is the point why such a "status" property should be required in the JSON? In my opinion that is what HTTP status codes were made for.
REST uses the HTTP means of communication between client and server, for example the "DELETE" verb should be used for deleting. In the same way, 404 should be used if a resource is not found, etc. So inline with that thinking, any error cases should be encoded properly in the HTTP status.
Are there specific reasons to return a HTTP 200 status code in an error case and have the error in the JSON instead? It just seems to make the javascript conditional branches more complex when processing the response.
I found some cases where status could be "redirect" to tell the application to redirect to a certain URL. But if the proper HTTP status code was used, the browser would perform the redirection "for free", maintaining the browsing history properly.
I picture mainly two possible answers from you:
Either there are two quarreling communities with their favorite approach each (use HTTP status always vs. use HTTP status never)
or I am missing an important point and you'll tell me that although the HTTP status should be used for some cases, there are specific cases where a HTTP status does not fit and the "status" JSON property comes into play.
You are right. I think what you are seeing is a side-effect of people not doing REST correctly. Or just not doing REST at all. Using REST is not a pre-requisite for a well-designed application; there is no rule that webapps have to be REST-ful.
On the other hand, for the error condition, sometimes apps want to return a 200 code but an error to represent a business logic failure. The HTTP error codes don't always match the semantics of application business errors.
You are mixing two different Layers here:
HTTP is for establishing (high-level) connections and transferring data. The HTTP status codes thus informs you if and how the connection was established or why it was not. On a successful connection the body of the HTTP request could then contain anything (e.g. XML, JSON, etc.), thus these status code have to define a general meaning. It does not inform you about the correctness or type (e.g. error message or data) of the response.
When using JSON for interchanging data you could certainly omit the status property, however it is easier for you to parse the JSON, if you know if it includes the object you were requesting or an error message by just reading one property.
So, yes, it is perfectly normal to return a 200 status code and have a "status": "error" property in your JSON.
HTTP status codes can be caused by a lot of things, including load balancers, proxies, caches, firewalls, etc. None of these are going to modify your JSON output, unless they completely break it, which can also be treated as an error.
Bottom line: it's more reliable to do it via JSON.

How to know when to send a 304 Not Modified response

I'm writing a resource handling method where I control access to various files, and I'd like to be able to make use of the browser's cache. My question is two-fold:
Which are the definitive HTTP headers that I need to check in order to know for sure whether I should send a 304 response, and what am I looking for when I do check them?
Additionally, are there any headers that I need to send when I initially send the file (like 'Last-Modified') as a 200 response?
Some psuedo-code would probably be the most useful answer.
What about the cache-control header? Can the various possible values of that affect what you send to the client (namely max-age) or should only if-modified-since be obeyed?
Here's how I implemented it. The code has been working for a bit more than a year and with multiple browsers, so I think it's pretty reliable. This is based on RFC 2616 and by observing what and when the various browsers were sending.
Here's the pseudocode:
server_etag = gen_etag_for_this_file(myfile)
etag_from_browser = get_header("Etag")
if etag_from_browser does not exist:
etag_from_browser = get_header("If-None-Match")
if the browser has quoted the etag:
strip the quotes (e.g. "foo" --> foo)
set server_etag into http header
if etag_from_browser matches server_etag
send 304 return code to browser
Here's a snippet of my server logic that handles this.
/* the client should set either Etag or If-None-Match */
/* some clients quote the parm, strip quotes if so */
mketag(etag, &sb);
etagin = apr_table_get(r->headers_in, "Etag");
if (etagin == NULL)
etagin = apr_table_get(r->headers_in, "If-None-Match");
if (etag != NULL && etag[0] == '"') {
int sl;
sl = strlen(etag);
memmove(etag, etag+1, sl+1);
etag[sl-2] = 0;
logit(2,"etag=:%s:",etag);
}
...
apr_table_add(r->headers_out, "ETag", etag);
...
if (etagin != NULL && strcmp(etagin, etag) == 0) {
/* if the etag matches, we return a 304 */
rc = HTTP_NOT_MODIFIED;
}
If you want some help with etag generation post another question and I'll dig out some code that does that as well. HTH!
A 304 Not Modified response can result from a GET or HEAD request with either an If-Modified-Since ("IMS") or an If-Not-Match ("INM") header.
In order to decide what to do when you receive these headers, imagine that you are handling the GET request without these conditional headers. Determine what the values of your ETag and Last-Modified headers would be in that response and use them to make the decision. Hopefully you have built your system such that determining this is less costly than constructing the complete response.
If there is an INM and the value of that header is the same as the value you would place in the ETag, then respond with 304.
If there is an IMS and the date value in that header is later than the one you would place in the Last-Modified, then respond with 304.
Else, proceed as though the request did not contain those headers.
For a least-effort approach to part 2 of your question, figure out which of the (Expires, ETag, and Last-Modified) headers you can easily and correctly produce in your Web application.
For suggested reading material:
http://www.w3.org/Protocols/rfc2616/rfc2616.html
http://www.mnot.net/cache_docs/
You should send a 304 if the client has explicitly stated that it may already have the page in its cache. This is called a conditional GET, which should include the if-modified-since header in the request.
Basically, this request header contains a date from which the client claims to have a cached copy. You should check if content has changed after this date and send a 304 if it hasn't.
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25 for the related section in the RFC.
We are also handling cached, but secured, resources.  If you send / generate an ETAg header (which RFC 2616 section 13.3 recommends you SHOULD), then the client MUST use it in a conditional request (typically in an If-None-Match - HTTP_IF_NONE_MATCH - header).  If you send a Last-Modified header (again you SHOULD), then you should check the If-Modified-Since - HTTP_IF_MODIFIED_SINCE - header.  If you send both, then the client SHOULD send both, but it MUST send the ETag.  Also note that validtion is just defined as checking the conditional headers for strict equality against the ones you would send out.  Also, only a strong validator (such as an ETag) will be used for ranged requests (where only part of a resource is requested).
In practice, since the resources we are protecting are fairly static, and a one second lag time is acceptable, we are doing the following:
 Check to see if the user is authorized to access the requested resource
     If they are not, Redirect them or send a 4xx response as appropriate.  We will generate 404 responses to requests that look like hack attempts or blatant tries to perform a security end run.
 Compare the If-Modified-Since header to the Last-Modified header we would send (see below) for strict equality
     If they match, send a 304 Not Modified response and exit page processing
 Create a Last-Modified header using the modification time of the requested resource
    Look up the HTTP Date format in RFC 2616
 Send out the header and resource content along with an appropriate Content-Type
We decided to eschew the ETag header since it is overkill for our purposes.  I suppose we could also just use the date time stamp as an ETag.  If we move to a true ETag system, we would probably store computed hashes for the resources and use those as ETags.
If your resources are dynamically generated, from say database content, then ETags may be better for your needs, since they are just text to be populated as you see fit.
regarding cache-control:
You shouldn't have to worry about the cache-control when serving out, other than setting it to a reasonable value. It's basically telling the browser and other downstream entities (such as a proxy) the maximum time that should elapse before timing out the cache.