I'm writing a resource handling method where I control access to various files, and I'd like to be able to make use of the browser's cache. My question is two-fold:
Which are the definitive HTTP headers that I need to check in order to know for sure whether I should send a 304 response, and what am I looking for when I do check them?
Additionally, are there any headers that I need to send when I initially send the file (like 'Last-Modified') as a 200 response?
Some psuedo-code would probably be the most useful answer.
What about the cache-control header? Can the various possible values of that affect what you send to the client (namely max-age) or should only if-modified-since be obeyed?
Here's how I implemented it. The code has been working for a bit more than a year and with multiple browsers, so I think it's pretty reliable. This is based on RFC 2616 and by observing what and when the various browsers were sending.
Here's the pseudocode:
server_etag = gen_etag_for_this_file(myfile)
etag_from_browser = get_header("Etag")
if etag_from_browser does not exist:
etag_from_browser = get_header("If-None-Match")
if the browser has quoted the etag:
strip the quotes (e.g. "foo" --> foo)
set server_etag into http header
if etag_from_browser matches server_etag
send 304 return code to browser
Here's a snippet of my server logic that handles this.
/* the client should set either Etag or If-None-Match */
/* some clients quote the parm, strip quotes if so */
mketag(etag, &sb);
etagin = apr_table_get(r->headers_in, "Etag");
if (etagin == NULL)
etagin = apr_table_get(r->headers_in, "If-None-Match");
if (etag != NULL && etag[0] == '"') {
int sl;
sl = strlen(etag);
memmove(etag, etag+1, sl+1);
etag[sl-2] = 0;
logit(2,"etag=:%s:",etag);
}
...
apr_table_add(r->headers_out, "ETag", etag);
...
if (etagin != NULL && strcmp(etagin, etag) == 0) {
/* if the etag matches, we return a 304 */
rc = HTTP_NOT_MODIFIED;
}
If you want some help with etag generation post another question and I'll dig out some code that does that as well. HTH!
A 304 Not Modified response can result from a GET or HEAD request with either an If-Modified-Since ("IMS") or an If-Not-Match ("INM") header.
In order to decide what to do when you receive these headers, imagine that you are handling the GET request without these conditional headers. Determine what the values of your ETag and Last-Modified headers would be in that response and use them to make the decision. Hopefully you have built your system such that determining this is less costly than constructing the complete response.
If there is an INM and the value of that header is the same as the value you would place in the ETag, then respond with 304.
If there is an IMS and the date value in that header is later than the one you would place in the Last-Modified, then respond with 304.
Else, proceed as though the request did not contain those headers.
For a least-effort approach to part 2 of your question, figure out which of the (Expires, ETag, and Last-Modified) headers you can easily and correctly produce in your Web application.
For suggested reading material:
http://www.w3.org/Protocols/rfc2616/rfc2616.html
http://www.mnot.net/cache_docs/
You should send a 304 if the client has explicitly stated that it may already have the page in its cache. This is called a conditional GET, which should include the if-modified-since header in the request.
Basically, this request header contains a date from which the client claims to have a cached copy. You should check if content has changed after this date and send a 304 if it hasn't.
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25 for the related section in the RFC.
We are also handling cached, but secured, resources. If you send / generate an ETAg header (which RFC 2616 section 13.3 recommends you SHOULD), then the client MUST use it in a conditional request (typically in an If-None-Match - HTTP_IF_NONE_MATCH - header). If you send a Last-Modified header (again you SHOULD), then you should check the If-Modified-Since - HTTP_IF_MODIFIED_SINCE - header. If you send both, then the client SHOULD send both, but it MUST send the ETag. Also note that validtion is just defined as checking the conditional headers for strict equality against the ones you would send out. Also, only a strong validator (such as an ETag) will be used for ranged requests (where only part of a resource is requested).
In practice, since the resources we are protecting are fairly static, and a one second lag time is acceptable, we are doing the following:
Check to see if the user is authorized to access the requested resource
If they are not, Redirect them or send a 4xx response as appropriate. We will generate 404 responses to requests that look like hack attempts or blatant tries to perform a security end run.
Compare the If-Modified-Since header to the Last-Modified header we would send (see below) for strict equality
If they match, send a 304 Not Modified response and exit page processing
Create a Last-Modified header using the modification time of the requested resource
Look up the HTTP Date format in RFC 2616
Send out the header and resource content along with an appropriate Content-Type
We decided to eschew the ETag header since it is overkill for our purposes. I suppose we could also just use the date time stamp as an ETag. If we move to a true ETag system, we would probably store computed hashes for the resources and use those as ETags.
If your resources are dynamically generated, from say database content, then ETags may be better for your needs, since they are just text to be populated as you see fit.
regarding cache-control:
You shouldn't have to worry about the cache-control when serving out, other than setting it to a reasonable value. It's basically telling the browser and other downstream entities (such as a proxy) the maximum time that should elapse before timing out the cache.
Related
I'm using Apache HttpClient 4.3.1 and I'm trying to integrate etag validation cache.
I've tried to "drop in" httpclient-cache CachingHttpClientBuilder instead of my usual HttpClientBuilder using instructions in here, but that didn't seem to do any good. While tracing the execution, it seems like a response that has "etag" header (weak etag) isn't considered cache-able - and so isn't retained for the next cycle.
Has anyone managed to use etag validation based cache with Apache HttpClient? I'm also open for alternative implementations.
Notes:
The server returns the first request with a weak etag header (W/"1234"). If the second request to the same URL has "If-None-Match=1234", the server returns 304. This is checked and working.
The server does not send any other cache header (expires, etc).
The whole setup works wonderfully when using a modern browser.
Whether a response is considered as cacheable or not is decided in
ResponseCachingPolicy#isResponseCacheable(org.apache.http.HttpRequest, org.apache.http.HttpResponse)
which checks for some headers using
ResponseCachingPolicy#isExplicitlyCacheable
when
header 'Expires' is set or the header 'Cache-Control:' has one of the values "max-age" "s-maxage" "must-revalidate" "proxy-revalidate" or "public" the response is considered cacheable.
For us, it worked to add "Cache-Control: 'must-revalidate' to the response on the server, along with the 'Etag' header.
With this settings the apache http client
stores the response of the first request in the cache
on the second request, sends a request to the server and if this responds with a HttpStatus 304 (Not Modified) returns a HttpStatus 200 (ok) and the original content to the caller
That is how it should be.
We are using release 4.5.2 of apache http client cache.
Why does Chrome send a HEAD request? Example in logs:
2013-03-04 07:43:51 W3SVC7 NS1 GET /page.html 80 - *.*.*.* HTTP/1.1 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.22+(KHTML,+like+Gecko)+Chrome/25.0.1364.97+Safari/537.22
2013-03-04 07:43:51 W3SVC7 NS1 HEAD / - 80 - *.*.*.* HTTP/1.1 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.22+(KHTML,+like+Gecko)+Chrome/25.0.1364.97+Safari/537.22
I have a ban system, and this head request really annoying, and its happening exactly the same second with GET request.
What is the nature of it? any help appreciated.
p.s: I noticed that the head requests are all only to my homepage.
RFC 2616 states:
9.4 HEAD
The HEAD method is identical to GET except that the server MUST NOT
return a message-body in the response. The metainformation contained
in the HTTP headers in response to a HEAD request SHOULD be identical
to the information sent in response to a GET request. This method can
be used for obtaining metainformation about the entity implied by the
request without transferring the entity-body itself. This method is
often used for testing hypertext links for validity, accessibility,
and recent modification.
The response to a HEAD request MAY be cacheable in the sense that the
information contained in the response MAY be used to update a
previously cached entity from that resource. If the new field values
indicate that the cached entity differs from the current entity (as
would be indicated by a change in Content-Length, Content-MD5, ETag
or Last-Modified), then the cache MUST treat the cache entry as
stale.
Most likely it is trying to verify the clients cookie/session is valid with the server.
I'd like to know if the POST method on HTTP sends data as a QueryString, or if it use a special structure to pass the data to the server.
In fact, when I analyze the communication with POST method from client to server (with Fiddler for example), I don't see any QueryString, but a Form Body context with the name/value pairs.
The best way to visualize this is to use a packet analyzer like Wireshark and follow the TCP stream. HTTP simply uses TCP to send a stream of data starting with a few lines of HTTP headers. Often this data is easy to read because it consists of HTML, CSS, or XML, but it can be any type of data that gets transfered over the internet (Executables, Images, Video, etc).
For a GET request, your computer requests a specific URL and the web server usually responds with a 200 status code and the the content of the webpage is sent directly after the HTTP response headers. This content is the same content you would see if you viewed the source of the webpage in your browser. The query string you mentioned is just part of the URL and gets included in the HTTP GET request header that your computer sends to the web server. Below is an example of an HTTP GET request to http://accel91.citrix.com:8000/OA_HTML/OALogout.jsp?menu=Y, followed by a 302 redirect response from the server. Some of the HTTP Headers are wrapped due to the size of the viewing window (these really only take one line each), and the 302 redirect includes a simple HTML webpage with a link to the redirected webpage (Most browsers will automatically redirect any 302 response to the URL listed in the Location header instead of displaying the HTML response):
For a POST request, you may still have a query string, but this is uncommon and does not have anything to do with the data that you are POSTing. Instead, the data is included directly after the HTTP headers that your browser sends to the server, similar to the 200 response that the web server uses to respond to a GET request. In the case of POSTing a simple web form this data is encoded using the same URL encoding that a query string uses, but if you are using a SOAP web service it could also be encoded using a multi-part MIME format and XML data.
For example here is what an HTTP POST to an XML based SOAP web service located at http://192.168.24.23:8090/msh looks like in Wireshark Follow TCP Stream:
Post uses the message body to send the information back to the server, as opposed to Get, which uses the query string (everything after the question mark). It is possible to send both a Get query string and a Post message body in the same request, but that can get a bit confusing so is best avoided.
Generally, best practice dictates that you use Get when you want to retrieve data, and Post when you want to alter it. (These rules aren't set in stone, the specs don't forbid altering data with Get, but it's generally avoided on the grounds that you don't want people making changes just by clicking a link or typing a URL)
Conversely, you can use Post to retrieve data without changing it, but using Get means you can bookmark the page, or share the URL with other people, things you couldn't do if you'd used Post.
http://en.wikipedia.org/wiki/GET_%28HTTP%29
http://en.wikipedia.org/wiki/POST_%28HTTP%29
As for the actual format of the data sent in the message body, that's entirely up to the sender and is specified with the Content-Type header. If not specified, the default content-type for HTML forms is application/x-www-form-urlencoded, which means the server will expect the post body to be a string encoded in a similar manner to a GET query string. However this can't be depended on in all cases. RFC2616 says the following on the Content-Type header:
Any HTTP/1.1 message containing an entity-body SHOULD include a
Content-Type header field defining the media type of that body. If
and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its
content and/or the name extension(s) of the URI used to identify the
resource. If the media type remains unknown, the recipient SHOULD
treat it as type "application/octet-stream".
A POST request can include a query string, however normally it doesn't - a standard HTML form with a POST action will not normally include a query string for example.
GET will send the data as a querystring, but POST will not. Rather it will send it in the body of the request.
If your post try to reach the following URL
mypage.php?id=1
you will have the POST data but also GET data.
How to structure an API where the same data may request in different format, in a RESTful format. For example.
GET /person/<id> //get the details of resource <id>
Now depending on the client (browser) requirement, the data may send as html (say normal rendering) or Json (say ajax call). So my doubts are
Can I keep the same url for both requests, or should keep them seperate?
How to detect whether the request is for html/Json at the server. The request type is same (GET). So which parameter should I consider.
How to detect the difference in data type at client (html/Json)\
thanks,
bsr.
Similar question: REST Content-Type: Should it be based on extension or Accept header?
The accepted answers has great points.
Can I keep the same url for both requests, or should keep them seperate?
Yes, keep them the same. Its the same resource, you're just asking for different representations of it.
How to detect whether the request is for html/Json at the server. The request type is same (GET). So which parameter should I consider.
You can use the Accept header to specify the return content-type.
How to detect the difference in data type at client (html/Json)\
You would look at the "Content-Type" header.
What about adding a variable for output type?
There're many ways to write an HTTP-status header:
HTTP/1.1 404 Not Found
Status: 404
Status: 404 Not Found
but which is the semantically-correct and spec-compliant way?
Edit: By status headers I mean this, using a function such as PHP's header().
Adding some information some time later, since I came across this question whilst researching something related.
I believe the Status header field was originally invented as part of the CGI specification, RFC 3875:
https://www.rfc-editor.org/rfc/rfc3875#section-6.3.3
To quote:
The Status header field contains a 3-digit integer result code that
indicates the level of success of the script's attempt to handle the
request.
Status = "Status:" status-code SP reason-phrase NL
status-code = "200" | "302" | "400" | "501" | extension-code
extension-code = 3digit
reason-phrase = *TEXT
It allows a CGI script to return a status code to the web server that overrides the default seen in the HTTP status line. Usually the server buffers the result from the script and emits a new header for the client. This one is a valid HTTP header which starts with an amended HTTP status line and omits the scripts "Status:" header field (plus some other transformations mandated by the RFC).
So all of your examples are valid from a CGI script, but only the first is really valid in a HTTP header. The latter two are only valid coming from a CGI script (or perhaps a FastCGI application).
A CGI script can also operate in "non-parsed header" (NPH) mode, when it generates a complete and valid HTTP header which the web server passes to the client verbatim. As such this shouldn't include a Status: header field.
Note, what I am interested in is what which status should win if an NPH script gets it a bit wrong and emits the Status: header field, possibly in addition to the HTTP status line. I can't find any clear indication so and I suspect it is left to the implementation of whatever is parsing the output, either the client or the server.
Since https://www.rfc-editor.org/rfc/rfc2616#section-6 and more specifically https://www.rfc-editor.org/rfc/rfc2616#section-6.1 does not mention use of "Status:" when indicating a status code, and since the official list of headers at http://www.iana.org/assignments/message-headers/message-headers.xml does not mention "Status", I'd be inclined to believe it should not be served with it as a header.
The closest thing I've found to an answer is the Fast CGI spec, which states to set status codes through Status and Location headers.
A lot of them are pretty much arbitrary strings, but there here is the w3c's spec for the commonly used ones
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html