Have been considering caching my JSON on Amazon Cloudfront.
The issue with that is it can take 15 minutes to manually clear that cache when the JSON is updated.
Is there a way to store a simple JSON value in a CDN-like http cache that -
does not touch an application server (heroku) after intial generation
allows me to instantly expire a cache
Update
In response to AdamKG's point:
If it's being "updated", it's not static :D Write a new version and
tell your servers to use the new URL.
My actual idea is to cache a new CloudFront url every time a html page changes. That was my original focus.
The reason I want to JSON is to store the version number for that latest CloudFront url. That way I can make an AJAX call to discover what version to load, then a second AJAX call to actually load the content. This way I never need to expire CloudFront content, I just redirect the ajax loading it.
But then I have the issue of the JSON needing to be cached. I don't want people hitting the Heroku dynamos every time they want to see the single JSON version number. I know memcache and rack can help me speed that up, but it's a problem I just dont want to have.
Some ideas I've had:
Maybe there is a third party service, similar to a Memcache db, that allows me to be expose a value in a JSON url? That way my dynamos are never touched.
Maybe there is an alternative to Cloudfront that allows for quicker manual expiration? I know that kinda defeats the nature of caching, but maybe there more intermediary services, like a varnish layer or something.
One method is to use asset expiration similar to the way that Rails static assets are expired. Rails adds a hash signature to filenames, so something like application.js becomes application-abcdef1234567890.js. Then, each time a user requests your page, if application.js has been updated, the script tag has the new address.
Here is how I envision you doing this:
User → CloudFront (CDN) → Your App (Origin)
User requests http://www.example.com/. The page has meta tag
<meta content="1231231230" name="data-timestamp" />
based on the last time you updated the JSON resource. This could be generated from something like <%= Widget.order(updated_at: :desc).pluck(:updated_at).first.to_i %> if you are using Rails.
Then, in your application's JavaScript, grab the timestamp and use it for your JSON url.
var timestamp = $('meta[name=data-timestamp]').attr('content');
$.get('http://cdn.example.com/data-' + timestamp + '.json', function(data, textStatus, jqXHR) {
blah(data);
});
The first request to CloudFront will hit your origin server at /data/data-1231231230.json, which can be generated and cached forever. Each time your JSON should be updated, the user gets a new URL to query the CDN.
Update
Since you mention that the actual page is what you want to cache heavily, you are left with a couple options. If you really want CloudFront in front of your server, your only real option would be to send an invalidation request every time your homepage updates. You can invalidate 1,000 times per month for free, and $5 per 1,000 after that. In addition, CloudFront invalidations are not fast, and you will still have a delay before the page is updated.
The other option is to cache your content in Memcached and serve it from your dynos. I will assume that you are using Ruby on Rails or another Ruby framework based on your asking history (but please clarify if you are not). This entails getting Rack::Cache installed. The instructions on Heroku are for caching assets, but this will work for dynamic content, as well. Next, you would use Rack::Cache's invalidate method each time the page is updated. Yes, your dyno's will handle some of the load, but it will be a simple Memcached lookup and response.
Your server layout would look like:
User → CloudFront (CDN) → Rack::Cache → Your App (Origin) on cdn.example.com
User → Rack::Cache → Your App (Origin) on www.example.com
When you serve static assets like your images, CSS, and JavaScript, use the cdn.example.com domain. This will route requests through CloudFront and they will be cached for long periods of time. Requests to your app will go directly to your Heroku dyno, and the cacheable parts will be stored and retrieved by Rack::Cache.
Related
I have a page /data.txt, which is cached in the client's browser. Based on data which might be known only to the server, I now know that this page is out of date and should be refreshed. However, since it is cached, they will not re-request it for a long time (until the cache expires).
The client is now requesting a different page /foo.html. How can I make the client's browser re-request /data.txt and update its cache?
This should be done using HTTP or HTML (not all clients have JS).
(I specifically want to avoid the "cache-busting" pattern of appending version numbers to the /data.txt URL, like /data.txt?v=2. This fills the cache with useless entries rather than replacing expired ones.)
Edit for clarity: I specifically want to cache /data.txt for a long time, so telling the client not to cache it is unfortunately not what I'm looking for (for this question). I want /data.txt to be cached forever until the server chooses to invalidate it. But since the user never re-requests /data.txt, I need to invalidate it as a side effect of another request (for /foo.html).
To expand my comment:
You can use IF-Modified-Since and Etag, and to invalidate the resource that has been already downloaded you may take a look at the different approaches suggested in Clear the cache in JavaScript and fetch(), how do you make a non-cached request?, most of the suggestions there mentioned fetching the resource from JavaScript with no-cache header fetch(url, {cache: "no-store"}).
Or, if you can try sending a Clear-Site-Data header if your clients' browsers are supported.
Or maybe, give up this time only for the cache-busting solution. And if it's possible for you, rename the file to something else rather than adding a querystring as suggested in Revving Filenames: don’t use querystring.
Update after clarification:
If you are not maintaining a legacy implementation with users that already have /data.txt cached, the use of Etag And IF-Modified-Since headers should help.
And for the users with the cached versions, you may redirect to: /newFile.txt or /data.txt?v=1 from /foo.html. The new requests will have the newly added headers.
The first step is to fix your cache headers on the data.txt resource so it uses your desired cache policy (perhaps Cache-Control: no-cache in conjunction with an ETag for conditional validation). Otherwise you're just going to have this problem over and over again.
The next step is to get clients who have it in their cache already to re-request it. In general there's no automatic way to achieve this, but if you know they're accessing foo.html then it should be possible. On that page you can make an AJAX request to data.txt with the Cache-Control: no-cache request header. That should force the browser to bypass the cache and get a fresh version, and the cache should then be repopulated with the new version.
(At least, that's how it's supposed to work. I've never tried this, and I've seen reports here that browsers don't handle Cache-Control request headers properly.)
I'm trying to understand the Google CDN behavior in the following scenario:
Let's assume I have a backend service serving chunked http data. For the sake of the explanation, let's assume that serving a single request takes up to 10s
Let's imagine the case where a file is requested through the CDN by a client A, and that this file is not currently cached in the CDN. The request will go to the backend service, that starts serving the file. Client A will immediately start receiving HTTP chunks
After 5s, another client B requests the same file. I can envision 3 possible behaviors, but I can't figure out how to control this through the CDN configuration:
Option a: the CDN simply pass the request to the backend service, ignoring that half of the file has already been served and could already be cached. Not desirable as the backend service will be reached twice and serve the same data twice.
Option b: the CDN puts the second request on "hold", waiting for the first one to be terminated before serving the client B from its cache (in that case, the request B does not reach the backend service). Ok, but still not amazing as client B will wait 5s before getting any http data.
Option c: the CDN immediately serves the first half of the http chunks and then serves the remaining http chunks at the same pace than request A. Ideal!
Any ideas on the current behavior ? And what could we do to get the option C, which is by far our preferred option ?
Tnx, have a great day!
Jeannot
It is important to note that GFE historically cached only complete responses and stored each response as a single unit. As a result, the current behavior will follow option A. You can take a look at this help center article for more details.
However, with the introduction of Chunk caching, which is currently in Beta, large response bodies are treated as a sequence of chunks that can each be cached independently. Response bodies less than or equal to 1 MB in size can be cached as a unit, without using chunk caching. Response bodies larger than 1 MB are never cached as a unit. Such resources will either be cached using chunk caching or not cached at all.
Only resources for which byte range serving is supported are eligible for chunk caching. GFE caches only chunk data received in response to byte range requests it initiated, and GFE initiates byte range requests only after receiving a response that indicates the origin server supports byte range serving for that resource.
To be more clear, once Chunk caching is in GA, you would be able to achieve your preferred option C.
Regarding your recent query, unfortunately, only resources for which byte range serving is supported are eligible for chunk caching at the moment. You can definitely create a feature request for your use case at Google Issue Trackers.
The good news is that chunk caching with Cloud CDN is now in GA and you can check the functionality anytime you wish.
I'm working on a simple ruby script with cli that will allow me to browse certain statistics inside the terminal.
I'm using API from the following website: https://worldcup.sfg.io/matches
require 'httparty'
url = "https://worldcup.sfg.io/matches"
response = HTTParty.get(url)
I have to goals in mind. First is to somehow save the JSON response (I'm not using a database) so I can avoid unnecessary requests. Second is to check if the new data is available, and if it is, to override the previously saved response.
What's the best way to go about this?
... with cli ...
So caching in memory is likely not available to you. In this case you can save the response to a file on disk.
Second is to check if the new data is available, and if it is, to override the previously saved response.
The thing is, how can you check if new data is available without doing a request for the data? Not possible (given the information you provided). So you can simply keep fetching data every 5 minutes or so and updating your local file.
I have an application that downloads data via NSURLConnection in the form of a JSON object; it then displays the data to the user. As new data may be created on the server at any point, what is the best way to 'realise' this and download this data?
At the moment I am planning on having the application download all the data every 30-40 seconds, and then check the data downloaded against the current data: if it is the same do nothing; if it is different, procede with the alterations. However, this seems a bit unnecessary, especially as the data may not change for a while. Is there a more efficient way of updating the application data when new server data is created?
Use ETag if the server supports it.
Wikipedia ETag
"If the resource content at that URL ever changes, a new and different ETag is assigned."
You could send a HTTP HEAD request to the server with the "If-Modified-Since" header set to the time you recieved the last version. If the server handles this correctly, it should return 304 (Not Modified) while the file is unchanged; so as soon as it doesn't return that, you GET the file and procede as usual.
See HTTP/1.1: Header Field Definitions
I am creating a dashboard application in which i show information about the servers. I have a Servlet called "poller.java" that will collect information from the servers and send it back to a client.jsp file. In the client.jsp , i make AJAX calls every 2 minutes to call the poller.java servlet in order to get information about the servers.
The client.jsp file shows information in the form of a table like
server1 info
server 2 info
Now, i want to add one more functionality. when the user clicks on the server1, I should show a separate page (call it server1.jsp) containing the time stamps in which the AJAX call was made by calling.jsp and the server information that was retrieved. This information is available in my calling.jsp page. But, how do i show it in the next page.
Initially, i thought of writing to a file and then retrieving it in my server1.jsp file. But, I dont think it is a good approach. I am sure i am missing a much simpler way to do this. Can someone help me ?
You should name your servlet Poller.java not poller.java. Classes should always start with an uppercase. You can implement your servlet to forward to a different page for example if sombody clicks to server1 then the servlet will forward to server1.jsp. Have a look at RequestDispatcher for this. Passing information between request's should be done by request attributes. if you need to retain the information over several request you could think about using session.
In the .NET world, we use SessionState to maintain data that must persist between requests. Surely there's something similar for JSP? (The session object, perhaps.)
If you can't use session state in a servelet, you're going to have to fall back on a physical backing store. I'd use a database, or a known standard file format (like XML). Avoid home-brew file formats that require you to write your own parser.