how to update webpages of site hosted on aws s3? - html

After hosting website on s3, how can we make changes in text in its webpages. I deleted older html files from bucket and uploaded new files by same name with updated text in the code. But no changes were reflected after refreshing those webpages.
Is there is any other way to update webpages of a website already hosted on s3 ? If so would somebody please post steps here to make those updates ? TIA.

I notice you have CloudFront in your tags so that is most likely the issue. When you upload a file to S3, CloudFront won't know about it right away if it's an existing file. Instead it's set to a default of 24 hours where it checks your origin (in this case your S3 bucket) to see if any changes have been made and if it needs to update the cache. There are a few ways to make it update the cache for those files:
Using files with versions in their names, and updating links. The downside is that you have to make more changes than normal to get this to work.
Invalidating the cache. This is not what Amazon recommends, but is nonetheless a quick way to make the cache pickup new changes right away. Note that there can be charges if you do a lot of invalidations:
No additional charge for the first 1,000 paths requested for invalidation each month. Thereafter, $0.005 per path requested for invalidation
Using Behaviors:
Here is where you can assign a path (individual file, folders, etc.) and adjust certain properties. One of them is the TTL(Time To Live) of the path in question. If you make the TTL a smaller value CloudFront will pickup changes more quickly. However since you have an S3 origin note that you'll have to deal with request allocations. Also CloudFront will need some time to distribute these changes to all the edge servers.
Hope this helps.

In case you are not using cloud-front, but just a normal static S3 website: check if your browser may be caching the pages.
Chrome at least does. So, updating the pages in S3 might not be visible until you clear the browser cache.
In chrome you can remove the cache as follows:
Open settings
Search for 'cache'
and remove pictures & files.

Related

how can I show the s3 files to download with index.html [duplicate]

I have set up an S3 bucket to host static files.
When using the website endpoint (http://.s3-website-us-east-1.amazonaws.com/): it forces me to set an index file. When the file isn't found, it throws an error instead of listing directory contents.
When using the s3 endpoint (.s3.amazonaws.com): I get an XML listing of the files, but I need an HTML listing that users can click the link to the file.
I have tried setting the permissions of all files and the bucket itself to "List" for "Everyone" in the AWS Console, but still no luck.
I have also tried some of the javascript alternatives, but they either don't work under the website url (that redirects to the index file) or just don't work at all. As a last resort, a collapsible javascript listing would be better than nothing, but I haven't found a good one.
Is this possible? If so, do I need to change permissions, ACL or something else?
I've created a simple bit of JS that creates a directory index in HTML style that you are looking for: https://github.com/rgrp/s3-bucket-listing
The README has specific instructions for handling Amazon S3 "website" buckets: https://github.com/rgrp/s3-bucket-listing#website-buckets
You can see a live example of the script in action on this s3 bucket (in website mode): http://data.openspending.org/
There is also this solution: https://github.com/caussourd/aws-s3-bucket-listing
Similar to https://github.com/rgrp/s3-bucket-listing but I couldn't make it work with Internet Explorer. So https://github.com/caussourd/aws-s3-bucket-listing works with IE and also add the possibility to order the files by names, size and date. On the downside, it doesn't follow folders: only the files at one level are displayed.
This might solve your problem. Security settings for Everyone group:
(you need the bucketexplorer.com software for this)
If you are sharing files of HTTP, you may or may not want people to be able to list the contents of a bucket (folder.) If you want the bucket contents to be listed when someone enters the bucket name (http://s3.amazonaws.com/bucket_name/), then edit the Access Control List and give the Everyone group the access level of Read (and do likewise with the contents of the bucket.) If you don’t want the bucket contents list-able but do want to share the file within it, disable Read access for the Everyone group for the bucket itself, and then enable Read access for the individual files within the bucket.
I created a much simpler solution. Just place the index.html file in root of your folder and it will do the job. No configuration required. https://github.com/prabhatsharma/s3-directorylisting
I had a similar problem and created a JavaScript-and-iframe solution that works pretty well for listing directories in S3 website files. You just have to drop a couple of .html files into the directory you want to list. You can find it here:
https://github.com/adam-p/s3-file-list-page
I found s3browser, which allowed me to set up a directory on the main web site that allowed browsing of the s3 bucket. It worked very well and was very easy to set up.
Using another approach base in pure JavaScript and AWS SDK JavaScript API. Not need PHP or other engine just pure web site (Apache or even IIS).
https://github.com/juvs/s3-bucket-browser
Not intent for deploy on your own bucket (for me, no make sense).
Using the new IAM Users from AWS you can provide more specific and secure access to your buckets. No need to publish your bucket to website and make all public.
If you want secure the access, you can use the conventional methods to authenticate users for your current web site.
Hope this help too!

Cloudfront Custom Origin Is Causing Duplicate Content Issues

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.
Main site: www.mainsite.com
static1.mainsite.com
static2.mainsite.com
Sample page: www.mainsite.com/summary/page1.htm
This page calls an image from static1.mainsite.com/images/image1.jpg
If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg
This all works fine.
The problem is that google alert has reported the page as being found at both:
http://www.mainsite.com/summary/page1.htm
http://static1.mainsite.com/summary/page1.htm
The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.
I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.
But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.
Questions then are:
1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?
Thanks for your help.
Joe
[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.
You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.
In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

How do I specify a wildcard in the HTML5 cache manifest to load all images in a directory?

I have a lot of images in a folder that are used in the application. When using the cache manifest it would be easier maintenance wise if I could specify a wild card to load all the images or files in a certain directory to be cached.
E.g.
CACHE MANIFEST
# 2011-11-3-v0.1.8
#--------------------------------
# Pages
#--------------------------------
../index.html
../edit.html
#--------------------------------
# JavaScript
#--------------------------------
../js/jquery.js
../js/main.js
#--------------------------------
# Images
#--------------------------------
../img/*.png
Can this be done? Have tried it in a few browsers with ../img/* as well but it doesn't seem to work.
It would be easier, but how's it going to work? The manifest file is something which is parsed and acted upon in the browser, which has no special knowledge of files on your server other than what you've told it. If the browser sees this:
../img/*.png
What is the first image the browser should request from the server? Let's start with these:
../img/1.png
../img/2.png
../img/3.png
../img/4.png
...
../img/2147483647.png
That's all the images that might exist with a numeric name, stopping semi-arbitrarily at 231-1. How many of those 2 billion files exist in your img directory? Do you really want a browser making all those requests only to get 2 billion 404s? For completeness the browser would probably also want to request all the zero-filled equivalents:
../img/01.png
../img/02.png
../img/03.png
../img/04.png
...
../img/001.png
../img/002.png
../img/003.png
../img/004.png
...
../img/0001.png
../img/0002.png
../img/0003.png
../img/0004.png
...
Now the browser's made more than 4 billion HTTP requests for files which mostly aren't there, and it's not yet even got on to letters or punctuation in constructing the possible filenames which might exist on the server. This is not a feasible way for the manifest file to work. The server is where the files in the img directory are known, so it's on the server that the list of files has to be constructed.
I don't think it works that way. You'll have to specify all of the images one by one, or have a simple PHP script to loop through the directory and output the file (with the correct text/cache-manifest header of course).
It would be a big security issue if browsers could request folder listings - that's why Tomcat turns that capability off by default now.
But, the browser could locate all matches to the wildcards referenced by the pages it caches. This approach would still be problematic (like, what about images not initially used but set dynamically by JavaScript, etc., and it would require that all cached items not only be downloaded but parsed as well).
If you are trying automate this process, instead of manually doing it. Use a script, or as I do I use manifestR. It will output your manifest/appcache file and all you have to do is copy and paste. I've used it successfully and usually only have to make a few changes.
Also, I recommend using the network header with the wild card:
NETWORK:
*
This allows all assets from other linked domains via JSON, for instance, to download into the cache. I believe that this is the only header where you can specify a wildcard. Like the others have said here, it's for security reasons.
The cache manifest is now deprecated and you should use HTML headers to control caching.
For example:
<meta http-equiv="Cache-control" content="public">
Public - may be cached in public shared caches.
Private - may only be cached in private cache.
No-Cache - may not be cached.
No-Store - may be cached but not archived.

Cache Manifest messes up my app when online

Gurus of SO
I am trying to play with CACHE MANIFEST/HTML5. My app is JS heavy and built on jquery/jquerymobile.
This is an excerpt of what my Manifest looks like
CACHE MANIFEST
FALLBACK:
/
NETWORK:
*
CACHE:
/css/style.css
/js/jquery.js
But somehow, the app doesn't load the files the first time itself and the entire app breaks down.
Is my format wrong?
Should I never load JS into the Cache?
How should I treat this differently to always check the network first if anything isn't available and only load stuff available from the Cache?
Thank you.
I tried a simple page with your cache manifest and it worked fine for me, so I'm not really sure what the problem is. But,
Yes, there is something wrong with the format. The entries in the FALLBACK section need to have two parts: a pattern, and a URL. This says "if any page matching the pattern is not available offline, display the URL instead (which will be cached)." The main example of this (as shown here) is "/ /offline.html", which means "for all pages, if we are offline and they are not cached, display /offline.html instead." However, I don't think this is the source of your problem since I tested it with your exact manifest and it still worked.
There is nothing special about JS files. It should be fine to load them into the cache.
I don't understand the third question. There are possibly two goals here: a) how do you check to see if there is a newer version of the file available online first, before going back to the cache, and b) how do you check the network to see if there is a file that is not cached, and if we are offline, fall back to an error page. The answer to (a) is that once you have turned on the cache manifest, things work very differently. It will never check for new versions of the files unless there is a new version of the manifest also. So you must always update the manifest whenever you change any files. The answer to (b) is the FALLBACK section.
See Dive Into HTML5's excellent chapter on this, particularly the section "The fine art of debugging, a.k.a. “Kill me! Kill me now!”" which explains how the manifest updates.
Also I don't think we've gotten to the meat of your question, because it's unclear what you mean by "the app doesn't load the files the first time itself". Which files don't load? Do they load properly after a refresh? Etc.
The only way I got this to work to refresh a cache was to rename the manifest file with a commit number or timestamp, and change the cache declaration to
<html manifest='mymanifest382330.manifest'>
I made this part of my build.

HTML5 Cache -- Is it possible to have several distinct caches for a single URL?

Every URL can be linked to a single cache manifest. But I want several cache manifests linked to a same URL. Here is the reason:
Some files I want to be cached are rarely updated and large.
So everytime the cache gets updated these large files get re-downloaded even though they may not have been changed.
So I want to split up the cache. One cache for theses rarely updated large files and another cache for the often updated light files.
Do you guys have any idea how to split up an HTML5 cache?
The most efficient way is:
a) Use far-future expiration date (max-age) on all resources mentioned in manifest's CACHE section and add timestamp suffix to each file in the CACHE section, e.g.:
CACHE:
menu_1355817388000.js
toolbar_1355817389100.js
b) When any of the above files change on the server, regen/update manifest to change the timestamp. Only the file with the modified timestamp will get downloaded next time. Mission accomplished.
Note: Reload the page twice in the browser, as on the first refresh browser downloads just the manifest and uses old cached resources to paint the page. This is done to speed up displaying the page (there are tricks to handle this issue of double refresh, but they are outside the scope of your question)
See more info in this long but best article I ever seen on appcache.
Use an iframe
Your page's cache manifest would include the light files and the cache manifest of an iframe loaded by this page would include the large files
On chrome the iframe's application cache will also be used for the page. I didn't tested this method on other browsers yet.
see a live example at http://www.timer-tab.com and if you are using chrome see its split up cache at chrome://appcache-internals/
When the manifest file is changed and the files of the application cache are downloaded again, the normal HTTP caching rules still apply. This means that if you set the correct HTTP caching headers for these large files, you'll get a 304 so these files are not downloaded again. So it's not necessary to split the application cache.
Maybe an answer but I'd more like to shed some light on my findings as a I troubleshoot my own webapp.
I've discovered that I can use 2 iframes (manifest_framework) and (manifest_media) to load the manifests, but i'm still not exactly clear how they are targetted, but I had limited success.
manifest_framework:
CACHE MANIFEST
CACHE:
appdata.ini
dialog.png
jquery.min.js
login.htm
login.js
manifest.appcache.js
NETWORK:
*
FALLBACK:
manifest_media:
CACHE MANIFEST
CACHE:
manifest_fwk.php
od/audio_track_1_1.m4a
od/audio_track_1_2.m4a
od/audio_track_1_3.m4a
od/audio_track_1_4.m4a
od/video_1.mp4
od/video_2.mp4
od/video_3.mp4
NETWORK:
*
FALLBACK:
./ webapp.php
./index.php is the page the 'landing page' which itself isn't cached but falls back to webapp.php when offline.
What I don't understand is how these link to the webapp.php page.
I am finding I can only get access to one or the other manifests cache.
The above works in mobile safari, the media would be cached, and image but not necessarily the JS or images in the framework manifest.
Anyone have more examples where multiple manifests are referenced from the one URL/page?
The W3C working group has abandoned the file system api, so it SHOULD NOT BE USED anymore.
We'll likely see it fall off the next version of Chrome.
http://www.w3.org/TR/file-system-api/
CACHE MANIFEST
# This is a comment.
# Cache manifest version 0.0.1
# If you change the version number in this comment,
# the cache manifest is no longer byte-for-byte
# identical.
demoimages/mypic.jpg
demoimages/yourpic.jpg
demoimages/ourpic.jpg
sr/scroll.js
NETWORK:
# All URLs that start with the following lines
# are whitelisted.
# whitelisted items are needed to help the site function, you could put regularly
# changing items here
http://example.com/examplepath/
http://www.example.org/otherexamplepath/
CACHE:
# Additional items to cache.
demoimages/allpics.jpg
FALLBACK:
demoimages/currentImg.jpg images/stockImage.jpg`
If the Iframe trick does not work, use the HTML5 FileSystem API
See http://updates.html5rocks.com/2012/04/Taking-an-Entire-Page-Offline-using-the-HTML5-FileSystem-API