Cloudfront Custom Origin Is Causing Duplicate Content Issues - duplicates

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.
Main site: www.mainsite.com
static1.mainsite.com
static2.mainsite.com
Sample page: www.mainsite.com/summary/page1.htm
This page calls an image from static1.mainsite.com/images/image1.jpg
If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg
This all works fine.
The problem is that google alert has reported the page as being found at both:
http://www.mainsite.com/summary/page1.htm
http://static1.mainsite.com/summary/page1.htm
The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.
I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.
But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.
Questions then are:
1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?
Thanks for your help.
Joe

[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.

You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.
In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

Related

IPFS X-Ipfs-Path on static images referenced on a dynamic non-IPFS https page forces localhost gateway to load over https

I'm trying to utilize IPFS to load static content, such as images and javascript libraries, on a dynamic site loaded on the http protocol.
For example https://www.example.com/ is a normal web 2.0 page, with an image reference here https://www.example.com/images/myimage.jpg
When the request is made on myimage.jpg, the following header is served
x-ipfs-path: /ipfs/QmXXXXXXXXXXXXXXXXX/images/myimage.jpg
Which then gets translated by the IPFS Companion browser plugin as:
https://127.0.0.1:8081/ipfs/QmXXXXXXXXXXXXXXXXX/images/myimage.jpg
The problem being, is that it has directed to an SSL page on the local IP, which won't load due to a protocol error. (changing the above from https to http works)
Now, if I were to request https://www.example.com/images/myimage.jpg directly from the address bar, it loads the following:
http://localhost:8081/ipfs/QmYcJvDhjQJrMRFLsuWAJRDJigP38fiz2GiHoFrUQ53eNi/images/myimage.jpg
And then a 301 to:
http://(some other hash).ipfs.localhost:8081/images/myimage.jpg
Resulting in the image loading successfully.
I'm assuming because the initial page is served over SSL, it wants to serve the static content over SSL as well. I also assume that's why it then uses the local IP over https, rather than localhost in the other route.
My question is, how do I get this to work?
Is there a header which lets IPFS companion to force it to load over http? If so, I'm assuming this would cause browser security warnings due to mixed content. I have tried adding this header without luck: X-Forwarded-Proto: http
Do I need to do something to enable SSL over 127.0.0.1, connecting this up with my local node? If so, this doesn't seem to be the default setup for clients, and worry that all the content will show broken images if they do not follow some extra steps.
Is it even possible to serve static content over IPFS from non-IPFS pages?
Any hints appreciated!
Edit: This appears to effect the Chrome engine and and Firefox.
Looks like a configuration error on your end.
Using IPFS Companion with default settings on a clean browser profile works as expected.
Opening: https://example.com/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi redirects fine to http://localhost:8080/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi which then redirects to unique Origin based on the root CID: http://bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi.ipfs.localhost:8080/
You use custom port (8081), which means you changed the Gateway address in ipfs-companion Preferences at some point.
Potential fix: go there and make sure your "Local gateway" is set to http://localhost:8081 (instead of https://).
If you have http:// there, then see if you have some other extension or browser setting forcing https:// (check if this behavior occurs on a browser profile, and then add extensions/settings one by one, to identify the source of the problem).

Specifying index.html in browser to load home page

If I want to load the homepage of https://medium.com/ by typing the exact index.html file address into my browser, how would I do that? Or is it not possible?
https://medium.com/index.html gives me a 404 error. Also curious how I would do this more broadly with any webpage for which my browser is displaying a url that does not end in .html.
Common static websites hosted just as files somewhere usually have an index.html document which can be resolved either directly or is normally loaded when no particular document is specified so https://example.com/ and https://example.com/index.html both work.
But this is not how most webs work. Pages can be dynamically generated server side, you just send a request to the server and if the path matches some server operation it will create a response for you. Unless https://example.com/ returns documents from a directory using something classic like the Apache Web Server set to serve static files from a directory, it won't work.
There is no general way to know what, if any, URLs for a given website resolve to duplicates of the homepage (or any other page).
Dynamically generated sites, in particular, tend not to have alternative URLs for pages.

how to update webpages of site hosted on aws s3?

After hosting website on s3, how can we make changes in text in its webpages. I deleted older html files from bucket and uploaded new files by same name with updated text in the code. But no changes were reflected after refreshing those webpages.
Is there is any other way to update webpages of a website already hosted on s3 ? If so would somebody please post steps here to make those updates ? TIA.
I notice you have CloudFront in your tags so that is most likely the issue. When you upload a file to S3, CloudFront won't know about it right away if it's an existing file. Instead it's set to a default of 24 hours where it checks your origin (in this case your S3 bucket) to see if any changes have been made and if it needs to update the cache. There are a few ways to make it update the cache for those files:
Using files with versions in their names, and updating links. The downside is that you have to make more changes than normal to get this to work.
Invalidating the cache. This is not what Amazon recommends, but is nonetheless a quick way to make the cache pickup new changes right away. Note that there can be charges if you do a lot of invalidations:
No additional charge for the first 1,000 paths requested for invalidation each month. Thereafter, $0.005 per path requested for invalidation
Using Behaviors:
Here is where you can assign a path (individual file, folders, etc.) and adjust certain properties. One of them is the TTL(Time To Live) of the path in question. If you make the TTL a smaller value CloudFront will pickup changes more quickly. However since you have an S3 origin note that you'll have to deal with request allocations. Also CloudFront will need some time to distribute these changes to all the edge servers.
Hope this helps.
In case you are not using cloud-front, but just a normal static S3 website: check if your browser may be caching the pages.
Chrome at least does. So, updating the pages in S3 might not be visible until you clear the browser cache.
In chrome you can remove the cache as follows:
Open settings
Search for 'cache'
and remove pictures & files.

get around cross-origin resource sharing on Amazon Aws

I am creating a Virtual reality 360-degree video website using the krpano html5 player.
This was going great until testing on safari and I realised it didn't work. The reason for this is because safari does not support CORS (cross-origin resource sharing) for videos going through WebGL.
To clarify, if my videos where on the same server with my application files it would work, but because I have my files hosted on Amazon s3 , they are CORS. Now I'm unsure what to do because I have built my application on digital ocean which connects to my amazon s3 bucket, but I cannot afford to up my droplet just to get the storage I need(which is around 100GB to start and will increase in the future to Terrabytes and my video collection gets bigger).
So does anyone know a way I can get around this to make it seem like the video is not coming from a different origin or alternatively anything I can do to get past this obstacle?
Is there any way that I could set up amazon s3 and amazon EC2 so that they dont see each other as cross-origin resource sharing?
EDIT:
I load my videos like this:
<script>
function showVideo(){
embedpano({
swf:"/krpano/krpano.swf",
xml:"/krpano/videopano.xml",
target:"pano",
html5:"only",
});
}
</script>
This then calls my xml file which calls the video file:
<krpano>
<!-- add the video sources and play the video -->
<action name="add_video_sources">
videointerface_addsource(‘medium', 'https://s3-eu-west-1.amazonaws.com/myamazonbucket/Shoots/2016/06/the-first-video/videos/high.mp4|https://s3-eu-west-1.amazonaws.com/myama…ideos/high.webm');
videointerface_play(‘medium');
</action>
</krpano>
I don't know exactly how krpano core works, I assume it the javascript gets the URLs from the XML file and then makes a request to pull them in.
#datasage mentions in comments that CloudFront is a common solution. I don't know if this is what he was thinking of but it certainly will work.
I described using this solution to solve a different problem, in detail, on Server Fault. In that case, the question was about integrating the main site and "/blog/*" from a different server under a single domain name, making a unified web site.
This is exactly the same thing you need, for a different reason.
Create a CloudFront distribution, setting the alternate domain name to your site's name.
Create two (or more) origin servers pointing to your dynamic and static content origin servers.
Use one of them as default, initially handling all possible path patterns (*, the default cache behavior) and then carve out appropriate paths to point to the other origin (e.g. /asset/* might point to the bucket, while the default behavior points to the application itself).
In this case, CloudFront is being used other than for its primary purpose as a CDN and instead, we're leveraging a secondary purpose, using it as a reverse proxy that can selectively route requests to multiple back-ends, based on the path of the request, without the browser being aware that there are in fact multiple origins, because everything sits behind the single hostname that points to CloudFront (which, obviously, you'll need to point to CloudFront in DNS.)
The caching features can be disabled if you don't yet want/need/fully-understand them, particularly on requests to the application itself, where disabling caching is easily done by selecting the option to forward all request headers to the origin, in any cache behavior that sends requests to the application itself. For your objects in S3, be sure you've set appropriate Cache-Control headers on the objects when you uploaded them, or you can add them after uploading, using the S3 console.
Side bonus, using CloudFront allows you to easily enable SSL for the entire site, with a free SSL certificate from Amazon Certificate Manager (ACM). The certificate needs to be created in the us-east-1 region of ACM, regardless of where your bucket is, because that is the region CloudFront uses when fetching the cert from ACM. This is a provisioning role only, and has no performance implications if your bucket is in another region.
You need to allow your host in CORS Configuration of your AWS-S3 bucket .
Refer to Add CORS Configuration in Editing Bucket Permissions.
Hence after that, every request you make to the S3 bucket files, will have CORS headers set.
In case you need to serve the content via AWS-CDN CloudFront then follow these steps, ignore if you server content directly via S3 :
Go to AWS CloudFront Console.
Select your CloudFront Distribution.
Go to Behaviors Tab.
Create a Behavior(for the files which needs to be served with CORS Header).
Enter Path Pattern, Select Protocol & Methods.
Select All in Forward Headers option.
Save the behavior.
If needed, Invalidate the CloudFront Edge Caches by running an Invalidation request for the Files you just allowed for CORS.

Is there a way to whitelist the master in HTML5 application cache?

Here is my manifest :
CACHE MANIFEST
CACHE:
//code.jquery.com/jquery-2.0.3.min.js
NETWORK:
*
My index.html which defines that manifest (<html manifest="app.manifest>) is always stored as "Master" even with the NETWORK wildcard part of my manifest.
The problem is that my MASTER index.html is stored in the cache... and won't be refreshed if it changes on the server side if the manifest file is not updated.
I've seen multiple not really beautiful solutions to that problem (like the iframe solution), so my question is : is there a clean HTML 5 way to do this ?
The clean way to do it is to only have static content in your index.html file then load the data dynamically (eg. via AJAX) to create the page the user sees. An alternative would be to have a big link which says 'Enable Offline Support' which links to a page with the manifest link in it.
Other than that, the iframe solution is the cleanest way - you're hacking around the intended use of AppCache, why do you expect that to be 'clean'? What application scenario do you have that jquery-2.0.3.min.js needs to be available offline but not the index page of the app which accesses it?