How is "Is offline for maintenance" page implemented? - language-agnostic

Occasionally when I try to open a site I will see a page saying smth like "This site is offline for maintenance" and then some comments follow on how long it would presumably take. Stack Overflow does that too.
How does it work? I mean if the site is shut down who replies to my HTTP request and serves this page?

There is a trick in asp.net where you place a file called
App_Offline.htm
All requests will go to this, until the page is deleted.
For other environments you can often just change where the server points, or another such plan.
-- Edit
A server-agnostic approach is achieved through something like load-balancing.
Under the hood you can send the requests to a given internal server. You may then decide to point all requests to your server 'a', which you configure to show the 'downtime' page. Then, you make changes to server 'b', confirm it as successful, and point all requests to 'b'. Then you update 'a', and let requests go to both.

In ASP.NET (and ASP.NET MVC as Stackoverflow uses) this is provided by the app_offline.htm feature. This works simply by forwarding all ASP.NET requests to the app_offline.htm file.
Incidentally the copy Web Site tool in ASP.NET performs the process of placing this file in the root of the web app, copyies the Web site files and then deletes this file.
Strategies for other technologies are discussed here.

In apache you may use a .htacces file with this content.
order deny,allow
allow from 192.168.1.151
deny from all
ErrorDocument 403 404.html
ErrorDocument 404 404.html
ErrorDocument 500 404.html
This will deny access to everyone except one IP and serve a static 404.html file.
This works in the case you only have one server without load-balancing and other stuff. Should work for load-balancing too though.

The apache reverse proxy server can be configured to send that response - if it is being used as part of that architecture.

Related

AMP: why are files with .amp.html extensions not displayed on linux hosting?

I recently converted all the web pages of my website into amp. I rename them all in (.amp.html). I took care to test each page with the amp tester: https://ampbyexample.com/playground/
i also bought a domain name that points to https, a linux hosting at godaddy. Only here, when I send the files to the extensions (.amp.html) nothing is displayed on the domain name. By cons when I rename all files in (.html) simply, the website is displayed. My question is, why are files with .amp.html extensions not displayed?
The problem comes down to webserver configuration, and likely has two issues.
The first is that you're probably expecting a default document to appear when you don't request a specific one. For example, http://example.com/... the path here is just /, but a web server will commonly load index.html from disk. Chances are, your web server is not configured to load index.amp.html from disk.
The second issue may come down to a bad MIME type configuration. It's important that text/html; charset=utf-8 be sent as the Content-Type response header value for your HTML files.
If you have control over your webserver, you can reconfigure it yourself. You didn't tell us what server you're using, so we can't tell you specifically how to do that. If you don't have control over your webserver, you'll have to take it up with your hosting provider... GoDaddy. Or, just name things .html and you'll be fine!

Which robots.txt for forwarded subdomain?

In theory I have two subdomains set up in my hosting:
subdomain1.mydomain.com
subdomain2.mydomain.com
subdomain2 has a CNAME record pointing to an external service.
mydomain.com has a robots.txt that allows indexing everything.
subdomain2.mydomain.com has a robots.txt that allows indexing nothing due to the CNAME record.
If I set up a forward from subdomain1.mydomain.com to subdomain2.mydomain.com, which robots.txt would be used if accessing a link to subdomain1.mydomain.com? Does the domain forward work in the same way as a CNAME record when it comes to robots.txt?
This depends on your server setup.
Take the following config, for example:
server {
server_name subdomainA.example.com;
listen 80;
return 302 http://subdomainB.example.com$request_uri;
}
In this case, we're redirecting everything from subdomainA.example.com to subdomainB.example.com. This will include your robots.txt file.
However, if your configuration is set up to only redirect certain parts, your robots.txt file will only be redirected if it's on your list. This would be the case if you were redirecting only, say, /someFolder.
Note that if you don't return a 302 but just use a different root (e.g. subdomainA and subdomainB are different subdomains but serve the same content), your robots.txt content will be determined by the root directory.
So, therefore, if I'm understanding your config correctly, subdomain1 will use the the robots.txt from subdomain2.
The challenge you're running into is you're looking at things from the standpoint of whatever software you're trying to configure, but search engines and other robots only see the document they load from a URL (just like any other user with a web browser would). That is, search engines will try to load http://subdomain1.mydomain.com/robots.txt and http://subdomain2.mydomain.com/robots.txt, and it's up to you (through configuring whatever software your server is running) to ensure that those are in fact serving what you want.
A CNAME is just a way to add a redirection when loading what IP a browser should look at to resolve a domain name. A robot will use it when resolving the name to find out the "real" IP to connect to, but it doesn't have any further bearing on what the GET /robots.txt request does once it connects to the server.
In terms of "forwarding", that term can mean different things, so you'd need to know what a browser or robot would receive when it requested the page. If it's doing a 301 or 302 redirection to send the client to another URL, you'll probably get different results from different search engines on how they may honor that, particularly if it's being redirected to an entirely different domain. I probably would try to avoid it, just because a lot of robots are poorly written. Some search engines have tools to help you determine how their crawlers are reading your robots.txt URLs, such as Google's tool.

Cloudfront Custom Origin Is Causing Duplicate Content Issues

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.
Main site: www.mainsite.com
static1.mainsite.com
static2.mainsite.com
Sample page: www.mainsite.com/summary/page1.htm
This page calls an image from static1.mainsite.com/images/image1.jpg
If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg
This all works fine.
The problem is that google alert has reported the page as being found at both:
http://www.mainsite.com/summary/page1.htm
http://static1.mainsite.com/summary/page1.htm
The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.
I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.
But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.
Questions then are:
1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?
Thanks for your help.
Joe
[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.
You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.
In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

difference between http and www

pardon me for asking a very basic doubt.
I have hosted a page in the site collinfo.annauniv.edu
The page opens fine when i enter the address as http://collinfo.annauniv.edu
But when i gave www.collinfo.annauniv.edu my browser shows 404 error.
What is the difference that http causes here in place of www.
The www. before your domain is actually a subdomain. It's essentially the same thing as help.microsoft.com or orders.amazon.com.
With that in mind, there are a few things that could be happening:
1) Your DNS records do not include the appropriate A Record for the www subdomain.
In this case, you'll need to setup an A record that points to your web site's IP address. If you don't know how to do this, your web host should be able to help.
2) Your server is not configured to handle the www subdomain.
If you're using the apache web server, it needs to be configured to show your web site when the user enters www before your domain. Again, your web host can set this up for you.
It all comes down to a misconfiguration issue. If you don't have experience administering web servers, you may want to give your web host a holler.
www comes from the (rather) old time where a domain had several sub-features, of which the web was not always the main service. For instance
www.domain.tld for web
mail.domain.tld for mail
ftp.domain.tld for ftp
domain.tld for web
but this is a convention - any subdomain may point to anything actually.
This is more a question of DNS declaration and/or web-server configuration ; in this case it is probably that the web-server configuration does not trigger the same pages for www.domain and domain (since you get a 404).
The author / administrator of collinfo.annauniv.edu either forgot to create a DNS entry for www.collinfo.annauniv.edu or did not create a virtual domain (web-server side) for it that would point to the same pages as collinfo.annauniv.edu.
HTTP is a protocol.
http://collinfo.annauniv.edu
Is the address of a resource which can be retrieved using HTTP.
annauniv.edu is the domain in your case.
collinfo is the subdomain.
www.collinfo is also considered as a subdomain but it does not exist. That's why you get HTTP 404 not found.
Subdomain can be anything, www is usually used as it usually mean World Wide Web.
WWW is a subdomain
HTTP is a protocol (language)
Whether you specify HTTP in the browser or not, the browser will always assume the request is being of "http" type and will ussually add http:// for you.
WWW however is just an alternative subdivision of the domain name, the same as in:
www.domain.com
site.domain.com
sub1.domain.com
sub2.domain.com
.....
etc.domain.com
In most cases the WWW subdomain will point to the same "page" as the main domain, which is usually called the "index" page, such as index.html, or index.php and in most cases the index page is hidden in the browser's address bar, unless you specifically type it in, such as http://www.yahoo.com/index.html, but you have to understand that if you have a full control of your webserver you can modify these, so WWW doesn't point to the same page or you can call you main page "home.html" instead of "index.html" and instruct your webserver to "point" your browswer to that page by default.
But things like HTTP are not easily changed, since HTTP is the main language of the web and most browswers use that as the primary means to access the webservers.
Peace!

Identify Webserver & Script of a website

I have got two simple questions
How can I tell what server is a website on? I remember I used to read the HTTP Host Header to identify the type of server. Is there any tool to do it?
2a. A lot of the website have the page extension .html and you just know they are not html. How can I tell what programming language is behind them?
2b. For ASPX, I think IIS can map the extension, so it will show HTML instead of ASPX, right?
Cheers
1.
Yes, you can check the http header tag "SERVER". Example of responses:
-Microsoft-IIS/6.0
-GFE/1.3
-Server Apache/2.2.11 (Ubuntu) PHP/5.2.6-3ubuntu4.2 with Suhosin-Patch
You can also check "X-Powered-By" on some servers, example:
-PHP/5.2.6-3ubuntu4.2
-ASP.NET
You can do this in firefox/firebug for example. Go to NET pick a request, select headers and check under response headers. You could do this is Fiddler to or any other http sniffer.
2a)
See my first answer
2b)
Yes you can map .html or anything as a "asp.net" extension, meaning that the extension will be handled by the web application. Common use is that you have a httphandler that catches that extension in web.config.
Not sure what your endgoal of these questions are.. or rather to what purpose, maybe we could answer better then.
Look at the HTTP headers. This works as long as the Server admin hasn't disabled them (which he usually doesn't).
Try http://kalender-365.de/ip/get-http-header.php
2a. This actually works with all servers and all extensions. Some Interpreters - such as e.g. PHP - send a special created-by HTTP header (which can be disabled, however).