How to keep the search engines away from some pages on my domain - html

I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks

This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)

Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).

Related

How stop bots from crawling or indexing an Angular app

I want to publish an Angular app for testing purposes, but I want to make sure that the site does not get crawled or indexed by bots.
I assume (might be way off!) I would add my <meta> tags simply on my index.html page, and for good measure add a robots.txt file in my root?
These are my meta tags:
<meta name="robots" content="noindex,nofollow">
<meta name="googlebot" content="noindex" />
This is the content of my robots.txt file:
User-agent: *
Disallow: /
Thank you in advance!
Using the robots.txt file you specified will be enough to prevent your site from being indexed by the bots that follow the robots exclusion standard. With this robots.txt you don't need to specify the meta headers, because the bot read the robots.txt first and won't parse HTML of the website to read the meta tags.
The meta tags are used when your robots.txt file would normally allow to index that page, but you want to exclude it on the page-level, which allows more granular selection.
Note that some uncommon crawlers may not respect the exclusion standard. If you really want to restrict access to your test site, you should consider making it accessible only after authentication or allowing access only to certain IP addresses.

Prevent google from indexing ajax loaded content

On our site we load identical content via Ajax calls (when the users click on the menu, just to prevent reloading the entire page again, so as to improve user experience).
So this is works well, but actually this Ajax loaded content is actually a copy of the original content.
May I prevent Google from indexing this content?
http://dinox-h.hu/en/gallery.php
In the left menu you can see the links:
For example:
http://dinox-h.hu/puffer_tartalyok_galeria.php?ajax=1
Try adding the following on your Ajax-delivered pages:
<meta name="robots" content="noindex,nofollow" />
This will tell site crawlers to not crawl the page. You could also add the pages in robots.txt, like this:
User-agent: *
Disallow: /*?ajax=1
That would block any URL with ?ajax=1 from being indexed (providing a robot honours your robots.txt). A better solution would also involve creating a sitemap and telling various search engines about it.
Edit
A better way of delivering Ajax content IMO would be to send the following header when requesting your pages via Ajax:
X-Requested-With: XMLHttpRequest
jQuery will do this by default, so provided you can check for it on the server side, you could deliver your usual content e.g. without the template. You could then very easily deliver different content from the same URL depending on what the type of request is. This should also solve your crawling issue as I doubt a crawler would stumble across it.

How to add a redirect to a web page where you have limited user priveledges

The company I work for has replaced our previously very flexible website with a much more restrictive "website in a box" technology. I have my web pages hosted on Google Sites and would like to redirect people to those pages. When I attempt to do this via javascript it gets stripped from the page when its saved. I do not have access to the section to attempt the depreciated method of redirecting.
Is there another method available to automatically redirect a customer other than just posting a link in a restricted environment like this?
If you're limited to using HTML to do the redirect, you can use a meta redirect:
<meta http-equiv="refresh" content="0; url=http://example.com/">
Though note that its use is deprecated because it may be disorienting to the user. In addition to the <meta> tag, you can add <link rel="canonical" href="http://example.com/"> to let search engines know that the targeted page is the canonical one.
Edit: if Google Sites won't allow you to change the <head> HTML, the Javascript, or the PHP, then it's time to go searching for solutions within Google Sites itself. One solution that pops up pretty frequently in searches seems to be using a URL Redirect Gadget.
On the page you want to redirect from, click the Edit Page button, then Insert Menu, then More Gadgets. Once there, search for "redirect gadgets" and some widgets that should help will show up.
These instructions are based on advice given in the Google Products forums. I don't have a Google Site myself, so I can't verify that they work.

No Access to Top Directory, Want to Stop Certain Robots

I have an essay I want to release under an open licence so that others can use it, but I don't want it to be read by turnitin (google if you don't know.)
I want to host it in my university's public_html directory, so I don't have access to the top directory's robots.txt.
An answer to this problem will resolve how to stop turnitin from reading the page, but allow humans and search engine spiders from finding, reading and indexing it.
The TurnitinBot general information page at:
https://turnitin.com/robot/crawlerinfo.html
describes how their plagiarism prevention service crawls Internet content
The section:
https://turnitin.com/robot/crawlerinfo.html#access
describes how robots.txt can be configured to prevent TurnitinBot crawling by adding a line for their user agent:
User-agent: TurnitinBot
Disallow: ...your document...
Because you don't have access to the robots.txt file, if you can expose your essay in HTML format, you could try including a meta tag in the document like:
<meta name="TurnitinBot" content="noindex" />
(If you don't expose in HTML and it's important enough, could you?)
Their crawlerinfo page above says this about "good crawling etiquette":
It should also obey META exclusion tags within pages.
and hopefully they follow the good etiquette they provide on their own page.

Preventing Site from Being Indexed by Search Engines

How can I prevent Google and other search engines from indexing my website?
I realize this is a very old question, but I wanted to highlight the comment made by #Julien as an actual answer.
According to Joost de Valk, robots.txt will indeed prevent your site from being crawled by search engines, but links to your site may still appear in search results if other sites have links that point to your site.
The solution is either adding a robots meta tag to the header of your pages:
<meta name="robots" content="noindex,nofollow"/>
Or, a simpler option is to add the following to your .htaccess file:
Header set X-Robots-Tag "noindex, nofollow"
Obviously your web host has to allow .htaccess rules and have the mod_headers module installed for that to work.
Both of these tags keep search engines from following links that point to your site AND displaying your pages in search results. Win-Win, baby.
Create a robots.txt file in your site root with the following content:
# robots.txt for yoursite
User-agent: *
Disallow: /
Search engines (and most robots in general) will respect the contents of this file. You can put any number of Disallow: /path lines for robots to ignore. More details at robotstxt.org.