Can anyone help me add a disallow rule to my robots.txt file that will stop crawlers indexing any link containing %2C which is the HTML URL encoding for a comma (,).
I think what I'm looking for is the wild card character if one exists in the robots.txt file.
So far I have this:
Disallow: %2C
But cannot seem to see it working.
Any suggestions?
Cheers
The best thing when testing robots.txt against the search engines is to utilize the tools they provide to you. Google Webmaster Tools has a robots.txt tester under "Health > Blocked URLs". If you use
User-agent: *
Disallow: *,*
this will block any requests for http://example.com/url%2Cpath/. I tried Disallow: *%2C* but apparently that doesn't block Googlebot from crawling the HTML escaped path. My guess is that Googlebot encodes it in the queuing process.
As for bing, they apparently removed their robots.txt validation tool. So really the only sure way of testing it, is to deploy a robots.txt on a test site, and the use Bing Webmaster Tools to fetch a page with the ','. It'll tell you at that point if it's blocked by robots.txt or not.
Remember when using robots.txt, that doesn't prevent the search engines from displaying the URL in the search results. It just prevents them from crawling the URL. If you simply don't want those type of URLs in the search results, but don't mind them crawling the page (meaning you can't block those URLs with robots.txt), you can add a meta tag or x-robots-tag in the http headers with a value of NOINDEX to prevent it from being added to the search results.
Regarding one of the other comments about using the "nofollow" standard. Nofollow doesn't actually prevent the search engines from crawling those URLs. It's more recognized as a way to disavowing any endorsement of that link to the destination. Google and Bing have suggested using nofollow to indicate sponsored links or untrusted UGC links.
Related
I want to publish an Angular app for testing purposes, but I want to make sure that the site does not get crawled or indexed by bots.
I assume (might be way off!) I would add my <meta> tags simply on my index.html page, and for good measure add a robots.txt file in my root?
These are my meta tags:
<meta name="robots" content="noindex,nofollow">
<meta name="googlebot" content="noindex" />
This is the content of my robots.txt file:
User-agent: *
Disallow: /
Thank you in advance!
Using the robots.txt file you specified will be enough to prevent your site from being indexed by the bots that follow the robots exclusion standard. With this robots.txt you don't need to specify the meta headers, because the bot read the robots.txt first and won't parse HTML of the website to read the meta tags.
The meta tags are used when your robots.txt file would normally allow to index that page, but you want to exclude it on the page-level, which allows more granular selection.
Note that some uncommon crawlers may not respect the exclusion standard. If you really want to restrict access to your test site, you should consider making it accessible only after authentication or allowing access only to certain IP addresses.
I've been checking around in different sites robots.txt files and stumbled upon something I didn't expect at MediaWiki's robots.txt. From what I've read so far you would can write in a robots.txt file like below:
Disallow: foo
Noindex: bar
I then wonder if:
Disallow: /wiki/Category:Noindexed_pages
is a correct structure in a robots.txt file, or at least for mediawiki's part? Also wants to know if Noindexed_pages can be anything or if it is static.
The last code was taken from a wikipedia article of mediawiki's robots.txt.
/wiki/Category:Noindexed_pages is a MediaWiki category page that links to pages that should not be indexed. The rule is probably there to prevent the category itself from propping up in search results when the search term is part of the title of a noindexed page.
I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks
This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)
Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).
I have an essay I want to release under an open licence so that others can use it, but I don't want it to be read by turnitin (google if you don't know.)
I want to host it in my university's public_html directory, so I don't have access to the top directory's robots.txt.
An answer to this problem will resolve how to stop turnitin from reading the page, but allow humans and search engine spiders from finding, reading and indexing it.
The TurnitinBot general information page at:
https://turnitin.com/robot/crawlerinfo.html
describes how their plagiarism prevention service crawls Internet content
The section:
https://turnitin.com/robot/crawlerinfo.html#access
describes how robots.txt can be configured to prevent TurnitinBot crawling by adding a line for their user agent:
User-agent: TurnitinBot
Disallow: ...your document...
Because you don't have access to the robots.txt file, if you can expose your essay in HTML format, you could try including a meta tag in the document like:
<meta name="TurnitinBot" content="noindex" />
(If you don't expose in HTML and it's important enough, could you?)
Their crawlerinfo page above says this about "good crawling etiquette":
It should also obey META exclusion tags within pages.
and hopefully they follow the good etiquette they provide on their own page.
How can I prevent Google and other search engines from indexing my website?
I realize this is a very old question, but I wanted to highlight the comment made by #Julien as an actual answer.
According to Joost de Valk, robots.txt will indeed prevent your site from being crawled by search engines, but links to your site may still appear in search results if other sites have links that point to your site.
The solution is either adding a robots meta tag to the header of your pages:
<meta name="robots" content="noindex,nofollow"/>
Or, a simpler option is to add the following to your .htaccess file:
Header set X-Robots-Tag "noindex, nofollow"
Obviously your web host has to allow .htaccess rules and have the mod_headers module installed for that to work.
Both of these tags keep search engines from following links that point to your site AND displaying your pages in search results. Win-Win, baby.
Create a robots.txt file in your site root with the following content:
# robots.txt for yoursite
User-agent: *
Disallow: /
Search engines (and most robots in general) will respect the contents of this file. You can put any number of Disallow: /path lines for robots to ignore. More details at robotstxt.org.