How do you create a robots.txt file that blocks all but the root - html

How do you create a valid robots.txt file that blocks all crawler requests except for the root, aka landing http://www.mysite.com

Assuming your default page for the root is named index.htm, I believe this will accomplish what you're looking for.
User-agent: *
Allow: /index.htm
Disallow: /
Google's Webmaster Tools has some great help for formulating a robots.txt and if you use the Webmaster Tools, you also get a robots.txt builder/tester.

Related

Hosting RTD documentation in a subfolder of existing domain

I currently have documentation built by readthedocs.org on docs.mycompany.com and my regular company website on mycompany.com. For SEO purposes I would prefer to have the documentation on mycompany.com/docs, so my documentation content contributes to the company website ranking.
From the documentation on custom domains I can only find subdomain examples. Am I correct in thinking the above is not possible with readthedocs.org and would require me to start self-hosting the documentation?
Read the Docs cannot control your root domain (mycompany.com) if you have something else hosted there. You could put a reverse proxy listening at /docs on your mycompany.com domain pointing to your <project-slug>.readthedocs.io URL.

Google showing "A description for this result is not available because of this site's robots.txt "

I create a web application and host this site using wordpress.When i search name in goole it is showing
A description for this result is not available because of this site's robots.txt
Why this is happening.Is there problem in metatag?
Your site’s robots.txt file is disallowing crawling of the page you found in Google Search. This means Google’s bot won’t visit this page to read its content.
The robots.txt file exists at the URL /robots.txt, e.g., http://example.com/robots.txt.
You’ll want to look for and check any lines starting with Disallow::
Disallow: allows everything
Disallow: / blocks everything
Disallow: /a blocks every URL whose path starts with /a,
e.g. http://example.com/a, http://example.com/abc, http://example.com/a/b.html etc.
If you have the Yoast SEO plugin installed, you can find and edit the robots.txt file at
SEO > Tools > File Editor
If you use Google Search Console (highly recommended) then you can test your robots.txt file
Go to Crawl > robots.txt tester
GSC - https://www.google.com/webmasters/tools/home?hl=en

Disallow google robot from robots.txt and list sitemap instead

I am using a paid geolocation script to direct users to specific sites based on country.
However, I am getting charged a lot because robots keep crawling every page of my large site.
If I were to disallow google within the robots.txt and provide a sitemap within the robots.txt would google still index my page without crawling?
Example
User-agent: *
Disallow: /
Sitemap: sitemap.xml
Google index only with crawling...
The best thing to do for you, is to disable the geolocation script when you detect a Google robot (or other)
You can recognize them in various ways: HTTP_USER_AGENT or HTTP_FROM, or IP

disable crawl unwanted subdomain

How to disable and remove subdomain.domain.com being crawled and listed to alexa and other crawlers ? Specially the cpanel.domain.com and webmail.domain.com that listed into my alexa information page and annoying :/ .
From this article: https://alexa.zendesk.com/hc/en-us/articles/200450194-Alexa-s-Web-and-Site-Audit-Crawlers
The Alexa web crawler (robot) identifies itself as “ia_archiver” in the HTTP “User-agent” header field. The Alexa Internet ia_archiver crawler strictly adheres to robots.txt rules.
To prevent ia_archiver from visiting any part of your site, your robots.txt file should look like this:
User-agent: ia_archiver
Disallow: /
You can also restrict crawling of specific directories. For example, to prevent ia_archiver from visiting the images directory (and its subdirectories):
User-agent: ia_archiver
Disallow: /images/
If you can you can place a robots.txt in the root of the subdomains you do not wish to have crawled. If these pages are outside of your control; the hosting service should/could have done these or similar restrictions.

Disallow subdomain url using robots.txt

i would like to ask you a question...
i have a domain kiosban.com and store.kiosban.com..
and i want to disallow
store.kiosban.com/template/*
And i have this on my store.kiosban.com/robots.txt
but when i look at google webmaster tools... on health menu >> Blocked Url, i got
robots.txt file Blocked URLs Downloaded Status
http://www.store.kiosban.com/robots.txt - Never
Did i do something wrong??
www.store.kiosban.com and store.kiosban.com are different hosts. You should provide a robots.txt on both hosts, or even better, 301-redirect one form to the other one.
But that doesn’t seem to be the issue in your case. It looks like Google just needs some time to crawl your site resp. its robots.txt.