How to disable and remove subdomain.domain.com being crawled and listed to alexa and other crawlers ? Specially the cpanel.domain.com and webmail.domain.com that listed into my alexa information page and annoying :/ .
From this article: https://alexa.zendesk.com/hc/en-us/articles/200450194-Alexa-s-Web-and-Site-Audit-Crawlers
The Alexa web crawler (robot) identifies itself as “ia_archiver” in the HTTP “User-agent” header field. The Alexa Internet ia_archiver crawler strictly adheres to robots.txt rules.
To prevent ia_archiver from visiting any part of your site, your robots.txt file should look like this:
User-agent: ia_archiver
Disallow: /
You can also restrict crawling of specific directories. For example, to prevent ia_archiver from visiting the images directory (and its subdirectories):
User-agent: ia_archiver
Disallow: /images/
If you can you can place a robots.txt in the root of the subdomains you do not wish to have crawled. If these pages are outside of your control; the hosting service should/could have done these or similar restrictions.
Related
I create a web application and host this site using wordpress.When i search name in goole it is showing
A description for this result is not available because of this site's robots.txt
Why this is happening.Is there problem in metatag?
Your site’s robots.txt file is disallowing crawling of the page you found in Google Search. This means Google’s bot won’t visit this page to read its content.
The robots.txt file exists at the URL /robots.txt, e.g., http://example.com/robots.txt.
You’ll want to look for and check any lines starting with Disallow::
Disallow: allows everything
Disallow: / blocks everything
Disallow: /a blocks every URL whose path starts with /a,
e.g. http://example.com/a, http://example.com/abc, http://example.com/a/b.html etc.
If you have the Yoast SEO plugin installed, you can find and edit the robots.txt file at
SEO > Tools > File Editor
If you use Google Search Console (highly recommended) then you can test your robots.txt file
Go to Crawl > robots.txt tester
GSC - https://www.google.com/webmasters/tools/home?hl=en
I am using a paid geolocation script to direct users to specific sites based on country.
However, I am getting charged a lot because robots keep crawling every page of my large site.
If I were to disallow google within the robots.txt and provide a sitemap within the robots.txt would google still index my page without crawling?
Example
User-agent: *
Disallow: /
Sitemap: sitemap.xml
Google index only with crawling...
The best thing to do for you, is to disable the geolocation script when you detect a Google robot (or other)
You can recognize them in various ways: HTTP_USER_AGENT or HTTP_FROM, or IP
i would like to ask you a question...
i have a domain kiosban.com and store.kiosban.com..
and i want to disallow
store.kiosban.com/template/*
And i have this on my store.kiosban.com/robots.txt
but when i look at google webmaster tools... on health menu >> Blocked Url, i got
robots.txt file Blocked URLs Downloaded Status
http://www.store.kiosban.com/robots.txt - Never
Did i do something wrong??
www.store.kiosban.com and store.kiosban.com are different hosts. You should provide a robots.txt on both hosts, or even better, 301-redirect one form to the other one.
But that doesn’t seem to be the issue in your case. It looks like Google just needs some time to crawl your site resp. its robots.txt.
Are there any security concerns when putting an HTTPS iframe in an HTTPS page? Or security wise is it essentially the same as having a single HTTPS page.
(The iframe content is coming from a different domain if that makes a difference)
There are no on-the-wire privacy implications associated with this compared with a normal HTTPS page, but bear in mind that you're doubling the number of servers and probably companies involved.
Browser exploits, popups and adware can all be served under HTTPS - and visiting a HTTPS site that isn't in your control can expose your users to privacy violations if the HTTPS URL reveals personal information about one of your users - for example if you serve https://www.example.com/redir.php?url= + CURRENT_URL and you've logged in a user using a GET postback with the username and password in the URL, you could be exposing this to third party sites.
Other than that, there are no issues associated with embedding third-party HTTPS sites into your own HTTPS pages.
How do you create a valid robots.txt file that blocks all crawler requests except for the root, aka landing http://www.mysite.com
Assuming your default page for the root is named index.htm, I believe this will accomplish what you're looking for.
User-agent: *
Allow: /index.htm
Disallow: /
Google's Webmaster Tools has some great help for formulating a robots.txt and if you use the Webmaster Tools, you also get a robots.txt builder/tester.