Correct robots.txt structure? (Mediawiki) - mediawiki

I've been checking around in different sites robots.txt files and stumbled upon something I didn't expect at MediaWiki's robots.txt. From what I've read so far you would can write in a robots.txt file like below:
Disallow: foo
Noindex: bar
I then wonder if:
Disallow: /wiki/Category:Noindexed_pages
is a correct structure in a robots.txt file, or at least for mediawiki's part? Also wants to know if Noindexed_pages can be anything or if it is static.
The last code was taken from a wikipedia article of mediawiki's robots.txt.

/wiki/Category:Noindexed_pages is a MediaWiki category page that links to pages that should not be indexed. The rule is probably there to prevent the category itself from propping up in search results when the search term is part of the title of a noindexed page.

Related

How to keep the search engines away from some pages on my domain

I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks
This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)
Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).

Disallow Google from indexing links containing "%2C" (comma) with robots.txt

Can anyone help me add a disallow rule to my robots.txt file that will stop crawlers indexing any link containing %2C which is the HTML URL encoding for a comma (,).
I think what I'm looking for is the wild card character if one exists in the robots.txt file.
So far I have this:
Disallow: %2C
But cannot seem to see it working.
Any suggestions?
Cheers
The best thing when testing robots.txt against the search engines is to utilize the tools they provide to you. Google Webmaster Tools has a robots.txt tester under "Health > Blocked URLs". If you use
User-agent: *
Disallow: *,*
this will block any requests for http://example.com/url%2Cpath/. I tried Disallow: *%2C* but apparently that doesn't block Googlebot from crawling the HTML escaped path. My guess is that Googlebot encodes it in the queuing process.
As for bing, they apparently removed their robots.txt validation tool. So really the only sure way of testing it, is to deploy a robots.txt on a test site, and the use Bing Webmaster Tools to fetch a page with the ','. It'll tell you at that point if it's blocked by robots.txt or not.
Remember when using robots.txt, that doesn't prevent the search engines from displaying the URL in the search results. It just prevents them from crawling the URL. If you simply don't want those type of URLs in the search results, but don't mind them crawling the page (meaning you can't block those URLs with robots.txt), you can add a meta tag or x-robots-tag in the http headers with a value of NOINDEX to prevent it from being added to the search results.
Regarding one of the other comments about using the "nofollow" standard. Nofollow doesn't actually prevent the search engines from crawling those URLs. It's more recognized as a way to disavowing any endorsement of that link to the destination. Google and Bing have suggested using nofollow to indicate sponsored links or untrusted UGC links.

No Access to Top Directory, Want to Stop Certain Robots

I have an essay I want to release under an open licence so that others can use it, but I don't want it to be read by turnitin (google if you don't know.)
I want to host it in my university's public_html directory, so I don't have access to the top directory's robots.txt.
An answer to this problem will resolve how to stop turnitin from reading the page, but allow humans and search engine spiders from finding, reading and indexing it.
The TurnitinBot general information page at:
https://turnitin.com/robot/crawlerinfo.html
describes how their plagiarism prevention service crawls Internet content
The section:
https://turnitin.com/robot/crawlerinfo.html#access
describes how robots.txt can be configured to prevent TurnitinBot crawling by adding a line for their user agent:
User-agent: TurnitinBot
Disallow: ...your document...
Because you don't have access to the robots.txt file, if you can expose your essay in HTML format, you could try including a meta tag in the document like:
<meta name="TurnitinBot" content="noindex" />
(If you don't expose in HTML and it's important enough, could you?)
Their crawlerinfo page above says this about "good crawling etiquette":
It should also obey META exclusion tags within pages.
and hopefully they follow the good etiquette they provide on their own page.

Preventing Site from Being Indexed by Search Engines

How can I prevent Google and other search engines from indexing my website?
I realize this is a very old question, but I wanted to highlight the comment made by #Julien as an actual answer.
According to Joost de Valk, robots.txt will indeed prevent your site from being crawled by search engines, but links to your site may still appear in search results if other sites have links that point to your site.
The solution is either adding a robots meta tag to the header of your pages:
<meta name="robots" content="noindex,nofollow"/>
Or, a simpler option is to add the following to your .htaccess file:
Header set X-Robots-Tag "noindex, nofollow"
Obviously your web host has to allow .htaccess rules and have the mod_headers module installed for that to work.
Both of these tags keep search engines from following links that point to your site AND displaying your pages in search results. Win-Win, baby.
Create a robots.txt file in your site root with the following content:
# robots.txt for yoursite
User-agent: *
Disallow: /
Search engines (and most robots in general) will respect the contents of this file. You can put any number of Disallow: /path lines for robots to ignore. More details at robotstxt.org.

How should I handle autolinking in wiki page content?

What I mean by autolinking is the process by which wiki links inlined in page content are generated into either a hyperlink to the page (if it does exist) or a create link (if the page doesn't exist).
With the parser I am using, this is a two step process - first, the page content is parsed and all of the links to wiki pages from the source markup are extracted. Then, I feed an array of the existing pages back to the parser, before the final HTML markup is generated.
What is the best way to handle this process? It seems as if I need to keep a cached list of every single page on the site, rather than having to extract the index of page titles each time. Or is it better to check each link separately to see if it exists? This might result in a lot of database lookups if the list wasn't cached. Would this still be viable for a larger wiki site with thousands of pages?
In my own wiki I check all the links (without caching), but my wiki is only used by a few people internally. You should benchmark stuff like this.
In my own wiki system my caching system is pretty simple - when the page is updated it checks links to make sure they are valid and applies the correct formatting/location for those that aren't. The cached page is saved as a HTML page in my cache root.
Pages that are marked as 'not created' during the page update are inserted into the a table of the database that holds the page and then a csv of pages that link to it.
When someone creates that page it initiates a scan to look through each linking page and re-caches the linking page with the correct link and formatting.
If you weren't interested in highlighting non-created pages however you could just have a checker to see if the page is created when you attempt to access it - and if not redirect to the creation page. Then just link to pages as normal in other articles.
I tried to do this once and it was a nightmare! My solution was a nasty loop in a SQL procedure, and I don't recommend it.
One thing that gave me trouble was deciding what link to use on a multi-word phrase. Say you had some text saying "I am using Stack Overflow" and your wiki had 3 pages called "stack", "overflow" and "stack overflow"....which part of your phrase gets linked to where? It will happen!
My idea would be to query the titles like SELECT title FROM articles and simply check if each wikilink is in that array of strings. If it is you link to the page, if not, you link to the create page.
In a personal project I made with Sinatra (link text) after I run the content through Markdown, I do a gsub to replace wiki words and other things (like [[Here is my link]] and whatnot) with proper links, on each checking if the page exists and linking to create or view depending.
It's not the best, but I didn't build this app with caching/speed in mind. It's a low resource simple wiki.
If speed was more important, you could wrap the app in something to cache it. For example, sinatra can be wrapped with the Rack caching.
Based on my experience developing Juli, which is an offline personal wiki with autolink, generating static HTML approach may fix your issue.
As you think, it takes long time to generate autolinked Wiki page. However, in generating static HTML situation, regenerating autolinked Wiki page happens only when a wikipage is newly added or deleted (in other words, it doesn't happen when updating wikipage) and the 'regenerating' can be done in background so that usually I don't matter how it take long time. User will see only the generated static HTML.