Using robots.txt to hide a specific parameter [duplicate] - html

Simple question. I want to add:
Disallow */*details-print/
Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic.
I thought this would be simple, but then on www.robotstxt.org there is this message:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
So we can't do that? Do search engines abide by it? But then, there's Quora.com's robots.txt file:
Disallow: /ajax/
Disallow: /*/log
Disallow: /*/rss
Disallow: /*_POST
So, who is right -- Or am I misunderstanding the text on robotstxt.org?
Thanks!

The answer is, "it depends". The robots.txt "standard" as defined at robotstxt.org is the minimum that bots are expected to support. Googlebot, MSNbot, and Yahoo Slurp support some common extensions, and there's really no telling what other bots support. Some say what they support and others don't.
In general, you can expect the major search engine bots to support the wildcards that you've written, and the one you have there looks like it will work. Best bet would be to run it past one or more of these robots.txt validators or use Google's Webmaster tools to check it.

Related

Disallow: /*? in robots.txt explanation needed

Could someone please tell me if the following rules in my robots.txt would allow google or NOT to crawl the following example links?
Allow: /search/
Disallow: /*?
it´s an e-commerce and I would like to understand if when searching in the product search bar, a dynamic link like these would be crawled or not:
https://www.whateverwebsite.de/search/?q=hello
https://www.whateverwebsite.de/category.html?form_new=18658
https://www.whateverwebsite.de/search/?q=grand&productFilter=motive_design%3AGeometric
Those links are generic examples, but I would really like to know if the Disallow:/*?
blocks this kind of links to be crawled on not since there is nothing between the "/" and the "?".
Thanks a lot in advance and I look forwards to some answers to keep learning :)
Your question is answered here: A Deeper Look At Robots.txt
Disallow: /*? # block any URL that includes a ?
But, what you also should consider is that conflicting rules are resolved by giving precedence to the longer rule. (compare here)
So, take care that the right rule gets precedence as you want it.

How to prevent search engines from indexing a span of text?

From the information I have been able to find so far, <noindex> is supposed to achieve this, making a single section of a page hidden from search engine spiders. But then it also seems this is not obeyed by many browsers - so if that is the case, what markup should be used instead of / in addition to it?
Yahoo uses a built-in class: <span class="robots-nocontent">
Googlebot has no equivalent(?)
Yandex uses <noindex>
Others?
There is no way to stop crawlers from indexing anything, it's up to their author to decide what the crawlers would do. The rule-obeying ones, like Yahoo Slurp, Googlebot, etc. they each have their own rule, as you've already discovered, but it's still up to them whether to completely obey the rules, or not - say you set robots-nocontent but that part is still indexed and put in some other place, maybe for checks for spam, illegal material, malware, etc.
And that's just for the "good" ones, there's no telling what the bad ones would do. So think of all the noindex stuff as a set of guidelines, not a set of strict rules.
And the only thing that works for sure: if you have sensitive data, or you simply don't want something indexed - don't make it publicly available.

HTML5: which is better - using a character entity vs using a character directly?

I've recently noticed a lot of high profile sites using characters directly in their source, eg:
<q>“Hi there”</q>
Rather than:
<q>“Hi there”</q>
Which of these is preferred? I've always used entities in the past, but using the character directly seems more readable, and would seem to be OK in a Unicode document.
If the encoding is UTF-8, the normal characters will work fine, and there is no reason not to use them. Browsers that don't support UTF-8 will have lots of other issues while displaying a modern webpage, so don't worry about that.
So it is easier and more readable to use the characters and I would prefer to do so.
It also saves a couple of bytes which is good, although there is much more to gain by using compression and minification.
The main advantage I can see with encoding characters is that they'll look right, even if the page is interpreted as ASCII.
For example, if your page is just a raw HTML file, the default settings on some servers would be to serve it as text/html; charset=ISO-8859-1 (the default in HTTP 1.1). Even if you set the meta tag for content-type, the HTTP header has higher priority.
Whether this matters depends on how likely the page is to be served by a misconfigured server.
It is better to use characters directly. They make for: easier to read code.
Google's HTML style guide advocates for the same. The guide itself can be found here:
Google HTML/CSS Style guide.
Using characters directly. They are easier to read in the source (which is important as people do have to edit them!) and require less bandwidth.
The example given is definitely wrong, in theory as well as in practice, in HTML5 and in HTML 4. For example, the HTML5 discussions of q markup says:
“Quotation punctuation (such as quotation marks) that is quoting the contents of the element must not appear immediately before, after, or inside q elements; they will be inserted into the rendering by the user agent.”
That is, use either ´q’ markup or punctuation marks, not both. The latter is better on all practical accounts.
Regarding the issue of characters vs. entity references, the former are preferable for readability, but then you need to know how to save the data as UTF-8 and declare the encoding properly. It’s not rocket science, and usually better. But if your authoring environment is UTF-8 hostile, you need not be ashamed of using entity references.

URL hash format, what's allowed and what's not?

I'm using hash-based navigation in my rich web app. I also found I needed to create permalinks that would point to single instances of resources, but since I cannot cause the page o refresh, and the main page is loaded from a single path '/', I cannot use real URLs. Instead I thought about using hashes. Let me give you an example because I know the explanation above sucks.
So, instead of having http://example.com/path/to/resource/1, I would have http://example.com/#path/to/resource/1
This seems to work ok, and browser believes '#path/to/resource/1' is a hash (slashes permitted, I think) but I was wondering about what characters are allowed in URL hash. Is there a specification or a RFC that I could read to find out what the standard behavior of browsers is when it comes to hashes?
EDIT: Ok, so silly me. Didn't actually check if slashes worked in all browsers. Chrome obviously doesn't like them. Only works in FF.
Look at: http://www.w3.org/Addressing/rfc1630.txt or http://www.w3.org/Addressing/URL/4_2_Fragments.html
Basically you can use anything that can be encoded in an URL.
Note: There might be browser inconsistencies. If you fear them, you might use a serialization mechanism, like converting the string to hex or something (will be twice longer though), or use an id of some sort.
This document should help. Slashes are allowed, but the lexical analysis might differ between browsers.
I think you might find that useful: RFC3986
If you use PHP to generate your page paths you could also urlencode() which generates you a valide URL.

How to request certain page elements not be indexed

Essentially I would like to specify an element to be ignored by search engines. If I reference pornography from an academic standpoint, I don't want Google to list my site under porn searches, for instance, but would like it to index the rest of the page.
Is this possible? I'm sure I have come across a method of including meta data into one's html to achieve this.
I have tried to find this on the web, but have been unsuccessful.
I can't make sense of this page, since I don't know if because it's a draft specification it is not recognised by crawl bots.
Using the robots.txt File in the root directory of your website.
User-agent: *
Disallow: /myreference_dir/
Disallow: /myreference_dir/myarticle.html
Wikipedia