How to request certain page elements not be indexed - html

Essentially I would like to specify an element to be ignored by search engines. If I reference pornography from an academic standpoint, I don't want Google to list my site under porn searches, for instance, but would like it to index the rest of the page.
Is this possible? I'm sure I have come across a method of including meta data into one's html to achieve this.
I have tried to find this on the web, but have been unsuccessful.
I can't make sense of this page, since I don't know if because it's a draft specification it is not recognised by crawl bots.

Using the robots.txt File in the root directory of your website.
User-agent: *
Disallow: /myreference_dir/
Disallow: /myreference_dir/myarticle.html
Wikipedia

Related

How to make mediawiki sitemap URLs match the canonical URLs?

From my homepage, links look like "/index.php?title=My_Page_Name". I turned on $wgEnableCanonicalServerLink, so my pages contain canonical meta data, and the URL is the same. So far so good!
Unfortunately, generateSitemap.php is making entries that look like "/index.php/My_Page_Name", i.e. without the "title=".
Google's indexing is mad about this discrepancy. What's the magic incantation to make them all contain "title="?

Disallow: /*? in robots.txt explanation needed

Could someone please tell me if the following rules in my robots.txt would allow google or NOT to crawl the following example links?
Allow: /search/
Disallow: /*?
it´s an e-commerce and I would like to understand if when searching in the product search bar, a dynamic link like these would be crawled or not:
https://www.whateverwebsite.de/search/?q=hello
https://www.whateverwebsite.de/category.html?form_new=18658
https://www.whateverwebsite.de/search/?q=grand&productFilter=motive_design%3AGeometric
Those links are generic examples, but I would really like to know if the Disallow:/*?
blocks this kind of links to be crawled on not since there is nothing between the "/" and the "?".
Thanks a lot in advance and I look forwards to some answers to keep learning :)
Your question is answered here: A Deeper Look At Robots.txt
Disallow: /*? # block any URL that includes a ?
But, what you also should consider is that conflicting rules are resolved by giving precedence to the longer rule. (compare here)
So, take care that the right rule gets precedence as you want it.

Using robots.txt to hide a specific parameter [duplicate]

Simple question. I want to add:
Disallow */*details-print/
Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic.
I thought this would be simple, but then on www.robotstxt.org there is this message:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
So we can't do that? Do search engines abide by it? But then, there's Quora.com's robots.txt file:
Disallow: /ajax/
Disallow: /*/log
Disallow: /*/rss
Disallow: /*_POST
So, who is right -- Or am I misunderstanding the text on robotstxt.org?
Thanks!
The answer is, "it depends". The robots.txt "standard" as defined at robotstxt.org is the minimum that bots are expected to support. Googlebot, MSNbot, and Yahoo Slurp support some common extensions, and there's really no telling what other bots support. Some say what they support and others don't.
In general, you can expect the major search engine bots to support the wildcards that you've written, and the one you have there looks like it will work. Best bet would be to run it past one or more of these robots.txt validators or use Google's Webmaster tools to check it.

Where to place the humans.txt file if I cannot put it on the site root?

Background
I'm building a web application for a client.
This app will be accessible to the world and will be placed in a directory (e.g., /my-app) in web-root. A subdomain isn't an option as they don't want to cough up the dough for another SSL cert.
/my-app is the only directory that I'm allowed to touch (unreasonable IT guys).
I'm using an icon set which requires attribution.
I've contacted the original author of the icon set and have gotten permission to link back to his work in the THANKS section of a humans.txt file.
I also feel like I should mention some other people's work. This information combined with the above will probably take up a good 20 lines, so a separate file like humans.txt seems like an ideal place to put this considering that I'll be serving minified markup, CSS, and script files.
Questions
Since I'm not allowed to place a humans.txt file in web-root, (and even if I was, it wouldn't really make much sense to put it there as it only applies to the /my-app portion of the site) is it acceptable to do the following:
Create: /my-app/humans.txt
Place: <link rel="author" href="//example.com/my-app/humans.txt"> in my markup
I'll be serving strict HTML 4.01 and the author value for the rel attribute doesn't seem to be a recognized link type in that specification. Do I need to do anything extra to define the author link type, or is the act of using it enough?
I don't even know if there are any non-spider tools that actually use this file at the moment, but I'd like to minimize the chance of this not working in the future when something does come along.
I think it is ok to put the file in the applications own directory, since it clarifies that it is specific to the content inside the directory and not all the other stuff you might find in the root directory.
Of course it would be nice if there are 0 errors in HTML strict mode. However this is one situation where you have to decide if you want to
keep up with the standard and not insert the meta tag (maybe put it in a comment or as a real link in a credits page)
ignore the standard, because the standard is nice but not the holy grail (there are quite worse errors you can make than that)
chose another Doctype, which allows you to use the meta tag you want, but to test again if all browsers render the new Doctype correctly
However I can not make this decision for you ;)

Embed sandboxed HTML on a webpage + SEO

I want to embed some HTML on my website... I would like that:
SEO: that content can be crawled and indexed
Integration: it renders nicely (does not break my DOM trees for instance, or does not inherit my styles)
Security: it remains safe for our user (javascript disabled)
Flexibility: the HTML can be completely free (don't want any BBCode or MarkDown or even TinyMCE, it's our users that are writing the HTML code...)
I saw that I might be able to use the IFrame for that, but I am not sure it is a very good solution concerning my SEO constraint.
Any answer would be greatly appreciated!!! Thanks.
For your requirements (rendering and security, primarily), IFRAME seems to be your only option, especially when we consider no rules are specified for the HTML content except the JS removal. Even some CSS + 'a' tag can bring a serious security risk, like overlaying outgoing links on your standard interface.
For the SEO part, you can use SEO maps to show the search engines the relation between the content and the container, also use html tags like link to make connection.
To make sure the user's html is safe then you should use HTMLPurifer. In terms of the rest of the question, you should split this up into multiple questions.