Block certain html element from getting indexed by search engines - html

For styling purposes i want to insert some dummy text on the page, but it shouldn't be getting linked to the actual content. Is there a way to block it for search engines, or do i have to use good old images for that?
Or would it be possible to load it dynamically via javascript? because i heard that google will read certain amount of javascript.

Can you show the content in a borderless iframe, and block the iframe's src (a completely separate "page") from the search engines?
Alternatively, add the content with javascript, storing the javascript in a .js file that you block from the engines?

If you load that text via AJAX it probably won't be indexed - last time I checked, GoogleBot doesn't actually execute JS (nor do the other spiders (but some spambots apparently can and do)).
Caveat: the AJAX response should probably contain a X-Robots-Tag: noindex header, in case its URL is actually linked somewhere.

I'd be extremely careful with whatever trick you decide on. Odds are just as likely google will think you're trying to display different content to the user than to it.
I've always believed that Google actually works by rendering the page (possibly using some server-side version of the Chrome rendering engine) and then reads the result back with OCR software to confirm that the text in the source matches what the user would see with JS and frames enabled. Google has always openly warned webmasters not to try serving robots different content to the users, OCR would be the perfect way to find out (especially if your 'verifier' used IE's user-agent string and crawled from IP ranges not registered by Google).
Short answer then, serve the decoration as either:
an iframe
an object
an SVG image
Since your clearly linking the document into your page google will proably consider it a seperate resource and rate things accordingly, especially if the same text appears on every page. Which brings me to:
Are you going to use the same text decor on all/most pages? If so Google will almost certainly treat it as "window dressing" and ignore it (it apparently does this with menus and such).

I'd guess that loading in the content after the page has finished loading (when the document.ready event fires, for example) would be a fairly safe way to do what you're talking about. Not 100% sure about this, though.

Related

Is importing text via CSS really permitted?

There is a html page that I have no control over, ditto goes for any javascript.
I can however style it, and one of the thing's I've done is injected a slab of text via content in a CSS pseudo-element.
However this slab is multiple lines and with CSS strings being only one line leads to a cumbersome property full of \0as.
I was wondering whether I could use the url(blah/blah) syntax with content in place of a string; and the docs say yes!
However, when I try it (the slab now unencoded and hosted in it's own file), the content doesn't show.
Looking in the networking tab of the devtools shows it is requested, but it looks like the browser is ignoring it.
At first I thought it was a headers issue (I was working just out of the filesystem), so I built a tiny server to apply text/plain (I also tried text/html) on localhost.
It appears the browsers are only accepting images for content, with the following header seen to be sent with the request in chrome's devtools; Accept:image/webp,image/*,*/*;q=0.8.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
Is there any way to get something like what I'm attempting up and running, or am I left to deal with the long CSS statement?
The docs say "an external resource (such as an image)" so they don't explicitly rule out such use of plain text, but they don't explicitly allow it either. It seems likely that "such as an image" is intended to allow further media types such as video or interactive SVG but deliberately vague so as to not second-guess future technologies.
As such, it would seem that it's "permitted" as in you haven't done anything invalid, but not supported as in there is no reason why you should expect it to actually do anything useful.
This issue occurs in firefox too, so why does the mdn specifically use a .html example in the syntax?
I'd guess that it was simply a RFC 6761- and RFC 2606-compliant example URI often used in the docs as an example URI. (Of course, to nitpick, there's no reason why a URI ending in .html should be assumed to always return HTML either, though its a bit perverse to do otherwise).

Hiding noscript tags from search engines

I was looking for information on hiding noscript from search engines, this way help information on enabling JavaScript wouldn't end up added to the keyword density...
I’m thinking of using a different approach and wanted to get thoughts on it?
Looking at logs I noticed both google and bing have the word bot in it, so what about using an if statement using something asp to look at the user agent for the word bot in it and if it’s not there then continue to write the noscript tag?
This should eliminate it from both google and bing search results and would leave it available for other users who might have JavaScript disabled...
I don't think using is a good idea. I've heard that it is ineffective when the client is behind a JavaScript-blocking firewall - if the client's browser has JavaScript enabled the tag won't activate, because, as far as the browser's concerned, JavaScript is fully operable within the document...
A better method IMO, is to have all would-be 'noscript' content hidden by JavaScript.
Here's a very basic example:
<body>
<script>
document.body.className += ' js-enabled';
</script>
<div id="noscript">
some content
</div>
And within your StyleSheet:
body.js-enabled #noscript { display: none; }
More info:
Replacing with accessible, unobtrusive DOM/JavaScript
Reasons to avoid NOSCRIPT
There is no need hide noscript tags from search engines. As far as we can know, they ignore them. They process the content of noscript elements as if the tags were not there. I presume you would want to hide that content.
There is no reasonable way to do that. Trying to send different content to search engines than to browsers is one of the main tricks that search engines try to detect, and they may punish for it.
The way to avoid having nonsensical content indexed as part of the page is to avoid including nonsensical content. There is normally no point in telling users to enable JavaScript. If the page is really an application that won’t work at all without JavaScript, then saying this inside noscript is OK, and when formulated properly, it does not pollute search engine indexes but provides useful additional content.

Embed sandboxed HTML on a webpage + SEO

I want to embed some HTML on my website... I would like that:
SEO: that content can be crawled and indexed
Integration: it renders nicely (does not break my DOM trees for instance, or does not inherit my styles)
Security: it remains safe for our user (javascript disabled)
Flexibility: the HTML can be completely free (don't want any BBCode or MarkDown or even TinyMCE, it's our users that are writing the HTML code...)
I saw that I might be able to use the IFrame for that, but I am not sure it is a very good solution concerning my SEO constraint.
Any answer would be greatly appreciated!!! Thanks.
For your requirements (rendering and security, primarily), IFRAME seems to be your only option, especially when we consider no rules are specified for the HTML content except the JS removal. Even some CSS + 'a' tag can bring a serious security risk, like overlaying outgoing links on your standard interface.
For the SEO part, you can use SEO maps to show the search engines the relation between the content and the container, also use html tags like link to make connection.
To make sure the user's html is safe then you should use HTMLPurifer. In terms of the rest of the question, you should split this up into multiple questions.

Browser HTML filters?

When writing filters for the Firefox Add-On 'Adblock Plus' you can write rules to completely remove certain HTML elements from the page, but filtering criteria is in fact limited to a handful of things, like class and id names and attribute values.
What I was hoping for is say a Firefox Add-On which would pass the HTML for a page to some arbitrary process you specify, where this process could reconstitute the HTML for the entire page in any arbitrary way and then have the browser display that. Is there a Firefox add-on that allows this or is this sort of operation commonly accomplished by some entirely different but well-known means (and perhaps not browser-specific).
Wouldn't this allow you to augment pages coming from some website to your browser with arbitrary new features, maybe from an entirely different website.
You are looking for Greasemonkey.

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.