Making a div non-indexable? - html

I have a div with some sentences that I don't want to be indexed by search engines.
Is it possible to somehow hide this from Google in a way?
I thought about using frames, and having the site within the frame being blocked by robots.txt, but I've never liked the idea of using frames.
Are there other options?

Technically, you could use iframe and put <meta name=robots content=noindex> into the iframed document. Using suitable attributes and CSS, you can make the iframed document appear as part of the page, mostly, though you would still need to reserve some fixed area for it.
Or you could generate the div with JavaScript, thought then it would not be seen when JavaScript is disabled. Note that search engine bots may execute JavaScript code and might thus “see” the generated content, though I would not expect that to happen now or in the near future.
If the content is text, without internal markup or images etc., you could have an empty div with a CSS rule that adds content using the :before pseudoelement and content property. This would fail for users with CSS disabled or with an aggressive user style sheet, and search engine bots might some day start interpretign CSS.
There might be trickier methods, too, but as a whole, there is no good way I think. It’s probably more useful to consider why you would want to prevent from finding the page on the basis of its content. As a tool for hiding information, it would be inefficient.

You could create images from the sentences, then the text wouldn't be indexed.

Related

Is it better to not render HTML at all, or add display:none?

As far as I understand, not rendering the HTML for an element at all, or adding display:none, seem to have exactly the same behavior: both make the element disappear and not interact with the HTML.
I am trying to disable and hide a checkbox. So the total amount of HTML is small; I can't imagine performance could be an issue.
As far as writing server code goes, the coding work is about the same.
Given these two options, is one better practice than the other? Or does it not matter which I use at all?
As far as I understand, not rendering the HTML for an element at all, or adding display:none, seem to have exactly the same behavior: both make the element disappear and not interact with the HTML.
No, these two options don’t have "exactly the same behavior".
If you hide an element with CSS (display:none), it will still be rendered for
user agents that don’t support CSS (e.g., text browsers), and
user agents that overwrite your CSS (e.g., user style sheets).
So if you don’t need it, don’t include it.
If, for whatever reason, you have to include the element, but it’s not relevant for your document/users (no matter in which presentation), then use the hidden attribute. By using this attribute, you give the information on the HTML level, hence CSS support is not needed/relevant.
You might want to use display:none in addition (this is what many CSS supporting user agents do anyway, but it’s useful for CSS-capable user agents that don’t support the hidden attribute).
You could also use the aria-hidden state in addition, which could be useful for user agents that support WAI-ARIA but not the hidden attribute.
I mean do you need that checkbox? If not then .hide() is just brushing things under the carpet. You are making your HTML cluttered as well as your CSS. However, if it needs to be there then sure, but if you can do without the checkbox then I would not have it in the HTML.
Keep it simple and readable.
The only positive thing I see in hiding it is in the case where you might want to add it back in later as a result of a button being clicked or something else activating it in the page. Otherwise it is just making your code needlessly longer.
For such a tiny scenario the result would be practically the same. But hiding the controls with CSS is IMO not something that you want to make a habit of.
It is always a good idea to make both the code and its output efficient to the point that is practical. So if it's easy for you to not include some controls in the output by adding a little condition everything can be managed tidily, try to do so. Of course this would not extend to the part of your code that receives input, because there you should always be ready to handle any arbitrary data (at least for a public app).
On the other hand, in some cases the code that produces the output is hard to modify; in particular, giving it the capability to determine what to do could involve doing damage in the form of following bad practices: perhaps add a global variable, or else modify/override several functions so that the condition can be transferred through. It's not unreasonable in that case to just add a little CSS in order to again, achieve the solution in a short and localized manner.
It's also interesting to note that in some cases the decision can turn out to be based on hard external factors. For example, a pretty basic mechanism of detecting spambots is to include a field that appears no different in HTML than the others but is made invisible with CSS. In this situation a spambot might fill in the invisible field and thus give itself away.
The confusion point here is this: Why would you ever use display: none instead of simply not render something?
To which the answer is: because you're doing it client side!
"display: none" is better practice when you're doing client side manipulations where the element might need to disappear or reappear without an additional trip to the server. In that case, it is still part of the logical structure of the page and easier to access and manipulate it than remove (and then store in memory in Javascript) and insert it.
However if you're using a server-side heavy framework and always have the liberty of not rendering it, yes, display:none is rather pointless.
Go with "display:none" if the client has to do the work, and manage its relation to the DOM
Go with not rendering it if every time the rendered/not rendered decision changes, the server is generating fresh (and fairly immutable) HTML each time.
I'm not a fan of adding markup to your HTML that cannot be seen and serves no purpose. You didn't provide a single benefit of doing that in your question and so the simple answer is: If you don't need a checkbox to be part of the page, then don't include it in your markup.
I suspect that a hidden checkbox will not add any noticeable time to the download or work by the server. So I agree it's not really a consideration. However, many pages do have extra content (comments, viewstate, etc.) and it can all add up. So anyone with the attitude that they will go ahead and add content that is not needed and never seen by the user, I would expect them to create pages that are noticeably slower overall.
Now, you haven't provided any information about why you might want to include markup that is not needed. Although you said nothing about client script, the one case where I might leave elements in a page that are hidden is when I'm writing client script to remove them. In this case, I may hide() it and leave in the markup. One reason for that is that I could easily show it again if needed.
That's my answer, but I think you'd get a much better answer if you described what considerations you had for including markup on the page that no one will see. Surely, it must offer some benefit that you haven't disclosed or you would have no reason to do it.

Optimize CSS Delivery - a suggestion by Google

Google suggests to use very important CSS inline in head and other CSS inside <noscript><link rel="stylesheet" href="small.css"></noscript>.
This raises few questions in my mind:
How to prioritize CSS in two files. Everything for that page looks important. Display, font etc. If I move it to bottom then how it helps page render. Wont it cause repaint, etc?
Is that CSS is required after Document ready event? Got it from here.
How 'CSS can' go inside <noscript></noscript>, which is for script? Will it work when JavaScript is enabled? Is it browsers compatible?
Reference
Based on my reading of the link given in the question:
Choose which CSS declarations are inlined based on eliminating the Flash-of-Unstyled-Content effect. So, ensure that all page elements are the correct size and colour. (Of course, this will be impossible if you use web-fonts.)
Since the CSS which is not inlined is deferrable, you can load it whenever makes sense. Loading it on DOMContentReady, in my opinion, goes against the point of this optimisation: launching new HTTP requests before the document is completely loaded will potentially slow the rest of the page load. Also, see my next point:
The example shows the CSS in a noscript tag as a fallback. Below the example, the page states
The original small.css is loaded after onload of the page.
i.e. using javascript.
If I could add my own personal opinion to this piece:
this optimisation seems especially harmful to code readability: style sheets don't belong in noscript tags and, as pointed out in the comments, it doesn't pass validation.
It will break any potential future enhancements to HTTP (or other protocol) requests, since the network transaction is hard-coded through javascript.
Finally, under what circumstances would you get a performance gain? Perhaps if your page loads a lot of initially-hidden content; however I would hope that the browser itself is able to optimise the page load better than this hack can.
Take this with a grain of salt, however. I would hesitate to say that Google doesn't know what they're doing.
Edit: note on flash-of-unstyled-content (abbreviated FOUC)
Say you a block of text spanning multiple lines, and includes some text with custom styling, say <span class="my-class">. Now, say that your CSS will set .my-class { font-weight:bold }. If that CSS is not part of the inline style sheet, .my-class will suddenly become bold after the deferred loading has finished. The text block may reflow, and might also change size if it requires an extra line.
So, rather than a flash of totally-unstyled content, you have a flash of partly-styled content.
For this reason you should be careful when considering what CSS is deferred. A safe approach would be to only defer CSS which is used to display content which is itself deferred, for example hidden elements which are displayed after user interaction.

Is it safe to allow to embed an arbitrary external stylesheet into my web-page?

I have a dynamic web-page which I want other people to embed into their web-pages, with an iframe (not necessarily with any kind of more advanced techniques like JavaScript).
Instead of providing all sorts of designs and styles myself, I'm thinking about allowing them to provide their own stylesheet for my page through an HTTP GET parameter, and embed such external stylesheet through a URL w/ <link type="text/css" rel="stylesheet" href… on my page.
Is this safe? Will it violate the security paradigm of my web-site? I'm aware that extra text could be inserted with CSS alone, and indeed elements could be removed (which is the whole point of me providing such functionality for my users), but anything else I should be aware of?
Could malicious people insert links onto my site through such a CSS, to benefit from my http referer and potentially violate some checks, or is CSS insertion limited to text?
In the general case, no, allowing third-party CSS is not safe. Some implementations allow JavaScript in CSS, which means that allowing users to modify your CSS allows them to execute arbitrary JavaScript in the context of your page.
However, if this is meant to be sort of a "white-label" page, where it appears to be part of the site it's embedded in and the fact that it's really your page is just an implementation detail, this doesn't seem like a major concern. The person specifying the "third-party" CSS is the site owner, so it's not really third-party at that point — they're not going to XSS themselves!
But nobody else should ever be putting CSS on a page that's meant to be under your control, because it's really under the control of whoever is controlling the CSS.
CSS cannot insert linkable content. It can only style, position and hide what's already there. Sure, people can mess up your page with :before and :after text an perhaps make things look a little confusing or change labels on existing links, but not the URLs themselves.

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

Apart from <script> tags, what should I strip to make sure user-entered HTML is safe?

I have an app that reprocesses HTML in order to do nice typography. Now, I want to put it up on the web to let users type in their text. So here's the question: I'm pretty sure that I want to remove the SCRIPT tag, plus closing tags like </form>. But what else should I remove to make it totally safe?
Oh good lord you're screwed.
Take a look at this
Basically, there are so many things you want to strip out. Plus, there's stuff that's valid, but could be used in malicious ways. What if the user wants to set their font size smaller on a footnote? Do you care if that get applied to your entire page? How about setting colors? Now all the words on your page are white on a white background.
I would look into the requirements phase again.
Is a markdown-like alternative possible?
Can you restrict access to the final content, reducing risk of exposure? (meaning, can you set it up so the user only screws themselves, and can't harm other people?)
You should take the white-list rather than the black-list approach: Decide which features are desired, rather than try to block any unwanted feature.
Make a list of desired typographic features that match your application. Note that there is probably no one-size-fits-all list: It depends both on the nature of the site (programming questions? teenagers' blog?) and the nature of the text box (are you leaving a comment or writing an article?). You can take a look at some good and useful text boxes in open source CMSs.
Now you have to chose between your own markup language and HTML. I would chose a markup language. The pros are better security, the cons are incapability to add unexpected internet contents, like youtube videos. A good idea to prevent users' rage is adding an "HTML to my-site" feature that translates the corresponding HTML tags to your markup language, and delete all other tags.
The pros for HTML are consistency with standards, extendability to new contents types and simplicity. The big con is code injection security issues. Should you pick HTML tags, try to adopt some working system for filtering HTML (I think Drupal is doing quite a good job in this case).
Instead of blacklisting some tags, it's always safer to whitelist. See what stackoverflow does: What HTML tags are allowed on Stack Overflow?
There are just too many ways to embed scripts in the markup. javascript: URLs (encoded of course)? CSS behaviors? I don't think you want to go there.
There are plenty of ways that code could be sneaked in - especially watch for situations like <img src="http://nasty/exploit/here.php"> that can feed a <script> tag to your clients, I've seen <script> blocked on sites before, but the tag got right through, which resulted in 30-40 passwords stolen.
<iframe>
<style>
<form>
<object>
<embed>
<bgsound>
Is what I can think of. But to be sure, use a whitelist instead - things like <a>, <img>† that are (mostly) harmless.
† Just make sure that any javascript:... / on*=... are filtered out too... as you can see, it can get quite complicated.
I disagree with person-b. You're forgetting about javascript attributes, like this:
<img src="xyz.jpg" onload="javascript:alert('evil');"/>
Attackers will always be more creative than you when it comes to this. Definitely go with the whitelist approach.
MediaWiki is more permissive than this site; yes, it accepts setting colors (even white on white), margins, indents and absolute positioning (including those that would put the text completely out of screen), null, clippings and "display;none", font sizes (even if they are ridiculously small or excessively large) and font-names (even if this is a legacy non-Unicode Symbol font name that will not render text successfully), as opposed to this site which strips out almost everything.
But MediaWiki successifully strips out the dangerous active scripts from CSS (i.e. the behaviors, the onEvent handlers, the active filters or javascript link targets) without filtering completely the style attribute, and bans a few other active elements like object, embed, bgsound.
Both sits are banning marquees as well (not standard HTML, and needlessly distracting).
But MediaWiki sites are patrolled by lots of users and there are policy rules to ban those users that are abusing repeatedly.
It offers support for animated iamges, and provides support for active extensions, such as to render TeX maths expressions, or other active extensions that have been approved (like timeline), or to create or customize a few forms.