Is there a way to make search bots ignore certain text? [closed] - html

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.

There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source

Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives

I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.

Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.

"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.

The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.

Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )

Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea

you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

Related

<nav> vs <article> for SEO

In term of SEO, if I want to group relevant page content together to maximize search engine readability, should I use the tag <nav> or <article>?
1) It's not there yet.
2) If it was, and you were wrapping menus as article, or wrapping affiliate link-farms as article, Google would slap you (keep that in mind in three or four years).
3) If you have lots of legitimate content, and each piece of content is self-contained (ie: suitable for article), then not only should you wrap it in an article tag, but you should also learn how to use Google's "Rich Snippet Tool", which was recently renamed "Structured Data Tool".
If you learn how to mark things up, both in an html5-friendly way, and in a Google-friendly microformat, then GoogleBot will grab all of the content it knows how, and it will be displayed in search results and elsewhere, when relevant.
Like I said... ...that's if you've got content which is worthy of doing this, because otherwise, Google will slap you, eventually, if you try to use it for evil.
article tag:-
The tag allows to mark separate entries in an online publication, such as a blog or a magazine. It is expected that when articles are marked with the tag, this will make the HTML code cleaner because it will reduce the need to use tags. Also, probably search engines will put more weight on the text inside the tag as compared to the contents on the other parts of the page.
nav tag:-Navigation is one of the important factors for SEO and everything that eases navigation is welcome. The new tag can be used to identify a collection of links to other pages.
so both tag have their own functionality which can be implemented according to need.

allowing users to add html formatted notes

We want to allow the users of our web application, to leave notes formatted with html.
On client side we are providing them with ckeditor [http://ckeditor.com/] which is a wisywig editor that generates html, that is then submitted to the server via a form
We then want to display the notes created by the users, with exactly the same formatting as they submitted them
My concerns are:
Putting attacks and bad intentions aside, how can I encapsulate the note when displayed on the site, so that
a. They don't inherit the design from the rest of the page
b. They don't influence the rest of the page, for example by opening and not closing a tag accidentally, or closing without opening.
Malicious code injection attacks
At the moment, the first is much more important, as it's an in house product for our clients, and is not open to the wide public. But security comments are very wellcome as well
Possible solutions that I consider are:
Ideally, I look for a way to encapsulate this pieces of user html, like : inside this area I show what you submitted (rendered, not source), you cannot influence and are not influenced by the code on other parts of the page
Specifically, we thought of displaying the notes inside iframes.
Other natural direction is dealing with parsing the inserted contents, and stripping out stuff.
Any inputs are welcome, and mainly:
How can I "encapsulate" the inserted contents, if I can?
Any comments on the iframe direction
Do I have to parse the contents anyway? What do I absolutely have to strip out?
How can I "encapsulate" the inserted contents, if I can?
The truth is unless you 'fix' their code (via some kind of check) you will get issues (think broken divs, etc). I don't see how you can encapsulate HTML FROM HTML. I would however only let them put in content like bold, italicize, center, etc;
Any comments on the iframe direction
Personally I wouldn't go that route, new can of worms for security and not a 'clean' way of doing this.
Do I have to parse the contents anyway? What do I absolutely have to strip out?
Yes don't be lazy, some devs always say "well I dont need it, its internal" and then it becomes an external thing, and at that point its so big that ONLY a full re-write will set it right, and it keeps chugging along until something is broken, then shit hits the fan and the big boss cries out why hasn't this been done. Long story short.
Yes you have to parse / validate / check all your input, wether internal or external. Anything other than that is just lazy.
In closing I would do it by using an editor like here on SO, which only allows some types of selective formatting. After all a broken <b> will not kill your whole layout, a <div> will...
Markdown formatting
You could use exactly the same type of intermediary solution that this site (StackOverflow) uses in it's user-generated-content (questions, answers, comments).
It's not the complete solution that could replace WYSIWYG solutions like the code editor, but it's just what a usual user-generated-content woudl require. It even allows you to include images.
For a complete guide:
https://www.markdownguide.org/cheat-sheet

Do #anchors after urls affect ranking in seo?

While programming a new tabs-system in javascript, where each tab is clickable and has its own ancher text on it, it would be easier to make these anchors numbers #1 #2 #2 etx.
On the other hand, its more difficult to make #text anchors, but if that has a meaning for the on page seo, then I will consider re-programming my tabs sytem.
Do anchors in urls affect ranking in seo, when on-page hyperlinks to cotton
http://website.com/cotton#1
http://website.com/cotton#2
http://website.com/cotton#3
http://website.com/cotton#trousers
http://website.com/cotton#hats
http://website.com/cotton#socks
What do you reccon in this case? Go with the more complex programming, or stick with the easier, program generated autonumeric tabs anchors?
You should use an exclamation mark in your anchor links if you want them to be indexed by Google.
So instead of http://website.com/cotton#trousers you should use http://website.com/cotton#!trousers to have Google index them. It will be processed as if it was a separate page.
As far as I know the so-called "fragments" of the URI (everything behind those #s) are important for SEO.
They usually point to a direct spot on a page where you can get your information from, which is especially useful at sites that contain much information on one page and you would need to scroll down to get your information.
Then again, if you can't follow those links and they do something if JavaScript is enabled only, search-engines won't care, because they can't exactly follow the links.
Always remember, search-engines don't parse JavaScript, for them those are plain URIs that lead to nowhere (or to the same page they're already on, which would be bad for the SEO)
You can parse those fragments with PHP, though, and show the right tab if it is clicked, so if JavaScript is turned off, the user as well as the search engine get the right content, anyways.

making websites accessible to visually impaired people?

can anyone give me some tips or hook me up with some good links on this?
i'm having trouble finding much more than 'add alt text to the images' and i'm not sure how current the info is...
i get the whole semantic markup thing but could probably do with a bit more guidance on that too.
also not sure how things would work across different browsers.
1) Use HTML's heading tags for each and every section of content on your pages. The heading tags are: h1, h2, h3, h4, h5, h6
2) Ensure the prior mentioned heading tags exist with the proper heirarchal sequence. For instance h1 tags are important than h2 tags. Screen readers use these heading tags to navigate the content of the page. If they not present or improperly ordered a visually impared user cannot navigate the page's content.
3) Don't use JavaScript to dynamically change the content on the screen without first prompting the user that text will change. If JavaScript changes text on the screen before a screen reader can read the content there is no way a visually impared user can know that content was changed.
4) Don't serve the user a 1000 images. If an image does not convey relevant content then make it a CSS background image.
5) Be gracious with the title attribute, especially on anchor tags. This can tell the user where they are about to go.
6) Don't put text on an image that cannot be conveyed as alternate content. The visually impared do not read images.
7) Ensure all your meta data is relevant. If you change any of your content be sure not to forget the extra bits of descriptive data.
8) AJAX defeats accessibility. Be kind with your use of AJAX.
9) The visually impared, and actually almost all visual users, do not care how pretty your pages are. They are there to get information, shop, or what ever other specific purpose. Make your data easy to understand and quick to retrieve. If a user cannot get in, get what they wanted, and then get out in record time they won't ever come back.
10) Do not use any presentation tags or presentation attributes in your HTML. Use a stylesheet. If your HTML contains presentation conventions they are probably not accessibile.
11) If your content exists in a different order visually than how it is written in the HTML, from top to bottom, it likely fails accessibility. Keep things orderly and consistent. Users expect content to flow from top to bottom and for tab indexing to follow the flow of content.
12) Do usability testing with screen reader software. It is not possible to know how accessible a page is by looking at.
I am totally blind myself, and you'd be amazed how much stuff still doesn't have alt attributes on it after all these years... Be careful, there are still a lot of myths out there, such as no graphics allowed (wrong), talbes are bad (wrong) and frames are bad (wrong, though I realize frames are bad for other reasons.) Ideally you should have someone who is blind test your site, if you need further help on this feel free to email me at westbchris#gmail.com. One other thing, try to make controls that actually do things buttons and/or links. Clickable divs aren't cool because it is not obvious that they do anything, and depending on which assistive technology you are using you may not even be able to click on them.
Check out this explaination from Alertbox:
Disabled Users and the Web (The article is from 1996...but the issues still hold true, if not more so today)
...then follow the link at the bottom to the 148 page report with Design Guidlines (the document is copyrighted 2001 so it must've been updated since the original).
The term for this is Accessibility. Take a look at the W3C's WAI Website. I've always found Juicy Studio to be an invaluable resource for articles discussing accessibility.
There are in-depth definitions that are difficult to master and implement. Examples include Web Content Accessibility Guidelines (WCAG) and Section 508.
A less than official suggestion is to make your site easy to navigate with a text browser. Don't rely on colors or structure to convey content. Don't rely on widgets for important functionality.
EDIT: Thought I would add that you shouldn't bother testing your site with JAWS or another screen reader. Your inability to navigate a site would be more related to your inexperience with the screen reader rather than the inaccessibility of the site. That said, having a sample of your target audience test your site for usability is highly beneficial.
EDIT #2: As discussed in comments, I intended to convey that you shouldn't make judgements on a site's usability based on your experiences with a screen reader. That said, I would recommend that anyone in Web development have exposure to the browsers/equipment used to view web sites including screen readers. It was poor wording in the original edit.
Well, it looks like no one mentioned WAI-ARIA which is for the Accessible Rich Internet Applications. IE making things like gmail accessible. And a decent search term to find things like this a list apart article on wai-aria. Already pretty supported.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.