How to remove page element from search results / keywords? - html

Problem is here. I have simple page with form and not so much other text content. In form I have for example:
<h1>It should be my search result</h1>
<form>
<select name="Months">
<option>January</option>
<option>February</option>
...
<option>December</option>
</select>
</form>
In search results/page keywords I have more months names than other important content, where keywords should be IT SHOULD BE MY SEARCH RESULT only. How I can disable select content from results? Possible?

Repositioning the code (as mentioned in the other answer) does work when you're trying to prioritize content on the markup level. However, it doesn't take care of the keyword density of the month names you want to suppress.
If your page is littered with multiple dropdowns with the same 12 months in them; you could use Javascript to add the dropdowns after the page is done loading. It will suppress it from Google Crawlers while also leaving it in for site visitors. Of course, too much Javascript will be a performance hit.
(Note: on a SEO level, Google doesn't like when you show Google a different page than what would be shown to a site visitor; but I don't think this would hurt you).
If your month names are literally inside the search results content, you could try using server side code (PHP, ASP.NET, etc.) to filter out the month names entirely or replacing them with ellipses; though, may be confusing
It really depends on what your results page would look like. More information would be needed to get a better idea on the best way to tackle it.

Here is an article on 4 ways to hide content from Google
http://www.searchenginejournal.com/4-ways-to-hide-content-from-google-and-googles-reaction/6782/
I would use the option of moving the navigation to the bottom of your code and then reposition it with CSS so that it appears in the location you want. Then from the user's perspective it will appear at the top, and from Google's perspective appear at the bottom.

No guarantee that this would work (it might be viewed as deceiving the bot), but try to split the months into meaningless parts by <wbr>.
<select name="Months">
<option>Ja<wbr>nuary</option>
<option>Fe<wbr>bru<wbr>ary</option>
...
<option>Dece<wbr>mber</option>
</select>

Related

How can I add an icon to select box choices?

Basically I have a text box with a modifier dropdown. I would like an icon of the current choice to display when chosen. The problem with the current set-up (using UNICODE) is that the icons do not always display, such as on google chrome (unless the specific fonts have been installed).
For example see the Fiddle:
https://jsfiddle.net/q5eLLu2h/
<select class="squareDropdown" id="da" name="da">
<option value="AND">➕ AND</option>
<option value="EXCEPT">➖ EXCEPT</option>
<option value="OR"><b>O</b> OR</option></select>
<input class="textArea" id="d" name="d" placeholder="d" type="text" />
<br>
<br>
<br>
<select class="squareDropdown" id="ca" name="ca">
<option value="OR"><b>O</b> OR</option>
<option value="AND">➕ AND</option>
<option value="EXCEPT">➖ EXCEPT</option>
</select>
<input class="textArea" id="c" name="c" placeholder="c" type="text" />
Does anyone have any advice on what I could do? Either a way to get the unicode icons to display (like through forcing the person to download them?) or a simple way to get pictures or images into a select box, or anything really. Jquery UI is installed, is there anything I can do with it?
Thank you
Options are hard to style.
The basic reason is that they are reaching into the operating system for generation rather than being generated solely by the browser like most other website HTML elements.
This is why file upload form field 'submit' buttons don't follow the same rules as any other submit buttons on a form.
Here is some more detail about limited and inconsistent possibilities and a question that proves <option> can't have any children (this in turn stops the psuedo-elements from working, unfortunately, as psuedo elements are rendered like DOM nodes).
All this means that using an icon font won't work either as you'd need to target just the bit with icon in, which you could only do with extra child elements.
Using JQuery
There are many ways of mimicking select option form fields with various javascript plugins that will give you more control, however, they come with an important caveat. Standard select elements bring with them all the extra usability and accessibility features (such as keyboard focus and operation) javascript plugin writers don't necessarily think about, so use with care.
It's also worth remembering that when you are replacing complex functionality with JQuery you've got quite a lot of overhead in testing / development - select elements do just work in all browsers, without their actual functionality needing testing. This is more considerable if you do pay attention to all the accessibility points.
The MDN article does however judge that this is the best way and lists some good plugins for a solution, which I would use as MDN can be trusted to have considered all the important stuff I mention above.
For the sake of link rot, here are the two of the three they recommend that link to a product still:
UniForm
FormalizeMe
(please note, I haven't tried these, I'm just trusting Mozilla!)
Unicode
You are trying to use unicode in your option tags. This should work, and without forcing a download on the user if you:
Use a font that has the unicode characters you want to use defined.
Ensure that you have the charset in your docs set to utf-8
However, you'll only ever get characters, never styleable icons and you might be stuck with websafe or proprietary fonts.
This jsfiddle here demonstrates with the web-safe font-family: sans-serif and also characters taken from this link with a big list of those that are commonly supported
And finally
For small option sets radio buttons are probably better. The user can immediately see the available options rather than having to open them each time (better learnability, faster cognition); plus, they can make their desired selection, or change their desired selection with just one click each time. The guys over at UX.SE discuss what to use further, raising other points.
All your problems might just disappear if you think about whether <select> is the best option.

Making a div non-indexable?

I have a div with some sentences that I don't want to be indexed by search engines.
Is it possible to somehow hide this from Google in a way?
I thought about using frames, and having the site within the frame being blocked by robots.txt, but I've never liked the idea of using frames.
Are there other options?
Technically, you could use iframe and put <meta name=robots content=noindex> into the iframed document. Using suitable attributes and CSS, you can make the iframed document appear as part of the page, mostly, though you would still need to reserve some fixed area for it.
Or you could generate the div with JavaScript, thought then it would not be seen when JavaScript is disabled. Note that search engine bots may execute JavaScript code and might thus “see” the generated content, though I would not expect that to happen now or in the near future.
If the content is text, without internal markup or images etc., you could have an empty div with a CSS rule that adds content using the :before pseudoelement and content property. This would fail for users with CSS disabled or with an aggressive user style sheet, and search engine bots might some day start interpretign CSS.
There might be trickier methods, too, but as a whole, there is no good way I think. It’s probably more useful to consider why you would want to prevent from finding the page on the basis of its content. As a tool for hiding information, it would be inefficient.
You could create images from the sentences, then the text wouldn't be indexed.

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

Any tool to show you all elements/pages in a site that are affected by a particular CSS rule?

Of course we can use tools like Firebug to highlight portions of HTML and see what all CSS is being applied.
But what about the reverse? Is there some kind of tool which would allow you to highlight a particular CSS rule and show you all the pages on a site (either static HTML pages or their dynamic templates) that it applies to?
Example: I've come to work on a new site, very large and I need to edit CSS on a particular page but in doing so, I have no idea how many other pages on the site might also have these class names and hence be affected. Of course I can try to search the whole site for the class name(s) but this can be time consuming or tricky. This site has a class named "ba" for example. Guess how many irrelevant pages will turn up if I just search for "ba"? So how about including a double quote ("ba)? Well, it could be in the middle of a few other classes (class="hz ba top"), at the end (class="hz ba"), etc. More so, it could be dynamically plugged in via server side code making it even harder to find. A tool that could somehow spider your site and be able to identify all the places your CSS change will affect would be great.
not exactly that, but there is a firebug plugin that does that for any loaded page:
http://robertnyman.com/firefinder/
You could use regular expressions ..
for example in Dreamweaver on the search dialog box :
select 'Find in: Entire current local site.."
select 'Search: Source code'
check 'Use regular expression'
in the find textarea type class\s*?=\s*?".*?content.*?"
click 'find all'
the same regular expression could be used with other software that can search inside files using regular expressions....
for example : http://www.sowsoft.com/search.htm (not affiliated with them, just found it for here..)
Keep in mind though, that all the suggestions here do not take into account the case where the class is added by script..
If you use a Mac, there's an excellent shareware app called "CSSEdit" by an Indy Mac Shop in Belgium. A single-user license is 30 euros. I've used it for approx. three years and i can recommend it highly. It's a mature, stable App (though continuously updated/improved); widely used among Mac Web Designers, and those of use who are not but need all the help we can get, which CSSEdit certainly seems to provide.
To show elements on the html page styled by a given selector:
(i) open both the style sheet and the markup page (markup page must have a link to the style sheet);
(ii) click the X-Ray button off (must read 'Not Active' below the button);
(iii) in the style editor, click any selector (i click it so that my cursor is at the left margin, e.g., in front of the '#', etc.);
(iv) now click 'inspector' on the mark-up page (next to 'X-ray').
Now, look at your mark-up page--it will have a blue outline around the elements affect by the style you clicked on in step (iii) above.
For this kind of things I just use grep, or - even better, ack.
If you're concerned about false positive when searching for short patterns, you can do a double filter: you grep all lines containing class= and you feed its output in another grep which only narrows the result down to the lines containing both class= and your search pattern (which can also be more precisely specified with a regexp using word boundaries like \bba\b to avoid matching bar or abba)
How about putting an ID on the body of each page, and use that to restrain the use of CSS outside of pages clearly stated in the CSS?
Like this:
#mypage .description,
#myotherpage . description {
}
Cons:
Must put a body ID on each page / template.
Must specify each page the CSS should apply to. More CSS code to manage.
Makes the CSS less easy to scan through with your eyes (since the line starts with the page ID and not the CSS style). This is a bigger problem if some of the CSS styles are used on several pages (since the css spec for each style would be long).
Pros:
Avoids unintentioned CSS change propagations. I.e. changing one page affects other unknown pages.
See what pages a CSS change would affect, when you're editing the CSS style itself. The information is right there; no need to search/grep for it.
Forces developers to specify what pages the CSS would affect. If you'd just included this information as comments in your CSS, some person would inevitably forget to update the comment when the CSS is used on a new page.
I agree with this statement, made by a friend:
"Minimize CSS that is used several places. It's not like programming; it's better with a little duplicate CSS, than unmanageable CSS. (Pages like apple.com, has own stylesheets for
each page, and some global CSS.)"
- Olav Bjørkøy, original creator of the Blueprint CSS framework
I'd love your input on this, or if any of you have found a better way.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.