Rel canonical without a primary URL - html

Background: We have a situation where the customer can select in which places to publish a content on a website. If it’s a municipality website, an article describing a playground could be published both in the “For families” section and “Parks” section. In some government site with instructions for companies divided into sections by company types: Instructions that are identical for all companies will be published in all company type sections. There is often no definite primary place that is more right than the others.
The CMS renders top, bottom and side content relevant to the part of the site where you are, so only the content part is identical between locations.
Questions:
Do I need rel canonical for URLs inside the same site, or is it only for external links?
If I need them, can I somehow specify that they are all “primary” or did I already do that by not having a the canonical tag at all?
Do search engines generally show pages that has the canonical tag?

If you want to merge internal pages, then yes, a canonical is required for those pages.
By setting a canonical, the target URL will be displayed preferably by Google.
No, they display the page that is linked to in the canonical.

Related

Can I denote the language of each page of my website so that Google won't show contents of a foreign to a user?

I have a website that contains both Chinese and English contents (and they are different, NOT translations of each others). When some English users search the name of my site on Google, some Chinese contents also appear on the search result. Can I avoid this? Is there any HTML markup I can use to indicate the content language of each page? Thanks a lot!
Google ignores HTML code when determining language (such as LANG attributes). Instead it determines the page's language from the content.
They suggest having different subdomains or URL indicators for differing content (e.g. en.mypage.com and cn.mypage.com - or mypage.com/cn/content) and making sure boilerplate code is different (your site headers, navigation, etc should be localized, not just the content body).
More information is available on Webmaster Tools

Semantically, must text which visually looks like a heading use h1-h6 tags?

I have a page which contains a list of items as its content.
When no items exist, the design which I am to implement has a rather large heading reading something like:
'No results for this topic'
Now initially when I saw the design I instinctively wrapped the 'No results' text in a <h2> tag.
Afterwards I noticed that although I included meta content for title and description - Google displayed the 'no results' text as the title in search results - clearly not being the desired result.
Now on one hand I want to stick to semantic markup, but on the other I don't want it to mess up my SEO.
So my question is: Do I really need to use a <h2> element here for semantic markup?
True, the designer decided to display the text to look like a heading - but does this mean semantically that this is a heading?
Just for fun, I checked what Google does when you enter a search phrase with no results:
Result:
The 'No results' isn't displayed like a heading and (hence) not within a h1-h6 tag.
Disclaimer: I tried searching for an answer at W3C here and here but that didn't really help me here.
Edit: I meant the 'No results' to be an example. Actually, I had similar cases where Google picked up other pieces of not-so-relevant text (which I had wrapped in a <h2> because of the design) as the title - even when the page contained many items.
I think that such message shouldn't appear in h2 tag. But there are also other factors that determine what Google will display. All title, description and keywords should vary between pages but it also doesn't guarantee Google will use them.
In fact Google want to be smarter than we are. For one of my pages for English main page version Google used alt logo to display as page title although title is unique so now in Google it's displayed as mainpage - logo instead of normal title.
If I were you I would change "no results" from h2 to regular text for example p. You should also consider if you really need and should have indexed those pages at all.
Google "guidelines" change very often and they can even punish you if you have many subpages with in fact no content.
-- after editing question --
You should check first that your meta tags if they are unique on your page. It means searches (if it is indexes, pagination pages and so on). As I have written just before there is no guarantee that Google uses them at all. Google can use any part of your site and display it in search result as title or description.
Sitemap has no impact what Google indexes (or other search engines). It only help search engines faster index pages that are for example deep in structure. For sub-pages you don't want to be indexed you need to use in html head:
<meta name="Robots" content="noindex,nofollow" />
to stop indexing it by search engines that respect this rule (of course many crawlers / spam spiders don't respect it). After change it takes some time to deindex this page by Google. It depends of course on site size and how often Google spider is visiting your website.

What is <link rel="image_src">

Today I came across a <link rel="image_src"> tag. I don't know about it, so I use google. Google tell me that this tag are similar to og:image. So I came to open graph main site to read about it http://ogp.me/, but i found nothing about link rel="image_src". So this tag is replacement to meta property="og:image" or is in special tag in another specification ? How use this tag or for what is used?
The rel attribute specifies the type of the link, i.e. the kind of the relationship between the document and the linked resource. Usually just a few keywords, like stylesheet and icon, are used. Although many other keywords have been proposed and registered, most of them are write-only: they are meant to express something, but nobody cares (no software uses the information).
The extension mechanisms of HTML5 include, in the description of link types, a somewhat obscure mechanism that allows, in theory, anyone register his favorite keyword in the existing rel values wiki to make documents using it as rel value “conforming”.
And image_src has indeed been registered there, with the information that it is used to “specify a Webpage Icon for use by Facebook, Yahoo, Digg, etc.”, no specification has been identified but an article about it is linked to, and it is “probably redundant with rel=icon”.
You can use this tag to use an image as the thumb for link share.
When someone posts a link to your site on social media, such as Facebook, the image that is displayed with your link is usually the first one in your code. This may not be the image that best fits defines your site, and it may not fit well in the small box that Facebook posts. The link rel="image_src" tag lets you control what image (or images, you can have more than one by stacking separate references) is displayed alongside your link.

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.

<cite> as part of semantic markup

One of the sites I develop has lots of information linked between each other; we have companies, we have products for those companies. The company page links to the page listing the products for that company, and vice versa.
From the HTML spec:
CITE:
Contains a citation or a reference to other sources.
Does this imply that I could (semantically) use a <cite> for a company link? What about on the company page to a product?
If not, could someone tell me what might be the "correct" semantic tag for this?
If you're just linking to other pages then semantically you should just use <a href=...>. If you're quoting a small piece of information, like the information from the HTML spec in your question, and providing a link to the original source, you might use <cite>. Think of it as a citation in a book or research paper.
I'm not sure that cite is intended to mark up links - you may be looking at something akin to a more professional (less inter-personal) XFN using the rel attribute of the link.
Cite is more for marking up titles of articles or other created work.
XFN is specifically for marking up the relationship you (or your company) have with the person or company you are linking to. What I'm not sure of is what xfn values there are (if any) for company links.
http://reference.sitepoint.com/html/xfn
What you might consider is in what detail will the information be used? Semantic markup, although a noble direction to head in, is not yet utilised to it's full extent when looking at (by a human) or parsing (by a program) a resource.