<nav> vs <article> for SEO - html

In term of SEO, if I want to group relevant page content together to maximize search engine readability, should I use the tag <nav> or <article>?

1) It's not there yet.
2) If it was, and you were wrapping menus as article, or wrapping affiliate link-farms as article, Google would slap you (keep that in mind in three or four years).
3) If you have lots of legitimate content, and each piece of content is self-contained (ie: suitable for article), then not only should you wrap it in an article tag, but you should also learn how to use Google's "Rich Snippet Tool", which was recently renamed "Structured Data Tool".
If you learn how to mark things up, both in an html5-friendly way, and in a Google-friendly microformat, then GoogleBot will grab all of the content it knows how, and it will be displayed in search results and elsewhere, when relevant.
Like I said... ...that's if you've got content which is worthy of doing this, because otherwise, Google will slap you, eventually, if you try to use it for evil.

article tag:-
The tag allows to mark separate entries in an online publication, such as a blog or a magazine. It is expected that when articles are marked with the tag, this will make the HTML code cleaner because it will reduce the need to use tags. Also, probably search engines will put more weight on the text inside the tag as compared to the contents on the other parts of the page.
nav tag:-Navigation is one of the important factors for SEO and everything that eases navigation is welcome. The new tag can be used to identify a collection of links to other pages.
so both tag have their own functionality which can be implemented according to need.

Related

What is the purpose of the <figure> element?

I have been trying to understand the <figure> element; take a look at this from w3.org:
Self-contained in this context does not necessarily mean independent. For example,each sentence in a paragraph is self-contained; an image that is part of a sentence would be inappropriate for figure, but an entire sentence made of images would be fitting.
How can an image can be part of a sentence? What is this talking about? I have read many explanations, but yet I don't understand why I would want to use this element. What is the purpose of this tag?
According to MDN:
Usually a <figure> is an image, illustration, diagram, code snippet, etc., that is referenced in the main flow of a document, but that can be moved to another part of the document or to an appendix without affecting the main flow.
An example of this (based on one from that same MDN link) might be a code snippet that prints the parts of the browser's navigator in an article about that attribute. You don't need to know the exact code to print that information to understand what is in the navigator, but it can aid the reader's understanding.
Additionally, the <figure> tag allows use of the <figcaption> tag as a child, which is a convenient and accessible way to caption images.
HTML5 semantic tags are cryptic for everyone. Since the concepts are kept abstract and scientific in purpose, I will use a bunch of over-simplification to make it understandable, in spite of HTML5
gurus telling me how wrong I am.
Remember that semantic tags are created to be read by computers, not humans. They exist so that google's scripts (and everyone's scripts, but mainly google and hacker bots) can quickly search for figures and index them. To understand semantic tags, think like a script, not as a human.
"If I was a bot programmed by a hacker, how would I mine figures when crawling my website?"
If we code a script that looks for IMG tags it will end up gathering a bunch of garbage (i.e. icons). What we're interested in is real content, that's what "self-contained content" means: you cut & paste this element from your website into my gallery of mined images in your HTML, and it's useful.
Garbage: icons, decorative images, images created by javascript plugins to prettify things, smileys, etc.
Good finds: charts, photos, diagrams, maps, drawings, etc.
The "Good finds" make sense even if you steal them from your webpage with no context whatsoever, except things like the accompanying "caption" tag. These tags allow crawlers to associate your images with text tags, making categorization easier So you don't want to miss this content.
So we're interested in the figure caption, title, subtitle, whatever, and the figure should wrap all of this. The figure tag is not limited to images; it can contain text, video, audio, code blocks, anything as long as it's part of that "entity" in your document. So the following is a single figure:
Now, in documents like scientific papers you will often find stuff like this:
Although each image makes sense by itself, they make more sense as a group as they are connected by some kind of relationship; sometimes even the order matters (i.e. series of steps, or photos of states something goes through). That's a "sentence of figures" and you want to mine them together or you'll lose valuable information.
Our mining algorithm would have to understand your website's content with some kind of natural-language-processing AI, or it could just gather the FIGURE tags you provide for it. The later is good semantics.
Summarizing, imagine you're the Google AI algorithm mining figures in your code. Write your figure tags to make this script's job easy.

Is it better not to change html markup for the same content on different page?

I am using section tag for grouping topics and replies on the forum page. In cases that I need to load the topic and its replies on other article page, I use div tag for the same block and change topic title from h1 to h2. Although it is valid. But, for assistive technology, will this make navigating a bit confusing?
Assuming that the assistive technology you are talking about concerns mainly screenreaders, the best way for you to know how accessible your pages are is by downloading one yourself and testing it out. A free screenreader that I have used to do this is called NVDA but there are more out there.
In general, screenreaders work best when a page has a logical structure behind it. If you are displaying several articles, make sure that each article is located in a similar heirarchical location on the page and that each article itself resembles the others in terms of its structure. Using HTML5 semantic tags like article, aside and the like can be helpful but are not necessary. Screenreaders and other assistive technologies have made due for a lot longer than these tags have been around. They are certainly good to use when possible, but there are other more important ways to make your page accessible to as wide an audience as possible.
Another good thing to do is to use header tags for titles, and to use them in order. Screenreaders often give the option to users to skip from heading to heading in order to get a summary of what is on the page. You can also include visually invisible (via placing them far off the edge of the page using CSS) links at the top of the page, or in sections where placing a heading may not be appropriate visually. These will be read in context by screenreaders without your non-visually-impaired users seeing them.
If you are concerned about accessibility, a good way to get a clearer picture of how accessible your pages are is by following the WCAG (Web Content Accessibility Guidelines) standard recommendations. WCAG is managed by the W3C, and has various levels of accessibility that you can consider respecting when developing your content. The W3C has a list of validators that can be found here.
To answer your question from comments:
How it sound when read a topic title as h2, click it, then arrive the forum page and this topic title become h1?
This shouldn't confuse most people, especially if you do it consistently. I am assuming that you are making a news-like site.
Above Levi mentioned article tags. I would recommend using them if you are having multiple stories per page. The div tag is roughly the garbage can of the HTML world, you only should use it when nothing else is available. Article tags both give your code better syntaxical value as well as they have another feature, called a role. Roles allow a person using a screen reader to jump around a page, like they can with heading tags.

html4 header tag position

In all my websites XHTML source code, navigation and breadcrumbs appear below the content of the page yet visually they appear above. I am doing this as believe that in such way search engines find content more relevant.
In all the HTML5 examples I've seen, the order is classical:
header, body section, footer.
From SEO point of view, by working on HTML5 page, is it better to use classical tags order or the one I used till now in XHTML?
Unfortunately, this is more or less outdated advice.
Both Google and Bing have for many years now had the ability to render the DOM of the page and determine the actual layout of the page regardless of how the code is structured.
The old theory behind this technique was that search engines would only index the first 100kb or so of a page and typically that could be taken up by templated boilerplate code in some instances. This isn't a restriction that really exists anymore and to be honest if your pages are reaching that kind of filesize you probably have other things that you want to consider.
I think it is better when the content with keywords appear earlier in the source code. For the general link structure it doesn't matter where main navigation links are placed.
But maybe search engines can weight structure different when using standard semantic ids like navigation, breadcrumbs, content and footer? In this case the position would be equal. Isn't the semantic thing one of the big advantages of HTML 5?!

HTML5 for marking up functionality - what semantic tags should I use?

When it comes to writing blog markup, I absolutely understand the use of article and section tags. But my masthead sections have two widgets. One has a search engine embedded and the other is marketing copy leading to an FAQ page.
What would be the correct HTML5 markup in this case? How do I mark up widget functionality?
my masthead sections have two widgets. One has a search engine embedded...
A search engine embedded? Do you mean a search field, i.e. a text field into which you can type search terms? For that, you want <input type="search">.
...and the other is marketing copy leading to an FAQ page.
Does this really qualify as a “widget”? If it’s marketing copy “leading” to an FAQ page, that just sounds like a link to me, which has been semantically represented in HTML since version 1 with the <a> element.
HTML is pretty simple, you really don’t want to over-think it. You don’t need specific tags for everything people could possibly give a name to. (What exactly is a “widget”? Isn’t it just a section of the page?) For most things, <section> is fine.
While HTML5 is a big improvement, there's one thing it doesn't fix: The subjectivity of what is considered proper semantics for every situation.
And, I doubt HTML will ever fix that.
If you're already using HTML5 containers for other more obvious parts of the page, I wouldn't sweat these too elements much. You could put the marketing stuff in an aside. Search could be considered a form of nav. But...I don't think bad karma will come your way if you just stick them in a couple of divs, either. ;)

Is there a way to make search bots ignore certain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 months ago.
Improve this question
I have my blog (you can see it if you want, from my profile), and it's fresh, as well as google robots parsing results are.
The results were alarming to me. Apparently the most common 2 words on my site are "rss" and "feed", because I use text for links like "Comments RSS", "Post Feed", etc. These 2 words will be present in every post, while other words will be more rare.
Is there a way to make these links disappear from Google's parsing? I don't want technical links getting indexed. I only want content, titles, descriptions to get indexed. I am looking for something other than replacing this text with images.
I found some old discussions on Google, back from 2007 (I think in 3 years many things could have changed, hopefully this too)
This question is not about robots.txt and how to make Google ignore pages. It is about making it ignore small parts of the page, or transforming the parts in such a way that it will be seen by humans and invisible to robots.
There is a simple way to tell google to not index parts of your documents, that is using googleon and googleoff:
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index-->
In this example, the second paragraph will not be indexed by Google. Notice the “index” parameter, which may be set to any of the following:
index — content surrounded by “googleoff: index” will not be indexed
by Google
anchor — anchor text for any links within a “googleoff: anchor” area
will not be associated with the target page
snippet — content surrounded by “googleoff: snippet” will not be used
to create snippets for search results
all — content surrounded by “googleoff: all” are treated with all
source
Google ignores HTML tags which have data-nosnippet:
<p>
This text can be included in a snippet
<span data-nosnippet>and this part would not be shown</span>.
</p>
Source: Special tags that Google understands - Inline directives
I work on a site with top-3 google ranking for thousands of school names in the US, and we do a lot of work to protect our SEO. There are 3 main things you could do (which are all probably a waste of time, keep reading):
Move the stuff you want to downplay to the bottom of your HTML and use CSS and/or to place it where you want readers to see it. This won't hide it from crawlers, but they'll value it lower.
Replace those links with images (you say you don't want to do that, but don't explain why not)
Serve a different page to crawlers, with those links stripped. There's nothing black hat about this, as long as the content is fundamentally the same as a browser sees. Search engines will ding you if you serve up a page that's significantly different from what users see, but if you stripped RSS links from the version of the page crawlers index, you would not have a problem.
That said, crawlers are smart, and you're not the only site filled with permalink and rss links. They care about context, and look for terms and phrases in your headings and body text. They know how to determine that your blog is about technology and not RSS. I highly doubt those links have any negative effect on your SEO. What problem are you actually trying to solve?
If you want to build SEO, figure out what value you provide to readers and write about that. Say interesting things that will lead others to link to your blog, and crawlers will understand that you're an information source that people value. Think more about what your readers see and understand, and less about what you think a crawler sees.
Firstly think about the issue. If Google think "RSS" is the main keyword that may suggest the rest of your content is a bit shallow and needs expanding. Perhaps this should be the focus of your attention.If the rest of your content is rich I wouldn't worry about the issue as a search engine should know what the page is about from title and headings. Just make sure RSS etc is not in a heading or bold or strong tag.
Secondly as you rightly mention, you probably don't want use images as they are not assessable to screen readers without alt text and if they have alt text or supporting text then you add the keyword back in. However aria live may help you get around this issue, but I'm not an expert on accessibility.
Options:
Use JavaScript to write that bit of content (maybe ajax it in after load). Search engines like Google can execute JavaScript but I would guess it wont value any JS written content very highly.
Re-word the content or remove duplicates of it, one prominent RSS feed link may be better than several smaller ones dotted around the page.
Use the css content attribute with pseudo :before or :after to add your content. I'm not sure if bots will index words in content attributes in CSS and know that contents value in relation to each page but it seems unlikely. Putting words like RSS in the CSS basically says it's a style thing not an HTML thing, therefore even if engines to index it they wont add much/any value to it. For example, the HTML and CSS could be:
.add-text:after { content:'View my RSS feed'; }
Note the above will not work in older versions of IE, so you may need some IE version comments if you care about that.
"googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own internal website).
They are not supported by Google's web-search at all. So please refrain from doing that and I think that should not be marked as a correct answer as this might create ambiguity.
Now, to get Google to exclude part of a page, you will need to place that content in a separate file, such as excluded.html, and use an iframe to display that content in the host page.
The iframe tag grabs content from another file and inserts it into the host page. I think there is no other available method so far.
The only control that you have over the indexing robots, is the robots.txt file. See this documentation, linked by Google on their page explaining the usage of the file.
You basically can prohibit certain links and URL's but not necessarily keywords.
Other than black-hat server-side methods, there is nothing you can do. You may want to look at why you have those words so often and remove some of them from the site.
It used to be that you could use JS to "hide" things from googlebot, but you can't now that it parses JS. ( http://www.webmasterworld.com/google/4159807.htm )
Google crawler are smart but someone that program them are smartest. Human always sees what is sensible in the page, they will spend time on blog that have some nice content and most rare and unique.
It is all about common sense, how people visit your blog and how much time they spend. Google measure the search result in the same way. Your page ranking also increase as daily visits increase and site content get better and update every day.
This page has "Answer" words repeated multiple times. It doesn't mean that it will not get indexed. It is how much useful is to every one.
I hope it will give you some idea
you have to manually detect the "Google Bot" from request's user agent and feed them little different content than you normally serve to your user.