I hope my question will not be too vague.
I am starting to dive into awesomness of microdata and Schema.org. What remains a little mystery to me is an exact specification of itemtype BLOG to me.
Does it work as a general container for articles of all kind or is it appropriate for "regular" posts only?
To clarify, here is my example: I am building my online webdesign portfolio. I have two <sections> - one for portfolio items, one for my regular blog (consisting from Twitter updates, videos and other microblogging formats). Should I mark both of them as "blogs", their content as "articles" or would you recommend me completely different approach.
I've found quite a lot of discussion about the role of itemtype blog but most of them concentrate at the usage of itemtypes in "regular blog situations".
https://webmasters.stackexchange.com/questions/46680/using-schema-for-blogging-article-vs-blogposting
What microdata should I use for a blog?
Blog Posts Optimized by Schema
Some commercial portfolio WP themes I was going through use "blog" itemtype for portfolio items, some don't bother mark the list at all.
What do you suggest?
schema.org doesn’t define or explain the term blog (it only says: "A blog"). So in the end it’s up to you and your understanding of what constitutes a blog.
If your posts are http://schema.org/BlogPosting, you have a blog. If your posts are http://schema.org/Article, you don’t have a blog. Now the question is: When is a post a blog post?
A http://schema.org/BlogPosting is a more specific http://schema.org/Article. But they consist of absolutely the same properties, so again we have to base the decision on our understanding of the terms article and blog posting.
How to define blog or blog post? For me, content-wise, a blog is a (reverse chronological) collection of self-contained posts (… and so on). But opinions may differ.
So I’d propose a simple rule of thumb:
Imagine a specialized blog search engine, making use of http://schema.org/Blog and http://schema.org/BlogPosting. Would it be useful for the searchers if your posts are indexed there? If not, don’t use these types.
Agree with unor about difference between Blog, BlogPosting and Article. Just my two cents - to be a bit more specific at your particular case.
For blog section I'd use Blog and BlogPosting exactly as it written by Eric here.
I don't think that Blog should be used for portfolio items. Instead I'd use more specific types from schema.org (e.g., http://schema.org/ImageObject). They can be wrapped up in some "container" type like http://schema.org/ImageGallery or http://schema.org/ItemList.
Hope this helps.
Related
Teasers on the front page of a blog surely are not the targets for us to add itemtype="http://schema.org/BlogPosting" to because each of them is not a full blog posting and is just one or two paragraphs with a "Continue reading" link instead.
But since they are part of a blog, is there any blog-related Microdata for them or not?
A person is still a person, even if you don’t provide the name. A place is still a place, even if you don’t provide the address. A blog posting is still a blog posting, even if you don’t provide the full content.
So using BlogPosting for a teaser is perfectly fine. If you don’t show the full content, just omit the articleBody property.
Reading an article on the <article> tag on HTML5, I really think my biggest confusion is in the first question of this section:
Using <article> gives more semantic meaning to the content. By contrast <section> is only a block of related content, and <div> is only a block of content... To decide which of these three elements is appropriate, choose the first suitable option:
Would the content would make sense on its own in a feed reader? If so, use <article>.
Is the content related? If so, use <section>.
Finally, if there’s no semantic relationship, use <div>.
So I guess my question is really: What types of content belong in a feed reader?
The spec answers this quite clearly:
The article element represents a self-contained composition in a
document, page, application, or site and that is, in principle,
independently distributable or reusable, e.g. in syndication. This
could be a forum post, a magazine or newspaper article, a blog entry,
a user-submitted comment, an interactive widget or gadget, or any
other independent item of content.
see: http://dev.w3.org/html5/spec/Overview.html#the-article-element
The W3C spec leaves a lot open to interpretation and it ultimately comes down to the author's opinion. Here is a short and simple answer in the form of a question:
What are the primary significant pieces of content you want to share on the page?
Here are a few examples:
On this very page, each answer could be an article.
On flickr each photo displayed in the photostream could be considered an article.
On dribbble each shot displayed on the page could be an article.
On google each search result listed could be an article.
On a blog each article.. well each article could be an article.
On a blog page with an article and a series of comments you could have two major sections. One with an article and another for comments in which each comment could be considered an article.
It's the author's discretion as to how far they want to go. Most blog authors have an RSS feed for their articles, but others may also provide feeds for comments, and shared links.
A lot of people have written on this subject. For further information I highly recommend reading:
http://html5doctor.com/the-article-element/ (you've already shared this)
http://www.impressivewebs.com/html5-section/
http://www.iandevlin.com/blog/2011/04/html5/html5-section-or-article
You've brought up a good argument and yes the spec does rather clearly define <article> as a syndication-worthy collection of content. The way I see it, your article would be the composed blog post – what you as the content writer of the site produce. While comments on that section are related to the article, they are not, in fact, part of the article, and should be relegated to another block in the <section>, either a non-semantic <div> or simply <p>s with display:block set. This is a decision that's left to the designer, depending on how they semantically evaluate the worth of the commentary.
Remember too that you have the <aside> tag, which is almost tailor-made for commentary, whether from the author or from the reader.
Most feed readers can handle many types of content, it could include copy, images, videos, etc. The feed for your will include the content on your site that is repeated or includes multiple versions. A question and answer site will have a feed of new questions. A video sharing site will have a feed of new videos. A software review site will have a feed of new software or new reviews.
I'd recommend considering what the typical consumer of your content would want to find easily in their feed reader. You get to define what types of content belong in a feed reader.
A feed reader, in general, should contain a list of stories. Look at http://google.com/elections/ - it's a good example of the sort of thing a feed reader might contain. The important part is that all the stories are self-contained, and in theory do not need to be related at all.
The markup for that document could look like the following:
<body>
<header>...</header>
<nav>...</nav>
<article>
<section>
...
</section>
</article>
<aside>...</aside>
<footer>...</footer>
</body>
You may find more information in this article on A List Apart.
The new HTML5 article tag all seems very great and wonderful and there has been much discussion here and elsewhere about its uses.
Unfortunately, all this discussion seems to be in the context of blog or news sites where the content is all just that, content.
In an ecommerce site, the biggest question to be asking is, how do I now mark up a product?
Taking the spec for guidance, it seems that a saleable item is indeed something distinct that could be syndicated (and often is). The article tag seems like a good match, yet I see no mention of its use in this way.
Is it appropriate here but all the examples blogs etc. because they seem to fit a bit more intuitively with the name of the tag? Or am I stretching too hard at the spec?
Any guidance would be much appreciated.
I don't think <article> is suitable for product data. Although not using semantic elements, you may wish to look at the Product schema from schema.org.
EDIT :
See the following quote from the W3C spec. Perhaps article is suited after all, as a product can be considered an "independent item of content."
The article element represents a component of a page that consists of
a self-contained composition in a document, page, application, or site
and that is intended to be independently distributable or reusable,
e.g. in syndication. This could be a forum post, a magazine or
newspaper article, a blog entry, a user-submitted comment, an
interactive widget or gadget, or any other independent item of
content.
You should take a look at this article
Looks like <article> is not that bad an idea. I am using pointers from here and http://schema.org/Product for an e-commerce site project.
Having custom tags bothers IE a lot and we can not ignore the internet explorer yet.
I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1, H2, title and other HTML tags to enhance the search results.
What is the name of this technique "using the document structure to enhance the search results"?
And are there any academic papers to help me study this area?
The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.
I think it's called "Semantic Markup"
[...] semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information. http://www.digital-web.com/articles/writing_semantic_markup/
A more practical article here
http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/
SEO has become almost a religion to some people where they obsess about minutiae. Frankly, I'm not convinced that all this effort is justified.
My advice? Ignore what so-called pundits say and just follow Google's guidelines.
You might be looking for an academic answer but honestly, this isn't an academic question beyond the very basics of how Web indexing works. The reality of a modern page indexing and ranking algorithm is far more complex.
You may want to look at one of the earlier works on search engines. Note the authors' names. You may also want to read Google Patent application 20050071741.
These general principles aside, Google's search algorithm is constantly tweaked based on actual and desired results. The exact workings are a closely guarded secret just to make it harder for people to game the system. Much of the "advice" or descriptions on how Google's search algorithm works is pure supposition.
So, apart from having a title and having well-formed and valid HTML, I don't think you're going to find what you're looking for.
Google very deliberately doesn't give away too much information about its search algorithm, so it's unlikely you will find a definitve answer or academic paper that confirms this. If you're interested from an SEO point of view, just write your pages so they are good for humans and the robots will like them too.
To make a page good for humans, you SHOULD use tags such as h1, h2 and so on to create a hierarchical page outlay... a bit like this...
h1 "Contact Us"
...h2 "Contact Details"
......h3 "Telephone Numbers"
......h3 "Email Addresses"
...h2 "How To Find Us"
......h3 "By Car"
......h3 "By Train"
The difficulty with your question is that if you put something in your h1 tag hoping that it would increase your position in Google, but it didn't match up with other content on your page, you could look like you are spamming. Similarly, if your page is made up of too many headings and not enough actual content, you could look like you are spamming. It's not as simple as add a h1 and h2 tag and you'll go up! That's why you need to write websites for humans, not robots.
I have found this paper:
A New Study on Using HTML Structures to Improve Retrieval
however it is an old paper 1999,
still looking for more recent papers.
Check out
http://jcmc.indiana.edu/vol12/issue3/pan.html
http://www.springerlink.com/content/l22811484243r261/
Some time spent on scholar.google.com might help you find what you are looking for
You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.
It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)
Like cletus said follow the google guidelines.
I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
In short; very carefully. In long:
Quote from anatomy of a large-scale hypertextual erb search engine:
[...] This gives us some limited
phrase searching as long as there are
not that many anchors for a particular
word. We expect to update the way that
anchor hits are stored to allow for
greater resolution in the position and
docIDhash fields. We use font size
relative to the rest of the document
because when searching, you do not
want to rank otherwise identical
documents differently just because one
of the documents is in a larger
font. [...]
It goes on:
[...] Another big difference between
the web and traditional well controlled collections is that there
is virtually no control over what
people can put on the web. Couple
this flexibility to publish anything
with the enormous influence of search
engines to route traffic and companies
which deliberately manipulating search
engines for profit become a serious
problem. This problem that has not
been addressed in traditional closed
information retrieval systems. Also,
it is interesting to note that
metadata efforts have largely failed
with web search engines, because any
text on the page which is not directly
represented to the user is abused to
manipulate search engines. [...]
The Challenges in a web search engine addresses these issues in a more modern fashion:
[...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:
And adds:
[...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.
There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.
Resources
Hypertext and the web:
Extracting knowledge from the World Wide Web
Rich media and web 2.0
Thresher: automating the unwrapping of semantic content from the World Wide Web
Information retrieval
Webspam papers
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.
The Basic Process is:
Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.
Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.
If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.
I really enjoyed the book Transcending CSS
for a clean explanation of properly structured HTML.
I suggest trying Google scholar as one of your avenues when looking for academic articles
semantic search
I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:
<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>
Always the "text for section one" is shown on the search result page.
New tag to use called CANONICAL can now also be used, from Google, click HERE
One of the sites I develop has lots of information linked between each other; we have companies, we have products for those companies. The company page links to the page listing the products for that company, and vice versa.
From the HTML spec:
CITE:
Contains a citation or a reference to other sources.
Does this imply that I could (semantically) use a <cite> for a company link? What about on the company page to a product?
If not, could someone tell me what might be the "correct" semantic tag for this?
If you're just linking to other pages then semantically you should just use <a href=...>. If you're quoting a small piece of information, like the information from the HTML spec in your question, and providing a link to the original source, you might use <cite>. Think of it as a citation in a book or research paper.
I'm not sure that cite is intended to mark up links - you may be looking at something akin to a more professional (less inter-personal) XFN using the rel attribute of the link.
Cite is more for marking up titles of articles or other created work.
XFN is specifically for marking up the relationship you (or your company) have with the person or company you are linking to. What I'm not sure of is what xfn values there are (if any) for company links.
http://reference.sitepoint.com/html/xfn
What you might consider is in what detail will the information be used? Semantic markup, although a noble direction to head in, is not yet utilised to it's full extent when looking at (by a human) or parsing (by a program) a resource.