Microdata & Vocabulary - html

I'm working on forum software, and looking to use HTML5 and Microdata(new to microdata). I was considering adding vocabulary to the software itself, instead of linking to schema or data-vocabulary, or whatever.
Then again, I wondered about the impact this may have server performance, being hit by all those spiders crawling the vocabulary.
What are your thoughts on this?

You should only create a new vocabulary if there is no (popular) existing one that could be used.
Dublin Core, FOAF and SIOC are popular ones that almost every forum could use. However, I'm not sure if these can be used with microdata (I guess it should be possible, but I don't know microdata very well). But they work with RDFa, which is very similar to microdata and a W3C recommendation (like HTML). RDFa can be used in HTML5, too. If new to RDF, you'd probably want to use RDFa Lite first (it has prefixes for DC, FOAF and schema.org vocabularies pre-defined).
I wondered about the impact this may have server performance, being hit by all those spiders crawling the vocabulary.
I don't think there are many (if any at all) microdata crawlers that would try to visit the vocabulary URIs. In most cases they wouldn't find any content there they could make use of, because the vocabulary URIs work as identifier only most of the time. Even if such crawlers arise, you'd hardly notice any performance impact because they'd probably cache it anyway.

Related

Keeping both microdata and microformats

I read that I should prefer putting in html code Microdata instead of Microformats, and actually I integrated both in my website.
Is it a bad practice keeping both? Should I have to remove the Microformats code?
Thanks
No need to remove it.
Some consumers only look for Microformats, some consumers only look for Microdata, and those that look for both should¹ have no reason to get confused. The syntaxes don’t share any mechanism, so there is no potential for conflicts.
¹ In the early days, Google stated that authors shouldn’t use both because "it may confuse the parser", but they revoked this shortly after.

Practically speaking, why semantic markup?

Does Google really care if I use an <h5> as a <b> tag?
What are some real-world, practical reasons I should care about semantic markup?
A few examples
Many visually impaired people rely on speech browsers to read pages back to them. These programs cannot interpret pages very well unless they are clearly explained. In other words semantic code aids accessibility
Search engines need to understand what your content is about in order to rank you properly on search engines.
Semantic code tends to improve your placement on search engines, as it is easier for the "search engine spiders" to understand.
However, semantic code has other benefits too:
As you can see from the example above, semantic code is shorter and so downloads faster.
Semantic code makes site updates easier because you can apply design style to headings across an entire site instead of on a per page basis.
Semantic code is easier for people to understand too so if a new web designer picks up the code they can learn it much faster.
Because semantic code does not contain design elements it is possible to change the look and feel of your site without recoding all of the HTML.
Once again, because design is held separately from your content, semantic code allows anybody to add or edit pages without having to have a good eye for design.
You simply describe the content and the cascading style sheet defines what that content looks like.
Source: boagworld
Semantics and the Web
Semantics are the implied meaning of a subject, like a word or sentence. It aids how humans (and these days, machines) interpret subject matter. On the web, HTML serves both humans and machines, suggesting the purpose of the content enclosed within an HTML tag. Since the dawn of HTML, elements have been revised and adapted based on actual usage on the web, ideally so that authors can navigate markup with ease and create carefully structured documents, and so that machines can infer the context of the wonderful collection of data we humans can read.
Until — and perhaps even after — machines can understand language and all its nuances at the same level as a human, we need HTML to help machines understand what we mean. A computer doesn’t care if you had pizza for dinner. It likely just wants to know what on earth it should do with that information.
HTML semantics are a nuanced subject, widely debated and easily open to interpretation. Not everyone agrees on the same thing right away, and this is where problems arise.
Allow me to paint a picture:
You are busy creating a website.
You have a thought, “Oh, now I have to add an element.”
Then another thought, “I feel so guilty adding a div. Div-itis is terrible, I hear.”
Then, “I should use something else. The aside element might be appropriate.”
Three searches and five articles later, you’re fairly confident that aside is not semantically correct.
You decide on article, because at least it’s not a div.
You’ve wasted 40 minutes, with no tangible benefit to show for it.
— Divya Manian
This generated a storm of responses, both positive and negative. In Pursuing Semantic Value By Jeremy Keith argued that being semantically correct is not fruitless, and he even gave an example of how <section> can be used to adjust a document’s outline. He concludes:
But if you can get past the blustery tone and get to the kernel of the article, it’s a fairly straightforward message: don’t get too hung up on semantics to the detriment of other important facets of web development.
— Jeremy Keith
Naming Things
Of all the possible new element names in HTML5, the spec is pretty set on things like <nav> and <footer>. If you’ve used either of those as a class or id in your own markup, it’s no coincidence. Studies of the web from the likes of Google and Opera (amongst others) looked at which names people were using to hint at the purpose of a part of their HTML documents. The authors of the HTML5 spec recognised that developers needed more semantic elements and looked at what classes and IDs were already being used to convey such meaning.
Of course, it isn’t possible to use all of the names researched, and of the millions of words in the English language that could have been used, it’s better to focus on a small subset that meets the demands of the web. Yet some people feel that the spec isn’t yet doing so.
Source: html5doctor (This goes on for quite a while so I've only put a few examples here.)
Hope this helps!

Microdata vs RDFa

I have a quick question about RDFa and Microdata.
My current understanding is that RDFa is RDF implemented into HTML but is complicated for new developers like myself, Microdata seems really easy and quick to implement.
What are the other advantages and disadvantages around these two semantic formats ?
Differences between Microdata and RDFa
While there are many (technical, smaller) differences, here’s a selection of those I consider important (used my answer on Webmasters as a base).
Specifications
As W3C’s HTML WG found no volunteer to edit the Microdata specification, it is now merely a W3C Group Note (see history), which means that there are no plans for any further work on it.
So the Microdata section in WHATWG’s "HTML Living Standard" is the only place where Microdata may evolve. Depending on what gets changed, it may happen that their Microdata becomes incompatible to W3C’s HTML5.
Update: In 2017, work started again, with the aim to publish Microdata as W3C Recommendation.
RDFa is published as W3C Recommendation.
Applicability
Microdata can only be used in (X)HTML5 (resp. HTML as defined by the WHATWG).
RDFa can be used in various host languages, i.e. several (X)HTML variants and XML (thus also in SVG, MathML, Atom etc.).
And new host languages can be supported, as RDFa Core "is a specification for attributes to express structured data in any markup language".
Use of multiple vocabularies
In Microdata, it’s harder, and sometimes impossible, to use several vocabularies for the same content.
Thanks to its use of prefixes, RDFa allows to mix vocabularies.
Use of reverse properties
Microdata doesn’t provide a way to use reverse properties. You need this for vocabularies that don’t define inverse properties (e.g., they only define parent instead of parent & child). The popular Schema.org is such a vocabulary (with only a few older exceptions).
While the W3C Note Microdata to RDF defines the experimental itemprop-reverse, this attribute is not part of W3C’s nor WHATWG’s Microdata.
RDFa supports the use of reverse properties (with the rev attribute).
Semantic Web
By using Microdata, you are not directly playing part in the Semantic Web (and AFAIK Microdata doesn’t intend to), mostly because it’s not defined as RDF serialization (although there are ways to extract RDF from Microdata).
RDFa is an RDF serialization, and RDF is the foundation of W3C’s Semantic Web.
The specifications RDFa Core and HTML+RDFa may be more complex than HTML Microdata, but it’s not a "fair" comparison because they offer more features.
Similar to Microdata would be RDFa Lite (which "does work for most day-to-day needs"), and this spec, at least in my opinion, is way less complex than Microdata.
What to do?
If you want to support specific consumers (for example, a search engine and a browser add-on), you should check their documentation about supported syntaxes.
If you want to learn only one syntax and have no specific consumers in mind, (attention, subjective opinion!) go with RDFa. Why?
RDFa matured over the years and is a W3C Rec, while Microdata is a relatively new invention and not standardized by the W3C.
RDFa can be used in many languages, not only HTML5.
RDFa allows mixed use of vocabularies for the same content, and it natively supports the use of reverse properties.
Can’t decide? Use both.
Note that you can also use several syntaxes for the same content, so you could have Microdata and RDFa (and Microformats, and JSON-LD, and …) for maximum compatibility.
Here’s a simple Microdata snippet:
<p itemscope itemtype="http://schema.org/Person">
<span itemprop="name">John Doe</span> is his name.
</p>
Here’s the same snippet using RDFa (Lite):
<p typeof="schema:Person">
<span property="schema:name">John Doe</span> is his name.
</p>
And here both syntaxes are used together:
<p itemscope itemtype="http://schema.org/Person" typeof="schema:Person">
<span itemprop="name" property="schema:name">John Doe</span> is his name.
</p>
But it’s typically not necessary/recommended to go down this route.
The main advantage you get from any semantic format is the ability for consumers to reuse your data.
For example, search engines like Google are consumers that reuse your data to display Rich Snippets, such as this one:
In order to decide which format is best, you need to know which consumers you want to target. For example, Google says in their FAQ that they will only process microdata (though the testing tool does now work with RDFa, so it is possible that they accept RDFa).
Unless you know that your target consumer only accepts RDFa, you are probably best going with microdata. While many RDFa-consuming services (such as the semantic search engine Sindice) also accept microdata, microdata-consuming services are less likely to accept RDFa.
I'm not certain if unor's suggestion to use both Microdata and RDFa is a good idea. If you use Google's Structured Data Testing Tool (or other similar tools) on his example it shows duplicate data which seems to imply that the Google bot would pick up two people named John Doe on the webpage instead of one which was the original intention.
I'm assuming therefore that using one syntax for a given item is a better idea (you should still be able to mix syntaxes as long as they describe separate entities).
Though I would be happy to be proven wrong on this.
I would say it largely depends on the use case: For Scientific use cases, RDF is common and used in different aspects.
For enriching Websites, JSON-LD is now recommended, for example by Google.
A JavaScript notation embedded in a tag in the page head or
body. The markup is not interleaved with the user-visible text, which
makes nested data items easier to express, such as the Country of a
PostalAddress of a MusicVenue of an Event. Also, Google can read
JSON-LD data when it is dynamically injected into the page's contents,
such as by JavaScript code or embedded widgets in your content
management system.

Is the role attribute supported by search engines?

HTML5 introduced 22 new markup tags. W3C still recommends we stick to the old tags, because IE exists. I think adding JavaScript for this purpose is over the top. HTML5 also features the less known role, comparable with the ARIA role of XHTML 2. The great advantage of the markup tags is that search engines like Google know which is which. Do search engines also support these?
Hard to tell, but I would say Google understands them in the same way as it understands HTML5 tags. And even support for HTML5 will take some time, since new sites could take too much vantage from old HTML4 sites.
For example a <div role="main">...</div> would be like the main <article>...</article> of your page.
Please keep in mind that it is important to make your document at least ARIA valid. For example, use more than one role="main" in your site is invalid and could be considered as blackhat SEO.
Worth using role where appropriate. Remember that it improves the accessibility of your site.

Is semantic markup too open-ended? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am taking a peek at Dive Into HTML5. It seems nice and interesting, but I am puzzled.
In the 1990s, at the time when Netscape was the browser and HTML was HTML2 or HTML3, there were a lot of tags: address, cite, code... Most of them are unused as of today, probably even obsolete.
HTML5 introduces tags to express "semantic meaning" to the tag itself. This is all fun and games, but I see something very strange in this approach. Technically, the semantics can be very open ended. HTML5 has tags for article, time, navigation bars, footer. Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident <rant> and <nsfw> would be very important tags): ? I thought XML was the strategy to assign semantics to stuff. Nothing forbids you to put an XML chunk under a XHTML div element, and assign a stylesheet to it so to style it properly, or to delegate to the proper viewer the handling of that namespace (for example, when handling RSS or SVG).
In conclusion, I don't understand the reason behind this extensions focused towards semantics, when it's clear that semantic is a very broad topic, which is guaranteed to require a potentially infinite amount of semantic tags. Since I am pretty sure there are clever people at W3C, I think I'm wrong, but I'd like to know why.
Why are tags for article, time, navigation bars, footer useful?
Because they facilitate parsing for text processing tools like Google.
It's nothing about semantics (at least in 'broad' meaning). Instead they just say: here is the body of page (most important text part) and there is the navigation bar full of links. With such an approach you can easily extract just what you need.
I too hate the way that W3C is going with their specs. There are many things that I don't like, and this "semantics" fad is one of them. (Others include taking forever to complete their specs and leaving too many important details for the browsers to implement as they choose)
Most of all I don't like it because it makes my work as a web developer more difficult. I often have to make a choice whether to make the webpage "semantically correct" or "visually/aesthetically pleasing". The latter wins of course, because that is what the users want, but as a result validations start failing and the whole thing gets quite non-semantic (tables for layout and other things).
Another issue at which I frown is that they have officialy declared that the "class" attribute is for semantics, but then they used it for visual presentation selectors in CSS.
Bottom line - DON'T MIX SEMANTICS AND VISUAL REPRESENTATION. If you use some mechanism for describing semantics (like tag names, attribute values, or what not else), then don't use it for funcional/visual purposes and vice versa.
If I would design HTML, I would simply add an attribute "semantic" which could (like the "class" attribute) be added to any tag. Then there would be a number of predefined values like all those headers/footers/articles/quotes/etc.
Tags would define functionality. Basically you could reduce HTML tags to just a handful, like "div", "table/tr/td", "a", "img", "form", "input" and "select". I probably missed a few but this is the bulk. Visual styling would be accomplished through CSS.
This way the three areas - semantics, visual representation, and functionality - would be completely independent and wouldn't clash in real life solutions.
Of course, I don't think W3C is interested in practical solutions...
There is already a lot of semantics in HTML markup in the forms of classes and IDs, of which there is a (near) infinite amount of possibilities of, And everyone has their own way of handling these semantics. One of the goals of HTML5 is to try to bring some structure to this. you will still be able to extend the semantics of tags with classes and ids. It will also most likely make things easier for search engines.
Look at it from the angle of trying to make statements either about the page, or about objects referenced from the page. If you see a <footer> tag, all you can say is "stuff in here is a footer" and pass it by. As such, adding custom tags is not as generic a solution as adding attributes and allowing people to use their own choice of URIs to specify predicates and optionally values - RDFa wins hands-down because you can express any triple-statement you like from RDF in a page, one way or another.
I just want to address one part of your question. You say:
In the nineties, at the time when
Netscape was the browser and html was
HTML2 or HTML3, there were a lot of
tags: address, cite, code... Most of
them are unused as of today, probably
even obsolete.
There are a great deal of tags to choose from in html, but the lack of usage does not imply that they are obsolete. In particular the header tags <h1>, etc, and <ul>, <ol> are used to join items into lists in a way I consider semantic. Many people may not use tags semantically, but the effort to create microformats is an ongoing continuation of the idea you consider an artifact of the 1990s. Efforts to make the semantic web be a winner keeps going, despite full-text search and link analysis (in the form of Google) being the winner as far as how to find and understand the web.
It would be great to see an updated version of Google's Web Stats which show "html as she is spoke." But you are right that many tags are underused.
Whether html5 will be successful is an open and interesting question, but the tags you describe as obsolete didn't go anywhere, they were there in HTML 4.01 and xhtml. HTML5 seems to be an effort to solidify what is useful in tags. In the end if html5 gets support in browsers and makes the job of web developers easier, it will succeed. xhtml2 failed because it roundly failed to gain adoption in browsers and did nothing to make the job of web page makers easier. The forces working on html5 seem keenly aware of the failure of xhtml2, and I think are avoiding having html5 suffer a similar fate.
"Why shouldn't it contain tags for post icon, author's place, name and surname, or whatever else you want to assign specific semantics to (I'm confident and would be very important tags): ?"
You use <dialog> to describe conversations or comments. Rant and NSFW are subjective terms therefore it makes sense not to use them.
From what I understand a bunch of experienced web developers did research and looked for what most websites have in common in html. They noticed that most websitse have id="header", id="footer", id="section" and id="nav" tags so they decided that we need HTML tags to replace those id's. So in other words, don't expect them to give you a HUGE amount of HTML vocabulary. Just keep it simple as possible as you can while addressing the MOST common needed HTML tags.
NAV tag is VERY important for providing accessibility as well. You want them to know where the navigation is rather than to force them to find whether links are for navigation or not.
I disagree with adding extra tags. If detailed vocabulary were actually import then there could be a different tag name for every word in the dictionary. Additional tags names are not helpful as they may communicate additional meaning to humans, but do nothing to facilitate machine parsing of the language. This is why I don't like the "semantic" tags for HTML5 as I believe this to be slippery slope to providing a vocabulary too complex while only providing a weak solution to a problem not fully addressed.
In my opinion markup language structure data as much as describe it in a tree diagram form. Through parsing of the structure and proper use of semantic conventions, such as RDFa, context can be leveraged to provide specific meaning to otherwise generic tag names. In such as case excessive vocabulary need not exist and structurally redundant tag names, such as footer and aside, could be eliminated. The final objective is to make content faster and more accurate to interpret by both humans and machines simultaneously while using as little code as possible to achieve that result. How that solution is lesser important, except to HTML5.
I thought XML was the strategy to assign semantics to stuff.
As far as I know, no it wasn’t. XML allows new languages to be defined which are all parsed in the same way, because they all use the XML syntax.
It doesn’t, of itself, provide any way to add meaning (“semantic” just means “meaningful”) to those languages. And until computers get artificial intelligence, they don’t actually understand meaning, so meaning is just what is agreed between human beings. HTML is the most commonly-used language with agreed meaning of its tags.
As HTML is so common, it’s helpful to add a few meaningful tags to it that are quite general in their application. The new HTML5 tags are aimed at that. The HTML5 spec’s authors could indeed carry on down this route, creating tags for every specific bit of meaning possible, but as they’re not robots, they probably won’t.
<section> is useful, and general enough to be meaningfully applicable in lots of documents. <author-last-name> isn’t. Distinguishing between the two is a judgment call, which is why humans, and not computers, write the spec.
For custom semantics that are too specific to be added to HTML as tags, HTML5 defines microdata.
I've been reading Andy Clark's book Transcending CSS (page 33).
...,it is now widely accepted that presentational names such as header, left, or red that describe an element's look or position are poor choices.
After reading these lines I asked myself: hey, aren't there elements in HTML5 spec such as header, footer?? Why is footer more semantic ? Andy in his book advocates to use site-info for the ID of the footer div and this makes more sense IMHO. Footer is a presentational name (describes the element's position).
In a word, AJAX. The new tags are meant to support what real-world developers are doing by replacing some of the <div class="sidebar-wrap"><div class="styling-hook"><div><ul class="nav"> type of divitis many websites suffer from. The only <div> left in the HTML5 is the styling hook.
The semantics that get promoted to tags from classes are those that developers have freely adopted en-masse as best practices, given an extended xhtml/css adoption period. Check out the WHATWG developer's edition of the spec's sections pagehere. The document itself is a pleasure, but I won't spoil it if you haven't seen it yet.
One of the less obvious reasons for some decisions made by the W3C is the importance of Webkit. If you look, you can see that they were better than some at taking the current work of the HTML5 Working Group and implementing ideas. They have historically been way out ahead in compliance (see here). The W3C placed a high priority on their (i.e. Android, iPhone, the Googlebot, Chrome, Safari, Dreamweaver, etc.,). Google, framework users, Wordpress/Moveable Type/Joomla! type users and others wanted self contained building blocks, so this is the style we get.
Facebook is modular. Responsive design's grids are modular. Wordpress is modular. Ajax works best with modular page structures. Widgets are modules. Plug-ins are modules. It would seem that we should be trying to figure out stuff like how to apply these tags to make it easier to hook the appropriate elements and activate them in our document/application/info-network hybrid Web 2.0.
In closing, HTML5 is meant to be written as xml (again, see the spec) in order to ensure that tools and machines making ajax requests for a portion of a document will get a well-formed useful response. How awesome in combination with things like media queries for devices like feed readers, braille printers, annotators, etc.,. I see a (near)future where anything with good semantic content is it's own newsfeed automagically! This only happens if developers adopt and write compliant documents.