html5: how to markup proper names and common names? - html

<p>...the favourite color of Purple is purple...</p>
the first "Purple" is a name of a company, the second one is a color name,
how should I markup this according to html5 spec?
thank you in advance

You have a number of options:
Leave it as is, HTML isn't really concerned with semantics which aren't about describing document structure (paragraphs, headings, lists etc.). If you do want to express more detailed document or application semantics look at WAI-ARIA.
If it's important for you to distinguish between the two uses of the word purple as part of your website or app then use the class attribute or data-* attributes
If the words have canonical machine readable forms and you want the values to be parsed by a computer somehow, use the data element.
If distinguishing between the two uses is important to users or systems consuming your site content, use the semantic extensibility feature of HTML5: Microdata. (If you're using the XML dialect of HTML, see also: RDFa)
Combine any of the above approaches according to your immediate needs.
To decide between the approaches you should ask yourself:
For what purpose do I need to extend the semantic vocabulary of HTML?
Is it for my own uses, or am I trying to publish information to be used by others?
If I'm publishing for others, what shared vocabulary am I going to use?
Code examples:
Class attributes
What they're for is adding additional information to your markup, remember the class attribute is in the HTML spec, not the CSS spec:
<p>...the favourite color of <span class="company">Purple</span>
is <span class="color">purple</span>...</p>
Having said that, of course, the obvious thing to do once you have things marked up in this way is provide in page tools to do things like 'highlight all companies'. People have used the class attribute as the basis for a general purpose semantic extension mechanism however, for this approach taken to the extreme see microformats.
Data attributes
The data-* attributes are to allow you to add custom attributes to your markup for processing with scripts in a way which guarantees you won't accidentally use a custom attribute which then gets used in a future version of HTML:
<p>...the favourite color of <span data-typeofthing="company">Purple</span>
is <span data-typeofthing="color">purple</span>...</p>
It's up to scripts on your page to do something useful with the data-* attributes, browsers and other web clients will ignore them.
Custom data elements
Data elements are for things that have an imprecise natural language expression but also a precise machine readable expression. Assuming that the company can be uniquely identified by a ticker symbol and RGB will do for the colour:
<p>...the favourite color of <data value="purp">Purple</data>
is <data value="rgb(128,0,128)">purple</data>...</p>
Browsers probably won't do anything special with the data element. It's most likely you'll use data elements in concert with microformats, RDFa or Microdata.
Microdata
Using the Organization schema:
<p>...the favourite color of
<span itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Purple</span>
</span>
is purple...</p>
There isn't anything for colours that I'm aware of, but you could always publish your own schema for that if it's important to you. This approach only really benefits anyone if there is a shared vocabulary of some kind.

Which element?
The first task would be to decide which element should be used to enclose the "entities" (company name and color name). Most probably you want to use span here. If in doubt, use span. There are some cases (depends on content) where other elements could be used:
the b element might be used if the entity is some kind of keyword ("a span of text to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood")
if the entity is the title of a work (book, film, song etc.), the cite element should be used.
the dfn element might be used if the entities are defined in the same paragraph or in the nearest ancestor sectioning element
in some cases the i element could be appropriate
If the term is an abbreviation/acronym, use the abbr element (instead of span resp. in addition to b/dfn/i).
A good idea might be to use the a element to link the entity name to a relevant webpage. The rel attribute might give additional metadata (you can use the rel values listed in the HTML5 spec, or the registered rel values in the microformats wiki), depending on the content/context.
Which attributes?
Have a look at the global attributes.
The class attribute would be used if you'd like to use microformats. You could of course use other class names, but they would only be useful for yourself (documentation, CSS, JS) or other people that read your markup (documentation, scraping).
For entity names like person/company names you probably want to use the translate attribute with the no keyword, because such names should not be translated.
The title might give additional information (note that it has special semantics for dfn/abbr), but don't rely on it for important information.
Use lang if the entity names are in a foreign language.
How to annotate the content with meaning?
There are three popular choices, which can also be used together (see this answer (the "third step") for a short summary of the differences):
microformats
RDFa
microdata
You may need additional elements; if so, use span.

Related

Should aria attributes also be translated if the page is using translations?

What's the best practice regarding aria attribute translations? If a page is available in several languages, should aria attributes be too?
I am using react-i18next useTranslation hook to translate a site to a couple different languages, and I was wondering if I should also use the hook to translate my aria attributes.
I would limit the question a bit because not all ARIA attributes should be translated. In fact, most of them should not. For example, you don't want "true"/"false" (aria-hidden) translated to another language. You don't want "polite"/"assertive" (aria-live) translated to another language.
So you're really only talking about aria-label and aria-roledescription (the latter is not used very much). In ARIA 1.3, there will also be aria-description so it would be translated too.
But yes, if you translate the whole page (and set the lang attribute appropriately), you should also translate the aria attributes associated with string labels.
That's why I like using aria-labelledby instead of aria-label because the ID associated with the label will be translated (assuming it's a regular text HTML element) so your label gets translated for free.
As all text potentially presented by the user, of course they have to be translated.
All ARIA attributes containing plain text, such as aria-label, have to be translated, with the same care as if they were normally displayed on screen.
This is also true for alt attribute of images, sr_only text, etc.
As a screen reader user speaking several languages living in a country with several official languages and so very often face multilingual websites, I can confirm that it's a common annoying oversight.
Of course, all what is purely technical, such as IDs, true/false values, etc. don't need to be translated, as no user will ever see them.
It’s a common misconception that all UI text was visible or in text nodes. Often there are plenty of texts that will only be seen under certain conditions (states), or that will be read by machines, including search engines and assistive technology, before being presented to the user.
ARIA attributes that include human readable text are part of it and need to be translated.
I read articles about always trying to put text in a text node, for example by preferring aria-labelledby over aria-label. In my personal opinion, there are too many use cases where this is simply not possible. So we need a better approach anyway.
In HTML, examples for attributes that need localization, are:
Head Metadata including OpenGraph for sharing like <meta property="og:description">
alt attributes and title, also for <iframes> or <style> elements
label for video tracks
placeholder and pattern attributes if the format needs localization
value of inputs if not user-defined
Human-readable ARIA attributes aria-description, aria-label, aria-placeholder, aria-roledescription, aria-valuetext, aria-braillelabel
And also text
visually hidden by CSS, often via .sr-only or .visually-hidden classes
added by CSS’s pseudo-elements like ::before
Text parts with the lang attribute should not get translated and keep that attribute.
Relatedly, there is the translate attribute …
[…] that is used to specify whether an element's translatable attribute values and its Text node children should be translated when the page is localized, or whether to leave them unchanged.

What does the html attributes data-number-to-fixed and data-number-stepfactor mean? [duplicate]

I've been seeing these attributes around on more modern websites like GitHub and such, and they always seemed to coincide with a customized popover like the title attribute.
Option 1
Option 2
Option 3
Option 4
I read some documents about data- attributes on HTML5 Doctor, and I'm not quite sure of the point.
Is there some SEO or accessibility benefit to using them? And what is the plugin(hopefully jQuery) commonly being used to create the popovers in this specific case?
Simply, the specification for custom data attributes states that any attribute that starts with “data-” will be treated as a storage area for private data (private in the sense that the end user can’t see it – it doesn’t affect layout or presentation).
This allows you to write valid HTML markup (passing an HTML 5 validator) while, simultaneously, embedding data within your page. A quick example:
<li class="user" data-name="John Resig" data-city="Boston"
data-lang="js" data-food="Bacon">
<b>John says:</b> <span>Hello, how are you?</span>
</li>
From : Ejohn.org 'Not sure about the external link policy, just putting it in here in case someone wants to know'
HTML5 data-* attribute is used for storing data in element
For manipulating with this attribute you can use jQuery.data() or .data() methods.
The main point is that data- attributes will not clash with attributes that may added to HTML later or with browser-specific attributes. The idea is to give an author a playground, a name space where he can use attributes for private purposes without fear of having them ever interpreted as standard or vendor-defined attributes in some different meaning.
According to this idea, search engines and assistive software ignore such attributes, as they have no interoperable meaning.

In semantic HTML does the class attribute mean anything in the absence of CSS or Javascript?

For example, does the class film_review mean anything in <article class="film_review"> (example from MDN) if there's no CSS or Javascript interacting with the page, or does it provide semantic information?
It doesn't provide an information that contemporary browsers would interpret or use without CSS or Javascript per se.
However it can carry semantic information - see e.g. microformats. For example, you could put an hcard
<div id="hcard-John-Doe" class="vcard">
<span class="fn">John Doe</span>
<div class="org">Cool Institute, Inc.</div>
<div class="adr"><span class="locality">Prague</span></div>
</div>
on your page and it carries a semantic information. A search engine like Google could infer that "John Doe" is a name of a person located in "Prague". There are other microformats that can represent geo information, calendar events, etc.
Anyone can write their own processor of HTML documents that would interpret class attribute values, so the answer is yes, it provides semantic information.
Quoting from hcard microformat example:
Per the HTML4.01 specification, authors should be using the element to indicate the "contact information for a document or a major part of a document." E.g.
<address>
Tantek Çelik</address>
By adding hCard to such existing semantic XHTML, you can explicitly indicate the name of the person, their URL, etc.:
<address class="vcard">
<a class="fn url" href="http://tantek.com/">Tantek Çelik</a>
</address>
It provides semantics purely in the sense that it semantically connects that element with other elements of the same class.
There's no rule which states that anything (specifically CSS and/or JavaScript in this case) must use that class. The class itself is simply part of the markup and is coincidentally being ignored by the current styling rules.
You might have other elements with the film_review class, and they are "semantically" connected in the sense that they represent "film reviews" in the markup. That's really all semantic information is... context about the thing being represented in the code. Well-named classes can provide such additional context.
But there's nothing special that the browser is going to do with this information. It's just there in case anybody (styling, code, or even just somebody looking at the markup) wants to know that this article belongs to a named class of elements.
Semantics on HTML5 are more oriented on standarizing the most used elements around the web. As described on HTML Semantic Elements:
With HTML4, developers used their own favorite attribute names to style page elements:
header, top, bottom, footer, menu, navigation, main, container, content, article, sidebar, topnav, ...
This made it impossible for search engines to identify the correct web page content.
With HTML5 elements like: <header> <footer> <nav> <section> <article>, this will become easier.
So an element so specific as a "Film Review" would not provide that much semantic information at HTML5 level.
That depends. Who and what else is processing your HTML?
For example, microformats sometimes use classes to add semantic information to elements which don't naturally possess rich semantics. In that case, neither ECMAScript nor CSS process that information, but a microformats parser might. film_review doesn't belong to any well-known microformat, however.
Everything on the page gets parsed (read) by a search-engine, so your answer is, YES, it does provide semantic information, however there are different weighted value associated with different HTML tokens (elements, attribute-names, attribute-values).
However what really defines how much weight a HTML token gets, is really dependent on the type of document that you declare it is (HTML4/HTML5), the <!DOCTYPE> tag at the top of your page declares that to the search-engine bot/parser what type of document it is, which in turn controls the search-engine bot's parsing-schema (behavior) on how to read your document.
The entire purpose of HTML5 was to provide "semantics", allowing you to use different tags so you can markup/define your document giving content more importance allowing search-engines to understand it better. This allows the search-engine a much better way to then supply the end-user, whom is searching for something with more relevant content associated with their search term... if your not using HTML5 and using HTML4 then the bots are relying mostly on HTML attributes to define the content within tags such as a <div> which provides no semantic meaning to the content inside it.

Why use Schema.org microdata to mark up web page elements?

I understand why and how to use Schema.org to add microdata to your site, this is not a question about that. The question is why Schema.org has support for certain things that can be marked up with simple HTML5. Among these are
Types
WebPage and WebSite
I can see why WebPage and WebSite would be needed, for example, to reference the page/site of a certain organization in a link, but there's no need to mark up your own page with this—the <html> tag does this.
SiteNavigationElement
Why not just use <nav>?
Table
Just use <table>.
properties
WebPage/mainContentOfPage
<main> element
WebPage/relatedLink
<link> element inside <head>
This answer is primarily about the WebPageElement types (like SiteNavigationElement).
For WebPage, see my answer to the question Implicity of web page structure in Schema.org (tl;dr: it can be useful to provide WebPage, even for the current page).
For WebSite, similar reasons from the answer above apply. HTML doesn’t allow you to state something about the whole site (and, by the way, a Google rich result makes use of this type).
Schema.org is not restricted to HTML5.
Schema.org is a vocabulary which can be used with various syntaxes (like JSON-LD, Microdata, RDFa, Turtle, …), stand-alone or in various host languages (like HTML 4.01, XHTML 1.0/1.1, (X)HTML5, XML, SVG, …). So having other ways to specify that something is (or: is about; or: represents) a site-wide navigation, a table etc. is the exception rather than the rule.
But there can be reasons to use these types even in HTML5 documents, for example:
The HTML5 markup and the annotations from Microdata/RDFa are two "different worlds": a Microdata/RDFa parser is only interested in the annotations, and after successfully parsing a document, the underlying markup is of no relevance anymore (e.g., the information that something was specified in a table element is lost in the Microdata/RDFa layer).
By using types like WebPageElement, you can specify metadata that is not possible to specify in plain HTML5. For example, the author/license/etc. of a table.
You can use these types to specify data about something which does not exist on the current document, e.g., you could say on your personal website that you are the author of a table in Wikipedia.
That said, these are not typical use cases relevant for a broad range of authors. Unless you have a specific reason for using them, you might want to omit them. They are not useful for typical websites. Using them can even be problematic in some cases.
See also my Schema.org issue The purpose of WebPageElement and mainContentOfPage, where I suggested to deprecate WebPageElement and the mainContentOfPage property.
Just use <table>.
You seem to be reading the title of the pages and no further. The <table> tag doesn't have the dozens of special properties listed on that page like isFamilyFriendly or license or timeRequired.
Schema.org microdata is intended to build a standard set of additional, semantic metadata that can be used by automated systems - search engine spiders, parser robots, etc. - to better understand the nature and features of the content.

what is the purpose and usage of data-value, data-title, data-original-title, original-title, etc.?

I've been seeing these attributes around on more modern websites like GitHub and such, and they always seemed to coincide with a customized popover like the title attribute.
Option 1
Option 2
Option 3
Option 4
I read some documents about data- attributes on HTML5 Doctor, and I'm not quite sure of the point.
Is there some SEO or accessibility benefit to using them? And what is the plugin(hopefully jQuery) commonly being used to create the popovers in this specific case?
Simply, the specification for custom data attributes states that any attribute that starts with “data-” will be treated as a storage area for private data (private in the sense that the end user can’t see it – it doesn’t affect layout or presentation).
This allows you to write valid HTML markup (passing an HTML 5 validator) while, simultaneously, embedding data within your page. A quick example:
<li class="user" data-name="John Resig" data-city="Boston"
data-lang="js" data-food="Bacon">
<b>John says:</b> <span>Hello, how are you?</span>
</li>
From : Ejohn.org 'Not sure about the external link policy, just putting it in here in case someone wants to know'
HTML5 data-* attribute is used for storing data in element
For manipulating with this attribute you can use jQuery.data() or .data() methods.
The main point is that data- attributes will not clash with attributes that may added to HTML later or with browser-specific attributes. The idea is to give an author a playground, a name space where he can use attributes for private purposes without fear of having them ever interpreted as standard or vendor-defined attributes in some different meaning.
According to this idea, search engines and assistive software ignore such attributes, as they have no interoperable meaning.