More semantic tags in HTML? - html

HTML5 introduced a variety of new semantic tags (such as <header>, <footer> and <article>) to replace divs and similar generic tags where appropriate. This seems like a good thing.
I've been experimenting with Polymer, and the Web Components spec seems like the future, albeit a little ahead of its time. In particular, its custom tag names and ability to separate data from layout makes pages of HTML extremely semantic and readable.
The components have templates that form the shadow DOM, and the templates are populated using content tags that select data elements based on their tag names. Putting HTML5's semantic tags and <content> tags together seems like a no-brainer, distributing uniquely-named elements rather than trying to insert three <div>s in three different places and relying on declaration order.
In particular, it would be great if we could use custom semantic tags for data, and use those semantic names for selection. For example, a name card component could have <content> tags that select data elements such as <name> and <description> rather than using attributes.
However, <name> and <description> are not valid tags in HTML5. Browsers treat all unrecognised tags as <span> elements, so there would be no implementation problems with allowing custom tags in the DOM. They would carry additional semantic information for readers of the code, and potentially search engine crawlers as well. Code might be less consistent, but arguably 'better'.
If this is too vague a question, I'll make it explicit: is there a definitive reason why tags such as <name> are not valid HTML5?
EDIT: I just tested the <content> selection, and verified that you can in fact use selectors other than just tag name in the select attribute. I hadn't seen this in any tutorials, but it mostly satisfies my practical desire for semantic tags. For my example, you could instead use <content select="p.name"> and <content select="p.description">.
I'm still interested from a theoretical perspective, however.

The HTML5 spec gives a reason, in section 3.2.1 Semantics:
Authors must not use elements, attributes, or attribute values that are not permitted by this specification or other applicable specifications, as doing so makes it significantly harder for the language to be extended in the future.

Related

Is there a list of all html tags that do not need a solidus or a closing tag?

I built a parser for html, but I worked under the assumption that it would follow the rule that there were only two forms:
<foo> </foo>
<foo/>
Obviously that is wrong. Tags such as base,meta, and link do not need this.
I kind of wish this wasn't the case, because I have found things like this in a script:
for(var d=b.length,e=b[a];a<d>>1;)
Oh look, the mythical <d> tag.
So I need to make myself a whitelist of tags to ignore. Is there a comprehensive list for tags that do not require a solidus or closing tag? If not, I'll have to rewrite my parser.
Thanks
You can extract a list from the WHATWG HTML Living Standard. Or, if you prefer, the W3C's HTML 5 Specification or the subsequent draft. According to Wikipedia, the conflict has somewhat recently been resolved in favour of WHATWG, so you probably want to go with the first one.
In any case, pay particular attention to the subheading "Tag omission in text/html" in each element description. But you need to read the document carefully to understand the ins and outs of HTML parsing.
Note: It's not just that end tags can be omitted. There are also elements whose open tag can be omitted. (The classic example is <tbody>, which is hardly ever physically present in an HTML document, but there are lots of others. <head>, for example.) The mere fact that an element's open tag has been omitted does not force the omission of the element's close tag, although it's pretty commonly the case. So you can't do it with just a list of omitable tags; you need to take element containment rules into account, too.
Also, even though the full parsing algorithm is suprisingly complicated even for valid documents, the standard algorithm and real-world HTML parsers are even more complicated, because they try to deal gracefully with web pages which don't conform to the standard.

In semantic HTML does the class attribute mean anything in the absence of CSS or Javascript?

For example, does the class film_review mean anything in <article class="film_review"> (example from MDN) if there's no CSS or Javascript interacting with the page, or does it provide semantic information?
It doesn't provide an information that contemporary browsers would interpret or use without CSS or Javascript per se.
However it can carry semantic information - see e.g. microformats. For example, you could put an hcard
<div id="hcard-John-Doe" class="vcard">
<span class="fn">John Doe</span>
<div class="org">Cool Institute, Inc.</div>
<div class="adr"><span class="locality">Prague</span></div>
</div>
on your page and it carries a semantic information. A search engine like Google could infer that "John Doe" is a name of a person located in "Prague". There are other microformats that can represent geo information, calendar events, etc.
Anyone can write their own processor of HTML documents that would interpret class attribute values, so the answer is yes, it provides semantic information.
Quoting from hcard microformat example:
Per the HTML4.01 specification, authors should be using the element to indicate the "contact information for a document or a major part of a document." E.g.
<address>
Tantek Çelik</address>
By adding hCard to such existing semantic XHTML, you can explicitly indicate the name of the person, their URL, etc.:
<address class="vcard">
<a class="fn url" href="http://tantek.com/">Tantek Çelik</a>
</address>
It provides semantics purely in the sense that it semantically connects that element with other elements of the same class.
There's no rule which states that anything (specifically CSS and/or JavaScript in this case) must use that class. The class itself is simply part of the markup and is coincidentally being ignored by the current styling rules.
You might have other elements with the film_review class, and they are "semantically" connected in the sense that they represent "film reviews" in the markup. That's really all semantic information is... context about the thing being represented in the code. Well-named classes can provide such additional context.
But there's nothing special that the browser is going to do with this information. It's just there in case anybody (styling, code, or even just somebody looking at the markup) wants to know that this article belongs to a named class of elements.
Semantics on HTML5 are more oriented on standarizing the most used elements around the web. As described on HTML Semantic Elements:
With HTML4, developers used their own favorite attribute names to style page elements:
header, top, bottom, footer, menu, navigation, main, container, content, article, sidebar, topnav, ...
This made it impossible for search engines to identify the correct web page content.
With HTML5 elements like: <header> <footer> <nav> <section> <article>, this will become easier.
So an element so specific as a "Film Review" would not provide that much semantic information at HTML5 level.
That depends. Who and what else is processing your HTML?
For example, microformats sometimes use classes to add semantic information to elements which don't naturally possess rich semantics. In that case, neither ECMAScript nor CSS process that information, but a microformats parser might. film_review doesn't belong to any well-known microformat, however.
Everything on the page gets parsed (read) by a search-engine, so your answer is, YES, it does provide semantic information, however there are different weighted value associated with different HTML tokens (elements, attribute-names, attribute-values).
However what really defines how much weight a HTML token gets, is really dependent on the type of document that you declare it is (HTML4/HTML5), the <!DOCTYPE> tag at the top of your page declares that to the search-engine bot/parser what type of document it is, which in turn controls the search-engine bot's parsing-schema (behavior) on how to read your document.
The entire purpose of HTML5 was to provide "semantics", allowing you to use different tags so you can markup/define your document giving content more importance allowing search-engines to understand it better. This allows the search-engine a much better way to then supply the end-user, whom is searching for something with more relevant content associated with their search term... if your not using HTML5 and using HTML4 then the bots are relying mostly on HTML attributes to define the content within tags such as a <div> which provides no semantic meaning to the content inside it.

Why use Schema.org microdata to mark up web page elements?

I understand why and how to use Schema.org to add microdata to your site, this is not a question about that. The question is why Schema.org has support for certain things that can be marked up with simple HTML5. Among these are
Types
WebPage and WebSite
I can see why WebPage and WebSite would be needed, for example, to reference the page/site of a certain organization in a link, but there's no need to mark up your own page with this—the <html> tag does this.
SiteNavigationElement
Why not just use <nav>?
Table
Just use <table>.
properties
WebPage/mainContentOfPage
<main> element
WebPage/relatedLink
<link> element inside <head>
This answer is primarily about the WebPageElement types (like SiteNavigationElement).
For WebPage, see my answer to the question Implicity of web page structure in Schema.org (tl;dr: it can be useful to provide WebPage, even for the current page).
For WebSite, similar reasons from the answer above apply. HTML doesn’t allow you to state something about the whole site (and, by the way, a Google rich result makes use of this type).
Schema.org is not restricted to HTML5.
Schema.org is a vocabulary which can be used with various syntaxes (like JSON-LD, Microdata, RDFa, Turtle, …), stand-alone or in various host languages (like HTML 4.01, XHTML 1.0/1.1, (X)HTML5, XML, SVG, …). So having other ways to specify that something is (or: is about; or: represents) a site-wide navigation, a table etc. is the exception rather than the rule.
But there can be reasons to use these types even in HTML5 documents, for example:
The HTML5 markup and the annotations from Microdata/RDFa are two "different worlds": a Microdata/RDFa parser is only interested in the annotations, and after successfully parsing a document, the underlying markup is of no relevance anymore (e.g., the information that something was specified in a table element is lost in the Microdata/RDFa layer).
By using types like WebPageElement, you can specify metadata that is not possible to specify in plain HTML5. For example, the author/license/etc. of a table.
You can use these types to specify data about something which does not exist on the current document, e.g., you could say on your personal website that you are the author of a table in Wikipedia.
That said, these are not typical use cases relevant for a broad range of authors. Unless you have a specific reason for using them, you might want to omit them. They are not useful for typical websites. Using them can even be problematic in some cases.
See also my Schema.org issue The purpose of WebPageElement and mainContentOfPage, where I suggested to deprecate WebPageElement and the mainContentOfPage property.
Just use <table>.
You seem to be reading the title of the pages and no further. The <table> tag doesn't have the dozens of special properties listed on that page like isFamilyFriendly or license or timeRequired.
Schema.org microdata is intended to build a standard set of additional, semantic metadata that can be used by automated systems - search engine spiders, parser robots, etc. - to better understand the nature and features of the content.

HTML5: Why not use my own arbitrary tag names?

As far as I know, it is perfectly legal to make up tag names in HTML5, and they work like normal with CSS styling and nesting and all.
Of course my arbitrary tag names will make no difference to the browser, which does not understand them, but it goes very far in making my code more readable, which makes it easier to maintain.
So why should I NOT use arbitrary tag names on my pages? Will it affect SEO? Will it break anything?
IMPORTANT EDIT: Older browsers DO NOT choke on unsupported tags when used with http://ejohn.org/blog/html5-shiv/
UPDATE
The original answer is quite old and was created before web components existed (although they had been discussed since 2011). The rules on custom elements have changed quite a bit. The W3C now has a whole standard for custom elements
Custom elements provide a way for authors to build their own fully-featured DOM elements. Although authors could always use non-standard elements in their documents, with application-specific behaviour added after the fact by scripting or similar, such elements have historically been non-conforming and not very functional. By defining a custom element, authors can inform the parser how to properly construct an element and how elements of that class should react to changes.
In other words, the W3C is now fine with you using custom elements (or arbitrary tag names as the question calls them). However, there are quite a few rules surrounding them and their use so you should refer to the linked documentation.
Original answer
Older browsers choke on tags they don't recognize. There is nothing that guarantees that your custom tags will work as you expect in any browser
There is no semantic meaning applied to your tags since this meaning only exists in the agreed upon tags in the spec. Whether or not this affects SEO is unknown, but anything that parses the HTML may have trouble if it runs into a tag it doesn't recognize including screen readers or web scrapers
You may not get the help you need from the browser if your tags are improperly nested
In the same vein, some browsers may screw up rendering and be different from one another. For example, <p><ul></ul></p> is parsed in Chrome as <p></p><ul></ul>. There is no guarantee how your tags will be treated since they have no agreed upon content model
The W3C doesn't want you to do it.
Authors must not use elements, attributes, or attribute values that are not permitted by this specification or other applicable specifications, as doing so makes it significantly harder for the language to be extended in the future.
If you really need to use custom tags for this purpose, you can use a template language of some sort such as XSLT and create a stylesheet that transforms it into valid HTML

html5: how to markup proper names and common names?

<p>...the favourite color of Purple is purple...</p>
the first "Purple" is a name of a company, the second one is a color name,
how should I markup this according to html5 spec?
thank you in advance
You have a number of options:
Leave it as is, HTML isn't really concerned with semantics which aren't about describing document structure (paragraphs, headings, lists etc.). If you do want to express more detailed document or application semantics look at WAI-ARIA.
If it's important for you to distinguish between the two uses of the word purple as part of your website or app then use the class attribute or data-* attributes
If the words have canonical machine readable forms and you want the values to be parsed by a computer somehow, use the data element.
If distinguishing between the two uses is important to users or systems consuming your site content, use the semantic extensibility feature of HTML5: Microdata. (If you're using the XML dialect of HTML, see also: RDFa)
Combine any of the above approaches according to your immediate needs.
To decide between the approaches you should ask yourself:
For what purpose do I need to extend the semantic vocabulary of HTML?
Is it for my own uses, or am I trying to publish information to be used by others?
If I'm publishing for others, what shared vocabulary am I going to use?
Code examples:
Class attributes
What they're for is adding additional information to your markup, remember the class attribute is in the HTML spec, not the CSS spec:
<p>...the favourite color of <span class="company">Purple</span>
is <span class="color">purple</span>...</p>
Having said that, of course, the obvious thing to do once you have things marked up in this way is provide in page tools to do things like 'highlight all companies'. People have used the class attribute as the basis for a general purpose semantic extension mechanism however, for this approach taken to the extreme see microformats.
Data attributes
The data-* attributes are to allow you to add custom attributes to your markup for processing with scripts in a way which guarantees you won't accidentally use a custom attribute which then gets used in a future version of HTML:
<p>...the favourite color of <span data-typeofthing="company">Purple</span>
is <span data-typeofthing="color">purple</span>...</p>
It's up to scripts on your page to do something useful with the data-* attributes, browsers and other web clients will ignore them.
Custom data elements
Data elements are for things that have an imprecise natural language expression but also a precise machine readable expression. Assuming that the company can be uniquely identified by a ticker symbol and RGB will do for the colour:
<p>...the favourite color of <data value="purp">Purple</data>
is <data value="rgb(128,0,128)">purple</data>...</p>
Browsers probably won't do anything special with the data element. It's most likely you'll use data elements in concert with microformats, RDFa or Microdata.
Microdata
Using the Organization schema:
<p>...the favourite color of
<span itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Purple</span>
</span>
is purple...</p>
There isn't anything for colours that I'm aware of, but you could always publish your own schema for that if it's important to you. This approach only really benefits anyone if there is a shared vocabulary of some kind.
Which element?
The first task would be to decide which element should be used to enclose the "entities" (company name and color name). Most probably you want to use span here. If in doubt, use span. There are some cases (depends on content) where other elements could be used:
the b element might be used if the entity is some kind of keyword ("a span of text to which attention is being drawn for utilitarian purposes without conveying any extra importance and with no implication of an alternate voice or mood")
if the entity is the title of a work (book, film, song etc.), the cite element should be used.
the dfn element might be used if the entities are defined in the same paragraph or in the nearest ancestor sectioning element
in some cases the i element could be appropriate
If the term is an abbreviation/acronym, use the abbr element (instead of span resp. in addition to b/dfn/i).
A good idea might be to use the a element to link the entity name to a relevant webpage. The rel attribute might give additional metadata (you can use the rel values listed in the HTML5 spec, or the registered rel values in the microformats wiki), depending on the content/context.
Which attributes?
Have a look at the global attributes.
The class attribute would be used if you'd like to use microformats. You could of course use other class names, but they would only be useful for yourself (documentation, CSS, JS) or other people that read your markup (documentation, scraping).
For entity names like person/company names you probably want to use the translate attribute with the no keyword, because such names should not be translated.
The title might give additional information (note that it has special semantics for dfn/abbr), but don't rely on it for important information.
Use lang if the entity names are in a foreign language.
How to annotate the content with meaning?
There are three popular choices, which can also be used together (see this answer (the "third step") for a short summary of the differences):
microformats
RDFa
microdata
You may need additional elements; if so, use span.