What does dcterms.date denote? - html

I was reading James Donnelly's answer to "Is there a standardized (Meta?) Tag for the Date of a Website?". At the end he writes:
I don't believe Hangy's answer of dc.date (now dcterms.date) would be relevant here as, as far as I'm lead to believe, the date of this is the date associated with the resource. For example, if the resource was a discussion about the Battle of Hastings in 1066, the dcterms.date could be set to 1066. The same could also be said for icas.datetime.
The definition of dcterms.date is "A point or period of time associated with an event in the lifecycle of the resource." I think the question is whether this "event in the lifecycle of the resource" means an event discussed within the resource, or an event pertaining to the resource itself.
Looking around, I found an example of the use of dcterms:date:
ex:myManuscript dcterms:date "1633"^^dcterms:W3CDTF .
The use of 1633 in the example leads me to believe Donnelly's interpretation is right (especially since the other examples on the same page use dates in the 2000s).
However, reading this post, I also discovered that Dublin Core has a dumb-down principle. Quoting from the post:
The solution to the paucity of Dublin Core elements was this thing called “qualified Dublin Core” (although that term doesn’t seem to be used much any more), in which the fifteen core elements are qualified to make them more specific — for example, dateAccepted, dateAvailable and dateCopyrighted are refinements of the core element date. According to the Dublin Core’s own dumb down principle, “a client should be able to ignore any qualifier and use the value as if it were unqualified […] Qualification is therefore supposed only to refine, not extend the semantic scope of an Element.”
This leads me to believe that Donnelly's interpretation is incorrect.
So my question is: What is the correct interpretation of Dublin Core's definition of dcterms.date?

There are two ways how the DCMI Metadata Term date can be used in HTML5 documents:
in meta-name elements (in the head element), because it’s registered as MetaExtension:
dcterms.date
in URI-based structured data syntaxes (typically RDF serializations like RDFa or JSON-LD, but possibly also Microdata):
http://purl.org/dc/terms/date (with the RDFa Initial Context: dc:date or dcterms:date)
In the latter case, you can differentiate if you are talking about the document or about the thing the document represents. You just have to give the thing a URI (see more details in my answer).
In the former case, HTML5 doesn’t allow this differentiation. The HTML5 specification defines that a meta element with the name attribute represents "document-level metadata"; "it sets document metadata". So unless it’s defined otherwise for the keyword dcterms.date (which doesn’t seem to be the case), the date should be associated with the document, not the thing.

Related

Are attribute values like contents, glossary for `rel` attribute deprecated?

I've been reading this tutorial. When I cross cheked it with MDN page on link_types I found that some values like contents, glossary and copyright aren't mentioned on MDN page. For copyright there seems to be an alternative of license value.
Am I reading an outdated tutorial? Are the values contents, glossary and copyright deprecated?
For current info on this, see the existing rel values page in the Microformats Wiki.
That page is what the HTML spec itself references as the official list of rel values that are valid in addition the ones defined in HTML spec itself:
Extensions to the predefined set of link types may be registered in the microformats wiki existing-rel-values page.
So you look there, you’ll see contents, glossary and copyright are all listed as valid rel values.
For copyright there seems to be an alternative of license value.
Yes, they’re basically synonyms, where rel=license is the latest and rel=copyright is old—though not formally deprecated. But given that rel=license is among the link types actually defined in the HTML spec itself, it’s recommended to instead use rel=license these days —but even that’s not formally mandated/required. (You can still safely use rel=copyright if you want.)
2016-03-06 update
So, the (now-deleted/struck-through) part I said above about rel=copyright not being formally deprecated is actually wrong. In fact the HTML standard says it “must not be used in documents”.
If you look at the Link types section of the spec and scroll just past the table there, you’ll see the following sentence [which I’m planning to have moved to make it harder to miss]:
Some of the types described below list synonyms for these values.
These are to be handled as specified by user agents, but must not be
used in documents.
And then if you look at end of the section for rel=license, you’ll see that is says:
Synonyms: For historical reasons, user agents must also treat the keyword "copyright" like the license keyword.
So that means the spec says that rel=copyright must not be used in documents.
So I’ll also soon be changing the HTML Checker behavior to emit an error for rel=copyright.

Which more extensions of HTML5 are "default" and not specified in the HTML5 spec?

There are many questions here, like this one, asking about attributes that not are defined in the HTML5 spec. All the HTML+RDFa attributes, like vocab, typeOf and property, are "valid by default", without necessity of an namespace mechanism.
So, the problem: if it is a "valid HTML5" attribute (or element), and it is not in the HTML5 spec, how can I (or my algorithm) know that it is valid?
There are another W3C spec saying "hello, this is a list of current specifications that are affected by other current specifications" (mutually affected specs)?
NOTES
Perhaps W3C uses some principle as
"ignorantia legis neminem excusat" (Latin for "ignorance of the law excuses no one") of the
Civil law countries (?)... So, in this case, W3C have obligation to show that "list of mutually affected specs" above.
The context here have no specific Stackoverflow-tags. Is something like "interoperating standards" or "agreement between W3C specifications of the same group"... or "inter-spec recommendations".
This question goes to the heart of what it means for a document to be "valid". Although we, in common parlance talk of validity, the HTML5 spec does not actually use the term "valid" but "conformance". That is, it says that an HTML document conforms or does not conform to the specific requirements laid out in the specification. It also says something about extensibility which is very illuminating:
When vendor-neutral extensions to this specification are needed,
either this specification can be updated accordingly, or an extension
specification can be written that overrides the requirements in this
specification. When someone applying this specification to their
activities decides that they will recognise the requirements of such
an extension specification, it becomes an applicable specification for
the purposes of conformance requirements in this specification.
Note: Someone could write a specification that defines any arbitrary byte
stream as conforming, and then claim that their random junk is
conforming. However, that does not mean that their random junk
actually is conforming for everyone's purposes: if someone else
decides that that specification does not apply to their work, then
they can quite legitimately say that the aforementioned random junk is
just that, junk, and not conforming at all. As far as conformance
goes, what matters in a particular community is what that community
agrees is applicable.
What that means is that whether an element or attribute is valid or not is not absolute but depends on the community that wishes to apply specific rules or not. So it is with the RDFa attributes: they're valid if you want them to be, not if you don't. Within the wider community, what elements are considered valid can change over time. If RDFa falls out of use, then they will be effectively invalid. If RDFa grows in popularity, then those attributes become valid to a wider community.
So, its effectively meaningless to talk of a document that defines which current specs form a full set of validity requirements. The set necessarily depends on any extant specs that are accepted as defining validity for each community.

Can an HTML element have the same attribute twice?

I'm considering writing code which produces an HTML tag that could have duplicate attributes, like this:
<div data-foo="bar" class="some-class" data-foo="baz">
Is this legal HTML? Does one of the data-foo-values take precendence over the other? Can I count on semi-modern browsers (IE >= 9) to parse it without choking?
Or am I about to do something really stupid here?
It is not valid to have the same attribute name twice in an element. The authoritative references for this are somewhat complicated, as old HTML versions were nominally based on SGML and the restriction is implied by a normative reference to the SGML standard. In HTML5 PR, section 8.1.2.3 Attributes explicitly says: “There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other.”
What happens in practice is that the latter attribute is ignored. Well, future browsers might do otherwise. In the DOM, attributes appear as properties of the element node as well as in the attributes object, so there would be no natural way to store two values.
It's not technically valid, but every browser will ignore duplicate attributes in HTML documents and use the first value (data-foo="bar" in your case).
Using the same attribute name twice in a tag is considered an internal parse error. It would cause your document to fail validation, if that's something you're worried about. However, it's important to understand that HTML 5 defines an expected result even for cases where you have a "parse error". The parser is allowed to stop when it encounters an error, but if it chooses not to stop it must produce a specific result described in the specification. In practice, no browsers choose to stop when encountering errors in HTML documents (XML/XHTML is a different matter), so all modern browsers will handle this case successfully and consistently.
The WHATWG HTML specification describes this case in section 12.2.4.33 "Attribute name state":
When the user agent leaves the attribute name state (and before emitting the tag token, if appropriate), the complete attribute's name must be compared to the other attributes on the same token; if there is already an attribute on the token with the exact same name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated with it (if any).
See also its description of "parse error" from the opening of section 12.2 "Parsing HTML documents":
Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
I wanted to add a comment to the excellent accepted answer, but my reputation is not high enough.
I wanted to add it is important to consider how your code gets compiled.
For example, Angular removes prior duplicate (non-angular) class attributes and only keeps the last one.
Note: Angular also modifies the value of the class attribute with ngClass and any [class.class-name] attributes.
This is also something you can use linter for.
See htmlhint (attr-no-duplication) or htmllint (attr-no-dup).

Validation error "Bad value apple-touch-icon-precomposed for attribute rel on element link: Keyword apple-touch-icon-precomposed is not registered."

I'm getting this error in w3C HTML 5 validator
Line 9, Column 101: Bad value
apple-touch-icon-precomposed for
attribute rel on element link: Keyword
apple-touch-icon-precomposed is not
registered. …-icon-precomposed"
sizes="72x72"
href="images/sl/touch/m/apple-touch-icon.png">
Syntax of link type valid for :
A whitespace-separated list of link
types listed as allowed on in
the HTML specification or listed as an
allowed on on the Microformats
wiki
How to fix this error?
Ignore it.
If that's the only error you have, then your document is valid HTML5.
Here's what the official (in development) spec states about the <meta> tag: Extensions to the predefined set of metadata names may be registered. I can't find the area in the spec that talks about the "ref" tag values, but the validator treats them similarly (one for links, one for strings), and points us to the extension Wiki. You 'may' register them, but don't have to. In RFC terminology this is a SHOULD not a MUST.
The spec doesn't seem to mandate a fixed list, or use of the Wiki. Doing so would seem odd, as these fields have often evolved with time. It does state that Conformance checkers must use the information given on the WHATWG Wiki MetaExtensions page to establish if a value is allowed or not: values defined in this specification or marked as "proposed" or "ratified" must be accepted. which is an interesting line as it is a specification for the HTML Validators, not HTML5 itself, and doesn't itself make the markup invalid.
In fact, many of these "extensions" are already in the wiki (including your one), they just haven't been accepted. Same with many meta tags, even very common ones. It seems many won't be accepted either.
I think it's very nice of the W3C to create a standardised list of these. It helps developers know what they should be using now and in the future (and can hopefully clean up some things linke reducing the number of ways you can specify a creation date from 5+ to 1).
Unfortunately we are dealing with third parties here (e.g. Apple) – and unless you want to contact every third party who has created one of these informal specification, and tell them to formalize a spec, and submit it to the W3C's list (which may or may not get accepted) what are you to do? At the end of the day you still need to support it.
Anyway, isn't the very point of having these HTML elements to support extensions so vendors don't break the spec by adding new elements to do what the need?
If you move the touch icons into your web root and follow the Apple documentation for naming conventions, you won't actually need to insert the link tags in your HTML and will avoid those validation errors.
The iOS devices will look for the icons in the web root automatically, using the predefined naming conventions and the correct resolution as also outline here. Good luck.
Delete the element from your source.
You probably don't want to do that though. Remember that validation is a tool, not a competition.
You might want to edit the wiki of supported link types and then wait for the validator to catch up.

Is HTML a context-free language?

Reading some related questions made me think about the theoretical nature of HTML.
I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html<head>
<title//
<p ltr<span id=p></span</p>
</>
So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar?
What about HTML5?
I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)
Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow:
X→δ
That simple restriction allows X to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at:
αXλ
one is sure that
αδλ
is also valid. Examples of non-context-free rules would be:
XY→δ
Xa→δ
aX→δ
Those would require knowing what could be derive arround X to determine if a rule applies, and that leads to non-determinism (what's around X would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined.
The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free.
For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF.
But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML.
HTML is defined at several levels.
At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
Finally, you have the semantic rules about what a given HTML document means.
The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen.
In the end:
Parsing correct HTML is extremely easy (it's context-free and LL/LR).
Parsing the HTML that actually exists over the Web is difficult.
Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.
Valid HTML is not a context-free language.
First of all, HTML being an application of SGML is fiction for all practical purposes, so analyzing SGML to answer the question is useless. (However, the SGML fiction probably isn't context-free, either.)
It's more useful to look at the actually defined HTML parsing algorithm. It works on two levels: tokenization and tree building. What HTML calls tokenization is a higher-level operation than what is usually called tokenization when talking about parsers. In the case of HTML, tokenization splits a stream of characters into units like start tags, end tags, comments and text. The tokenizer expands character references. Usually, when talking about parsers, you'd probably treat stuff like the less-than sign as "tokens" and would consider character references to consist of tokens instead of being resolved by the tokenizer.
If you consider the process of splitting the input stream into tokens, that level of the HTML language is regular (except for feedback from the tree builder).
However, there are three complications: The first one is that splitting the input stream into tokens is just the first and then there's the tree builder's side that actually cares about the identifiers in the tokens. The second one is that the tree builder feeds back into the tokenizer so that some state transitions made by the tokenizer depend on the state of the tree builder! The third one is that valid documents in the language are defined by rules that apply to the output of the tree builder stage and those rules are complex enough that they can't be fully defined using tree automata (as evidenced by RELAX NG not being expressive enough to describe all the validity constraints).
This isn't an actual proof, but you can probably develop real proofs by working from complications #2 and #3.
Note that the case of invalid documents is not particularly interesting as a question of whether the language is context-free in the sense of there being a context-free grammar that generates all the possible strings with no regard to the parse tree having some intelligible interpretation in terms of the tree that an HTML parser generates. The HTML parser will successfully consume all possible strings, so in that sense, all possible strings are in the "invalid HTML" language.
Edit: Interesting questions left as exercise to the reader:
Is HTML without parse errors but ignoring validity a context-free language?
Is HTML without parse errors and ignoring general validity but with only valid element names allowed a context-free language?
(Complication #2 applies in both cases.)
NO
See Edit Below
It depends.
If you are talking about the subset consisting of only theoretical HTML, then yes.
If you also include real life, working HTML that is accessed and used successfully by millions of people daily on many of the top sites on the internet then NO.
That is what gives HTML flexibility. The parsing engine adds tags, closes tags, and takes care of stuff that a theoretical CFG can't do. If you took automata you might remember that a production rule in a formal grammar cannot be empty (aka epsilon/lambda) on the lhs (left-hand side). Since the parsing engine is basically using knowledge that a formal grammar and automata couldn't have, it isn't restricted by that and the 'grammar' would have epsilon/lambda -> result where the specific epsilon/lambda rule is chosen based on information not available in the grammar.
Since I don't think empty lhs are allowed in any formal grammars, HTML cannot be defined by a formal grammar and is not a formal language at all.
Sure, HTML5 might try to move towards a 'more formal' language description but the likelihood that it becomes a context free language in reality (i.e. strings not matched by the grammar are rejected) is about the likelihood XHTML 2.0 takes the world by storm and replaces HTML altogether (XHTML is the attempt they made to make HTML a formal language...it was rejected en masse due to its fragility).
Noteworthy is the fact that HTML 5 is the FIRST HTML standard to be defined before being implemented! That's right, HTML 1-4 consist of random ideas someone just implemented in a browser, and were collected into standards after the fact based on which features were popularly used and widely implemented. Then they tried XHTML, which totally failed to be adopted. Even 'xhtml' on the web is automatically parsed as HTML under almost every circumstance to prevent stuff from just breaking with a cryptic syntax error. Now you can see how we got here and why it is unlikely to be formalized any time soon.
Lesson: "In theory, there is no difference between theory and practice. In practice, there is." - Yogi Berra
EDIT:
Actually, after reading through the documents it turns out that HTML, even according to the HTML 4.01 specification, doesn't actually conform to SGML. To see for yourself, view the HTML 4.01 Strict document type definition (doctype) at http://www.w3.org/TR/html4/strict.dtd and note the following lines:
The HTML 4.01 specification includes additional
syntactic constraints that cannot be expressed within
the DTDs.
So I would say that it is probably not a CFL due to those features (although it technically it doesn't disprove the hypothesis that there is some possible PDA that accepts HTML 4.01, it does prevent the argument that SGML is a CFL therefore HTML is a CFL).
HTML5 flip-flops, abandoning any implied conformance to SGML, but is presumably describable by a CFG. However it will still provide best-effort parsing not based on a cfg, so IMO the current situation (i.e. language specification is defined formally, with invalid strings still being accepted, parsed and rendered in a best effort fashion) in this regard is unlikely to change drastically for a long, long, long time.
HTML5 is different from previous HTML versions in that it strictly defines the parsing behaviour of code that isn't completely correct. Pre-HTML5 parsers vary and each do their best to 'guess' the intention of the code author.