I have small sections of text on my site that are in French (Canadian). When I look up on the W3schools language code reference, I find there is only a list of specific languages but not location. Is writing <span lang="fr-CA"> valid? Does adding the location-base of the language have any impact at all on screen-readers?
Yes, <span lang="fr-CA"> is valid, but you should not expect it to provide any benefits over <span lang="fr"> in most cases. In fact I have used a screen reader that recognized a few values for the lang tag but only when simply codes like en and fr were used.
According to HTML5 PR, the lang attribute value is a language code, or “language tag” as they call it, as defined in BCP 47, which is a concatenation of RFCs Tags for Identifying Languages and Matching of Language Tags. It defines a rather complicated system of codes. Most of the possibilities have no use in HTML in practice, now or in the foreseeable future. The code system has been defined to meet many different needs, including bibliographic information and text databases.
Using fr-CA is possible in accordance with RFC 5646, clause 2.2.4. Region Subtag:
Region subtags are used to indicate linguistic variations
associated with or appropriate to a specific country, territory, or
region. Typically, a region subtag is used to indicate variations
such as regional dialects or usage, or region-specific spelling
conventions. It can also be used to indicate that content is
expressed in a way that is appropriate for use throughout a region,
for instance, Spanish content tailored to be useful throughout
Latin America.
The subtag CA is the ISO 3166-1 country code for Canada, so fr-CA denotes French as spoken and written in Canada, i.e. Canadian French. It is possible that a screen reader that can speak French in different variants and recognizes lang attributes will use Canadian pronunciation for an element that has lang="fr-CA". However, this is probably very theoretical and would be of little practical impact if it were actually implemented.
More realistically, such an attribute may have other effects. If you open an HTML document in MS Word, it will recognize lang attributes. Whether this has any practical impact depends on whether and how Word treats Canadian French as differently from French in general e.g. in spelling checks.
The next reference on W3schools describes valid country codes and how to use them with language codes.
Ex: <html lang="en-US">
In regards to screen readers, it would depend on how the reader was implemented. Generally they would get the language from <html> but it is possible that some implementations may allow for the language to change over the course of the document.
The lang attribute can be used in any html element, as described in the HTML specifications, so <span lang="fr-CA"> would be valid.
Related
In HTML, it's good to have a lang attribute in <html>, e.g. <html lang="en">.
How is this useful?
If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang attribute).
I am quoting this from W3C:
Declaring language in HTML
Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.
Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.
Use language tags from the IANA Language Subtag Registry.
Also a good read is Why use the language attribute?.
You asked "how is this useful".
"The <lang=> attribute can be used to declare the language of a Web
page or a portion of a Web page. This is meant to assist search engine
spiders, page formatting and screen reader technology"
Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)
No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).
The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">.
The lang attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.
lang needed for speech synthesis
Some blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".
Using the lang attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.
I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:
<h5>No lang specified:</h5>
<p>J'aime le pain</p>
<h5>French:</h5>
<p lang="fr">J'aime le pain</p>
As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.
lang needed for text rendering
Perhaps surprisingly, the benefits of the lang attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang can also affect text rendering, since the correct way to render some text can be language-dependent.
There are a couple of different mechanisms by which the lang you set can affect how text gets rendered:
different fonts being selected based on the lang attribute, either:
based on the browser's default font selection rules, or
because you've explicitly set up language-specific fonts using :lang selectors in your CSS
or
fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature
Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.
Language-dependent forms of Han characters
There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.
When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.
To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:
Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>
Note that the same character, 飴, is used in both spans. But they display differently in the browser, at least in Chrome on my Windows PC:
As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.
fi ligatures getting disabled for Turkish text
As described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.
However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or fı (an f followed by a dotless ı).
For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.
To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:
<style>
#font-face {
font-family: oktaneue;
src: url("Groteskly Yours - Okta Neue UltraLight.otf");
}
* {
font-family: oktaneue;
}
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>
Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:
As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!
The lang attribute tells the client what language the document (or part of document) is written in. This is useful for any software which cares about language.
A key use is for accessibility. It is mentioned in WCAG:
This Success Criterion helps:
people who use screen readers or other technologies that convert text into synthetic speech;
people who find it difficult to read written material with fluency and accuracy, such as recognizing characters and alphabets or decoding
words;
people with certain cognitive, language and learning disabilities who use text-to-speech software
people who rely on captions for synchronized media.
Adrian Roselli describes some benefits:
Hyphens
By using lang, you get the benefits of hyphen support in your (modern)
browser that you otherwise would not get (assuming you use hyphens:
auto in your CSS).
Accessibility
At the very least, lang is a benefit for screen reader users,
particularly when your users don’t have the same primary language as
your site. It allows proper pronunciation and inflection when the page
is spoken.
… as well as referencing WCAG and pointing at this document from the W3C which lists benefits such as being able to write CSS which styles elements based on the language they are written in (so different fonts can be used for different languages), automatically selecting fonts with the right version of a glyph for a language, to aid search engines, spell checks and translation tools, to help speech synthesizers and Braille translators, and for custom scripting.
As far as I can tell, for a single-language website (hopefully you can surmise its utility for multi-language websites), the only real, actual, tangible use is that it makes the 'hyphens' CSS property work as expected... which is not much, but more than nothing. (I'm afraid I haven't actually tested this in a browser, however, which is something we should all do to know things for sure.)
Via: http://blog.adrianroselli.com/2015/01/on-use-of-lang-attribute.html (which is also full of irrelevant "reasons" to use it, save that mentioned).
The difference between lang and custom attributes is that lang is inherited, so even a child element of an element with attribute lang=en can be selected with the selector div:lang(en){}.
In HTML, it's good to have a lang attribute in <html>, e.g. <html lang="en">.
How is this useful?
If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang attribute).
I am quoting this from W3C:
Declaring language in HTML
Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.
Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.
Use language tags from the IANA Language Subtag Registry.
Also a good read is Why use the language attribute?.
You asked "how is this useful".
"The <lang=> attribute can be used to declare the language of a Web
page or a portion of a Web page. This is meant to assist search engine
spiders, page formatting and screen reader technology"
Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)
No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).
The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">.
The lang attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.
lang needed for speech synthesis
Some blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".
Using the lang attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.
I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:
<h5>No lang specified:</h5>
<p>J'aime le pain</p>
<h5>French:</h5>
<p lang="fr">J'aime le pain</p>
As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.
lang needed for text rendering
Perhaps surprisingly, the benefits of the lang attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang can also affect text rendering, since the correct way to render some text can be language-dependent.
There are a couple of different mechanisms by which the lang you set can affect how text gets rendered:
different fonts being selected based on the lang attribute, either:
based on the browser's default font selection rules, or
because you've explicitly set up language-specific fonts using :lang selectors in your CSS
or
fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature
Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.
Language-dependent forms of Han characters
There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.
When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.
To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:
Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>
Note that the same character, 飴, is used in both spans. But they display differently in the browser, at least in Chrome on my Windows PC:
As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.
fi ligatures getting disabled for Turkish text
As described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.
However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or fı (an f followed by a dotless ı).
For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.
To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:
<style>
#font-face {
font-family: oktaneue;
src: url("Groteskly Yours - Okta Neue UltraLight.otf");
}
* {
font-family: oktaneue;
}
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>
Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:
As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!
The lang attribute tells the client what language the document (or part of document) is written in. This is useful for any software which cares about language.
A key use is for accessibility. It is mentioned in WCAG:
This Success Criterion helps:
people who use screen readers or other technologies that convert text into synthetic speech;
people who find it difficult to read written material with fluency and accuracy, such as recognizing characters and alphabets or decoding
words;
people with certain cognitive, language and learning disabilities who use text-to-speech software
people who rely on captions for synchronized media.
Adrian Roselli describes some benefits:
Hyphens
By using lang, you get the benefits of hyphen support in your (modern)
browser that you otherwise would not get (assuming you use hyphens:
auto in your CSS).
Accessibility
At the very least, lang is a benefit for screen reader users,
particularly when your users don’t have the same primary language as
your site. It allows proper pronunciation and inflection when the page
is spoken.
… as well as referencing WCAG and pointing at this document from the W3C which lists benefits such as being able to write CSS which styles elements based on the language they are written in (so different fonts can be used for different languages), automatically selecting fonts with the right version of a glyph for a language, to aid search engines, spell checks and translation tools, to help speech synthesizers and Braille translators, and for custom scripting.
As far as I can tell, for a single-language website (hopefully you can surmise its utility for multi-language websites), the only real, actual, tangible use is that it makes the 'hyphens' CSS property work as expected... which is not much, but more than nothing. (I'm afraid I haven't actually tested this in a browser, however, which is something we should all do to know things for sure.)
Via: http://blog.adrianroselli.com/2015/01/on-use-of-lang-attribute.html (which is also full of irrelevant "reasons" to use it, save that mentioned).
The difference between lang and custom attributes is that lang is inherited, so even a child element of an element with attribute lang=en can be selected with the selector div:lang(en){}.
What HTML markup and tags should I use if write in article.
This `foreign word` translated from foreign language as `this word in native reader language word`.
Use the most appropriate markup (using a generic element if nothing better presents itself) with a lang attribute.
<body lang="en">
<!-- etc -->
<p><span lang="de">unbekanntes Flugobjekt</span> is German for UFO.</p>
This won't generally provide automatic translation, but the option exists for browsers / browser extensions to provide such a mechanism. Translation tools such as Google Translate may use it as a hint to identify the "from" language. Text to speech software may use it to select a pronunciation guide. And so on.
There is no HTML markup specifically for such purposes. It really depends on the conventions of the human language used on the page, as well as presentation style. Typically, either quotation marks or italic is used when mentioning words or expressions, rather than using them in normal use. For these, there are different options in HTML. Quotation marks are best written as such, using proper characters as per language rules, though some people still think that q markup is useful. For italic, you can use i markup or CSS font-style: italic.
In any case, if it is relevant to your purposes somehow that translations are marked up, e.g. in order to style them uniformly later, the best shot is to use classes.
The use of lang markup is recommendable in principle, and it is gaining some practical importance (e.g., for automatic hyphenation). In the following example, the span markup is used only to indicate the language (because you need an element for that):
The French word “<span lang=fr>cheval</span>” means “horse”.
Some languages have an accepted transliteration to Latin characters, such as Hindi, Russian or Japanese. For example, the Hindi for 'The man is eating' written in Devanagari script is 'आदमी खा रहा है।'. Transliterated, it would be 'Aadmi kha raha hai.' (or something similar; this approach is often used online, especially if people don't have access to a Hindi keyboard.)
In this case, we're using the Latin script but still writing Hindi, so it would be acceptable to mark up either variation using the lang attribute:
<span lang="hi">आदमी खा रहा है।</span> or <span lang="hi">Aadmi kha raha hai.</span>
My question then is about languages that are normally written in the Latin alphabet themselves, but might have phonetic guides for non-speakers/learners — either IPA or ad hoc pronunciation — is there any best practice in terms of giving it semantic meaning?
For example, in Irish if I were to say "The man is eating", I would say "Tá an fear ag ithe." I can mark this up as:
<span lang="ga">Tá an fear ag ithe.</span>
If I were to give a pronunciation guide for non-speakers, I might say "Taw on far eg ih-he". The sentence isn't meaningless, (like 'lorem ipsum' text) but neither is the sentence in either English or Irish.
What is the correct use of language related attributes in HTML in this case, or is this use case just not covered currently by the specification?
Short version: if you want to specifically say it's written in the Latin alphabet, go for "hi-Latn" or "ga-Latn" for the examples you gave.
Long version:
The W3C spec for the lang attribute doesn't specifically mention this - it suggests some uses of this that depend on orthography (such as using it in order to render high-quality versions of the characters used), but some that don't (such as for search engines).
RFC1766, which specifies the format for the language tags, suggests that specialisations of tags may be used to represent "script variations, such as az-arabic and az-cyrillic". There's more about the script subtag in this article on the W3C site, and a bit extra in the later RFC5646. That one points to an ISO standard list of script names, and in that list the script you'd want is "Latn" as they're romanised forms of other scripts.
(This doesn't cover things like specifying how you did the transliteration, though, for languages which may have more than one standard e.g. Chinese in Latin script using Wade-Giles versus pinyin.)
For most practical purposes, it does not matter, since browsers, search engines, and other relevant programs generally ignore lang attributes. The attributes may affect the choice of font, but only when the page itself does not suggest fonts (which is rare). Some speech browsers recognize a few values for lang and adapt their functionality accordingly. And if you open an HTML document in MS Word, it recognizes the lang markup and applies language-specific spelling tools. But all this is rather limited and rarely matters much. Moreover, in these cases, only the simplest types of language codes are recognized.
In principle, it is possible to indicate the writing system (“script”), such as Latin vs. Devanagari, and the transliteration or transcription system that has been used. This has been described in BCP 47. But for the most of it, it’s guidelines for implementors, not something you could use here and now.
For example, you can write <span lang="hi-Latn">Aadmi kha raha hai.</span> to indicate that the content is in Hindi but written in Latin letters. And there is, in principle at least, a way to indicate which of the competing romanization systems has been used. I don’t think any web-related software recognizes lang="hi-Latn"; programs might even fail to recognize it even if they recognize lang="hi".
So you can use detailed values for lang, but it’s not of much use. Using simple markup like lang="hi" for a any major fragment in another language (say, a sentence or more) is good practice, though not much more. Before spending too much time on it, consider what practical benefits you could expect. For example, if you consider using a client-side hyphenator like hyphenate.js, then lang markup becomes essential; but then you need to check out the expectations of that software, rather than just general specifications.
A word of warning: I have seen odd results when using lang="ru" for Russian written in Latin letters. The reason is that browsers may switch to their idea of “font for Russian”, causing a mix of fonts. But the simple remedy is to make some consistent font settings for all of your texts, overriding browser defaults, in cases like this.
Strings like “Taw on far eg ih-he” cannot be meaningfully classified as being in some language. If you use language markup, use lang="" (with empty string as value), since this is the defined way of explicitly indicating that the language is not indicated!
You might want to look into marking it up as <ruby>.
For example:
<ruby lang="hi">आदमी<rt>Aadmi</rt> खा<rt>kha</rt> रहा<rt>raha</rt> है।<rt>hai</rt></ruby>
What is the difference between <html lang="en"> and <html lang="en-US">? What other values can follow the dash?
According to w3.org "Any two-letter subcode is understood to be a [ISO3166] country code." so does that mean any value listed under the alpha-2 code is an accepted value?
<html lang="en">
<html lang="en-US">
The first lang tag only specifies a language code. The second specifies a language code, followed by a country code.
What other values can follow the dash? According to w3.org "Any
two-letter subcode is understood to be a [ISO3166] country code." so
does that mean any value listed under the alpha-2 code is an accepted
value?
Yes, however the value may or may not have any real meaning.
<html lang="en-US"> essentially means "this page is in the US style of English." In a similar way, <html lang="en-GB"> would mean "this page is in the United Kingdom style of English."
If you really wanted to specify an invalid combination, you could. It wouldn't mean much, but <html lang="en-ES"> is valid according to the specification, as I understand it. However, that language/country combination won't do much since English isn't commonly spoken in Spain.
I mean does this somehow further help the browser to display the page?
It doesn't help the browser to display the page, but it is useful for search engines, screen readers, and other things that might read and try to interpret the page, besides human beings.
This should help :
http://www.w3.org/International/articles/language-tags/
The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.
language-extlang-script-region-variant-extension-privateuse
You can use any country code, yes, but that doesn't mean a browser or other software will recognize it or do anything differently because of it. For example, a screen reader might deal with "en-US" and "en-GB" the same if they only support an American accent in English. Another piece of software that has two distinct voices, though, could adjust according to the country code.
RFC 3066 gives the details of the allowed values (emphasis and links added):
All 2-letter subtags are interpreted as ISO 3166 alpha-2 country codes
from [ISO 3166], or subsequently assigned by the ISO 3166 maintenance
agency or governing standardization bodies, denoting the area to which
this language variant relates.
I interpret that as meaning any valid (according to ISO 3166) 2-letter code is valid as a subtag. The RFC goes on to state:
Tags with second subtags of 3 to 8 letters may be registered with
IANA, according to the rules in chapter 5 of this document.
By the way, that looks like a typo, since chapter 3 seems to relate to the the registration process, not chapter 5.
A quick search for the IANA registry reveals a very long list, of all the available language subtags. Here's one example from the list (which would be used as en-scouse):
Type: variant
Subtag: scouse
Description: Scouse
Added: 2006-09-18
Prefix: en
Comments: English Liverpudlian dialect known as 'Scouse'
There are all sorts of subtags available; a quick scroll has already revealed fr-1694acad (17th century French).
The usefulness of some of these (I would say the vast majority of these) tags, when it comes to documents designed for display in the browser, is limited. The W3C Internationalization specification simply states:
Browsers and other applications can use information about the language
of content to deliver to users the most appropriate information, or to
present information to users in the most appropriate way. The more
content is tagged and tagged correctly, the more useful and pervasive
such applications will become.
I'm struggling to find detailed information on how browsers behave when encountering different language tags, but they are most likely going to offer some benefit to those users who use a screen reader, which can use the tag to determine the language/dialect/accent in which to present the content.
XML Schema requires that the xml namespace be declared and imported before using xml:lang (and other xml namespace values)
RELAX NG predeclares the xml namespace, as in XML, so no additional declaration is needed.
Well, the first question is easy. There are many ens (Englishes) but (mostly) only one US English. One would guess there are en-CN, en-GB, en-AU. Guess there might even be Austrian English but that's more yes you can than yes there is.