Can I use custom language / country code combinations in hreflang? - html

I'm internationalizing a large website in 14 languages.
I have found that some of the language/countries we want to target do not have lang-cc entries in various lists, e.g. dot net cultures, language-codes-and-iso-country-codes-for-html5.
As an example, Danish is widely spoken in Greenland.
We are translating our site in to Danish for users in Denmark.
It therefore makes sense to offer the translated content to Danish speakers in Greenland, however the lang/country code for this is not listed in resources we have found (especially as Google Greenland exists - google.gl).
So, can we safely use da-gl in hreflang and as a sub directory to target Danish speakers in Greenland even though that combination is not listed in the various resources we've found?
(Please note that we can't simply redirect users from Greenland to the Danish version targeted at Denmark as there are differences in currency and shipping prices, and we are trying to avoid any IP based redirection / content customisation.)

There is no list of "valid" combinations, as nobody can (nor should!) define which languages are spoken in which regions, or which linguistic variations exist.
HTML5 defines which content the hreflang attribute can have:
hreflang for a/area
hreflang for link
On Webmasters SE, I explained what this means (for the lang attribute, but it’s the same for hreflang): my answer to "Where do I get a list of attribute 'lang' values - what standard covers this, for SEO optimization?"
As you see, you only have to follow the rules from BCP 47 and choose tags from the Language Subtag Registry.
Thus da-GL is a valid value for hreflang/lang:
da is the subtag for the language Danish
- is the subtag separator
GL is the subtag for the region Greenland

Related

hreflang tags for Europe

I'd always understood that, because there is no en-EU IETF tag, to target Europe I'd have to either use en or list out an hreflang for each country within Europe.
Looking at the source of Think With Google, I noticed the following:
<link rel="alternate" href="//thinkwithgoogle.com/intl/es-419/" hreflang="es-419" />
Which appears to use a UN M.49 code to indicate Spanish for the Latin America and Caribbean region.
The UN M.49 Wikipedia page also lists a code for Europe - 150.
Does it follow that I could have the following hreflang tag to indicate English for the European market?
<link rel="alternate" href="//example.com/intl/en-150/" hreflang="en-150" />
In the Google docs it states the hreflang format consists of XX-YY where:
XX is ISO_639-1
YY is ISO_3166-1
By this it is conceivable that "en-eu" is actually correct as ISO_3166-1 #EU is noted as being "reserved on request of ISO 4217/MA for the European monetary unit Euro" which dating back to 1998. It specifically lists "eu" code relating to the ".eu" top level domain.
Most of the talk on the net around "en-eu" seems to be option rather than evidence based or direct from Google.
It would be worth running some experiments on "en-eu". I have also asked for more details on this in the Google doc as they specifically show examples for GBP and USD but ignore Euro.
Based on above "en-eu" make the most sense if reserved codes are considered valid to Google Bot.
If you are not going to specify languages for each European country just set hreflang="en". If you are going to specify English for each European country be aware that only IE (Ireland), MT (Malta) and GB (Great Britain) have English as their official language.
As per W3, you can use the region subtags as listed in the UN M.49 region code. So the tag hreflang="en150" is as valid.
The document also states
Only one region subtag can appear in a language tag, and it must appear after the language subtag
So you can't then use en-053(Oceania) together with en-150(Europe)
And since the IANA Language Subtag Registry lists 150, you can use it as any other region tag.
Reference: https://www.w3.org/International/articles/language-tags/#region

HTML language code

I have small sections of text on my site that are in French (Canadian). When I look up on the W3schools language code reference, I find there is only a list of specific languages but not location. Is writing <span lang="fr-CA"> valid? Does adding the location-base of the language have any impact at all on screen-readers?
Yes, <span lang="fr-CA"> is valid, but you should not expect it to provide any benefits over <span lang="fr"> in most cases. In fact I have used a screen reader that recognized a few values for the lang tag but only when simply codes like en and fr were used.
According to HTML5 PR, the lang attribute value is a language code, or “language tag” as they call it, as defined in BCP 47, which is a concatenation of RFCs Tags for Identifying Languages and Matching of Language Tags. It defines a rather complicated system of codes. Most of the possibilities have no use in HTML in practice, now or in the foreseeable future. The code system has been defined to meet many different needs, including bibliographic information and text databases.
Using fr-CA is possible in accordance with RFC 5646, clause 2.2.4. Region Subtag:
Region subtags are used to indicate linguistic variations
associated with or appropriate to a specific country, territory, or
region. Typically, a region subtag is used to indicate variations
such as regional dialects or usage, or region-specific spelling
conventions. It can also be used to indicate that content is
expressed in a way that is appropriate for use throughout a region,
for instance, Spanish content tailored to be useful throughout
Latin America.
The subtag CA is the ISO 3166-1 country code for Canada, so fr-CA denotes French as spoken and written in Canada, i.e. Canadian French. It is possible that a screen reader that can speak French in different variants and recognizes lang attributes will use Canadian pronunciation for an element that has lang="fr-CA". However, this is probably very theoretical and would be of little practical impact if it were actually implemented.
More realistically, such an attribute may have other effects. If you open an HTML document in MS Word, it will recognize lang attributes. Whether this has any practical impact depends on whether and how Word treats Canadian French as differently from French in general e.g. in spelling checks.
The next reference on W3schools describes valid country codes and how to use them with language codes.
Ex: <html lang="en-US">
In regards to screen readers, it would depend on how the reader was implemented. Generally they would get the language from <html> but it is possible that some implementations may allow for the language to change over the course of the document.
The lang attribute can be used in any html element, as described in the HTML specifications, so <span lang="fr-CA"> would be valid.

What to do if "webpage is missing meta language information"?

A webmaster-tools utility complains that my page is missing meta language information.
What should I do if my page contains mixture of text in multiple languages (all Western European languages)?
Most “webmaster-tools” complaints are best ignored (or not read at all). To get help with specific messages, please identify the tool, the message, and your URL.
A mixture of languages is a problem in itself, and best avoided by putting different language versions into different pages. If you need a mix of languages, use the lang attribute to specify the language of each part.
There is very little useful that you can do with meta tags, regarding a mixture of languages.
Have you looked at Multi-regional and multilingual sites article? It provides some good suggestions.

What is the correct way to use the lang attribute with phonetic pronunciations (if at all)?

Some languages have an accepted transliteration to Latin characters, such as Hindi, Russian or Japanese. For example, the Hindi for 'The man is eating' written in Devanagari script is 'आदमी खा रहा है।'. Transliterated, it would be 'Aadmi kha raha hai.' (or something similar; this approach is often used online, especially if people don't have access to a Hindi keyboard.)
In this case, we're using the Latin script but still writing Hindi, so it would be acceptable to mark up either variation using the lang attribute:
<span lang="hi">आदमी खा रहा है।</span> or <span lang="hi">Aadmi kha raha hai.</span>
My question then is about languages that are normally written in the Latin alphabet themselves, but might have phonetic guides for non-speakers/learners — either IPA or ad hoc pronunciation — is there any best practice in terms of giving it semantic meaning?
For example, in Irish if I were to say "The man is eating", I would say "Tá an fear ag ithe." I can mark this up as:
<span lang="ga">Tá an fear ag ithe.</span>
If I were to give a pronunciation guide for non-speakers, I might say "Taw on far eg ih-he". The sentence isn't meaningless, (like 'lorem ipsum' text) but neither is the sentence in either English or Irish.
What is the correct use of language related attributes in HTML in this case, or is this use case just not covered currently by the specification?
Short version: if you want to specifically say it's written in the Latin alphabet, go for "hi-Latn" or "ga-Latn" for the examples you gave.
Long version:
The W3C spec for the lang attribute doesn't specifically mention this - it suggests some uses of this that depend on orthography (such as using it in order to render high-quality versions of the characters used), but some that don't (such as for search engines).
RFC1766, which specifies the format for the language tags, suggests that specialisations of tags may be used to represent "script variations, such as az-arabic and az-cyrillic". There's more about the script subtag in this article on the W3C site, and a bit extra in the later RFC5646. That one points to an ISO standard list of script names, and in that list the script you'd want is "Latn" as they're romanised forms of other scripts.
(This doesn't cover things like specifying how you did the transliteration, though, for languages which may have more than one standard e.g. Chinese in Latin script using Wade-Giles versus pinyin.)
For most practical purposes, it does not matter, since browsers, search engines, and other relevant programs generally ignore lang attributes. The attributes may affect the choice of font, but only when the page itself does not suggest fonts (which is rare). Some speech browsers recognize a few values for lang and adapt their functionality accordingly. And if you open an HTML document in MS Word, it recognizes the lang markup and applies language-specific spelling tools. But all this is rather limited and rarely matters much. Moreover, in these cases, only the simplest types of language codes are recognized.
In principle, it is possible to indicate the writing system (“script”), such as Latin vs. Devanagari, and the transliteration or transcription system that has been used. This has been described in BCP 47. But for the most of it, it’s guidelines for implementors, not something you could use here and now.
For example, you can write <span lang="hi-Latn">Aadmi kha raha hai.</span> to indicate that the content is in Hindi but written in Latin letters. And there is, in principle at least, a way to indicate which of the competing romanization systems has been used. I don’t think any web-related software recognizes lang="hi-Latn"; programs might even fail to recognize it even if they recognize lang="hi".
So you can use detailed values for lang, but it’s not of much use. Using simple markup like lang="hi" for a any major fragment in another language (say, a sentence or more) is good practice, though not much more. Before spending too much time on it, consider what practical benefits you could expect. For example, if you consider using a client-side hyphenator like hyphenate.js, then lang markup becomes essential; but then you need to check out the expectations of that software, rather than just general specifications.
A word of warning: I have seen odd results when using lang="ru" for Russian written in Latin letters. The reason is that browsers may switch to their idea of “font for Russian”, causing a mix of fonts. But the simple remedy is to make some consistent font settings for all of your texts, overriding browser defaults, in cases like this.
Strings like “Taw on far eg ih-he” cannot be meaningfully classified as being in some language. If you use language markup, use lang="" (with empty string as value), since this is the defined way of explicitly indicating that the language is not indicated!
You might want to look into marking it up as <ruby>.
For example:
<ruby lang="hi">आदमी<rt>Aadmi</rt> खा<rt>kha</rt> रहा<rt>raha</rt> है।<rt>hai</rt></ruby>

What are situations with western languages where you'd use HTML 5's Ruby element?

HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?
id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.
As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.
As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.
My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".