HTML lang Attributes For Multilingual Site - html

I have a website that uses mainly English, but also incorporates a fair amount of Japanese.
What are best practices for multilingual sites? Declaring two languages in the html tag isn't exactly accurate, and inline lang markup seems redundant and heavy.
How should I approach this?

Use the lang attribute in the html element and or any element containing a sentence or more in a different language. You could count e.g. a book title or other longish phrase as a sentence. But using, say, lang=ja for any Japanese name occurring in English text is more or less pointless, unless you have a tangible practical reason to do so.
Language markup has several potential uses, some of which are slowly being materialized. Search engines are not known to make use of it, but e.g. automatic hyphenation (when invoked) uses it, and some speech synthesis software has used it. If you open an HTML document in Microsoft Word, it recognizes the lang markup and can run spell checking accordingly (for supported languages).
Such utilization of lang markup is most important regarding the document as a whole or major parts thereof. The more fine-grained the markup would be, the smaller its potential usefulness becomes, as a rule, in addition to being tedious to generate. It is not even clear whether e.g. a German proper name should be treated as German text in all respects when appearing inside English text.
Special caveat: The use of lang markup may affect the choice of font when left to the browser. For example, if you write I visited <span lang=ja>Yokohama</span>, which is logically sound (though I visited <span lang=ja-Latn>Yokohama</span> would be more accurate), you may well get the word “Yokohama” in a font different from the surrouding text on Firefox. The reason is that the browser uses different default fonts for different languages when declared in markup. But this is of course of no concern if you set the overall font family of text in CSS, as most authors do.

inline lang markup seems redundant and heavy
Well, this is the best way to do it. You can dig into the spec if you want.

Related

Does lang="en" refer to the file or the webpage? [duplicate]

In HTML, it's good to have a lang attribute in <html>, e.g. <html lang="en">.
How is this useful?
If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang attribute).
I am quoting this from W3C:
Declaring language in HTML
Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.
Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.
Use language tags from the IANA Language Subtag Registry.
Also a good read is Why use the language attribute?.
You asked "how is this useful".
"The <lang=> attribute can be used to declare the language of a Web
page or a portion of a Web page. This is meant to assist search engine
spiders, page formatting and screen reader technology"
Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)
No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).
The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">.
The lang attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.
lang needed for speech synthesis
Some blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".
Using the lang attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.
I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:
<h5>No lang specified:</h5>
<p>J'aime le pain</p>
<h5>French:</h5>
<p lang="fr">J'aime le pain</p>
As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.
lang needed for text rendering
Perhaps surprisingly, the benefits of the lang attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang can also affect text rendering, since the correct way to render some text can be language-dependent.
There are a couple of different mechanisms by which the lang you set can affect how text gets rendered:
different fonts being selected based on the lang attribute, either:
based on the browser's default font selection rules, or
because you've explicitly set up language-specific fonts using :lang selectors in your CSS
or
fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature
Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.
Language-dependent forms of Han characters
There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.
When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.
To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:
Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>
Note that the same character, 飴, is used in both spans. But they display differently in the browser, at least in Chrome on my Windows PC:
As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.
fi ligatures getting disabled for Turkish text
As described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.
However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or fı (an f followed by a dotless ı).
For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.
To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:
<style>
#font-face {
font-family: oktaneue;
src: url("Groteskly Yours - Okta Neue UltraLight.otf");
}
* {
font-family: oktaneue;
}
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>
Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:
As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!
The lang attribute tells the client what language the document (or part of document) is written in. This is useful for any software which cares about language.
A key use is for accessibility. It is mentioned in WCAG:
This Success Criterion helps:
people who use screen readers or other technologies that convert text into synthetic speech;
people who find it difficult to read written material with fluency and accuracy, such as recognizing characters and alphabets or decoding
words;
people with certain cognitive, language and learning disabilities who use text-to-speech software
people who rely on captions for synchronized media.
Adrian Roselli describes some benefits:
Hyphens
By using lang, you get the benefits of hyphen support in your (modern)
browser that you otherwise would not get (assuming you use hyphens:
auto in your CSS).
Accessibility
At the very least, lang is a benefit for screen reader users,
particularly when your users don’t have the same primary language as
your site. It allows proper pronunciation and inflection when the page
is spoken.
… as well as referencing WCAG and pointing at this document from the W3C which lists benefits such as being able to write CSS which styles elements based on the language they are written in (so different fonts can be used for different languages), automatically selecting fonts with the right version of a glyph for a language, to aid search engines, spell checks and translation tools, to help speech synthesizers and Braille translators, and for custom scripting.
As far as I can tell, for a single-language website (hopefully you can surmise its utility for multi-language websites), the only real, actual, tangible use is that it makes the 'hyphens' CSS property work as expected... which is not much, but more than nothing. (I'm afraid I haven't actually tested this in a browser, however, which is something we should all do to know things for sure.)
Via: http://blog.adrianroselli.com/2015/01/on-use-of-lang-attribute.html (which is also full of irrelevant "reasons" to use it, save that mentioned).
The difference between lang and custom attributes is that lang is inherited, so even a child element of an element with attribute lang=en can be selected with the selector div:lang(en){}.

What is the 'lang' attribute of the <html> tag used for?

In HTML, it's good to have a lang attribute in <html>, e.g. <html lang="en">.
How is this useful?
If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang attribute).
I am quoting this from W3C:
Declaring language in HTML
Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.
Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.
Use language tags from the IANA Language Subtag Registry.
Also a good read is Why use the language attribute?.
You asked "how is this useful".
"The <lang=> attribute can be used to declare the language of a Web
page or a portion of a Web page. This is meant to assist search engine
spiders, page formatting and screen reader technology"
Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)
No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).
The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">.
The lang attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.
lang needed for speech synthesis
Some blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".
Using the lang attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.
I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:
<h5>No lang specified:</h5>
<p>J'aime le pain</p>
<h5>French:</h5>
<p lang="fr">J'aime le pain</p>
As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.
lang needed for text rendering
Perhaps surprisingly, the benefits of the lang attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang can also affect text rendering, since the correct way to render some text can be language-dependent.
There are a couple of different mechanisms by which the lang you set can affect how text gets rendered:
different fonts being selected based on the lang attribute, either:
based on the browser's default font selection rules, or
because you've explicitly set up language-specific fonts using :lang selectors in your CSS
or
fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature
Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.
Language-dependent forms of Han characters
There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.
When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.
To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:
Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>
Note that the same character, 飴, is used in both spans. But they display differently in the browser, at least in Chrome on my Windows PC:
As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.
fi ligatures getting disabled for Turkish text
As described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.
However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or fı (an f followed by a dotless ı).
For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.
To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:
<style>
#font-face {
font-family: oktaneue;
src: url("Groteskly Yours - Okta Neue UltraLight.otf");
}
* {
font-family: oktaneue;
}
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>
Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:
As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!
The lang attribute tells the client what language the document (or part of document) is written in. This is useful for any software which cares about language.
A key use is for accessibility. It is mentioned in WCAG:
This Success Criterion helps:
people who use screen readers or other technologies that convert text into synthetic speech;
people who find it difficult to read written material with fluency and accuracy, such as recognizing characters and alphabets or decoding
words;
people with certain cognitive, language and learning disabilities who use text-to-speech software
people who rely on captions for synchronized media.
Adrian Roselli describes some benefits:
Hyphens
By using lang, you get the benefits of hyphen support in your (modern)
browser that you otherwise would not get (assuming you use hyphens:
auto in your CSS).
Accessibility
At the very least, lang is a benefit for screen reader users,
particularly when your users don’t have the same primary language as
your site. It allows proper pronunciation and inflection when the page
is spoken.
… as well as referencing WCAG and pointing at this document from the W3C which lists benefits such as being able to write CSS which styles elements based on the language they are written in (so different fonts can be used for different languages), automatically selecting fonts with the right version of a glyph for a language, to aid search engines, spell checks and translation tools, to help speech synthesizers and Braille translators, and for custom scripting.
As far as I can tell, for a single-language website (hopefully you can surmise its utility for multi-language websites), the only real, actual, tangible use is that it makes the 'hyphens' CSS property work as expected... which is not much, but more than nothing. (I'm afraid I haven't actually tested this in a browser, however, which is something we should all do to know things for sure.)
Via: http://blog.adrianroselli.com/2015/01/on-use-of-lang-attribute.html (which is also full of irrelevant "reasons" to use it, save that mentioned).
The difference between lang and custom attributes is that lang is inherited, so even a child element of an element with attribute lang=en can be selected with the selector div:lang(en){}.

HTML tags for translation

What HTML markup and tags should I use if write in article.
This `foreign word` translated from foreign language as `this word in native reader language word`.
Use the most appropriate markup (using a generic element if nothing better presents itself) with a lang attribute.
<body lang="en">
<!-- etc -->
<p><span lang="de">unbekanntes Flugobjekt</span> is German for UFO.</p>
This won't generally provide automatic translation, but the option exists for browsers / browser extensions to provide such a mechanism. Translation tools such as Google Translate may use it as a hint to identify the "from" language. Text to speech software may use it to select a pronunciation guide. And so on.
There is no HTML markup specifically for such purposes. It really depends on the conventions of the human language used on the page, as well as presentation style. Typically, either quotation marks or italic is used when mentioning words or expressions, rather than using them in normal use. For these, there are different options in HTML. Quotation marks are best written as such, using proper characters as per language rules, though some people still think that q markup is useful. For italic, you can use i markup or CSS font-style: italic.
In any case, if it is relevant to your purposes somehow that translations are marked up, e.g. in order to style them uniformly later, the best shot is to use classes.
The use of lang markup is recommendable in principle, and it is gaining some practical importance (e.g., for automatic hyphenation). In the following example, the span markup is used only to indicate the language (because you need an element for that):
The French word “<span lang=fr>cheval</span>” means “horse”.

What is the correct way to use the lang attribute with phonetic pronunciations (if at all)?

Some languages have an accepted transliteration to Latin characters, such as Hindi, Russian or Japanese. For example, the Hindi for 'The man is eating' written in Devanagari script is 'आदमी खा रहा है।'. Transliterated, it would be 'Aadmi kha raha hai.' (or something similar; this approach is often used online, especially if people don't have access to a Hindi keyboard.)
In this case, we're using the Latin script but still writing Hindi, so it would be acceptable to mark up either variation using the lang attribute:
<span lang="hi">आदमी खा रहा है।</span> or <span lang="hi">Aadmi kha raha hai.</span>
My question then is about languages that are normally written in the Latin alphabet themselves, but might have phonetic guides for non-speakers/learners — either IPA or ad hoc pronunciation — is there any best practice in terms of giving it semantic meaning?
For example, in Irish if I were to say "The man is eating", I would say "Tá an fear ag ithe." I can mark this up as:
<span lang="ga">Tá an fear ag ithe.</span>
If I were to give a pronunciation guide for non-speakers, I might say "Taw on far eg ih-he". The sentence isn't meaningless, (like 'lorem ipsum' text) but neither is the sentence in either English or Irish.
What is the correct use of language related attributes in HTML in this case, or is this use case just not covered currently by the specification?
Short version: if you want to specifically say it's written in the Latin alphabet, go for "hi-Latn" or "ga-Latn" for the examples you gave.
Long version:
The W3C spec for the lang attribute doesn't specifically mention this - it suggests some uses of this that depend on orthography (such as using it in order to render high-quality versions of the characters used), but some that don't (such as for search engines).
RFC1766, which specifies the format for the language tags, suggests that specialisations of tags may be used to represent "script variations, such as az-arabic and az-cyrillic". There's more about the script subtag in this article on the W3C site, and a bit extra in the later RFC5646. That one points to an ISO standard list of script names, and in that list the script you'd want is "Latn" as they're romanised forms of other scripts.
(This doesn't cover things like specifying how you did the transliteration, though, for languages which may have more than one standard e.g. Chinese in Latin script using Wade-Giles versus pinyin.)
For most practical purposes, it does not matter, since browsers, search engines, and other relevant programs generally ignore lang attributes. The attributes may affect the choice of font, but only when the page itself does not suggest fonts (which is rare). Some speech browsers recognize a few values for lang and adapt their functionality accordingly. And if you open an HTML document in MS Word, it recognizes the lang markup and applies language-specific spelling tools. But all this is rather limited and rarely matters much. Moreover, in these cases, only the simplest types of language codes are recognized.
In principle, it is possible to indicate the writing system (“script”), such as Latin vs. Devanagari, and the transliteration or transcription system that has been used. This has been described in BCP 47. But for the most of it, it’s guidelines for implementors, not something you could use here and now.
For example, you can write <span lang="hi-Latn">Aadmi kha raha hai.</span> to indicate that the content is in Hindi but written in Latin letters. And there is, in principle at least, a way to indicate which of the competing romanization systems has been used. I don’t think any web-related software recognizes lang="hi-Latn"; programs might even fail to recognize it even if they recognize lang="hi".
So you can use detailed values for lang, but it’s not of much use. Using simple markup like lang="hi" for a any major fragment in another language (say, a sentence or more) is good practice, though not much more. Before spending too much time on it, consider what practical benefits you could expect. For example, if you consider using a client-side hyphenator like hyphenate.js, then lang markup becomes essential; but then you need to check out the expectations of that software, rather than just general specifications.
A word of warning: I have seen odd results when using lang="ru" for Russian written in Latin letters. The reason is that browsers may switch to their idea of “font for Russian”, causing a mix of fonts. But the simple remedy is to make some consistent font settings for all of your texts, overriding browser defaults, in cases like this.
Strings like “Taw on far eg ih-he” cannot be meaningfully classified as being in some language. If you use language markup, use lang="" (with empty string as value), since this is the defined way of explicitly indicating that the language is not indicated!
You might want to look into marking it up as <ruby>.
For example:
<ruby lang="hi">आदमी<rt>Aadmi</rt> खा<rt>kha</rt> रहा<rt>raha</rt> है।<rt>hai</rt></ruby>

What is a correct approach to using strong/em tags when localising strings?

I know some languages emphasise words differently to English, e.g. via changing word endings rather than stressing words with inflection of the voice.
If you are localising a site, would you trust that <strong> and <em> tags (and their placement) will have the same meaning in other languages — would you maintain this emphasis, check with your translator or leave them out?
What I'm wondering is how this translates (excuse the pun) into the semantics of the web? — Strong and em tags carry semantic meaning that is used within SEO, screen-readers etc. So should they be left in place so this isn't lost, or dropped to better conform with the target language?
Mark-up is there to convey meaning a whole, and so long as the meaning is conveyed, you have succeeded in your mark-up. So in a language where stress emphasis is conveyed in the text, using tags to signal the emphasis is redundant and optional.
Inline level markup, much more so than block level markup, may need to be radically different in different languages. In a good translation, the text should be marked up from scratch in each language.
For individual words, I would leave the markup tags in the translation strings for the reasons outlined in comments above. If emphasising blocks of text, whole sentences, numerals, etc, I'd put it in the template if possible as it's not really something that would need to be be messed with by the translation.
A good idea might be to flag in the template the you have done this using comments. You will also need some reliable process for getting all the translation files changed if you decide to alter the emphasis ever (which you inevitably will). This is a pain, so I tend to avoid adding emphasis to individual words wherever possible :)
Interesting question! The only thing I can add is that strong and em tags are only useful for SEO if the search engines connect those tags with the content on your site.
I'd recommend using these tags only if there's an actual reason (for emphasis, say) rather than hoping to gain SEO benefit from bolding or italicizing keywords.
For screen readers, it comes down to what language you are talking about. JAWS for example, you can download voice files. If it isn't listed, then they have to have to choose another language or find alternative means. The key thing is for you to set the lang attribute correctly.