What are situations with western languages where you'd use HTML 5's Ruby element? - html

HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?

id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.

As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.

As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.

My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".

Related

creating a common embedding for two languages

My task deals with multi-language like (english and hindi). For that I need a common embedding to represent both languages.
I know there are methods for learning multilingual embedding like 'MUSE', but this represents those two embeddings in a common vector space, obviously they are similar, but not the same.
So I wanted to know if there is any method or approach that can learn to represent both embedding in form of a single embedding that represents the both the language.
Any lead is strongly appreciated!!!
I think a good lead would be to look at past work that has been done in the field. A good overview to start with is Sebastian Ruder's talk, which gives you a multitude of approaches, depending on the level of information you have about your source/target language. This is basically what MUSE is doing, and I'm relatively sure that it is considered state-of-the-art.
The basic idea in most approaches is to map embedding spaces such that you minimize some (usually Euclidean) distance between the both (see p. 16 of the link). This obviously works best if you have a known dictionary and can precisely map the different translations, and works even better if the two languages have similar linguistic properties (not so sure about Hindi and English, to be honest).
Another recent approach is the one by Multilingual-BERT (mBERT), or similarly, XLM-RoBERTa, but those learn embeddings based on a shared vocabulary. This might again be less desirable if you have morphologically dissimilar languages, and also has the drawback that they incorporate a bunch of other, unrelated, languages.
Otherwise, I'm unclear on what exactly you are expecting from a "common embedding", but happy to extend the answer once clarified.

Using BERT in order to detect language of a given word

I have words in the Hebrew language. Part of them are originally in English, and part of them are 'Hebrew English', meaning that those are words that are originally from English but are written with Hebrew words.
For example: 'insulin' in Hebrew is "אינסולין" (Same phonetic sound).
I have a simple binary dataset.
X: words (Written with Hebrew characters)
y: label 1 if the word is originally in English and is written with Hebrew characters, else 0
I've tried using the classifier, but the input for it is full text, and my input is just words.
I don't want any MASKING to happen, I just want simple classification.
Is it possible to use BERT for this mission? Thanks
BERT is intended to work with words in context. Without context, a BERT-like model is equivalent to simple word2vec lookup (there is fancy tokenization, but I don't know how it works with Hebrew - probably, not very efficiently). So if you really really want to use distributional features in your classifier, you can take a pretrained word2vec model instead - it's simpler than BERT, and no less powerful.
But I'm not sure it will work anyway. Word2vec and its equivalents (like BERT without context) don't know much about inner structure of a word - only about contexts it is used in. In your problem, however, word structure is more important than possible contexts. For example, words בלוטת (gland) or דם (blood) or סוכר (sugar) often occur in the same context as insulin, but בלוטת and דם are Hebrew, whereas סוכר is English (okay, originally Arabic, but we are probably not interested in too ancient origins). You just cannot predict it from context only.
So why not start with some simple model (e.g. logistic regression or even naive bayes) over simple features (e.g. character n-grams)? Distributional features (I mean w2v) may be added as well, because they tell about topic, and topics may be informative (e.g. in medicine, and technology in general, there are probably relatively more English words than in other domains).

HTML language code

I have small sections of text on my site that are in French (Canadian). When I look up on the W3schools language code reference, I find there is only a list of specific languages but not location. Is writing <span lang="fr-CA"> valid? Does adding the location-base of the language have any impact at all on screen-readers?
Yes, <span lang="fr-CA"> is valid, but you should not expect it to provide any benefits over <span lang="fr"> in most cases. In fact I have used a screen reader that recognized a few values for the lang tag but only when simply codes like en and fr were used.
According to HTML5 PR, the lang attribute value is a language code, or “language tag” as they call it, as defined in BCP 47, which is a concatenation of RFCs Tags for Identifying Languages and Matching of Language Tags. It defines a rather complicated system of codes. Most of the possibilities have no use in HTML in practice, now or in the foreseeable future. The code system has been defined to meet many different needs, including bibliographic information and text databases.
Using fr-CA is possible in accordance with RFC 5646, clause 2.2.4. Region Subtag:
Region subtags are used to indicate linguistic variations
associated with or appropriate to a specific country, territory, or
region. Typically, a region subtag is used to indicate variations
such as regional dialects or usage, or region-specific spelling
conventions. It can also be used to indicate that content is
expressed in a way that is appropriate for use throughout a region,
for instance, Spanish content tailored to be useful throughout
Latin America.
The subtag CA is the ISO 3166-1 country code for Canada, so fr-CA denotes French as spoken and written in Canada, i.e. Canadian French. It is possible that a screen reader that can speak French in different variants and recognizes lang attributes will use Canadian pronunciation for an element that has lang="fr-CA". However, this is probably very theoretical and would be of little practical impact if it were actually implemented.
More realistically, such an attribute may have other effects. If you open an HTML document in MS Word, it will recognize lang attributes. Whether this has any practical impact depends on whether and how Word treats Canadian French as differently from French in general e.g. in spelling checks.
The next reference on W3schools describes valid country codes and how to use them with language codes.
Ex: <html lang="en-US">
In regards to screen readers, it would depend on how the reader was implemented. Generally they would get the language from <html> but it is possible that some implementations may allow for the language to change over the course of the document.
The lang attribute can be used in any html element, as described in the HTML specifications, so <span lang="fr-CA"> would be valid.

Pitfalls when performing internationalization / localization with numbers?

When developing an application that will need to work with a variety of localizations, particularly with "right to left" text, is there a possibility of a case where numbers would need to be converted to "right to left" as well?
I'm no language scholar, but I know the RTL languages I am familiar with present their numbers in LTR.
For instance (using google translate):
I have 345 apples.
In Arabic:
لدي 345 التفاح.
So, I have two questions:
Is it possible to run into a language that uses RTL numbers?
How should internationalizing be handled in such cases?
or,
Is the "accepted norm" to just do numbers using Western Arabic characters, read from left to right?
In the big right-to-left scripts - Arabic, Hebrew and Thaana - numbers always run left to right. (When I say "Arabic", I refer to all the languages that are written in the Arabic script - Arabic, Farsi, Urdu, Pasto and many others.)
Hebrew and Thaana always use European digits, the same 0-9 set as English. There's nothing much to do there, because Unicode automatically takes care of ordering the numbers correctly. But see the comments about isolation below.
It's possible to use European digits in Arabic, too; for example, the Arabic Wikipedia uses them. However, very frequently Arabic texts use a different set of digits - https://en.wikipedia.org/wiki/Eastern_Arabic_numerals . It depends on your users' preferences. Notice also, that in the Persian language the digits are slightly different. From the point of view of right-to-left layout they behave pretty much the same way as European digits, although there are slight differences in the behavior of mathematical signs - for example, the minus can go on the other side. There are some subtleties here, but they are mostly edge cases.
In both Hebrew and Arabic you may run into a problem with bidi-isolation. For example, if you have a Hebrew paragraph in which you have an English word, and after the word you have numbers, the numbers will appear to the right of the word, although you may have wanted them to appear on the left. That's how the Unicode bidi algorithm works by default. To resolve such things you can use the Unicode control characters RLM and LRM. If you are using HTML5, you can also use the <bdi> tag for this, as well as the CSS rule "unicode-bidi: isolate". These CSS and HTML5 solutions are quite powerful and elegant, but aren't supported in all browsers yet.
I am aware of one script in which the digits run right-to-left: N'Ko, which is used for some languages of Africa. I actually saw websites written in it, but it is far less common than Hebrew and Arabic.
Finally, if you're using JavaScript, you can use the free jquery.i18n library for automatic number conversion. See https://github.com/wikimedia/jquery.i18n . (Disclaimer: I am one of this library's developers.)
Numbers will generally translate as you have them. Even in languages that read in different directions the Western Arabic numbers are typically recognized by the user.

What is the correct way to use the lang attribute with phonetic pronunciations (if at all)?

Some languages have an accepted transliteration to Latin characters, such as Hindi, Russian or Japanese. For example, the Hindi for 'The man is eating' written in Devanagari script is 'आदमी खा रहा है।'. Transliterated, it would be 'Aadmi kha raha hai.' (or something similar; this approach is often used online, especially if people don't have access to a Hindi keyboard.)
In this case, we're using the Latin script but still writing Hindi, so it would be acceptable to mark up either variation using the lang attribute:
<span lang="hi">आदमी खा रहा है।</span> or <span lang="hi">Aadmi kha raha hai.</span>
My question then is about languages that are normally written in the Latin alphabet themselves, but might have phonetic guides for non-speakers/learners — either IPA or ad hoc pronunciation — is there any best practice in terms of giving it semantic meaning?
For example, in Irish if I were to say "The man is eating", I would say "Tá an fear ag ithe." I can mark this up as:
<span lang="ga">Tá an fear ag ithe.</span>
If I were to give a pronunciation guide for non-speakers, I might say "Taw on far eg ih-he". The sentence isn't meaningless, (like 'lorem ipsum' text) but neither is the sentence in either English or Irish.
What is the correct use of language related attributes in HTML in this case, or is this use case just not covered currently by the specification?
Short version: if you want to specifically say it's written in the Latin alphabet, go for "hi-Latn" or "ga-Latn" for the examples you gave.
Long version:
The W3C spec for the lang attribute doesn't specifically mention this - it suggests some uses of this that depend on orthography (such as using it in order to render high-quality versions of the characters used), but some that don't (such as for search engines).
RFC1766, which specifies the format for the language tags, suggests that specialisations of tags may be used to represent "script variations, such as az-arabic and az-cyrillic". There's more about the script subtag in this article on the W3C site, and a bit extra in the later RFC5646. That one points to an ISO standard list of script names, and in that list the script you'd want is "Latn" as they're romanised forms of other scripts.
(This doesn't cover things like specifying how you did the transliteration, though, for languages which may have more than one standard e.g. Chinese in Latin script using Wade-Giles versus pinyin.)
For most practical purposes, it does not matter, since browsers, search engines, and other relevant programs generally ignore lang attributes. The attributes may affect the choice of font, but only when the page itself does not suggest fonts (which is rare). Some speech browsers recognize a few values for lang and adapt their functionality accordingly. And if you open an HTML document in MS Word, it recognizes the lang markup and applies language-specific spelling tools. But all this is rather limited and rarely matters much. Moreover, in these cases, only the simplest types of language codes are recognized.
In principle, it is possible to indicate the writing system (“script”), such as Latin vs. Devanagari, and the transliteration or transcription system that has been used. This has been described in BCP 47. But for the most of it, it’s guidelines for implementors, not something you could use here and now.
For example, you can write <span lang="hi-Latn">Aadmi kha raha hai.</span> to indicate that the content is in Hindi but written in Latin letters. And there is, in principle at least, a way to indicate which of the competing romanization systems has been used. I don’t think any web-related software recognizes lang="hi-Latn"; programs might even fail to recognize it even if they recognize lang="hi".
So you can use detailed values for lang, but it’s not of much use. Using simple markup like lang="hi" for a any major fragment in another language (say, a sentence or more) is good practice, though not much more. Before spending too much time on it, consider what practical benefits you could expect. For example, if you consider using a client-side hyphenator like hyphenate.js, then lang markup becomes essential; but then you need to check out the expectations of that software, rather than just general specifications.
A word of warning: I have seen odd results when using lang="ru" for Russian written in Latin letters. The reason is that browsers may switch to their idea of “font for Russian”, causing a mix of fonts. But the simple remedy is to make some consistent font settings for all of your texts, overriding browser defaults, in cases like this.
Strings like “Taw on far eg ih-he” cannot be meaningfully classified as being in some language. If you use language markup, use lang="" (with empty string as value), since this is the defined way of explicitly indicating that the language is not indicated!
You might want to look into marking it up as <ruby>.
For example:
<ruby lang="hi">आदमी<rt>Aadmi</rt> खा<rt>kha</rt> रहा<rt>raha</rt> है।<rt>hai</rt></ruby>