Why does dir="rtl" change order, but only sometimes? - html

Why does
<input type="text" dir="rtl" value="08/15 word">
render into word 08/15 and not into 08/15 word?
Why does
<input type="text" dir="rtl" value="one word">
render into one word?
Why is the order switched in the first case but not in the second one?
https://jsfiddle.net/powtac/4aLn71mb/

dir=rtl gives the base direction of the text. Within that text, words with Latin letters will still be written left-to-right and multiple words with Latin letters (i.e. English sentences) will be ordered left-to-right as well!
The same is true for e.g. Arabic in dir=ltr (default) base direction: it will be written right-to-left anyway, and multiple Arabic words will be ordered form right-to-left, even though the environment tells otherwise.
For that to work, the browser (or text renderer in general) uses the Unicode bidi algorithm. The algorithm by itself needs to know the directionality of the characters, which is part of Unicode as well. Latin and Arabic letters have a strong directionality, while the “normal” numbers have a weak left-to-right direction.
This distinction between weak and strong is made, because numbers are used in both ltr and rtl languages. The number itself is always written left-to-right, but it will be word-ordered right-to-left if surrounded by strong rtl words, and it will be word-ordered left-to-right if surrounded by strong ltr words.
If the number is not surrounded by strong typed words, the base direction is used.
Your ltr-“08/15” is not surrounded by ltr, so it is put rtl (base direction) from ltr-word “word”.
“one word” are two strong directioned words, so they are layed out ltr themselves and the words layed out ltr as well. No matter what the base direction is.
Try “first 08/15 second” and the numbers will be considered part of a bigger ltr sentence and laid out ltr.

dir=RTL Characteristics
The order of digits in numbers (such as phone numbers) doesn’t
differ from left-to-right writing.
When combining symbols that can be used in both RTL and LTR
languages (such as periods, commas, or other punctuation marks),
their displayed positions will depend on the direction of the text.
This is because the data format starts from the beginning, but a
browser is still processing an RTL word in the RTL direction and
punctuation is converted towards the direction that has been
specified.
For more information, see this article

Related

how does the html property "dir" work, specifically when assigned "auto"?

How does <div dir=auto>bla bla שלום bla</div> work?
By majority of words? First character?
I'm looking for a solution that looks at the majority of characters in a span in order to know its direction (an angularjs directive would also be good, but if this is already built in HTML I'd like to know).
When an element has its dir set to "auto", the direction of the
element is determined based on its first strong directionality
character, or default to the directionality of its parent element.
https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement.dir
Characters with the left-to-right, right-to-left, and right-to-left
Arabic types are called “strong directional”, or just “strong”
characters. Numbers are a special case; their reading direction is
always left to right, but they do not affect the reading direction of
neighboring characters. Even numbers that are displayed with
Arabic-Indic digits have a left-to-right character direction.
In the example you posted, it'll be ltr.

What character should I use to maintain height of an empty (zero width) string?

I have a string that can potentially be empty, and in that case, I want to substitute it with a special character to maintain the ordinary text height while having zero width. In TeX, this would be called \strut. What is the counterpart for that in HTML? I came up with two candidates: ⁠ and . Should I use one of these?
On modern browsers, any zero-width character will do the job, provided that the browser either knows that the character is zero-width or uses a font that contains an empty glyph for it. But some characters may have effects, depending on the context and on software used to process the HTML file.
U+2060 WORD JOINER has the effect of preventing line break.
U+FEFF ZERO WIDTH NO-BREAK SPACE has the same effect. It is formally deprecated for any use except as Byte Order Mark, but in reality it works more often than WORD JOINER (though there are exceptions).
U+200B ZERO WIDTH SPACE has the effect of allowing a line break even when it would otherwise not be permitted; it’s like SPACE, but with zero width.
Usually the worst-case scenario for characters like this is an old version of IE. Checking in IE 6 shows that U+FEFF and U+200B are OK, but U+2060 shows as a small rectangle (i.e., the browser tries to render the character but finds no glyph for it).
So I’d use  or ​ depending on whether I’d like to prevent or allow line break at that point. If it does not matter, ​ is more logical to use.
I would suggest  or if zero width is not essential or if it is essential you could try the Unicode character ⁠ which is a zero width non-breaking space.

Why is a trailing punctuation mark rendered at the start with direction:rtl?

This is more a sort of curiosity. While working on a multilingual web application I noticed that certain characters like punctuation marks (!?.;,) at the end of a block element are rendered as if they were placed at the beginning instead when the writing direction is right-to-left (as it is the case for certain Asian languages I do not speak).
In other words, The string
Hello, World!
is rendered as
!Hello, World
when placed in a div block with direction: rtl
This becomes even more evident if the text is split in two parts and given different colors: a contiguous chunk of text at the end is rendered in two separated regions:
http://jsfiddle.net/22Qk9/
What's the point of this behavior? I guess this must be a peculiarity of (all?) right-to-left languages which is automatically handled by the browser, so I don't need to care about it, or should I?
If you want to fix this behavior add the LRM character ‎ in the end. It's a non=printing character.
Source : http://dotancohen.com/howto/rtl_right_to_left.html
Example : http://jsfiddle.net/yobjj6ed/
The reason is that the exclamation mark “!” has the BiDi class O.N. ('Other Neutrals'), which means effectively that it adapts to the directionality of the surrounding text. In the example case, it is therefore placed to the left of the text before it. This is quite correct for languages written right to left: the terminating punctuation mark appears at the end, i.e. on the left.
Normally, you use the CSS code direction: rtl or, preferably, the HTML attribute dir=rtl for texts in a language that is written right to left, and only for them. For them, this behavior is a solution, not a problem.
If you instead use direction: rtl or dir=rtl just for special effects, like making table columns laid out right to left, then you need to consider the implications. For example, in the table case, you would need to set direction to ltr for each cell of the table (unless you want them to be rendered as primarily right to left text).
If you have, say, an English sentence quoted inside a block of Arabic text, then you need to set the directionality of an element containing the English text to ltr, e.g.
<blockquote dir=ltr>Hello, World!</blockquote>
A similar case (just with Arabic inside English text) is discussed as use case 6 in the W3C document What you need to know about the bidi algorithm and inline markup (which has a few oddities, though, like using cite markup for quoted text, against W3C recommendations).
The accepted answer https://stackoverflow.com/a/20799360/477420 works if you can control markup/CSS of the value, if you have no control over HTML following approach could work.
If you don't know if page will be rendered RTL or LTR but some text is definitely LTR (i.e. English-only) you can wrap the value with LRE/PDF marks to signify that is LTR region. Text will be rendered LTR irrespective of page's LTR or RTL direction.
This works when you have some code that tries to render text without ability to change markup of how exactly it will show up on the page. I.e. you rendering value for "song tile" or "company name" field in some nested child component (or server side) without ability to control surrounding HTML elements.
One drawback of this and similar approaches (like LRM proposal in this question) with adding marks to text is copy-paste of such value from the resulting HTML page will generally preserve the marks but they are not visible/zero width. While for most cases it is fine consider if that is a problem for you.
Approximate sample code (some companies have "Inc." at the end which will end up with dot at the beginning when rendered as-is on RTL page):
// comanyName = "Alphabet Inc." - really likes dot at the end including RTL
if(stringIsDefinitelyAscii(companyName))
{
companyName = "\u202A" + companyName + "\u202C"
}
return companyName;
Details on LRE/PDF symbols can be found in https://unicode.org/reports/tr9/#Explicit_Directional_Embeddings:
LRE U+202A LEFT-TO-RIGHT EMBEDDING
Treat the following text as embedded left-to-right.
PDF U+202C POP DIRECTIONAL FORMATTING End the scope of the last LRE, RLE, RLO, or LRO.
Some approaches to figure out if string has RTL characters can be found in How to detect whether a character belongs to a Right To Left language?, JavaScript: how to check if character is RTL?, How to detect if a string contains any Right-to-Left character?.

what is the principle of displaying different languages(Arabic and English) together?

Such as this sentence:
عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC
Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?
In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.
This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a is a LTR character while א is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still ( in the text, even though you see )); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.
Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.
I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.
Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.
Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.
The letters are stored in logical order; meaning that a sentence such as "Hello! Salaam!" is in fact stored with the letters in precisely that order.
In addition to that, however, certain unicode flags are also added to the text that inform the text layout engine that the "Salaam" part of the sentence should be reversed when displayed; so the final text layout becomes "Hello! maalaS!", as well it should be.
These flags are either set through natural BIDI classification; e.g. غ; or through use of the Unicode RTL and LTR markers, U+200E and U+200F.
If you pay attention, the cursor doesn't in fact jump strangely, it always follows logical character order.

How does Zalgo text work?

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾