Such as this sentence:
عفواً يبدو أن النظام لا يستطيع تحديد أنك من عملاء STC أم لا، فإذا كنت عميل STC الرجاء الضغط على زر "إعادة المحاولة"، وإذا لم تكن من عملاء STC الرجاء الضغط على زر " لست عميل STC
Arabic is RTL and English is LTR. Sometimes after copy and paste the text goes disorder. When I move the cursor inside the sentence between English and Arabic characters it jumps in a very strange way. And I am also confused with how this stored in the memory. Can anyone help to explain this?
In memory this is all stored as a sequence of Unicode code points (hopefully; there were very werid things before that, but let's not go there) – that's the text itself, how it is represented in the computer. The text is independent from writing direction at first, it's just a sequence of characters.
This sequence goes through a rendering engine that knows the Unicode Bidi algorithm and thus can shape the text into glyphs to display at a particular position. Every character in Unicode has a Bidi property that controls how it behaves in such contexts. This specifies that a is a LTR character while א is an RTL character; it controls that parentheses are correctly mirrored in RTL contexts (an opening parenthesis is still ( in the text, even though you see )); and several characters can appear in both contexts. This is all very simplified, and there are quite a few things at work there. Finally, multiple glyphs can overlay each other (e.g. diacritics) or form ligatures; those are then graphemes which is essentially what we perceive as a “letter”.
Cursor movement is easy to do then, because the cursor can only be betweeen two graphemes (it gets more complicated at the start of a LTR or RTL segment, but let's leave it at that for now) and → moved it forwards through them while ← moves backwards. In RTL forwards means left, of course; it follows the text direction. What order the two graphemes have relative to each other doesn't really matter in positioning the cursor.
I admit though, that it can be confusing to see mixed RTL and LTR text, but I guess people in Arabic- or Hebrew-speaking countries are quite used to it.
Regarding the problem that the correct text layout is sometimes lost when you copy-paste text, I guess the most common problem is application or layout engine support for the respective script. If the layout engine does not know how to layout Arabic text all you get are the characters in their logical order from left to right. No ligatures are formed, no text direction applied. For example, browsers have quite good support by now for this kind of thing, but if I take the Arabic text and paste it into Word it will look wrong (was the case for Word 2007; PowerPoint did it fine, though). There is sadly no easy fix for that, but generally the text you copied is exactly the same, it's just the display that's wrong.
Disclaimer: I have lurked for a long time on the Unicode mailing list, but I'm by no means an expert on these things. I speak two languages and both are trivial what layout is concerned. This is a recollection of how I think it might work and might not be actual fact.
The letters are stored in logical order; meaning that a sentence such as "Hello! Salaam!" is in fact stored with the letters in precisely that order.
In addition to that, however, certain unicode flags are also added to the text that inform the text layout engine that the "Salaam" part of the sentence should be reversed when displayed; so the final text layout becomes "Hello! maalaS!", as well it should be.
These flags are either set through natural BIDI classification; e.g. غ; or through use of the Unicode RTL and LTR markers, U+200E and U+200F.
If you pay attention, the cursor doesn't in fact jump strangely, it always follows logical character order.
Related
For a recent project I've been working on a simple word processor, and because I need fine-grained control have had to implement a lot of the text shaping myself. Most of this is fairly straight-forward and described in detail places like here and here.
It's less obvious how to handle pressing down or up on the keyboard when dealing with non-monospaced text split across many lines. In monospaced text the algorithm is simple: move the text caret down one line and the same number of characters to the right that it was. But what about variable-width fonts? I've tried an algorithm like this (in pseudocode):
; Return text offset into next line after navigating down
function moveCaretDown():
Move text caret to start of next line
targetPixelOffset := previous pixel offset of caret in line above
textOffsetIntoLine := 0
pixelOffsetIntoLine := 0
prevDelta := Infinity
for each char in text of new line:
delta = abs(pixelOffsetIntoLine - targetPixelOffset)
; We are now further from the desired cursor offset than before, this must
; be closest slot to the caret's previous horizontal offset in this line.
if (delta > prevDelta):
return textOffsetIntoLine - 1
prevDelta = delta
pixelOffsetIntoLine += measureWidth(char)
currentOffset++
; Else return the offset of the last character in the line
return length of newline - 1
But I've found its behavior differs from text inputs in major web browsers and/or text editors (I can come up with some specific examples if needed). Is there some standard algorithm for this used by GUI toolkits or text shaping libraries? I was surprised I couldn't find a W3C standard on it, for example, considering this is behavior needed in every web browser.
* Inserting line breaks into a string at the correct places, handling ragged or fully-justified text, etc.
I don't think there's a standard other than to follow the Principle of Least Astonishment. Nowadays, that typically means seeing what the major applications do, since that will likely be familiar behavior to the user.
On the current line, you know the current horizontal offset. Let's call it x. I'm talking about the pixel position, not the number of characters or glyphs since the beginning of that line.
On the destination line, there is a set of horizontal offsets the caret can be placed (e.g., between glyphs). So you want to pick the one of those that's as similar to your current x as possible.
Furthermore, if the user moves the caret vertically several times in a row, you probably want to find the nearest to the original x. The caret may wiggle horizontally a bit as the user moves up and down, but you don't want it to drift. Once the user does something that intentionally changes the horizontal offset (e.g., inserts a character, uses a horizontal arrow, clicks the mouse, etc.) that's the best time to update x.
If you already have code to find the closest caret position to a mouse click, you might be able to re-use it as though the user had clicked the point exactly one line above or below the current x.
I've also seen some editors (including monospace text editors) that treat the end of the line as a special case. So if you move up or down when you're at the end of a line, you move to the end of the preceding or succeeding line. That seems a nice way to handle ragged right text and short lines at the end of a paragraph.
How does <div dir=auto>bla bla שלום bla</div> work?
By majority of words? First character?
I'm looking for a solution that looks at the majority of characters in a span in order to know its direction (an angularjs directive would also be good, but if this is already built in HTML I'd like to know).
When an element has its dir set to "auto", the direction of the
element is determined based on its first strong directionality
character, or default to the directionality of its parent element.
https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement.dir
Characters with the left-to-right, right-to-left, and right-to-left
Arabic types are called “strong directional”, or just “strong”
characters. Numbers are a special case; their reading direction is
always left to right, but they do not affect the reading direction of
neighboring characters. Even numbers that are displayed with
Arabic-Indic digits have a left-to-right character direction.
In the example you posted, it'll be ltr.
My HTML page contain tables with many negative numbers, like –0.25 . 8211 is the n-dash. Because my document is supposed to become epub2 eventually, javascript is not allowed. only xhtml+css.
Unfortunately, both ebook readers and the print function in Chrome think that it is a reasonable idea to line-break a negative number between the en-dash and the zero, even when there is a space before and/or after, e.g., in a table.
I need a "non-breaking" en-dash? there are non-breakable spaces, after all, too. Or is there a way to instruct css never to break such negative numbers anywhere throughout the entire document? (I doubt this one, but just had to ask.)
of course, I can wrap each negative number into a span to prevent breaking, but this is quite painful. literally, by the time I am all done, my number --0.25 would have to become <span class="nobreak">–0.25</span>. (joke: it's almost like a DOS 10x amplification attack, with 4 chars becoming 40 characters, all because I want to have negative numbers.)
advice appreciated.
/iaw
You can prevent negative numbers from breaking by using the proper MINUS SIGN “−” (U+2212). In
text rendering, browsers, ebook readers, and other software often treat EN DASH as well as HYPHEN-MINUS (the common Ascii hyphen) as allowing a line break after it, even when immediately followed by a digit. No such behavior has been observed for MINUS SIGN.
In HTML, you can write MINUS SIGN as − if you have difficulties in typing the character or if you wish to make it clear to anyone reading the HTML source that MINUS SIGN is used.
This is more a sort of curiosity. While working on a multilingual web application I noticed that certain characters like punctuation marks (!?.;,) at the end of a block element are rendered as if they were placed at the beginning instead when the writing direction is right-to-left (as it is the case for certain Asian languages I do not speak).
In other words, The string
Hello, World!
is rendered as
!Hello, World
when placed in a div block with direction: rtl
This becomes even more evident if the text is split in two parts and given different colors: a contiguous chunk of text at the end is rendered in two separated regions:
http://jsfiddle.net/22Qk9/
What's the point of this behavior? I guess this must be a peculiarity of (all?) right-to-left languages which is automatically handled by the browser, so I don't need to care about it, or should I?
If you want to fix this behavior add the LRM character in the end. It's a non=printing character.
Source : http://dotancohen.com/howto/rtl_right_to_left.html
Example : http://jsfiddle.net/yobjj6ed/
The reason is that the exclamation mark “!” has the BiDi class O.N. ('Other Neutrals'), which means effectively that it adapts to the directionality of the surrounding text. In the example case, it is therefore placed to the left of the text before it. This is quite correct for languages written right to left: the terminating punctuation mark appears at the end, i.e. on the left.
Normally, you use the CSS code direction: rtl or, preferably, the HTML attribute dir=rtl for texts in a language that is written right to left, and only for them. For them, this behavior is a solution, not a problem.
If you instead use direction: rtl or dir=rtl just for special effects, like making table columns laid out right to left, then you need to consider the implications. For example, in the table case, you would need to set direction to ltr for each cell of the table (unless you want them to be rendered as primarily right to left text).
If you have, say, an English sentence quoted inside a block of Arabic text, then you need to set the directionality of an element containing the English text to ltr, e.g.
<blockquote dir=ltr>Hello, World!</blockquote>
A similar case (just with Arabic inside English text) is discussed as use case 6 in the W3C document What you need to know about the bidi algorithm and inline markup (which has a few oddities, though, like using cite markup for quoted text, against W3C recommendations).
The accepted answer https://stackoverflow.com/a/20799360/477420 works if you can control markup/CSS of the value, if you have no control over HTML following approach could work.
If you don't know if page will be rendered RTL or LTR but some text is definitely LTR (i.e. English-only) you can wrap the value with LRE/PDF marks to signify that is LTR region. Text will be rendered LTR irrespective of page's LTR or RTL direction.
This works when you have some code that tries to render text without ability to change markup of how exactly it will show up on the page. I.e. you rendering value for "song tile" or "company name" field in some nested child component (or server side) without ability to control surrounding HTML elements.
One drawback of this and similar approaches (like LRM proposal in this question) with adding marks to text is copy-paste of such value from the resulting HTML page will generally preserve the marks but they are not visible/zero width. While for most cases it is fine consider if that is a problem for you.
Approximate sample code (some companies have "Inc." at the end which will end up with dot at the beginning when rendered as-is on RTL page):
// comanyName = "Alphabet Inc." - really likes dot at the end including RTL
if(stringIsDefinitelyAscii(companyName))
{
companyName = "\u202A" + companyName + "\u202C"
}
return companyName;
Details on LRE/PDF symbols can be found in https://unicode.org/reports/tr9/#Explicit_Directional_Embeddings:
LRE U+202A LEFT-TO-RIGHT EMBEDDING
Treat the following text as embedded left-to-right.
PDF U+202C POP DIRECTIONAL FORMATTING End the scope of the last LRE, RLE, RLO, or LRO.
Some approaches to figure out if string has RTL characters can be found in How to detect whether a character belongs to a Right To Left language?, JavaScript: how to check if character is RTL?, How to detect if a string contains any Right-to-Left character?.
I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾