Thai line breaking: how to break Thai text effectively

Thai line breaking: how to break Thai text effectively - html

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking and non-breaking zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆที่ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, or to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!

I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)

The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.

There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

Related

Using the Egyptian Hieroglyphic control characters

The Unicode standard now contains control characters for the formatting of Egyptian hieroglyphics. (Nine characters: 13430-8). In principle (my understanding) is that they are intended to work as so:
𓏏𓐰𓈖
in which code 13430 (EGYPTIAN HIEROGLYPH VERTICAL JOINER) indicates that one sign should be put above the other, like this:
Unfortunately, there seems to be no support for these characters at this point.
Particularly I am wanting to implement something for the web, i.e., I would like to display correctly formatted hieroglyphs without having to generate an image on the backend. Ideally, something like <span class="hieroglyphic-unicode">𓏏𓐰𓈖</span> would display correctly.
Any help/advice would be greatly appreciated.

Peter Constable's comments are correct, rendering Egyptian Hieroglyphs in fonts is very complex and takes time. There is a working group which combines the talents of Egyptologists and Unicode experts to ensure that the desired rendering capabilities are developed. We expect fonts to start becoming available during 2023 after platforms have had time to update to Unicode 15. As Peter pointed out, github.com/microsoft/font-tools is a good place to watch for progress.

The approach taken in your HTML looks correct, but I can't find any support for those Egyptian hieroglyphics control characters, even though they were added in Unicode 12.0 back in March, 2019.
One reason for that may have been the ongoing requests to add further control characters to provide richer support for combining hieroglyphics, so there was a reluctance to commit to doing anything while those proposed enhancements were still being discussed. However, those changes have now been approved (see Item 4 "Egyptian Hieroglyphs", pages 6 through 10).
Hopefully those changes will be included in Unicode 15 which is due to be released this September, but I couldn't find any documentation confirming that. If not, you may face a long wait! In the meantime I think you are stuck with using images, although that approach may not be practicable, depending on your requirements.
See this thread, Support Egyptian Hieroglyph Format Controls #1469, which confirms that there is no current support for the existing (Unicode 12) control characters with Google's Noto Sans Egyptian Hieroglyphs font. A comment near the end of the thread discussing support for Egyptian hieroglyph Format controls once Unicode 15 is released states "we will consider it along with all of the other Unicode changes, bug fixes, etc...".

If it is just for the web, there has been support (i.e. a JavaScript implementation) for the first 9 control characters (Unicode 12) since the beginning (2018):
https://mjn.host.cs.st-andrews.ac.uk/egyptian/res/js/
This uses HTML canvas, which is admittedly not ideal. I'm working on a better solution, which will also cover the new 29 control characters from Unicode 15.
It is the implementation in OpenType that is really difficult (it is being worked on).

Auto-insert soft hyphen in Jekyll Markdown, assuming a specific language?

I'd like to use justified text for my Jekyll-based website, but there are currently some unpleasant gaps in text at various widths. By inserting the soft hyphen character, , into my Markdown, I managed to make it look much better.
However, it is an unpleasant process, and it defeats most of the point of using Markdown. Software packages such as Microsoft Word are capable of inserting hyphens at logical points, presumably based on a dictionary of acceptable break points.
Where can I get a dictionary of that sort?
How can I make Jekyll automatically perform this process when processing Markdown text?
I only care about supporting English at this point.

However, it is an unpleasant process, and it defeats most of the point
of using Markdown.
This does not have anything to do with Markdown. This is an HTML issue.
HTML does not hyphenate, unless you use CSS hyphenation. This has poor browser support and is language dependent (use hyphenate-resources for this).
I would accept the spotty browser support of this CSS feature (accept its graceful degredation/progressive enhancement). If that is unacceptable for you, you can use a javascript library, like hyphenator.js, to insert the  characters into your content.

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?

You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○　 △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ￥ 〒 ￠ ￡ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?

There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.

I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.

The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.

There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.

The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.

I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!

What are these characters and why are they rendered this way?

I want to understand what is happening when these characters are displayed that they are displayed the way they are displayed.
I saw it on social media (FB and Twitter) and can't seem to understand what's technically happening.
Edit: If they characters from a character set I don't have installed I still don't get why they tend to not be displayed in a line and overlap other space even outside their line?
!̸̶͚͖͖̩̻̩̗͍̮̙̈͊͛̈͒̍̐ͣͩ̋ͨ̓̊̌̈̊́̚͝͠ͅ ̷̧̢̛͖̤̟̺̫̗͚̗͖ͪ̏̔̔̒́ͥ̓ͫ̀ͤ̇ͥ͝ ̡̊͛̇ ͫ̉ͦ̊̀̔ͧͮ͆̽ͦͩ͋̌͗̚̚҉̵͖̟͙̮͈̼̹̞͝ͅ

It's the magic of Unicode.
Unicode handles all extant writing systems of the world, and that includes the ones with symbols instead of letters, the ones that are written right-to-left instead of left-to-right, and the ones which are written top-to-bottom. It also contains provisions for how to render glyphs that are technically combinations of base and modifier glyphs (even 16 bit isn't enough for all possible accented, composited, or context-adapted characters in all languages). (Trivia: The Unicode standard is so complex and contains so much code that security issues have actually been found in it.)
Any software that claims to support Unicode fully has to be able to follow all these rules, and that includes stacking characters on top of each other, overlaying them etc. etc. This means that any person with an internet connection who connects can have their native language rendered correctly - but I dare say that on English-language boards the predominant use of all those features is to render cool pseudo-graphics, as in your example.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008