What are these characters and why are they rendered this way? - language-agnostic

I want to understand what is happening when these characters are displayed that they are displayed the way they are displayed.
I saw it on social media (FB and Twitter) and can't seem to understand what's technically happening.
Edit: If they characters from a character set I don't have installed I still don't get why they tend to not be displayed in a line and overlap other space even outside their line?
!̸̶͚͖͖̩̻̩̗͍̮̙̈͊͛̈͒̍̐ͣͩ̋ͨ̓̊̌̈̊́̚͝͠ͅ ̷̧̢̛͖̤̟̺̫̗͚̗͖ͪ̏̔̔̒́ͥ̓ͫ̀ͤ̇ͥ͝ ̡̊͛̇ ͫ̉ͦ̊̀̔ͧͮ͆̽ͦͩ͋̌͗̚̚҉̵͖̟͙̮͈̼̹̞͝ͅ

It's the magic of Unicode.
Unicode handles all extant writing systems of the world, and that includes the ones with symbols instead of letters, the ones that are written right-to-left instead of left-to-right, and the ones which are written top-to-bottom. It also contains provisions for how to render glyphs that are technically combinations of base and modifier glyphs (even 16 bit isn't enough for all possible accented, composited, or context-adapted characters in all languages). (Trivia: The Unicode standard is so complex and contains so much code that security issues have actually been found in it.)
Any software that claims to support Unicode fully has to be able to follow all these rules, and that includes stacking characters on top of each other, overlaying them etc. etc. This means that any person with an internet connection who connects can have their native language rendered correctly - but I dare say that on English-language boards the predominant use of all those features is to render cool pseudo-graphics, as in your example.

Related

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?
You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

Pitfalls when performing internationalization / localization with numbers?

When developing an application that will need to work with a variety of localizations, particularly with "right to left" text, is there a possibility of a case where numbers would need to be converted to "right to left" as well?
I'm no language scholar, but I know the RTL languages I am familiar with present their numbers in LTR.
For instance (using google translate):
I have 345 apples.
In Arabic:
لدي 345 التفاح.
So, I have two questions:
Is it possible to run into a language that uses RTL numbers?
How should internationalizing be handled in such cases?
or,
Is the "accepted norm" to just do numbers using Western Arabic characters, read from left to right?
In the big right-to-left scripts - Arabic, Hebrew and Thaana - numbers always run left to right. (When I say "Arabic", I refer to all the languages that are written in the Arabic script - Arabic, Farsi, Urdu, Pasto and many others.)
Hebrew and Thaana always use European digits, the same 0-9 set as English. There's nothing much to do there, because Unicode automatically takes care of ordering the numbers correctly. But see the comments about isolation below.
It's possible to use European digits in Arabic, too; for example, the Arabic Wikipedia uses them. However, very frequently Arabic texts use a different set of digits - https://en.wikipedia.org/wiki/Eastern_Arabic_numerals . It depends on your users' preferences. Notice also, that in the Persian language the digits are slightly different. From the point of view of right-to-left layout they behave pretty much the same way as European digits, although there are slight differences in the behavior of mathematical signs - for example, the minus can go on the other side. There are some subtleties here, but they are mostly edge cases.
In both Hebrew and Arabic you may run into a problem with bidi-isolation. For example, if you have a Hebrew paragraph in which you have an English word, and after the word you have numbers, the numbers will appear to the right of the word, although you may have wanted them to appear on the left. That's how the Unicode bidi algorithm works by default. To resolve such things you can use the Unicode control characters RLM and LRM. If you are using HTML5, you can also use the <bdi> tag for this, as well as the CSS rule "unicode-bidi: isolate". These CSS and HTML5 solutions are quite powerful and elegant, but aren't supported in all browsers yet.
I am aware of one script in which the digits run right-to-left: N'Ko, which is used for some languages of Africa. I actually saw websites written in it, but it is far less common than Hebrew and Arabic.
Finally, if you're using JavaScript, you can use the free jquery.i18n library for automatic number conversion. See https://github.com/wikimedia/jquery.i18n . (Disclaimer: I am one of this library's developers.)
Numbers will generally translate as you have them. Even in languages that read in different directions the Western Arabic numbers are typically recognized by the user.

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○  △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ¥ 〒 ¢ £ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?
There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.
I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.
The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.
There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.
The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.
I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!

Thai line breaking: how to break Thai text effectively

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, ​ or ­ to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!
I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)
The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.
There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

Unicode support in Web standard fonts

I need to decide whether to render geometric symbols in a web GUI (e.g. arrows and triangles for buttons, menus, etc.) as Unicode symbols (MUCH easier and color-independent) or GIF/PNG files (lots of hassle I would like to avoid).
However, I have seen clients that have trouble displaying even advanced punctuation symbols declared as unicode characters (Example).
Does anybody know from which version on, OSs / Service Packs / Applications ship with Unicode versions of the standard fonts? There is, for example, Microsoft's Arial unicode that ships with Office since 1999, however I do not have office installed and still my Arial has at least some of the Unicode range.
Also, what is the situation with Mac OS and Linux?
Could somebody point me towards some comprehensive resources on this - reports, lists, overviews?
There's not really such a thing as a “Unicode version” of a font(*). “Arial Unicode” is a misleading name: it's not materially different to normal “Arial”, it just has some more characters in it. It does not contain usable glyphs for every single one of the tens of thousands of characters defined so far, and indeed there is no one OS standard font that does.
The significant question is merely whether the characters you want to use have glyphs in the default fonts of commonly-deployed operating systems. You need to look to look at font support for particular characters you wish to use on an individual basis.
The character U+0360 Combining Double Tilde you mentioned is not really ‘advanced punctuation’, it's an curious and rarely-used diacritical mark for phonetics work. So it's not really surprising that font support for it is poor. On the other hand, Stack Overflow can get away with using U+25CF Black Circle (●) because lots of fonts have it. Some of the other characters from the Geometric Shapes block such as U+25B2 Black Up-pointing Triangle (▲) are also pretty common.
fileformat.info has a list of common fonts that support each character, so you can check there to get a feel of how widely supported a symbol is, and whether the default OS fonts you recognise are present, before using it as a replacement for an image. For example U+25CF is in many fonts, but U+0360 isn't that well-supported: none of the default Windows install fonts are there, and the ‘Libertine’ font renders it badly wrong.
(*: OK, there is sort of such a thing as a Unicode font, in that a font's internal character lookup tables may be denominated in Unicode or some other character set. However this makes no practical difference as the application will always be addressing it as Unicode; the OS will do the conversion on lookup transparently.)
This question may be a duplicate of Unicode and fonts where I posted a list of unicode font links:
http://en.wikipedia.org/wiki/Category%3AFree%5Fsoftware%5FUnicode%5Ftypefaces
http://en.wikipedia.org/wiki/Unicode%5Ftypefaces
http://unifoundry.com/unifont.html
http://www.fileformat.info/info/unicode/font/index.htm
http://www.alanwood.net/unicode/fontsbyrange.html
http://www.alanwood.net/unicode/fonts.html
http://www.unifont.org/fontguide/
http://www.wazu.jp/index.html
However, I am not sure how the Unicode standard defines how to render a stand-alone COMBINING DOUBLE character, as it is supposed to combine other characters.
Some random observations:
On OS X the Unicode support is perfect, at least for your needs.
On Windows the situation seems to depend on the browser. I don’t use many arcane characters, but the few I do (mostly punctuation) seem to display just fine in Firefox. The only problem is in Internet Explorer, as usual.
If you have some control over your clients you could distribute some good free fonts?
Even web fonts could work.
One drawback to Unicode charactes is that they are often quite ugly. Too big, too small, have wrong position, etc.