Using the Egyptian Hieroglyphic control characters - html

The Unicode standard now contains control characters for the formatting of Egyptian hieroglyphics. (Nine characters: 13430-8). In principle (my understanding) is that they are intended to work as so:
𓏏𓐰𓈖
in which code 13430 (EGYPTIAN HIEROGLYPH VERTICAL JOINER) indicates that one sign should be put above the other, like this:
Unfortunately, there seems to be no support for these characters at this point.
Particularly I am wanting to implement something for the web, i.e., I would like to display correctly formatted hieroglyphs without having to generate an image on the backend. Ideally, something like <span class="hieroglyphic-unicode">𓏏𓐰𓈖</span> would display correctly.
Any help/advice would be greatly appreciated.

Peter Constable's comments are correct, rendering Egyptian Hieroglyphs in fonts is very complex and takes time. There is a working group which combines the talents of Egyptologists and Unicode experts to ensure that the desired rendering capabilities are developed. We expect fonts to start becoming available during 2023 after platforms have had time to update to Unicode 15. As Peter pointed out, github.com/microsoft/font-tools is a good place to watch for progress.

The approach taken in your HTML looks correct, but I can't find any support for those Egyptian hieroglyphics control characters, even though they were added in Unicode 12.0 back in March, 2019.
One reason for that may have been the ongoing requests to add further control characters to provide richer support for combining hieroglyphics, so there was a reluctance to commit to doing anything while those proposed enhancements were still being discussed. However, those changes have now been approved (see Item 4 "Egyptian Hieroglyphs", pages 6 through 10).
Hopefully those changes will be included in Unicode 15 which is due to be released this September, but I couldn't find any documentation confirming that. If not, you may face a long wait! In the meantime I think you are stuck with using images, although that approach may not be practicable, depending on your requirements.
See this thread, Support Egyptian Hieroglyph Format Controls #1469, which confirms that there is no current support for the existing (Unicode 12) control characters with Google's Noto Sans Egyptian Hieroglyphs font. A comment near the end of the thread discussing support for Egyptian hieroglyph Format controls once Unicode 15 is released states "we will consider it along with all of the other Unicode changes, bug fixes, etc...".

If it is just for the web, there has been support (i.e. a JavaScript implementation) for the first 9 control characters (Unicode 12) since the beginning (2018):
https://mjn.host.cs.st-andrews.ac.uk/egyptian/res/js/
This uses HTML canvas, which is admittedly not ideal. I'm working on a better solution, which will also cover the new 29 control characters from Unicode 15.
It is the implementation in OpenType that is really difficult (it is being worked on).

Related

Is it safe to use UTF-8 characters (like ✖) on my web-page?

I would like to use the UTF-8 character ✖ on my site but I am not sure if this will be supported cross browser.
I am worried that:
a) Users will not have access to a font containing that character
b) IE will not find the character even if the user has a font that could display it. I am worried about this because of this info:
By the specifications, browsers should display a character if there is any font in the system that contains it. If the fonts specified by the author (in CSS font-family settings or, rarely these days, using font markup in HTML) do not contain the character, browsers are sup­posed to use fallback fonts. The same applies if no fonts are specified by the author; brows­ers should use primarily their default fonts, using alternate fonts for any character not covered by the primary font.
In practice, things don’t always work that way. Especially IE is notorious for its failures in this respect. It often fails to display a character, even though it could do that if it used all the fonts in the system. If a browser cannot render a character, it may show a small rectangle, possibly containing a question mark, ?, or some similar indicator. Here’s a quick test (char­ac­ter U+0840, which is probably not supported by any font on your computer): ࡀ.
Source.
c) Other issues that I have though of.
There is a resource called Unify, that will show what devices the character is supported on but it currently (Sept 14, 2015) only suport 107 characters.
So to summarize, the question is: How can I determine if it is safe to use a utf-8 special character on my site? Is it safe to use ✖ specifically on my site?
It's always safe - your user's computers won't suddenly burst into flame.
From a technical perspective, your best bet is to use a web font that has support for every Unicode character you want to use. That is not a catch-all (the user might have web fonts disabled or is using a command line browser, etc...), but it should support the vast majority of computers.
From there I would apply common sense. If the displaying of a character is absolutely crucial and lives depend on it, try to not use Unicode. Otherwise I'd say 'go ahead'.
This is as much a UX question as it is a technical one, so I will mention both.
As a comparison, on my IE11 browser, it looks like this: , but on my Firefox 31.8, it looks like this: . A good user experience is generally associated with consistency, and this approach is not very portable. So from a UX perspective, this is not a great solution.
I would say using a tiny *.gif or *.bmp, or even *.png if you need transparency, is a better solution. Even better yet, go with *.svg so scaling will not be an issue. From a technical aspect, the overhead of something that small is generally insignificant.
The only problem you can face is that exotic symbols are not implemented in many fonts, so the user can see a dummy character (e.g. square) instead of this. I personally like to use svg symbols for this purpose.
An alternative solution would be to use a web font with those icons in it (although probably a subset version of, so that it's less and 1 kb and doesn't weight down your pages).

Support Maldivian language

I'm building a quiz that support 20 languages.
One is Maldivian.
How do I support this. Right now I'm having a bunch of square.
I want to know:
- What font should I use.
- Is there an online translator for English-Maldivian? (google translate do not support this)
Maldivian uses the Thaana script, which is not very widely supported in fonts. There are two basic strategies: specify a font-family rule that lists fonts known to contain Thaana letters, hoping that the user has at least one of them installed, or use a downloadable font with #font-family. The latter sounds more realistic in this case. For it, you would need a font that you can legally use that way.
Free fonts that support Thaana include MPH 2B Damase and TITUS Cyberbit Basic.
For generalities, see my Guide to using special characters in HTML.
I would be very surprised at seeing an automatic translator for a small language like Maldivian, and I would also be surprised at seeing an automatic translator that produces decent results when translating a web site.

Why shouldn't I use weird Characters in code/HTML documents?

I'm wondering if it's a bad idea to use weird characters in my code. I recently tried using them to create little dots to indicate which slide you're on and to change slides easily:
There are tons of these types of characters, and it seems like they could be used in place of icons/images in many cases, they are style-able and scale-able, and screen readers would be able to make sense of them.
But, I don't see anyone doing this, and I've got a feeling this is a bad idea, I just can't decide why. I guess it seems too easy to be true. Could someone tell me why this is or isn't okay? Here are some more examples of the characters i'm talking about:
↖ ↗ ↙ ↘ ㊣ ◎ ○ ● ⊕ ⊙ ○  △ ▲ ☆ ★ ◇ ◆ ■ □ ▽ ▼ § ¥ 〒 ¢ £ ※ ♀ ♂ &⁂ ℡ ↂ░ ▣ ▤ ▥ ▦ ▧ ✐✌✍✡✓✔✕✖ ♂ ♀ ♥ ♡ ☜ ☞ ☎ ☏ ⊙ ◎ ☺ ☻ ► ◄ ▧ ▨ ♨ ◐ ◑ ↔ ↕ ♥ ♡ ▪ ▫ ☼ ♦ ▀ ▄ █ ▌ ▐ ░ ▒ ▬ ♦ ◊
PS: I would also welcome general information about these characters, what they're called and stuff (ASCII, Unicode)?
There are three things to deal with:
1. As characters in a sentence/text:
The problem is that some fonts simply do not have them. However since CSS can control font use you probably will not run into this problem. As long as you use a web safe font, and know that that character is available in that font, you should probably be okay.
You can also use an embedded font, though be sure to fall back on a web safe font that contains the character you need as many browser will not support embedded fonts.
However sometimes certain devices will not have multiple fonts to choose from. If that font does not support your character you will run into problems. However depending on what your site does and the audience you are targeting this may not be a problem for you. Not to mention that devices like that are very old, and uncommon.
All in all it was probably not a good idea a handful of years ago, but now you are not likely to have problems as long as you cover all your bases.
It is important however to point out that you should never hard code those characters, instead use HTML entities. Just inserting those characters into your code can lead to unpredictable results. I recently copied some text from Word directly into my code, Word used smart quotes (quote marks that curve inwards properly). They showed up fine in Notepad++, but when I viewed the page I did not get quotes, I got some weird symbol.
I could have either replaced them with normal quotes " or with HTML entities to keep the style “ and ” (“ and ”).
Any Unicode character can be inserted this way (even those without special names).
Wikipedia has a good reference:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
2. As UI elements:
While it may be safe to use them in many cases, it is still better to use HTML elements where possible. You could simply style some div elements to be round and filled/not filled for your example.
As far as design goes they are really limiting, finding one that fits with the style of your page can be a hassle, and may mean that you will definitely need to embed a font, which is still only supported by the latest browsers.
Plus many devices do not support heavy font manipulation, and will often display them poorly. It works in the flow of your text, but as a vital part of the UI there can be major problems. Any possible issue one of those characters can bring will be multiplied by the fact that it is part of your UI.
From an artistic stand point they simply limit your abilities too much.
3. What are you doing?
Finaly you need to consider this:
Text is for telling
Image is for showing
HTML is for organizing
CSS is for making things look good while you show them
JavaScript is for functionality
Those characters are text, they are for telling someone something. So ask the question: "What am I doing?" and then use what was designed for that task. If you are telling use them, if you are showing use Image, or CSS.
I've seen this done before (the stars) and I think it's an awesome idea! It's also becoming quite popular to use a font (with #font-face) full of icons, like this one: http://fortawesome.github.com/Font-Awesome/
I can't see any downside to using a font like "font awesome" (only the upsides you mention like scalabilty and the ability to change color with CSS). Perhaps there's a downside to using the special characters you mention but none that I know of.
The problem with using those characters is that not all of them are available in all fonts used by all users, which means your application may look strange, or in the worst case be unusable. That said, it is becoming more common to assume the characters available in certain common fonts (Apple/Microsoft's Arial, Bitstream Vera). You can't even assume that you can download a font, as some users may capture content for offline reading with a service like Instapaper or Read It Later.
There are a number of problems:
Portability: using anything other than the 7-bit ASCII characters in code can make your code less portable, as recipients may use the wrong encoding. You can do a lot to mitigate this (eg. use UTF16 or at least UTF-8 encoded files). Most languages allow you to specify strings in characters using some form of escape notation (eg. "\u1234" in C#), which will avoid the problem, but loses some of the advantages.
Font-dependency: user interface elements that depend on special characters being available in a font may be harder to internationalize, since those glyphs might not be in the font that you want/need to use for a particular audience.
No color, limited choice of art: while font glyphs might seem useful to a coder, they probably look pretty poor to a UI designer.
The question is very broad; it could be split to literally thousands of questions of the type “why shouldn’t I use character ... in HTML documents?” This seems to be what the question is about—not really about code. And it’s about characters, seen as “weird” or “uncommon” or “special” from some perspective, not about character encodings. (None of the characters mentioned are encoded in ASCII. Some are encoded in ISO-8895-1. All are encoded in Unicode.)
The characters are used in HTML documents. There is no general reason against not using them, but loads of specific reasons why some specific characters might not be the best approach in a specific situation.
For example, the “little dots” you mention in your example (probably not dots at all but circles or bullets), when used as control elements as you describe, would mean poor usability and poor accessibility. Making them significantly larger would improve the situation, but this more or less proves that such text characters are not suitable for controls.
Screen readers could make sense of special characters if they used a database of various properties of characters. Well, they don’t, and they often fail to read properly even the most common special characters. Just reading the Unicode name of a character can be cryptic or outright misleading. The proper reading would generally depend on meaning and context.
The main issue, however, is that people do not generally recognize characters in the meanings that you would assign to them. How many people know what the circled plus symbol “⊕” stands for? Maybe 1 out of 1,000, optimistically thinking. It might be all right to use in on a page about advanced mathematics or physics, especially if the notation is defined there. But used in general text, it would be just… a weird character, and people would read different meanings into it, or just get puzzled.
So using special characters just because they look cool isn’t a good idea. Even when there is time and place for a special character, there are technical issues with them. How many fonts do you expect to contain “⊕”? How many of those fonts do you expect Joe Q. Public to have in his computer? In this specific case, you would find the font coverage reasonably good, but you would still have to analyze it and write a longish list of font names in your CSS code to cover most platforms. In the pile of poo case (♨), it would be unrealistic to expect most people to see anything but a symbol for unrepresentable character. Regarding the methods of finding out such things, check out my Guide to using special characters in HTML.
I've run into problems using unusual characters: the tools editor, compiler, interpreter etc.) often complain and report errors. In the end, it wasn't worth the hassle. Darn western hegemony, or homogeneity, or, well, something!

Is it safe to assume users can see unicode characters U+2716 and U+2714 in CSS content?

I'm wanting to use the characters ✖ (U+2716) and ✔ (U+2714) in my CSS for form validation purposes. Basically, if a field is valid/invalid, I use the after pseudo class to insert the corresponding symbol after the field.
For example:
.field:after {
content: "\2716";
}
This is working great on my Mac, but when I switch to my Windows XP VMWare instance, I don't get the characters, no matter what font I choose (even Arial).
My suspicion is that perhaps my Windows VM isn't configured properly, but that causes me to be weary of using these characters at all.
Does anyone know if there are "safe" characters or ranges in unicode that you can reliably assume will be viewable by most people?
UDPATE:
Here is a list of unicode characters I was hoping to possibly be able to use as icons. Specifically the dingbats section.
http://en.wikipedia.org/wiki/List_of_Unicode_characters#Dingbats
If you don't see these characters on your machine, definitely let me know in the comments.
In addition to the problems of using CSS for presenting essential information (see CSS Caveats), there’s the problem that the characters mentioned are often not available in people’s computers. The fonts supporting them do not contain any font that is shipped with a Windows system, for example. Support exists in Arial Unicode MS, which is shipped with Microsoft Office, but not everyone is using Office.
Besides, the symbols are not universal. A symbol like “✔” meant wrong when I was at school.
Using “OK” and “error” might be best, unless you need to use some other language.
What browser are you using in your XP VM? IE6 and 7 don't support the :after selector, so that might be the issue.

Thai line breaking: how to break Thai text effectively

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, ​ or ­ to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!
I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)
The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.
There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq