Where can i find all right-to-left characters in UTF-8? - html

I know that Hebrew and Arabic characters are going from right to left but I want to see all of them.

Have a look at the bidirectional character type (Unicode character property Bidi_Class). You're looking for characters of type R (Right-to-Left) or AL (Right-to-Left Arabic). The file DerivedBidiClass.txt in the Unicode database contains a list of all code points with these classes.

Quoting i18nguy:
Languages don't have a direction. Scripts have a writing direction,
and so languages written in a particular script, will be written with
the direction of that script.
Here are some scripts using RTL: Arabic, Hebrew, N'ko, Syriac, Thaana/Thâna, Tifinar, Urdu.
You can just look for unicode range of a given script. Like for example Tifinar: U+2D30 – U+2D7F.
Not sure what you want to achieve by looking at all those characters but I think that is the only way of actually finding them.
You can refer to the original page here:
http://www.i18nguy.com/temp/rtl.html

Related

Is there a need to use HTML entities when using Unicode?

I am building a website for a German client, so the text on the website will regularly contain characters like:
ä
ö
ü
ß
Is it necessary for to convert all those characters to their HTML Entities while the website uses UTF-8 character encoding everywhere?
Or maybe there's no relation between the two areas?
When (if at all) should I convert those to their HTML Entities, then?
You should convert to HTML entity or character references when:
a. you are stuck with some editor or processing component that doesn't support Unicode properly;
b. you have manually-edited markup with confusable characters. For example, if you have a non-breaking-space that is important to lay out correctly, you might want to write it as or   so that it's obvious and doesn't get replaced with a normal space when someone edits the file.
Other than that, no, just go with the raw versions.

HTML pattern Arabic letters

I want to insert the Arabic letters in the pattern just like the English letters
pattern="[a-zA-Z0-9-_. ]{1,30}"
I have no idea how to accomplish the action.
The range for Arabic and Persian are shared so this code could be used for Arabic too.
[أ-يa-zA-Z]
This is the reference for finding the character range of Unicode languages:
preg_replace and preg_match arabic characters
http://unicode.org/charts/
The HTML5 pattern attribute follows JavaScript regular expression syntax, which makes things rather awkward. You cannot test character properties, for example. Instead, you need to list down the allowed characters or ranges of characters.
Using the normative Scripts.txt file (by the Unicode Consortium), which defines the script (writing system) of all characters, I constructed the following:
pattern=
"[a-zA-Z0-9-_. \
\u0620-\u063F\u0641-\u064A\u066E-\u066F\u0671-\u06D3\u06D5\
\u06E5-\u06E6\u06EE-\u06EF\u06FA-\u06FC\u06FF\u0750-\u077F\
\u08A0\u08A2-\u08AC\uFB50-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\
\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC]{1,30}"
Starting from the set of all characters with script defined to be Arabic, I picked up those that are declared as letters (General Category Lo or Lm), and then omitted those beyond BMP, the Basic Multilingual Plane.
Characters outside BMP are used very rarely, and to represent them in JavaScript syntax, you would need to either include the characters themselves or use two \u notations per character (one for each component of a surrogate pair). This does not sound realistic.
This is of course a “hardwired” solution: it may need updates if new Arabic letters are added to Unicode or the script of a character is changed from or to Arabic (which is highly unlikely). But I don’t expect to see new Arabic letters added to BMP during my lifetime.

Pitfalls when performing internationalization / localization with numbers?

When developing an application that will need to work with a variety of localizations, particularly with "right to left" text, is there a possibility of a case where numbers would need to be converted to "right to left" as well?
I'm no language scholar, but I know the RTL languages I am familiar with present their numbers in LTR.
For instance (using google translate):
I have 345 apples.
In Arabic:
لدي 345 التفاح.
So, I have two questions:
Is it possible to run into a language that uses RTL numbers?
How should internationalizing be handled in such cases?
or,
Is the "accepted norm" to just do numbers using Western Arabic characters, read from left to right?
In the big right-to-left scripts - Arabic, Hebrew and Thaana - numbers always run left to right. (When I say "Arabic", I refer to all the languages that are written in the Arabic script - Arabic, Farsi, Urdu, Pasto and many others.)
Hebrew and Thaana always use European digits, the same 0-9 set as English. There's nothing much to do there, because Unicode automatically takes care of ordering the numbers correctly. But see the comments about isolation below.
It's possible to use European digits in Arabic, too; for example, the Arabic Wikipedia uses them. However, very frequently Arabic texts use a different set of digits - https://en.wikipedia.org/wiki/Eastern_Arabic_numerals . It depends on your users' preferences. Notice also, that in the Persian language the digits are slightly different. From the point of view of right-to-left layout they behave pretty much the same way as European digits, although there are slight differences in the behavior of mathematical signs - for example, the minus can go on the other side. There are some subtleties here, but they are mostly edge cases.
In both Hebrew and Arabic you may run into a problem with bidi-isolation. For example, if you have a Hebrew paragraph in which you have an English word, and after the word you have numbers, the numbers will appear to the right of the word, although you may have wanted them to appear on the left. That's how the Unicode bidi algorithm works by default. To resolve such things you can use the Unicode control characters RLM and LRM. If you are using HTML5, you can also use the <bdi> tag for this, as well as the CSS rule "unicode-bidi: isolate". These CSS and HTML5 solutions are quite powerful and elegant, but aren't supported in all browsers yet.
I am aware of one script in which the digits run right-to-left: N'Ko, which is used for some languages of Africa. I actually saw websites written in it, but it is far less common than Hebrew and Arabic.
Finally, if you're using JavaScript, you can use the free jquery.i18n library for automatic number conversion. See https://github.com/wikimedia/jquery.i18n . (Disclaimer: I am one of this library's developers.)
Numbers will generally translate as you have them. Even in languages that read in different directions the Western Arabic numbers are typically recognized by the user.

flash cs5.5 as3 - get unicode character of Arabic Presentation forms A and B

I have a string like 'دبي' and i want to get its correct unicode character. Currently, I am using str.charCodeAt(index) to get its unicode character but for Arabic characters it gives between 0600 and 06FF. However, i want Arabic Presentation Forms A and B - whichever is actually written.
Can anyone suggest how to do this?
The string you posted consists of three normal Arabic letters in the 0600...06FF range, so what you are getting is the correct Unicode characters. If you mean that you would like to determine the contextual glyph forms used, then that’s outside the character level and cannot be determined from the string. (It can be determined, by applying rules of Arabic writing, which forms should be used, but that’s different from knowing which forms are actually used by the rendering software.)
Arabic Presentation Forms are legacy characters not meant for normal use. Normal rendering is not supposed to convert normal character to such forms but to select glyphs contextually.

Is it possible to print DOS characters on a website?

I would like to print some kind of ASCII "art" on a web page in pre-tags. These graphics use DOS characters to show a map like old maze games did. I didn't find anything in the HTML special character reference. Is there a way to use these characters in HTML ?
Thanks in advance.
With the right Unicode characters, the old character encodings shouldn't make much odds. The tricky bit may be converting existing ASCII art into Unicode - at which point you need to know the original encoding.
The relevant code charts will be listed on the Unicode "symbols" charts page. In particular, I suspect you'll find the box drawing and block elements charts useful.
You'll need to make sure that your page uses a font which contains the right characters, of course...
As an example, you can render this:
┌┐
└┘
With:
<pre>┌┐
└┘</pre>
Not quite a proper box, but getting there...
You can send them in the <pre> tags, although in XHTML you'll need to encapsulate it in <![CDATA[[]> I think. Be careful though, not all encodings render this correctly. For example, a lot of ASCII art designed for DOS code page 430 (US) fails over here in the UK (830). Eastern Europe suffers especially.
I think the best approach here would be to render images.
EDIT: Oh. You could try , but I'm not sure if that would work.