How can i use special characters in IE 6 ?
for exmple : for ș (LATIN SMALL LETTER S WITH COMMA BELOW)
I use ș but it did not work
Thank you in advance
Fonts shipped with old versions of Windows where you can use IE 6 apparently do not cntain the character LATIN SMALL LETTER S WITH COMMA BELOW. Since it would be unrealistic to expect users to install new fonts just to see some special characters, the best shot is to use U+015F LATIN SMALL LETTER S WITH CEDILLA instead, either the character “ş” as such or as the character reference f; orş`.
From the Unicode perspective, s with comma below and s with cedilla are glyph variants, but they have been defined as separate characters as requested by the Romanian standards institute, to allow the distinction to be made at the character level. However, even in Romania, this distinction is not made consistently at all, and s with cedilla has a much better font coverage.
Related
In Windows 7 system, Chrome uses the "Microsoft YaHei" font to display characters U+2F804 (你) as U+4F60 (你)
but there is no U+2F804 corresponding character in this font.
The results found using fontCreator are shown below
In windows 10 System, because There is Yu Gothic font, so the result is correct.
What puzzles me is why Windows 7 will show up as U+4F60(你)
The code's URL is:http://yanglikun.github.io/encoding/code.html
I think it should display question mark、口、or other characters when there is no corresponding character in the font of Microsoft YaHei, but not the wrong character U+4F60(你)
A note: unicode code point and font glyphs are not directly related. The actual glyph depends on context, ligature, combining characters, language, and possibly other factors (see Unicode standard).
Unicode defines that U+2F804 is decomposable to U+4F60. Often Unicode texts are normalized by software. Either by decomposing them (so often splitting characters and accents e.g. for Latin languages), or by composing them. Such algorithms are described in Unicode. So in that case, it is considered U+4F60 semantically fully equal to U+2F804 (and preferred form). It is not frequent to see decomposition which contain the same number of code points (but also not unseen). And it is also not seldom to have decomposition in one direction, and no relation on the other direction.
This character is in CJK Compatibility Ideographs Supplement, so the important part is compatibility, and this is confirmed also by wikipedia article (https://en.wikipedia.org/wiki/CJK_Compatibility_Ideographs_Supplement).
Compatibility codepoint were introduced to simplify introduction of Unicode, by providing a lossless roundtrip conversion of other encoding. [In such manner, one could implement Unicode on different layers, without problems and fully transparent, and without requiring to change other layers (or worse: to change all stack in one step).
I have a web application with embedded fonts. There is a small problem. Language of my web application is Persian and English but all numbers in the web page are shown in Persian even the numbers in the English content. This is the screenshot of web application.
Is there any way to show numbers like Microsoft Word (use Persian numbers in Persian text and English numbers in English text)?
Technically, you could put both common European digits 0, 1, 2… and Arabic digits ٠, ١, ٢, … as alternative glyphs for the characters U+0030 DIGIT ZERO, U+0031 DIGIT ONE, U+0032 DIGIT TWO, etc.,into the same font, using OpenType features, and you could use CSS tools for selecting between (though this is not yet supported by all browsers). But then you would need to be a font designer, or at least know how to edit a font.
The normal way, however, is to treat European digits and Arabic digits as distinct characters, i.e. make the difference at the character level. So the code that generates the calendar should take care of the issue. And then you just need a font that has both sets of digits, properly assigned to the separate characters.
Its quite easy if you use lang attribute inside the parent div of your calendar:
<div lang="en" style="font-family:Tahoma">
<!--Calendar code here -->
</div>
I want to insert the Arabic letters in the pattern just like the English letters
pattern="[a-zA-Z0-9-_. ]{1,30}"
I have no idea how to accomplish the action.
The range for Arabic and Persian are shared so this code could be used for Arabic too.
[أ-يa-zA-Z]
This is the reference for finding the character range of Unicode languages:
preg_replace and preg_match arabic characters
http://unicode.org/charts/
The HTML5 pattern attribute follows JavaScript regular expression syntax, which makes things rather awkward. You cannot test character properties, for example. Instead, you need to list down the allowed characters or ranges of characters.
Using the normative Scripts.txt file (by the Unicode Consortium), which defines the script (writing system) of all characters, I constructed the following:
pattern=
"[a-zA-Z0-9-_. \
\u0620-\u063F\u0641-\u064A\u066E-\u066F\u0671-\u06D3\u06D5\
\u06E5-\u06E6\u06EE-\u06EF\u06FA-\u06FC\u06FF\u0750-\u077F\
\u08A0\u08A2-\u08AC\uFB50-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\
\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC]{1,30}"
Starting from the set of all characters with script defined to be Arabic, I picked up those that are declared as letters (General Category Lo or Lm), and then omitted those beyond BMP, the Basic Multilingual Plane.
Characters outside BMP are used very rarely, and to represent them in JavaScript syntax, you would need to either include the characters themselves or use two \u notations per character (one for each component of a surrogate pair). This does not sound realistic.
This is of course a “hardwired” solution: it may need updates if new Arabic letters are added to Unicode or the script of a character is changed from or to Arabic (which is highly unlikely). But I don’t expect to see new Arabic letters added to BMP during my lifetime.
When developing an application that will need to work with a variety of localizations, particularly with "right to left" text, is there a possibility of a case where numbers would need to be converted to "right to left" as well?
I'm no language scholar, but I know the RTL languages I am familiar with present their numbers in LTR.
For instance (using google translate):
I have 345 apples.
In Arabic:
لدي 345 التفاح.
So, I have two questions:
Is it possible to run into a language that uses RTL numbers?
How should internationalizing be handled in such cases?
or,
Is the "accepted norm" to just do numbers using Western Arabic characters, read from left to right?
In the big right-to-left scripts - Arabic, Hebrew and Thaana - numbers always run left to right. (When I say "Arabic", I refer to all the languages that are written in the Arabic script - Arabic, Farsi, Urdu, Pasto and many others.)
Hebrew and Thaana always use European digits, the same 0-9 set as English. There's nothing much to do there, because Unicode automatically takes care of ordering the numbers correctly. But see the comments about isolation below.
It's possible to use European digits in Arabic, too; for example, the Arabic Wikipedia uses them. However, very frequently Arabic texts use a different set of digits - https://en.wikipedia.org/wiki/Eastern_Arabic_numerals . It depends on your users' preferences. Notice also, that in the Persian language the digits are slightly different. From the point of view of right-to-left layout they behave pretty much the same way as European digits, although there are slight differences in the behavior of mathematical signs - for example, the minus can go on the other side. There are some subtleties here, but they are mostly edge cases.
In both Hebrew and Arabic you may run into a problem with bidi-isolation. For example, if you have a Hebrew paragraph in which you have an English word, and after the word you have numbers, the numbers will appear to the right of the word, although you may have wanted them to appear on the left. That's how the Unicode bidi algorithm works by default. To resolve such things you can use the Unicode control characters RLM and LRM. If you are using HTML5, you can also use the <bdi> tag for this, as well as the CSS rule "unicode-bidi: isolate". These CSS and HTML5 solutions are quite powerful and elegant, but aren't supported in all browsers yet.
I am aware of one script in which the digits run right-to-left: N'Ko, which is used for some languages of Africa. I actually saw websites written in it, but it is far less common than Hebrew and Arabic.
Finally, if you're using JavaScript, you can use the free jquery.i18n library for automatic number conversion. See https://github.com/wikimedia/jquery.i18n . (Disclaimer: I am one of this library's developers.)
Numbers will generally translate as you have them. Even in languages that read in different directions the Western Arabic numbers are typically recognized by the user.
When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.
I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).
HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...
On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.
When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.