The title is pretty self explanatory.
I've never encountered this problem before. Only when I've tried to upload text with special characters on a database, but this is not the case.
I have HTML files that contain special characters like - āšķī etc. All of them are changed to some ?arab? letters after I upload the files on server.
What could be the solution?
Unicode
Unicode text files can store text in any language known to humanity. Modern globalized applications often use UTF-8 or UTF-16 to save text files.
UTF-8
UTF-16 little endian
UTF-16 big endian
UTF-32 little endian
UTF-32 big endian
Related
I'm trying to encrypt some content that I have in a JSON file.
I have this content localized in several languages such as Spanish, German, Japanese or Chinese (traditional and simplified), and others.
The content can be encrypted, but cannot be unencrypted because of some character are not encrypted correctly. I have checked the problematic characters are the Japanese or Chinese ones. I have the same problems with some German
or Russian characters. It crashes when I try to parse the content (that is plain text):
JSON.parse(decrypted_plain_text)
Then, I get the error.
Does this algorithm support characters such as Japanese or Chinese characters?
I've tried to change the encoding from UTF-8 to UTF-8 w/o BOM but it doesn't work, either.
The algorithm is A256GCM and the CEK is A128KW.
I have a test page that displays two images. One called hello.bmp and another called 徘吐驴欸觰.bmp (this is a random collection of Chinese characters - apologies if it means something weird). For the latter image, I use an encoded format in the page's HTML.
The html is pretty straight forward:
<img src="%E5%BE%98%E5%90%90%E9%A9%B4%E6%AC%B8%E8%A7%B0.bmp" />
<img src="hello.bmp" />
In Internet explore 7, the encoded filepath does not display (Red x). All other browsers display it.
Does anyone know what would cause this? Can it be avoided?
Character encoding of file:/// URLs works differently across browsers on Windows.
Windows filenames are natively Unicode-based, so when you use a URL, which is byte-based, it has to convert that sequence of bytes to Unicode characters using an encoding. What encoding? There is no standard to say, but there are two obvious possibilities:
UTF-8, since it covers everything and is a popular default encoding, also used by the IRI standard for putting Unicode in URIs;
the (misleadingly-named) “ANSI” code page, which is an arbitrary default that varies from system to system. On a Western European Windows install it will be code page 1252 (which is similar to ISO-8859-1); on a Chinese Windows install it will be code page 936 (similar to GB2312).
The ANSI code page is a pain because you never know what it's going to be, it's never UTF-8, and if your filename contains characters that don't exist in ANSI—which will certainly be the case if you have the filename 徘吐驴欸觰.bmp on a Western Windows install—you can't access the file at all.
So which do the browsers use?
IE: ANSI code page
Safari/Opera: UTF-8
Chrome/Firefox: UTF-8, unless the bytes are not a valid UTF-8 sequence, in which case the ANSI code page is used instead.
So in conclusion, you can't reliably use non-ASCII characters in file:/// URLs at all.
This is in contrast to HTTP. The IIS web server, for example, has the same UTF-8-with-fallback-to-ANSI behaviour as Chrome and Firefox. Non-ASCII characters via IRI and a suitably-configured server are fine, but not the local filesystem.
(On non-Windows platforms filenames are natively bytes, usually representing UTF-8-encoded characters, but still bytes. Oo there is no ambiguity between the filesystem names and the byte-based URL %-sequences.)
die ANSI code page die. Why won't Microsoft kill you? You have long outstayed your welcome. You ruin everything.
We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp
Okay, here's yet another character encoding question, demonstrating my ignorance of all things Unicode.
I am reading data out of Microsoft Excel .xls files, and storing it in ESRI shapefiles .shp. For versions of Excel > 5.0, text in excel files is stored as Unicode. However, Unicode (and specifically UTF-8 support for shapefiles is inconsistent and thus I think I should not use it at all. Shapefiles do support old-school codepages, however.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
As I understand it, a Unicode string can include characters from multiple "codepages". I would assume, therefore, that I must somehow estimate the "best" codepage to use, and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?). Is this the usual approach?
I can definitely use more than just the system codepage. Because .shp files use the .dbf files to store their attribute data, at least all the codepages specified by the .dbf format should be supported (see the xBase format description). The supported codepages are: DOS USA, DOS Multilingual, Windows ANSI, Standard Macintosh, EE MS-DOS, Nordic MS-DOS, Russian MS-DOS, Icelandic MS-DOS, Kamenicky (Czech) MS-DOS, Mazovia (Polish) MS-DOS, Greek MS-DOS (437G), Turkish MS-DOS, Russian Macintosh, Eastern European Macintosh, Greek Macintosh, Windows EE, Russian Windows, Turkish Windows, Greek Windows
In addition, some applications support the use of an *.cpg file which specifies additional codepages to use (although I understand support for utf-8, and I suspect many other codepages, is limited).
Because I am trying to develop a general purpose tool, I can't assume anything about the content of the Unicode in the .xls files.
What is the best practice in a
situation where you must convert a
Unicode string to a string in an
unknown but specific codepage?
Depends on the file format. If it supports Unicode "escape sequences" like XML's € or JSON's \u20AC, then use those, and you won't lose any information. If not, a different approach is required.
I would assume, therefore, that I must
somehow estimate the "best" codepage
to use,
Generally, on a non-Unicode system, you'd convert characters into whatever the default encoding is, not an arbitrary code page.
Edit: So you do get a choice of code pages:
01h DOS USA code page 437
6Ah Greek MS-DOS (437G) code page 737
02h DOS Multilingual code page 850
64h EE MS-DOS code page 852
6Bh Turkish MS-DOS code page 857
67h Icelandic MS-DOS code page 861
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866
C8h Windows EE code page 1250
C9h Russian Windows code page 1251
03h Windows ANSI code page 1252
CBh Greek Windows code page 1253
CAh Turkish Windows code page 1254
04h Standard Macintosh code page 10000
98h Greek Macintosh code page 10006
96h Russian Macintosh code page 10007
68h Kamenicky (Czech) MS-DOS
69h Mazovia (Polish) MS-DOS
97h Eastern European Macintosh
To choose a code page, I would recommend:
Check if your data is plain ASCII. If so, it doesn't matter which code page you choose.
If not, try to find a code page that can exactly represent your data (or if you can't, one that minimizes the unrepresentable characters). Try code page 1252 first, then the other 125x code pages. Don't bother with the DOS code pages unless you have box-drawing characters.
and then convert all non-supported
characters into their closest
approximation in that codepage (or the
dreaded ?). Is this the usual
approach?
It's the approach we take at work when we need to convert a UTF-8 file into windows-1252 or into EBCDIC. I used Unidecode to help generate the "closest approximations".
We do, however, only replace letters and digits, not punctuation. Replacing “” with "" would break a few file formats.
What language is your text in? If the characters are mostly ASCII, it's probably best to write the original UTF-8 encoded text as such. A non-UTF-8-aware program will still read ASCII text correctly and display garbled ASCII for unknown characters.
In case of publishing any text online as a HTML page – I face the problem of the correct reflection of symbols of several languages which require extended Latin character encoding. In this case I’m searching the Entity (hex) from the list on this site http://theorem.ca/~mvcorks/code/charsets/auto.html . I wonder If it’s possible to save my time via definition of any meta-tags and their attributes.
Any advice would be much appreciated.
Thanks.
Vitaly Repin
I recommend you to use the Unicode charset and encode the characters with UTF-8.
Unicode contains probably all characters you’ll need and UTF-8 is the most efficient encoding for the Unicode charset concerning the code word lengths. If you’re using UTF-8, you don’t need the HTML character references as you can use the character they represent themselves.
Just write your text with the plain characters, tell your editor to save it using UTF-8 as character encoding, and tell your web server to serve the document with UTF-8.