Choosing between UTF-8 and ISO-8859-1? - html

UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet
I'm not understanding both of these: where should I use this and in what cases on pages ?
Especially which one is suitable for login pages ?

Both encoding schemes are widely used on internet.
In summary, UTF-8 supports a lot more characters (all characters in the Unicode character set).
There are many languages ISO-8859-1 does not support. For example, Chinese characters.
Read about the languages ISO-8859-1 supports.

Related

Some characters are not correctly encrypted in A256GCM

I'm trying to encrypt some content that I have in a JSON file.
I have this content localized in several languages such as Spanish, German, Japanese or Chinese (traditional and simplified), and others.
The content can be encrypted, but cannot be unencrypted because of some character are not encrypted correctly. I have checked the problematic characters are the Japanese or Chinese ones. I have the same problems with some German
or Russian characters. It crashes when I try to parse the content (that is plain text):
JSON.parse(decrypted_plain_text)
Then, I get the error.
Does this algorithm support characters such as Japanese or Chinese characters?
I've tried to change the encoding from UTF-8 to UTF-8 w/o BOM but it doesn't work, either.
The algorithm is A256GCM and the CEK is A128KW.

Uploading HTML files containing special characters on server destroys them

The title is pretty self explanatory.
I've never encountered this problem before. Only when I've tried to upload text with special characters on a database, but this is not the case.
I have HTML files that contain special characters like - āšķī etc. All of them are changed to some ?arab? letters after I upload the files on server.
What could be the solution?
Unicode
Unicode text files can store text in any language known to humanity. Modern globalized applications often use UTF-8 or UTF-16 to save text files.
UTF-8
UTF-16 little endian
UTF-16 big endian
UTF-32 little endian
UTF-32 big endian

How to store unicode data in a format that doesn't support utf-8

Okay, here's yet another character encoding question, demonstrating my ignorance of all things Unicode.
I am reading data out of Microsoft Excel .xls files, and storing it in ESRI shapefiles .shp. For versions of Excel > 5.0, text in excel files is stored as Unicode. However, Unicode (and specifically UTF-8 support for shapefiles is inconsistent and thus I think I should not use it at all. Shapefiles do support old-school codepages, however.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
As I understand it, a Unicode string can include characters from multiple "codepages". I would assume, therefore, that I must somehow estimate the "best" codepage to use, and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?). Is this the usual approach?
I can definitely use more than just the system codepage. Because .shp files use the .dbf files to store their attribute data, at least all the codepages specified by the .dbf format should be supported (see the xBase format description). The supported codepages are: DOS USA, DOS Multilingual, Windows ANSI, Standard Macintosh, EE MS-DOS, Nordic MS-DOS, Russian MS-DOS, Icelandic MS-DOS, Kamenicky (Czech) MS-DOS, Mazovia (Polish) MS-DOS, Greek MS-DOS (437G), Turkish MS-DOS, Russian Macintosh, Eastern European Macintosh, Greek Macintosh, Windows EE, Russian Windows, Turkish Windows, Greek Windows
In addition, some applications support the use of an *.cpg file which specifies additional codepages to use (although I understand support for utf-8, and I suspect many other codepages, is limited).
Because I am trying to develop a general purpose tool, I can't assume anything about the content of the Unicode in the .xls files.
What is the best practice in a
situation where you must convert a
Unicode string to a string in an
unknown but specific codepage?
Depends on the file format. If it supports Unicode "escape sequences" like XML's € or JSON's \u20AC, then use those, and you won't lose any information. If not, a different approach is required.
I would assume, therefore, that I must
somehow estimate the "best" codepage
to use,
Generally, on a non-Unicode system, you'd convert characters into whatever the default encoding is, not an arbitrary code page.
Edit: So you do get a choice of code pages:
01h DOS USA code page 437
6Ah Greek MS-DOS (437G) code page 737
02h DOS Multilingual code page 850
64h EE MS-DOS code page 852
6Bh Turkish MS-DOS code page 857
67h Icelandic MS-DOS code page 861
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866
C8h Windows EE code page 1250
C9h Russian Windows code page 1251
03h Windows ANSI code page 1252
CBh Greek Windows code page 1253
CAh Turkish Windows code page 1254
04h Standard Macintosh code page 10000
98h Greek Macintosh code page 10006
96h Russian Macintosh code page 10007
68h Kamenicky (Czech) MS-DOS
69h Mazovia (Polish) MS-DOS
97h Eastern European Macintosh
To choose a code page, I would recommend:
Check if your data is plain ASCII. If so, it doesn't matter which code page you choose.
If not, try to find a code page that can exactly represent your data (or if you can't, one that minimizes the unrepresentable characters). Try code page 1252 first, then the other 125x code pages. Don't bother with the DOS code pages unless you have box-drawing characters.
and then convert all non-supported
characters into their closest
approximation in that codepage (or the
dreaded ?). Is this the usual
approach?
It's the approach we take at work when we need to convert a UTF-8 file into windows-1252 or into EBCDIC. I used Unidecode to help generate the "closest approximations".
We do, however, only replace letters and digits, not punctuation. Replacing “” with "" would break a few file formats.
What language is your text in? If the characters are mostly ASCII, it's probably best to write the original UTF-8 encoded text as such. A non-UTF-8-aware program will still read ASCII text correctly and display garbled ASCII for unknown characters.

How to deal with HTML-entities for publishing multilingual content

In case of publishing any text online as a HTML page – I face the problem of the correct reflection of symbols of several languages which require extended Latin character encoding. In this case I’m searching the Entity (hex) from the list on this site http://theorem.ca/~mvcorks/code/charsets/auto.html . I wonder If it’s possible to save my time via definition of any meta-tags and their attributes.
Any advice would be much appreciated.
Thanks.
Vitaly Repin
I recommend you to use the Unicode charset and encode the characters with UTF-8.
Unicode contains probably all characters you’ll need and UTF-8 is the most efficient encoding for the Unicode charset concerning the code word lengths. If you’re using UTF-8, you don’t need the HTML character references as you can use the character they represent themselves.
Just write your text with the plain characters, tell your editor to save it using UTF-8 as character encoding, and tell your web server to serve the document with UTF-8.

How do I sanitize user input for proper content-encoding before I save it?

I've got an application where users input text into forms.
The data is saved into a MySQL database (collation: utf8_general_ci) and then output as XML (encoding: UTF-8).
The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.
So, how to sanitise the input?
Previously, I've used some fairly brute-force methods, things like the "de-moronize" script which consists of a long list of search-and-replace operations.
Is this still the best way to do it? Is there any other way?
Can I just set the accept-charset attribute on the form and have the browser do it for me?
If so, which browsers will do that and are there likely to be any problems?
Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?
As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...
TIA
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.
Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.
If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:
Content-Type: text/html;charset=utf-8
HTTP header and/or the:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element in <head>.
Can I just set the accept-charset attribute on the form and have the browser do it for me?
No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!
how come my database is accepting these characters, which are reserved/control characters in UTF-8?
In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.
It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).
ETA:
So, I entered a string containing byte 146
No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.
(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘’’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)
It went into the database as 146
Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘’’?
it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser
Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)
I suspect it is coming out as a ‘’’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.
You might try the Perl Encode module. It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner:
perl -MEncode -e 'print map {"$_\n"} Encode->encodings(":all");'
"Can I just set the accept-charset attribute on the form and have the browser do it for me?"
Only if you're prepared to trust "the browser" - that might be suitable in some applications, but in general it's leaving yourself wide open to mischief (or worse).
(Also see bobince's warnings about IE...)
Iain