Which character encoding to use - mysql

If I go through my Twitter feed I can see tweets which are of different languages (Some are English, some are Chinese, some have European characters, some even have emojis). I would also like to support multiple languages and emojis inside of my app. If I have a MySQL database for example and have a column called 'message_content' which stores messagess content, how can I ensure the data under this column can support all languages+emojis?
I am not sure if it is as simple as choosing a character encoding and that's it or if it is more complicated?

utf8mb4 is a good choice for this.

Related

How to reliably transform ISO-8859 encoded characters into HTML entities with NodeJS?

The Expedia Hotel Database is providing some of its data using the ISO-8859 encoding:
Files with ONLY English content are ISO-8859.
However:
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.
So it a series of different encodings with notable differences, rather than a single one. My problem:
How can I convert their "ONLY English content" data using NodeJS into a safer form to store in my database, that I can reliably deliver to user's browser, without worrying that the data get corrupted at the user end?
I am thinking of trying to convert all data from ISO/IEC 8859-X (for each X = 1,...,16) into HTML entities first, and then check for presence of non-ASCII characters, which means the encoding was not correct and I have to take the next X. If none of the X works, that means this data entry is corrupted and should be discarded I suppose, as it will be unlikely displayed correctly. The whole task feel somewhat cumbersome, so I am wondering if there are simpler ways.
Note that even though the content is declared "ONLY English content", many data entries do actually contain accented characters that might get corrupted in a wrong encoding.

what collation must i use utf8_general_ci or utf8_unicode_ci or any other, for all world languages?

We develop android app. The app accepts text from users and upload to server (mysql). Then this text is read by other users.
While testing i found that 'Hindi' (Indian) language gets inserted in the column as '????? '. Then after SO search, i changed the collation to utf8_general_ci.
I am new to collation. I want to let user input text in any language in the world and others get the access. What shall i do. Accuracy is must.
But i saw a comment where one says, "You should never, ever use utf8_general_ci. It simply doesn’t work. It’s a throwback to the bad old days of ASCII stooopeeedity from fifty years ago. Unicode case-insensitive matching cannot be done without the foldcase map from the UCD. For example, “Σίσυφος” has three different sigmas in it; or how the lowercase of “TSCHüẞ” is “tschüβ”, but the uppercase of “tschüβ” is “TSCHÜSS”. You can be right, or you can be fast. Therefore you must use utf8_unicode_ci, because if you don’t care about correctness, then it’s trivial to make it infinitely fast."
Your question title is asking about collations, but in the body you say:
I want to let user input text in any language in the world and others get the access.
So, I'm assuming that is what you're specifically after. To clarify, collations affect how MySQL compares strings with each other, but it's not the thing that ultimately opens up the possibility of storing unicode characters.
For storage you need to ensure that the character set is defined correctly. MySQL allows you to specify character set and collation values on a column level, but it also allows you to specify defaults on a table and database level. In general I'd advice setting defaults on a database and table level, and let MySQL handle the rest when defining columns. Note that if columns already exist with a different character set, then you'll need to investigate changing it. Depending on what you're using to communicate with MySQL, you may need to specify a character encoding to use against the connection too.
Note that utf8mb4 is an absolute must for the character set used, do not use just utf8.. you won't be able to store unicode characters that consume 4 bytes with UTF-8, such as emoji characters.
As for the collation to use, I don't have a recommendation really, as it sort of depends what you're aiming for, speed or accuracy. There is a fair amount of information around which covers the topic in other answers.

Working with English, Arabic, and Chinese in MySQL

A few questions regarding character sets:
When storing English, Arabic, and Chinese in a MySQL database, is there any characters set that supports all of these languages?
Will Chinese and Arabic numbers still store in Integer and decimal type fields?
Are there any other limitations I am not thinking about?
Any guidance is appreciated!
you can use UTF to store the text. see http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
for numbers, the database stores the actual number, in binary, and not any particular representation of that number. what you need to do is display the numbers in the client program using the correct locale. for php see PHP: Locale aware number format (although i suspect that may only do things like choose the correct symbol for the decimal point, and not change from 0-9 for digits).
for more background see https://en.wikipedia.org/wiki/Internationalization_and_localization

Finding special characters in MySQL database

Does anyone know of a quick and easy way to locate special characters that didn't get correctly converted when data was imported into MySQL.
I think this an issue due to data encoding (e.g. Latin-1 vs. UTF-8). Regardless where the issue first occurred, I'm stuck with junk in my data that I need to remove.
There's unlikely to be an easy function for this, because for example, a broken UTF-8 special character will consist of two valid ISO-8859-1 characters. So while there are patterns of what those broken characters look like, there is no sure-fire way of identifying them.
You could build a search+replace function to replace the most common occurrences in your language (e.g. Ãœ for Ü if imported from UTF-8 into ISO-8859-1).
That said, it would be best to restart the import with the correct settings, if at all possible.

What character encoding should I use for a web page containing mostly Arabic text? Is utf-8 okay?

What character encoding should I use for a web page containing mostly Arabic text?
Is utf-8 okay?
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
However, if you were wondering what encoding would be most efficient:
All Arabic characters can be encoded using a single UTF-16 code unit (2 bytes), but they may take either 2 or 3 UTF-8 code units (1 byte each), so if you were just encoding Arabic, UTF-16 would be a more space efficient option.
However, you're not just encoding Arabic - you're encoding a significant number of characters that can be stored in a single byte in UTF-8, but take two bytes in UTF-16; all the html encoding characters <,&,>,= and all the html element names.
It's a trade off and, unless you're dealing with huge documents, it doesn't matter.
I develop mostly Arabic websites and these are the two encodings I use :
1. Windows-1256
This is the most common encoding Arabic websites use. It works in most cases (90%) for Arabic users.
Here is one of the biggest Arabic web-development forums: http://traidnt.net/vb/. You can see that they are using this encoding.
The problem with this encoding is that if you are developing a website for international use, this encoding won't work with every user and they will see gibberish instead of the content.
2. UTF-8
This encoding solves the previous problem and also works in urls. I mean if you want to have Arabic words in the your url, you need them to be in utf-8 or it won't work.
The downside of this encoding is that if you are going to save Arabic content to a database (e.g. MySql) using this encoding (so the database will also be encoded with utf-8) its size is going to be double what it would have been if it were encoded with windows-1256 (so the database will be encoded with latin-1).
I suggest going with utf-8 if you can afford the size increase.
UTF-8 is fine, yes. It can encode any code point in the Unicode standard.
Edited to add
To make the answer more complete, your realistic choices are:
UTF-8
UTF-16
UTF-32
Each comes with tradeoffs and advantages.
UTF-8
As Joe Gauterin points out, UTF-8 is very efficient for European texts but can get increasingly inefficient the "farther" from the Latin alphabet you get. If your text is all Arabic it will actually be larger than the equivalent text in UTF-16. This is rarely a problem, however, in practice in these days of cheap and plentiful RAM unless you have a lot of text to deal with. More of a problem is that the variable-length of the encoding makes some string operations difficult and slow. For example you can't easily get the fifth Arabic character in a string because some characters might be 1 byte long (punctuation, say), while others are two or three. This makes actual processing of strings slow and error-prone.
On the other hand, UTF-8 is likely your best choice if you're doing a lot of mixed European/Arabic text. The more European text in your documents, the better the UTF-8 choice will be.
UTF-16
UTF-16 will give you better space efficiency than UTF-8 if you're using predominantly Arabic text. I don't know about the Arabic code points, however, so I don't know if you risk having variable-length encodings here. (My guess is that this is not an issue, however.) If you do, in fact, have variable-length encodings, all the string processing problems of UTF-8 apply here as well. If not, no problems.
On the other hand, if you have mixed European and Arabic texts, UTF-16 will be less space-efficient. Also, if you find yourself expanding your text forms to other texts like, say, Chinese, you definitely go back to variable length forms and the associated problems.
UTF-32
UTF-32 will basically double your space requirements. On the other hand it's constant sized for all known (and, likely, unknown;) script forms. For raw string processing it's your fastest, best option without the problems that variable-length encoding will cause you. (This presupposes you have a string library that knows about 32-bit characters, naturally.)
Recommendation
My own recommendation is that you use UTF-8 as your external format (because everybody supports it) for storage, transmission, etc. unless you really see a benefit size-wise with UTF-16. So any time you read a string from the outside world it would be UTF-8 and any time you put one to the outside world it, too, would be UTF-8. Within your software, though, unless you're in the habit of manipulating massive strings (in which case I'd recommend different data structures anyway!) I'd recommend using UTF-16 or UTF-32 instead (depending on if there's any variable-length encoding issues in your UTF-16 data) for the speed efficiency and simplicity of code.
UTF-8 is the simplest way to go since it will work with almost everything:
UTF-8 can encode any Unicode
character. Files in different
languages can be displayed correctly
without having to choose the correct
code page or font. For instance
Chinese and Arabic can be in the same
text without special codes inserted to
switch the encoding.
(via wikipedia)
Of course keep in mind that:
UTF-8 often takes more space than an
encoding made for one or a few
languages.
Latin letters with diacritics and
characters from other alphabetic
scripts typically take one byte per
character in the appropriate
multi-byte encoding but take two in
UTF-8. East Asian scripts generally
have two bytes per character in their
multi-byte encodings yet take three
bytes per character in UTF-8.
... but in most cases it's not a big issues. It would become one if you start handling huge documents.
UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.