What COLLATE should i set to use all kind of possible languages? - mysql

I have a column called username, i want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should i set on my database and tables?
I'm using utf_general_ci, i'm new so i don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because i will not use preg_replace or a function to replace special chars, i will only use prepared statement to avoid SLQ injection and protect by database.

First choice (MySQL 8.0): utf8mb4_0900_ai_ci
Second choice (as of 5.6): utf8mb4_unicode_520_ci
Third choice (5.5+): utf8mb4_unicode_ci
Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci
The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.
No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.
utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)
If Portuguese is the focus:
See https://pt.stackoverflow.com/ and MySQL collation for Portugese .
Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.
(Else go with the 'choices' listed above.)

If you are using MySQL 5.5.3 or higher, I would recommend UTF-8 character encoding utf8mb4_unicode_ci . AFAIK it supports most, if not all languages, and implements the Unicode standard for sorting and comparison. As a second choice, have a look at utf8mb4_general_ci, which may be faster but also less accurate.
See this excellent SO post for (many) more details, or check out the official MySQL doc.
Below 5.5.3, utf8_unicode_ci is your friend.

COLLATION refers to ordering (as in comparisons in WHERE and ORDER BY); you should really ask about CHARACTER SET:
Pre-5.5.3: utf8 (aka utf8mb3) handles all languages, except for a few Chinese characters and Emoji.
5.5.3 forward: utf8mb4 - Handles everything. Outside of MySQL, it is spelled "UTF-8".

Related

Which collation to use so that `ş` and `s` are treated as unique values?

The issue is that ş and s are interpreted by MySQL as identical values.
I'm new to MySQL, so I have no idea which collations would view them as unique.
The collations that I've tried using which don't work are:
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
Does anybody know which collation to use?
P.S. I also really need the collation to interpret emojis and other non-Latin characters, and, to my knowledge of MySQL and collations, the only collation able to do this is unicode?
utf8_turkish_ci and utf8_romanian_ci -- as shown in http://mysql.rjweb.org/utf8_collations.html
(Plus, of course, utf8_bin.)
For your added question: You are looking for a "character set" (not a "collation") that can represent Emoji and other non-Latin characters -- UTF-8 is the one to use. In MySQL, it is utf8mb4. The "collations" that are associated with that are named utf8mb4_.... Collations control ordering and equality, as indicated in the first part of your question about s and ş.
MySQL's CHARACTER SET utf8 is a subset of utf8mb4. Either can handle all the "letters" in the world. But only utf8mb4 can handle Emoji and some Chinese characters.

What MySQL collation is best for accepting all unicode characters?

Our column is currently collated to latin1_swedish_ci and special unicode characters are, obviously, getting stripped out. We want to be able to accept chars such as U+272A ✪, U+2764 ❤, (see this wikipedia article) etc. I'm leaning towards utf8_unicode_ci, would this collation handle these and other characters? I don't care about speed as this column isn't an index.
MySQL Version: 5.5.28-1
The collation is the least of your worries, what you need to think about is the character set for the column/table/database. The collation (rules governing how data is compared and sorted) is just a corollary of that.
MySQL supports several Unicode character sets, utf8 and utf8mb4 being the most interesting. utf8 supports Unicode characters in the BMP, i.e. a subset of all of Unicode. utf8mb4, available since MySQL 5.5.3, supports all of Unicode.
The collation to be used with any of the Unicode encodings is most likely xxx_general_ci or xxx_unicode_ci. The former is a general sorting and comparison algorithm independent of language, the latter is a more complete language independent algorithm supporting more Unicode features (e.g. treating "ß" and "ss" as equivalent), but is therefore also slower.
See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html.

Why is MySQL's default collation latin1_swedish_ci?

What is the reasoning behind setting latin1_swedish_ci as the compiled default when other options seem much more reasonable, like latin1_general_ci or utf8_general_ci?
The bloke who wrote it was co-head of a Swedish company.
Possibly for similar reasons, Microsoft SQL Server's default language us_english.
latin1_swedish_ci is a single byte character set, unlike utf8_general_ci.
Compared to latin1_general_ci it has support for a variety of extra characters used in European languages. So it’s a best choice if you don’t know what language you will be using, if you are constrained to use only single byte character sets.

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

In MySQL, which collation should I choose?

When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci