In MySQL, which collation should I choose?

In MySQL, which collation should I choose? - mysql

When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?

Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.

Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.

Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci

Related

What COLLATE should i set to use all kind of possible languages?

I have a column called username, i want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should i set on my database and tables?
I'm using utf_general_ci, i'm new so i don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because i will not use preg_replace or a function to replace special chars, i will only use prepared statement to avoid SLQ injection and protect by database.

First choice (MySQL 8.0): utf8mb4_0900_ai_ci
Second choice (as of 5.6): utf8mb4_unicode_520_ci
Third choice (5.5+): utf8mb4_unicode_ci
Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci
The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.
No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.
utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)
If Portuguese is the focus:
See https://pt.stackoverflow.com/ and MySQL collation for Portugese .
Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.
(Else go with the 'choices' listed above.)

If you are using MySQL 5.5.3 or higher, I would recommend UTF-8 character encoding utf8mb4_unicode_ci . AFAIK it supports most, if not all languages, and implements the Unicode standard for sorting and comparison. As a second choice, have a look at utf8mb4_general_ci, which may be faster but also less accurate.
See this excellent SO post for (many) more details, or check out the official MySQL doc.
Below 5.5.3, utf8_unicode_ci is your friend.

COLLATION refers to ordering (as in comparisons in WHERE and ORDER BY); you should really ask about CHARACTER SET:
Pre-5.5.3: utf8 (aka utf8mb3) handles all languages, except for a few Chinese characters and Emoji.
5.5.3 forward: utf8mb4 - Handles everything. Outside of MySQL, it is spelled "UTF-8".

Why setting a collation for UTF8 text data in MySQL?

I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.
In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?
What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?

Every textual column has a collation. It may be set explicitly in the table definition, or it may simply be set from the table's default, the database's default, or the server-wide default. But it has a collation.
The collations you mention are all case-insensitive. That is, they ignore the difference between upper- and lower- case letters. If you want case-sensitive collations use utf8_binary.
You probably want to use utf8_unicode_ci in a modern server. Read this for background. What's the difference between utf8_general_ci and utf8_unicode_ci
utf8_general_mysql500_ci is a collation specifically for backward compatibility to older versions of MySQL. http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-21.html

Why does MySQL use latin1_swedish_ci as the default?

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.

As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.

latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.

Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.

Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.

To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.

From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.

The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

Should I migrate a MySQL database with a latin1_swedish_ci collation to utf-8 and, if so, how?

The MySQL database used by my Rails application currently has the default collation of latin1_swedish_ci. Since the default charset of Rails applications (including mine) is UTF-8, it seems sensible to me to use the utf8_general_ci collation in the database.
Is my thinking correct?
Assuming it is, what would be the best approach to migrate the collation and all the data in the database to the new encoding?

UTF-8, as well as any other Unicode encoding scheme, can store characters in any language, so it is an excellent choice of codepage for your database.
The collation setting, on the other hand, is a completely separate issue from the encoding scheme. It involves sort orders, upper/lowercase conversions, string equality comparisons, and things like that which are language-specific. The collation setting should match the language that is used in the database.
The UTF-8 general collation is (I am assuming here—I'm not familiar with MySQL in particular) used for situations where the language is unknown and some simple default ordering is needed. It probably corresponds to the Unicode code point ordering, which is almost certainly not what you want if you're storing Swedish.

Convert to UTF-8 as the charset.
Collation settings are only used for sorting and stuff like that. Choose the collation that most of your users would expect.

Providing your existing data in the database is CORRECTLY encoded in latin1, converting the tables to utf8 (using ALTER TABLE, as described in the docs) should just work.
Then all your application needs to do is continue doing whatever it did before. If your application wants to use unicode characters, it should set its connection encoding to utf8 and use utf8, but that's its own problem.
The problem is that a large number of crap web apps have historically sent utf8 data to mysql and told it to treat it as latin1. MySQL will honour this perfectly and save junk into the tables, as instructed.
Converting the tables from latin1 to utf8 will NOT repair this mistake, as you genuinely do have total rubbish in there. Repairing them is nontrivial, particularly if during the lifetime of the app it's been talking different types of rubbish to the database.

Use below mysql query to convert your column :
ALTER TABLE users MODIFY description VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
To see full details about your table :
SHOW FULL COLUMNS FROM users;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

In MySQL, which collation should I choose? - mysql

Related

What COLLATE should i set to use all kind of possible languages?

Why setting a collation for UTF8 text data in MySQL?

Why does MySQL use latin1_swedish_ci as the default?

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

Should I migrate a MySQL database with a latin1_swedish_ci collation to utf-8 and, if so, how?

Categories

Resources