Why setting a collation for UTF8 text data in MySQL? - mysql

I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.
In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?
What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?

Every textual column has a collation. It may be set explicitly in the table definition, or it may simply be set from the table's default, the database's default, or the server-wide default. But it has a collation.
The collations you mention are all case-insensitive. That is, they ignore the difference between upper- and lower- case letters. If you want case-sensitive collations use utf8_binary.
You probably want to use utf8_unicode_ci in a modern server. Read this for background. What's the difference between utf8_general_ci and utf8_unicode_ci
utf8_general_mysql500_ci is a collation specifically for backward compatibility to older versions of MySQL. http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-21.html

Related

Why is table CHARSET set to utf8mb4 and COLLATION to utf8mb4_unicode_520_ci

I've recently noticed that, when ever I start a new WordPress project, my tables' collation automatically changes from utf8_unicode_ci (which I select when I create a new DB from phpMyAdmin) to utf8mb4_unicode_520_ci.
Also, I've noticed in phpMyAdmin under “General Settings” that server connection Collation defaults to utf8mb4_unicode_520_ci.
I'm running MySQL Server 5.7.17 and phpMyAdmin 4.6.6 on Ubuntu 17.04.
My questions are following:
Why is this happening?
If possible, how do I prevent this? Because of utf8mb4 I've experienced problems when migrating WP sites to an older MySQL server which does not support it.
Is point 2. advisable? Are there any benefits in using charset utf8mb4 over utf8, and collation utf8mb4_unicode_520_ci over utf8_unicode_ci?
In the past, there was only utf8 (aka utf8mb3); in the future, utf8mb4 will be the default character set. now utf8mb4 is the default character set.
In the past, _general_ci was the default collation; then _unicode_ci (Unicode 4.0) was better, then _unicode_520_ci (Unicode 5.20). In the future (MySQL 8.0), the default will be _0900_ci_ai (Unicode 9.0).
Meanwhile, the road is full of potholes generated by MySQL's past mistakes. And WP designers are driving in a big tank that does not notice the potholes.
MySQL 5.6 was a big pothole that swallowed up many a WP user because of a 767 limit on indexes together with WP indexes on the overly-long VARCHAR(255) and the possibility of using utf8mb4. You are well past it by having 5.7.17. (Your future move to 8.0 will be less bumpy.)
That is, newly created databases/tables/columns on 5.7.7+ should not experience the 767 problem, but things migrated from older versions (5.5.3+) may have issues, especially if something causes you to change to utf8mb4.
What to do? I'll probably run out of space trying to spell out all the options. So provide the history of the data, the upgrade path (if any), the current settings, the ROW_FORMAT of the tables, the CHARACTER SET and COLLATION of the columns, the output of SHOW VARIABLES LIKE 'char%';
Where should you be? For 5.7.7+, utf8mb4 and utf8mb4_unicode_520_ci wherever practical. That charset gives you Emoji and all of Chinese (utf8 does not). That collation is the best available, although you might be hard pressed to notice where it matters.
Note: the first part of the collation name is the only character set that it works with. That is utf8_unicode_ci does not work with utf8mb4.
For MySQL 8.0, there is a better collation than the one mentioned in the title. In general, simply use the default collation for the chosen charset (unless you have some compatibility issue of language-specific need).

Relationship between database's charset, table's charset and columns' charset? Is diffrent charsets lead to any performance issues?

I am developing a website by using ASP.net and my DB is MYSQL. In there users can submit articles. This site goes internationally so I dont want to restrict the language only to English.
So I decided few things. Please guide me If I made the wrong choice.
1) I choose utf8mb4 as database charset. Because it is an improved version of UTF8 for store further characters. Am I made the right choice? I mean I have only few tables where need to use utf8mb4. So Shall I use Latin1 as Database charset?
2) I dont have an idea which collation to use for above charset. I decided to use utf8mb4 swedish_ci. Or should I use general Ci or any other?
3) In my tables most of tables not needed utf8mb4 charset. Latin 1 swedesh will do the work. So can I maintain selected tables under specific charset and collation even DB is in another Charset and collation?
4) Can I use utf8mb4 charset for a specific column in a table which have Latin1 swedesh as charset?
If those can do what is the relationship between database charset, table charset and column charsets?
Is different charsets lead to any performance issues?
Thank you very much.
The database charset is inherited by the table, unless you override it. (I recommend being specific at the table level.)
The table charset is inherited by the columns in the table. Since one usually has only one charset, this inheritance is fine. Also, it is pretty clear when you do SHOW CREATE TABLE what each column is set to -- without having to look at the database or system.
Go international -- use utf8 or utf8mb4. I agree that utf8mb4 is a better choice, especially for Chinese and some emoticons.
character_set_% -- Only _client, _connection, and _results are important. And these are the three that are set by SET NAMES utf8mb4. Leave the rest alone.
The default collation for utf8mb4 is utf8mb4_general_ci, which is possibly a good choice if you have multiple languages. The other choice is utf8mb4_unicode_ci . I talk more about "combining diacriticals" in http://mysql.rjweb.org/doc.php/charcoll#combining_diacriticals . This section gives examples of where those two collations differ: http://mysql.rjweb.org/doc.php/charcoll#utf8_collations_examples
See also the "Best Practice" section.
latin1 is smaller than utf8 for Western European text. MySQL will do the proper conversions when needed, so that is not a problem. But I prefer not to confuse the programmer by mixing character sets. Keep in mind that converting an existing table column from latin1 to utf8 takes some effort, possible downtime, and maybe risk.
4) Can I use utf8mb4 charset for a specific column in a table which have Latin1 swedesh as charset?
Yes. Each column (but not each row) can have a different character set and/or collation.
The existence of different charsets is not a performance, per se. What could bite you is WHERE col1 = col2 (and other cases) when the two columns have a different character set and/or collation. MySQL will abandon an otherwise perfectly good index if it sees a difference that is not easy to handle.

What is the best MySQL collation for German language

I am building a web site in German language, So I will be using characters like ä, ü, ß etc., So what are your recommendations?
This answer is outdated. For full emoji support, see this answer.
As the character set, if you can, definitely UTF-8.
As the collation - that's a bit nasty for languages with special characters. There are various types of collations. They can all store all Umlauts and other characters, but they differ in how they treat Umlauts in comparisons, i.e. whether
u = ü
is true or false; and in sorting (where in the alphabets the Umlauts are located in the sorting order).
To make a long story short, your best bet is either
utf8_unicode_ci
It allows case insensitive searches; It treats ß as ss and uses DIN-1 sorting. Sadly, like all non-binary Unicode collations, it treats u = ü which is a terrible nuisance because a search for "Muller" will also return "Müller". You will have to work around that by setting a Umlaut-aware collation in real time.
or utf8_bin
This collation does not have the u = ü problem but only case sensitive searches are possible.
I'm not entirely sure whether there are any other side effects to using the binary collation; I asked a question about that here.
This mySQL manual page gives a good overview over the various collations and the consequences they bring in everyday use.
Here is a general overview on available collations in mySQL.
To support the complete UTF-8 standard you have to use the charset utf8mb4 and the collation utf8mb4_unicode_ci in MySQL!
Note: MySQL only supports 1- to 3-byte characters when using its so called utf8 charset! This is why the modern Emojis are not supported as they use 4 Bytes!
The only way to fully support the UTF-8 standard is to change the charset and collation of ALL tables and of the database itself to utf8mb4 and utf8mb4_unicode_ci. Further more, the database connection needs to use utf8mb4 as well.
The mysql server must use utf8mb4 as default charset which can be manually configured in /etc/mysql/conf.d/mysql.cnf
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
# character-set-client-handshake = FALSE ## better not set this!
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
Existing tables can be migrated to utf8mb4 using the following SQL statement:
ALTER TABLE <table-name> CONVERT TO
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Note:
To make sure any JOINs between table-colums will not be slowed down by charset-encodings ALL tables have to be change!
As the length of an index is limited in MySQL, the total number of characters per index-row must be multiplied by 4 Byte and need to be smaller than 3072
When the innodb_large_prefix configuration option is enabled, this
length limit is raised to 3072 bytes, for InnoDB tables that use the
DYNAMIC and COMPRESSED row formats.
To change the charset and default collation of the database, run this command:
ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur.
utf-8-general-ci or utf-8-unicode-ci.
To know the difference :
UTF-8: General? Bin? Unicode?
The above comments aren't really addressing the specific problem with German umlauts, which is often described as: dictionary order or phone-book order? The Unicode default is okay for the former but if (e.g.) you want 'Ü' = 'UE' then you could consider utf8mb4_de_pb_0900_ai_ci or utf8mb4_german2_ci, assuming character set is utf8mb4.

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

In MySQL, which collation should I choose?

When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci