I am building a web site in German language, So I will be using characters like ä, ü, ß etc., So what are your recommendations?
This answer is outdated. For full emoji support, see this answer.
As the character set, if you can, definitely UTF-8.
As the collation - that's a bit nasty for languages with special characters. There are various types of collations. They can all store all Umlauts and other characters, but they differ in how they treat Umlauts in comparisons, i.e. whether
u = ü
is true or false; and in sorting (where in the alphabets the Umlauts are located in the sorting order).
To make a long story short, your best bet is either
utf8_unicode_ci
It allows case insensitive searches; It treats ß as ss and uses DIN-1 sorting. Sadly, like all non-binary Unicode collations, it treats u = ü which is a terrible nuisance because a search for "Muller" will also return "Müller". You will have to work around that by setting a Umlaut-aware collation in real time.
or utf8_bin
This collation does not have the u = ü problem but only case sensitive searches are possible.
I'm not entirely sure whether there are any other side effects to using the binary collation; I asked a question about that here.
This mySQL manual page gives a good overview over the various collations and the consequences they bring in everyday use.
Here is a general overview on available collations in mySQL.
To support the complete UTF-8 standard you have to use the charset utf8mb4 and the collation utf8mb4_unicode_ci in MySQL!
Note: MySQL only supports 1- to 3-byte characters when using its so called utf8 charset! This is why the modern Emojis are not supported as they use 4 Bytes!
The only way to fully support the UTF-8 standard is to change the charset and collation of ALL tables and of the database itself to utf8mb4 and utf8mb4_unicode_ci. Further more, the database connection needs to use utf8mb4 as well.
The mysql server must use utf8mb4 as default charset which can be manually configured in /etc/mysql/conf.d/mysql.cnf
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
# character-set-client-handshake = FALSE ## better not set this!
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
Existing tables can be migrated to utf8mb4 using the following SQL statement:
ALTER TABLE <table-name> CONVERT TO
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Note:
To make sure any JOINs between table-colums will not be slowed down by charset-encodings ALL tables have to be change!
As the length of an index is limited in MySQL, the total number of characters per index-row must be multiplied by 4 Byte and need to be smaller than 3072
When the innodb_large_prefix configuration option is enabled, this
length limit is raised to 3072 bytes, for InnoDB tables that use the
DYNAMIC and COMPRESSED row formats.
To change the charset and default collation of the database, run this command:
ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur.
utf-8-general-ci or utf-8-unicode-ci.
To know the difference :
UTF-8: General? Bin? Unicode?
The above comments aren't really addressing the specific problem with German umlauts, which is often described as: dictionary order or phone-book order? The Unicode default is okay for the former but if (e.g.) you want 'Ü' = 'UE' then you could consider utf8mb4_de_pb_0900_ai_ci or utf8mb4_german2_ci, assuming character set is utf8mb4.
Related
The issue is that ş and s are interpreted by MySQL as identical values.
I'm new to MySQL, so I have no idea which collations would view them as unique.
The collations that I've tried using which don't work are:
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
Does anybody know which collation to use?
P.S. I also really need the collation to interpret emojis and other non-Latin characters, and, to my knowledge of MySQL and collations, the only collation able to do this is unicode?
utf8_turkish_ci and utf8_romanian_ci -- as shown in http://mysql.rjweb.org/utf8_collations.html
(Plus, of course, utf8_bin.)
For your added question: You are looking for a "character set" (not a "collation") that can represent Emoji and other non-Latin characters -- UTF-8 is the one to use. In MySQL, it is utf8mb4. The "collations" that are associated with that are named utf8mb4_.... Collations control ordering and equality, as indicated in the first part of your question about s and ş.
MySQL's CHARACTER SET utf8 is a subset of utf8mb4. Either can handle all the "letters" in the world. But only utf8mb4 can handle Emoji and some Chinese characters.
I use Mamp 3.4. I have a small database with 3 tables. When I upload the database file to the server I have that error: #1115 - Unknown character set: 'utf8mb4'
I have gone back to MAMP and check: Operations > Collation > utf8_unicode_ci
I have that in each table and in the general database
To export I select the database > Export > Custom > Save Output to a file. In the rest of things I leave the default.
Where is the problem? what is that mb4? is the utf8_unicode_ci the right one? How to export from MAMP and import in my server?
Let's get one thing straight: character set is not the same as collation. The two concepts are closely realted only.
Character sets tell the programs processing text how to interpret the byte stream that makes up the text and what character to display on the screen.
Collations tell the programs processing text how to order characters for comparison and sorting purposes. So, if you do an order by on a text field in an RDBMS, then the RDBMS can figure out using the collation the order of the records.
utf8mb4 is a character set MySql uses. MySql's implementation of utf8 can represent a character on up to 3 bytes, while utf8mb4 can represent characters on up to 4 bytes. The utf8 standard uses the up to 4 bytes definition (utf8, wikipedia), so strictly speaking, utf8mb4 is the true utf8 implementation in mysql.
However, utf8mb4 has only been added relatively recently (v5.5.3), so its existence is still not that widely known in the mysql community (MySql utf8mb4).
If you try to import data using this character set to a database that does not support it, then you get the error message in your question.
Collation should match the encoding, so if you have utf8mb4 character set, then use an utf8mb4 collation as well. You need to convert your data to a character set that is supported by your target system and you need to align the collation with your encoding.
I try to store strings of charset ISO-8859-15 in MySQL fields with CHARACTER SET latin1 COLLATION latin1_general_ci.
It seems to be the case that both of them are not full compatible. I am not able to save a correct €-Sign.
Can anybody tell me the correct CHARACTER SET for ISO-8859-15?
According to Wikipedia, there are 8 differences between ISO-8859-1 and ISO-8859-15. The €-Sign is one of them. I see on my copy of 5.6 a latin1 (ISO-8859-1) CHARACTER SET, but no latin9 (ISO-8859-15).
It is possible to add your own character set and collation to MySQL, but that may be more than you want to tackle. There is a Worklog for adding it, but they need nudging to ever get it done.
Sorry. Can you live with latin1 or latin2 (which has at least the Euro)? Or, better yet, switch to utf8, which has all the characters, and a zillion more.
I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.
In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?
What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?
Every textual column has a collation. It may be set explicitly in the table definition, or it may simply be set from the table's default, the database's default, or the server-wide default. But it has a collation.
The collations you mention are all case-insensitive. That is, they ignore the difference between upper- and lower- case letters. If you want case-sensitive collations use utf8_binary.
You probably want to use utf8_unicode_ci in a modern server. Read this for background. What's the difference between utf8_general_ci and utf8_unicode_ci
utf8_general_mysql500_ci is a collation specifically for backward compatibility to older versions of MySQL. http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-21.html
When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci