I've got a MySQL database that stores Swedish characters (not part of the PK, though) and does selects on those characters.
I don't have a ton of experience with this kind of stuff, but I had previously set the collation to "utf16_swedish_ci", which seems to have worked fine for a long time and was able to differentiate the similar characters (like ä vs a and é vs e) in select statements.
Lately, though, I noticed that using that collation seems to always consider é and e the same (though it seems to distinguish all of the other similar Swedish characters fine).
Did something change regarding that in newer versions of MySQL? Or should that have always been the case and I just didn't notice it until now? What collation should I be using to uniquely identify all the Swedish characters that won't have any weird side effects?
Thanks in advance!
å, ä and ö are part of the native Swedish alphabet, and don't need any special treatment. However, é is not native, and relies on accent rules for collation.
As far as I am aware, to get accent sensitive collation in MySQL, you need to use one of the binary collations - eg utf16_bin, which unfortunately also is case sensitive.
What version of MySQL are you using, and have you recently updated to a newer version? If you have, then rolling back to a previous version could solve your collation issues. I know there were some changes to the collations included in version 8.x.x, so maybe that is what you are experiencing.
In most (including swedish_ci) utf8 or utf8mb4 collations E=é. Exceptions: _bin and _icelandic_ci. See http://mysql.rjweb.org/utf8_collations.html and http://mysql.rjweb.org/utf8mb4_collations.html
Note that most collations end with _ci, which implies both case folding and (mostly) ignoring accents.
Do not use utf16 or utf32; use only utf8/utf8mb4.
MySQL has no collations that treat case and accent differently.
The only incompatible change in collations has been in 5.0 with the German ß. It was a fiasco; MySQL will never change a collation again -- though it may add new collations.
Related
In my codeigniter project I am using MySQL as database. It's collation is 'latin1_swedish_ci'. Now I need to scale my website to store 'Polish', 'German', 'French', 'Ukrainian', 'Dutch' in addition to 'English'. But I don't know which collation to be used. I found different answers for different language in web. But I need a general one. Please help me to find out a solution.
(Alvaro's answer is good; I am adding some notes.)
If you are using MySQL 5.5 or 5.6 and have VARCHAR(255), see this for some issues you might run into.
ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;
(for each table) is probably the simplest way to convert to UTF-8. Caution: test it separately from production, and test that the Western European text does not get mangled. If you get gibberish or question marks, see this
In converting to CHARACTER SET utf8mb4, the preferred COLLATION is utf8mb4_unicode_520_ci. (With MySQL 8.0, there is a better one.)
utf8mb4 will let you handle all the languages of the world, so this should be the last 'conversion' necessary.
Before caring about collation, you need to migrate to a Unicode compatible encoding first. As the name suggest, Latin-1* is designed for Latin script and cannot encode all Polish characters and, of course, none of the Cyrillic script. The obvious choice in 2019 is UTF-8, which corresponds to utf8mb4 in MySQL terminology.
Beware though that this may not be trivial. If your application assumes single byte encodings any text manipulation feature may need to be reviewed and maybe fixed. For instance, the € symbol has a length of 1 byte in Windows-1252 but it's 3 bytes in UTF-8. Let's say you have code that strips it from a string like '29.92€'. If your application removes the last byte, code that was working flawlessly in a single-byte encoding will no longer be valid in a multi-byte encoding because one byte isn't one character any more. Or, even in MySQL itself, some as simple as regular expressions wasn't multibyte safe until MySQL 8.0.4.
Once you address this, you need to pick a proper collation. Since you're mixing languages you need a general purpose Unicode one. Here's a good overview.
(*) MySQL is actually lying to you. When it says Latin-1 it actually means Windows-1252.
I migrated a Microsoft SQL database to Mysql and I hat some collation problems in the rows in Mysql, I tried to change the collation but the erros still there. The data is goning to be in a Wordpress, so I tried the Database Collation Fix pluguin but doesn't work.
The table afected is wp_posts in post_title and post_content. All the characters that contain an accent or 'ñ' in Spanish are replaceed by a random character.
I already tried with utf8_spanish_ci and utf8mb4_spanish_ci.
Any suggestions?
Microsoft SQL database collation: Modern_Spanish_CI_AI
Mysql database collation: UTF8 Defaul Collation
Thanks
I don't know if this helps you, but the collating orders in MySQL's Modern Spanish utf8_spanish_ci and/or utf8mb4_spanish_ci collations are different from those in utf8_unicode_ci and/or utf8mb4_unicode_ci.
Modern Spanish collation handles N and Ñ as separate characters, with Ñ coming directly after N. Generic latin-language collation treats them as variants of the same character. So, if you want Spanish collation -- that is, if you're dealing with lots of proper names and so forth -- you'll need to use the Spanish collation for this data.
If ñ turned into ?, you have one type of problem.
If ñ turned into ñ, you have "Mojibake".
If ñ turned into �, it's yet another problem.
Please be more specific, since the solutions are quite different.
Trouble with utf8 characters; what I see is not what I stored provides information on the common issues.
The "Collation" is not relevant to ñ being replaced by a 'random character'. Only the CHARACTER SET is relevant.
When you get into comparing or sorting strings, then the COLLATION becomes relevant. I think the only difference between ..._spanish_ci and ...spanish2_ci is the handling of ch and ll.
Our column is currently collated to latin1_swedish_ci and special unicode characters are, obviously, getting stripped out. We want to be able to accept chars such as U+272A ✪, U+2764 ❤, (see this wikipedia article) etc. I'm leaning towards utf8_unicode_ci, would this collation handle these and other characters? I don't care about speed as this column isn't an index.
MySQL Version: 5.5.28-1
The collation is the least of your worries, what you need to think about is the character set for the column/table/database. The collation (rules governing how data is compared and sorted) is just a corollary of that.
MySQL supports several Unicode character sets, utf8 and utf8mb4 being the most interesting. utf8 supports Unicode characters in the BMP, i.e. a subset of all of Unicode. utf8mb4, available since MySQL 5.5.3, supports all of Unicode.
The collation to be used with any of the Unicode encodings is most likely xxx_general_ci or xxx_unicode_ci. The former is a general sorting and comparison algorithm independent of language, the latter is a more complete language independent algorithm supporting more Unicode features (e.g. treating "ß" and "ss" as equivalent), but is therefore also slower.
See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html.
Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci