I try to store strings of charset ISO-8859-15 in MySQL fields with CHARACTER SET latin1 COLLATION latin1_general_ci.
It seems to be the case that both of them are not full compatible. I am not able to save a correct €-Sign.
Can anybody tell me the correct CHARACTER SET for ISO-8859-15?
According to Wikipedia, there are 8 differences between ISO-8859-1 and ISO-8859-15. The €-Sign is one of them. I see on my copy of 5.6 a latin1 (ISO-8859-1) CHARACTER SET, but no latin9 (ISO-8859-15).
It is possible to add your own character set and collation to MySQL, but that may be more than you want to tackle. There is a Worklog for adding it, but they need nudging to ever get it done.
Sorry. Can you live with latin1 or latin2 (which has at least the Euro)? Or, better yet, switch to utf8, which has all the characters, and a zillion more.
Related
The issue is that ş and s are interpreted by MySQL as identical values.
I'm new to MySQL, so I have no idea which collations would view them as unique.
The collations that I've tried using which don't work are:
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
Does anybody know which collation to use?
P.S. I also really need the collation to interpret emojis and other non-Latin characters, and, to my knowledge of MySQL and collations, the only collation able to do this is unicode?
utf8_turkish_ci and utf8_romanian_ci -- as shown in http://mysql.rjweb.org/utf8_collations.html
(Plus, of course, utf8_bin.)
For your added question: You are looking for a "character set" (not a "collation") that can represent Emoji and other non-Latin characters -- UTF-8 is the one to use. In MySQL, it is utf8mb4. The "collations" that are associated with that are named utf8mb4_.... Collations control ordering and equality, as indicated in the first part of your question about s and ş.
MySQL's CHARACTER SET utf8 is a subset of utf8mb4. Either can handle all the "letters" in the world. But only utf8mb4 can handle Emoji and some Chinese characters.
I recently had to changed mysql from latin-1 to utf-8 to handle Russian characters. They were originally showing up as ?????.
I also had to change a couple of tables in my database to utf8mb4. I originally had these set to utf8 but this did not have enough bits to handle certain characters.
I have to make a change to a production database and want to ensure that i do not have any issues a few months down the line with a particular encoding type.
So my question is when do i use what encoding on a table?
You have multiple questions.
The "???" probably came from converting from latin1 to utf8 incorrectly. The data is now lost, since only '?' remains. SELECT HEX(...) ... to confirm that all you get is 3F (?) where you should get something useful.
See "question marks" in Trouble with utf8 characters; what I see is not what I stored .
utf8mb4 and utf8 handle Cyrillic (Russian) identically, so the CHARACTER SET is not the issue with respect to the "???".
If you have an original copy of the data, then probably you want the 3rd item in here -- "CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset". That is what I call the two-step ALTER.
As for avoiding future issues... See "Best Practice" in my first link. If all you need is European (including Russian), either utf8 or utf8mb4 will suffice. But if you want Emoji or all of Chinese, then go with utf8mb4.
Also, note that you must specify what charset the client is using; this is a common omission, and was probably part of what got you in trouble in the first place.
I am building a web site in German language, So I will be using characters like ä, ü, ß etc., So what are your recommendations?
This answer is outdated. For full emoji support, see this answer.
As the character set, if you can, definitely UTF-8.
As the collation - that's a bit nasty for languages with special characters. There are various types of collations. They can all store all Umlauts and other characters, but they differ in how they treat Umlauts in comparisons, i.e. whether
u = ü
is true or false; and in sorting (where in the alphabets the Umlauts are located in the sorting order).
To make a long story short, your best bet is either
utf8_unicode_ci
It allows case insensitive searches; It treats ß as ss and uses DIN-1 sorting. Sadly, like all non-binary Unicode collations, it treats u = ü which is a terrible nuisance because a search for "Muller" will also return "Müller". You will have to work around that by setting a Umlaut-aware collation in real time.
or utf8_bin
This collation does not have the u = ü problem but only case sensitive searches are possible.
I'm not entirely sure whether there are any other side effects to using the binary collation; I asked a question about that here.
This mySQL manual page gives a good overview over the various collations and the consequences they bring in everyday use.
Here is a general overview on available collations in mySQL.
To support the complete UTF-8 standard you have to use the charset utf8mb4 and the collation utf8mb4_unicode_ci in MySQL!
Note: MySQL only supports 1- to 3-byte characters when using its so called utf8 charset! This is why the modern Emojis are not supported as they use 4 Bytes!
The only way to fully support the UTF-8 standard is to change the charset and collation of ALL tables and of the database itself to utf8mb4 and utf8mb4_unicode_ci. Further more, the database connection needs to use utf8mb4 as well.
The mysql server must use utf8mb4 as default charset which can be manually configured in /etc/mysql/conf.d/mysql.cnf
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
# character-set-client-handshake = FALSE ## better not set this!
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
Existing tables can be migrated to utf8mb4 using the following SQL statement:
ALTER TABLE <table-name> CONVERT TO
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Note:
To make sure any JOINs between table-colums will not be slowed down by charset-encodings ALL tables have to be change!
As the length of an index is limited in MySQL, the total number of characters per index-row must be multiplied by 4 Byte and need to be smaller than 3072
When the innodb_large_prefix configuration option is enabled, this
length limit is raised to 3072 bytes, for InnoDB tables that use the
DYNAMIC and COMPRESSED row formats.
To change the charset and default collation of the database, run this command:
ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur.
utf-8-general-ci or utf-8-unicode-ci.
To know the difference :
UTF-8: General? Bin? Unicode?
The above comments aren't really addressing the specific problem with German umlauts, which is often described as: dictionary order or phone-book order? The Unicode default is okay for the former but if (e.g.) you want 'Ü' = 'UE' then you could consider utf8mb4_de_pb_0900_ai_ci or utf8mb4_german2_ci, assuming character set is utf8mb4.
Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci