MySQL Collation For English, Polish and German - mysql

In my codeigniter project I am using MySQL as database. It's collation is 'latin1_swedish_ci'. Now I need to scale my website to store 'Polish', 'German', 'French', 'Ukrainian', 'Dutch' in addition to 'English'. But I don't know which collation to be used. I found different answers for different language in web. But I need a general one. Please help me to find out a solution.

(Alvaro's answer is good; I am adding some notes.)
If you are using MySQL 5.5 or 5.6 and have VARCHAR(255), see this for some issues you might run into.
ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;
(for each table) is probably the simplest way to convert to UTF-8. Caution: test it separately from production, and test that the Western European text does not get mangled. If you get gibberish or question marks, see this
In converting to CHARACTER SET utf8mb4, the preferred COLLATION is utf8mb4_unicode_520_ci. (With MySQL 8.0, there is a better one.)
utf8mb4 will let you handle all the languages of the world, so this should be the last 'conversion' necessary.

Before caring about collation, you need to migrate to a Unicode compatible encoding first. As the name suggest, Latin-1* is designed for Latin script and cannot encode all Polish characters and, of course, none of the Cyrillic script. The obvious choice in 2019 is UTF-8, which corresponds to utf8mb4 in MySQL terminology.
Beware though that this may not be trivial. If your application assumes single byte encodings any text manipulation feature may need to be reviewed and maybe fixed. For instance, the € symbol has a length of 1 byte in Windows-1252 but it's 3 bytes in UTF-8. Let's say you have code that strips it from a string like '29.92€'. If your application removes the last byte, code that was working flawlessly in a single-byte encoding will no longer be valid in a multi-byte encoding because one byte isn't one character any more. Or, even in MySQL itself, some as simple as regular expressions wasn't multibyte safe until MySQL 8.0.4.
Once you address this, you need to pick a proper collation. Since you're mixing languages you need a general purpose Unicode one. Here's a good overview.
(*) MySQL is actually lying to you. When it says Latin-1 it actually means Windows-1252.

Related

MySQL to postgres migration issue

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?
utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:
Requires a maximum of four bytes per multibyte character.
So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.
But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

MySQL Collation in Kamailio tables

For some reason I don't know, I have some Kamailio tables with utf32_general_ci collation (location in example) and others with latin1_swedish_ci.
I try to find answer on github repos (https://github.com/kamailio/kamailio/blob/4.4/scripts/mysql/my_create.sql) but collation is not specified.
That's right? Does matters? Which one it's right choice?
My Kamailio version is 4.4.3.
What is Kamailio? What does it store in a database?
In general, all software today should be using full UTF-8 characterset encoding. In MySQL, the syntax is CHARACTER SET utf8mb4.
You mentioned two COLLATIONs; they have to do with sort order, not character encoding. The two mentioned correspond to character sets
utf32 -- 4 bytes per character quite wasteful of space; virtually no one uses it; there is no good reason (that I can think of) to use it today.
latin1 -- 1 byte per character; handles western Europe, but none of Asia. It is an old default in MySQL (hence your tables having it).
I do not know the ramifications of trying to change Kamailio.
One thing to note in how MySQL deals with character set differences: You must specify the encoding in the client when connecting to MySQL. Then, when INSERTing and SELECTing, MySQL will convert (if necessary, and if possible) between the client encoding and column encoding. Because of this automatic conversion, you will currently see nothing wrong for western European characters. But Greek, Chinese, etc will be mangled/lost when attempting to store into latin1.

What will happen if I changed the collation of my database?

I am trying to increase performance in MySQL. once of the thing that I learned in using Latin1 charset is faster than using UTF8 because latin1 uses less bytes.. But I am wondering what will happen to the data if I changed the collation? in my application today most of the things are in Amerian english but I can't guarantee that there won't be any other languages stores as well. it someone store data other than english I don't really care about that data much.
My question:
1) if I changed the collation in my databases to latin1 what will happen to the data if it was not written in American English?
2) which lation1_? do I use latin1_bin, latin1_general_ci, or latin1_general_cs? and if possible what is the difference?
3) when changing the collation of the database do I need to also change the collation of each table separately?
Thanks
UTF 8 only uses extra bytes if it has strange characters. So really, you should NOT change your collation. it won't help. UTF 16 was developed to hold all characters of all languages, and yes, it uses 16 bits so if you were using utf16 I would suggest utf8 if you mostly had standard latin characters. utf 8 is the compromise. it has a special character that means "more bytes coming", and if it sees it, it groups the next ones together. But if all you have is latin characters, the bytes will be exactly the same number as with latin collation.
to answer specifically, you can set the default colation for new tables, but yes, you have to do it for each one of the existing ones. you could do it with an sql statement that lists the tables then runs the sql statement on each, to change it. (change 1 and notice the sql statement). but again, don't do it. utf8 is the standard for a reason. your performance issues are elsewhere.

Why does MySQL use latin1_swedish_ci as the default?

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.