MySQL Collation in Kamailio tables - mysql

For some reason I don't know, I have some Kamailio tables with utf32_general_ci collation (location in example) and others with latin1_swedish_ci.
I try to find answer on github repos (https://github.com/kamailio/kamailio/blob/4.4/scripts/mysql/my_create.sql) but collation is not specified.
That's right? Does matters? Which one it's right choice?
My Kamailio version is 4.4.3.

What is Kamailio? What does it store in a database?
In general, all software today should be using full UTF-8 characterset encoding. In MySQL, the syntax is CHARACTER SET utf8mb4.
You mentioned two COLLATIONs; they have to do with sort order, not character encoding. The two mentioned correspond to character sets
utf32 -- 4 bytes per character quite wasteful of space; virtually no one uses it; there is no good reason (that I can think of) to use it today.
latin1 -- 1 byte per character; handles western Europe, but none of Asia. It is an old default in MySQL (hence your tables having it).
I do not know the ramifications of trying to change Kamailio.
One thing to note in how MySQL deals with character set differences: You must specify the encoding in the client when connecting to MySQL. Then, when INSERTing and SELECTing, MySQL will convert (if necessary, and if possible) between the client encoding and column encoding. Because of this automatic conversion, you will currently see nothing wrong for western European characters. But Greek, Chinese, etc will be mangled/lost when attempting to store into latin1.

Related

MySQL to postgres migration issue

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?
utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:
Requires a maximum of four bytes per multibyte character.
So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.
But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

MySQL Collation For English, Polish and German

In my codeigniter project I am using MySQL as database. It's collation is 'latin1_swedish_ci'. Now I need to scale my website to store 'Polish', 'German', 'French', 'Ukrainian', 'Dutch' in addition to 'English'. But I don't know which collation to be used. I found different answers for different language in web. But I need a general one. Please help me to find out a solution.
(Alvaro's answer is good; I am adding some notes.)
If you are using MySQL 5.5 or 5.6 and have VARCHAR(255), see this for some issues you might run into.
ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4;
(for each table) is probably the simplest way to convert to UTF-8. Caution: test it separately from production, and test that the Western European text does not get mangled. If you get gibberish or question marks, see this
In converting to CHARACTER SET utf8mb4, the preferred COLLATION is utf8mb4_unicode_520_ci. (With MySQL 8.0, there is a better one.)
utf8mb4 will let you handle all the languages of the world, so this should be the last 'conversion' necessary.
Before caring about collation, you need to migrate to a Unicode compatible encoding first. As the name suggest, Latin-1* is designed for Latin script and cannot encode all Polish characters and, of course, none of the Cyrillic script. The obvious choice in 2019 is UTF-8, which corresponds to utf8mb4 in MySQL terminology.
Beware though that this may not be trivial. If your application assumes single byte encodings any text manipulation feature may need to be reviewed and maybe fixed. For instance, the € symbol has a length of 1 byte in Windows-1252 but it's 3 bytes in UTF-8. Let's say you have code that strips it from a string like '29.92€'. If your application removes the last byte, code that was working flawlessly in a single-byte encoding will no longer be valid in a multi-byte encoding because one byte isn't one character any more. Or, even in MySQL itself, some as simple as regular expressions wasn't multibyte safe until MySQL 8.0.4.
Once you address this, you need to pick a proper collation. Since you're mixing languages you need a general purpose Unicode one. Here's a good overview.
(*) MySQL is actually lying to you. When it says Latin-1 it actually means Windows-1252.

What will happen if I changed the collation of my database?

I am trying to increase performance in MySQL. once of the thing that I learned in using Latin1 charset is faster than using UTF8 because latin1 uses less bytes.. But I am wondering what will happen to the data if I changed the collation? in my application today most of the things are in Amerian english but I can't guarantee that there won't be any other languages stores as well. it someone store data other than english I don't really care about that data much.
My question:
1) if I changed the collation in my databases to latin1 what will happen to the data if it was not written in American English?
2) which lation1_? do I use latin1_bin, latin1_general_ci, or latin1_general_cs? and if possible what is the difference?
3) when changing the collation of the database do I need to also change the collation of each table separately?
Thanks
UTF 8 only uses extra bytes if it has strange characters. So really, you should NOT change your collation. it won't help. UTF 16 was developed to hold all characters of all languages, and yes, it uses 16 bits so if you were using utf16 I would suggest utf8 if you mostly had standard latin characters. utf 8 is the compromise. it has a special character that means "more bytes coming", and if it sees it, it groups the next ones together. But if all you have is latin characters, the bytes will be exactly the same number as with latin collation.
to answer specifically, you can set the default colation for new tables, but yes, you have to do it for each one of the existing ones. you could do it with an sql statement that lists the tables then runs the sql statement on each, to change it. (change 1 and notice the sql statement). but again, don't do it. utf8 is the standard for a reason. your performance issues are elsewhere.

Why does MySQL use latin1_swedish_ci as the default?

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

Should I migrate a MySQL database with a latin1_swedish_ci collation to utf-8 and, if so, how?

The MySQL database used by my Rails application currently has the default collation of latin1_swedish_ci. Since the default charset of Rails applications (including mine) is UTF-8, it seems sensible to me to use the utf8_general_ci collation in the database.
Is my thinking correct?
Assuming it is, what would be the best approach to migrate the collation and all the data in the database to the new encoding?
UTF-8, as well as any other Unicode encoding scheme, can store characters in any language, so it is an excellent choice of codepage for your database.
The collation setting, on the other hand, is a completely separate issue from the encoding scheme. It involves sort orders, upper/lowercase conversions, string equality comparisons, and things like that which are language-specific. The collation setting should match the language that is used in the database.
The UTF-8 general collation is (I am assuming here—I'm not familiar with MySQL in particular) used for situations where the language is unknown and some simple default ordering is needed. It probably corresponds to the Unicode code point ordering, which is almost certainly not what you want if you're storing Swedish.
Convert to UTF-8 as the charset.
Collation settings are only used for sorting and stuff like that. Choose the collation that most of your users would expect.
Providing your existing data in the database is CORRECTLY encoded in latin1, converting the tables to utf8 (using ALTER TABLE, as described in the docs) should just work.
Then all your application needs to do is continue doing whatever it did before. If your application wants to use unicode characters, it should set its connection encoding to utf8 and use utf8, but that's its own problem.
The problem is that a large number of crap web apps have historically sent utf8 data to mysql and told it to treat it as latin1. MySQL will honour this perfectly and save junk into the tables, as instructed.
Converting the tables from latin1 to utf8 will NOT repair this mistake, as you genuinely do have total rubbish in there. Repairing them is nontrivial, particularly if during the lifetime of the app it's been talking different types of rubbish to the database.
Use below mysql query to convert your column :
ALTER TABLE users MODIFY description VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
To see full details about your table :
SHOW FULL COLUMNS FROM users;