Proper configuration to mix Collations in SQL? - mysql

I’m a little confused by the collations. Not sure if the DB would traduce a column collation to the table collation on a SELECT, or is just a ruleset for when comparing.
So what to put as CHARSET and COLLATE? (10.4.11-MariaDB)
Here are some examples of what I have:
Case #1: The utf8_bin column I just SELECT it, not compare it, but the ascii I do WHERE bot=?
CREATE TABLE `bots_trace` (
`id` int(10) UNSIGNED NOT NULL,
`bot` varchar(20) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`info` varchar(2000) COLLATE utf8_bin DEFAULT NULL,
`seen` enum('yes','no') CHARACTER SET ascii COLLATE ascii_bin NOT NULL DEFAULT 'no'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
I almost never ask the DB to do an utf8mb4_bin comparison or similar, just SELECT.
So what collations I should use in those cases, what to use as DEFAULT and as COLLATE
Case #2: The only time I ask the DB to do something with an uft8mb4 is to check the mail.
CREATE TABLE `changed_email` (
`id` int(10) UNSIGNED NOT NULL,
`old_mail` varchar(256) COLLATE utf8mb4_bin NOT NULL,
`ctime` int(10) UNSIGNED NOT NULL,
`ip` varchar(94) CHARACTER SET ascii COLLATE ascii_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
SELECT id FROM changed_email WHERE old_mail = ? LIMIT 1
What to do in this case? Because the only comparison I do is a utf8mb4_bin I'm assuming that would be the correct CHARSET & COLLATE.
Also, I use PHP and I set mysqli_set_charset($link, 'utf8mb4'), which I needed to retrieve the data correctly, if I change some table COLLATION to ascii, could I have trouble retrieving utf8mb4 data columns?

ascii encoding is a subset of utf8 which is a subset of utf8mb4. But that is probably irrelevant.
mysqli_set_charset() announces the CHARACTER SET of the data in the client.
MySQL, during INSERT will convert the bytes from the encoding indicated by mysqli_set_charset to the encoding specified for the column in the table. Similarly SELECT will convert the other direction.
If you are only dealing with ascii characters, then there is effectively no conversion, and no possibility of problems. If, on the other hand, you have accented letters or Emoji, there will be problems.
The above talks about CHARACTER SET, which is the "encoding" of letters. the COLLATION is a different matter; this term refers to the ordering, including case folding and accent stripping. For example, should 'a' = 'A' or not? For COLLATION ascii_general_ci or utf8mb4...ci, those are "equal". For any collation ...bin they are "not equal", and one of them will consistently be sorted (think ORDER BY) before the other.
In some, but not all, situations, MySQL will allow mixing character sets or collections, and "do the right thing". For example, storing a character in once CHARACTER SET into another, either it can be converted, or it will mess up. A is available in perhaps all character sets, but, for example, an accented A is not available in Ascii.
In the case of COLLATION, when there is a conflict of collections, there may be a rule that says which collation to use, but often it gives up and complains about a "mix of collections".
Keep in mind that all of this comes from multiple places:
The column definition
The connection parameters (between client and MySQL server)
The bytes in the client.
A common example is latin1 accented letters cannot be interpreted as utf8 bytes, but they can be converted to utf8. This raises its ugly head when connection specification disagrees with the bytes in the client.

Related

Is it normal for a MySQL create table statement to include redundant collation declarations for every char, varchar, and text column?

When running SHOW CREATE TABLE `my_table`;, I notice that COLLATE utf8mb4_unicode_ci is shown for every char, varchar, and text column in the table. This seems a bit redundant since the collation is already declared in the table_option portion of the create statement.
mysql> SHOW CREATE TABLE `my_table`;
| Table | Create Table
| my_table | CREATE TABLE `my_table` (
...
`char_col_1` char(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`varchar_col_1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`varchar_col_2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`varchar_col_3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`text_col_1` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
...
) ENGINE=InnoDB AUTO_INCREMENT=1816178 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
This behavior is noticeable in both MySQL 5.7 and MySQL 8.0 and therefore most likely in other versions as well.
Is this behavior normal and acceptable, or is it a symptom of something that is misconfigured either with the table, database, or MySQL instance?
On the other hand, since collation can be individually set for any specific column, perhaps it is better to explicitly display the collation for every column to avoid any ambiguity or assumptions, even in cases where the collation of the column matches the collation of the table?
You have touched only the tip of the iceberg.
I think the settings on the table are just defaults for columns that are defined without charset or collate.
Ditto for ALTER TABLE ADD COLUMN -- will inherit from the table defaults.
I think that the column settings are put into the information_schema.COLUMNS table and that won't change with an ALTER TABLE .. MODIFY COLUMN ..
Similarly, the table charset and collation inherit from the database definition, and will be frozen as the table is defined.
About defaults:
The old default charset was latin1
The current default is utf8mb4; this is unlikely to ever change in the future.
Every collation applies to exactly one charset, and the charset name is the beginning of the collation name.
Each charset has exactly one "default" collation: latin1_swedish_ci, utf8_unicode_ci, utf8mb4_0900_ai_ci, etc.
That default collation (for a given charset) has rarely, if ever, changed. Perhaps the only change has been for utf8mb4 between 5.7 and 8.0??
(The more I experiment, the less certain I am about all this.)
Best practice: Always explicitly set CHARSET and COLLATE for each string column.
Secondary considerations:
Use utf8mb4, if available, for most string (VARCHAR / TEXT).
Use the latest available collation (Unicode keeps improving it); currently utf8mb4_0900_ai_ci.
Use ascii for things that are clearly only ascii -- country-code, postal_code, hex, etc. Mostly these can use CHAR(..)
Use ascii_general_ci or ascii_bin, depending on whether you need case folding.
Yes, it is redundant to have CHARACTER SET and COLLATION the same in a table definition and a column definition.
Having explicit column definitions means that anyone changes the table definitions of CHARACTER SET or COLLATION the column will remain identical.

MySQL mixing Charset & Collations

I read different articles and topics on this forum to help me setting up the charset & collation for my database. Not sure about the choices I made. I would appreciate any comments or advice.
I'm using MySQL 5.5.
The database (used with PHP) will have some datas from different languages (chinese, french, dutch, Us, spanish, arabic etc..)
I will mainly insert datas and get information from table ID'S. I won't need to full search and compare text.
So here is what I've done to create my database, I decided to use CHARSET utf8mb4 and COLLATION utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
When I create the table:
CREATE TABLE IF NOT EXISTS sector (
idSector INT(5) NOT NULL AUTO_INCREMENT,
sectoreName VARCHAR(45) NOT NULL DEFAULT '',
PRIMARY KEY (idSector)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=0;
For some tables, I thought it was better to use utf8_bin
Ex: timezone (contain 168 047 rows)
CREATE TABLE timezone (
zone_id int(10) NOT NULL,
abbreviation varchar(6) COLLATE utf8_bin NOT NULL,
time_start decimal(11,0) NOT NULL,
gmt_offset int(11) NOT NULL,
dst char(1) COLLATE utf8_bin NOT NULL,
KEY idx_zone_id (zone_id),
KEY idx_time_start (time_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
So basically I would like to know if I'm on the right or if I'm doing something that could lead to problems.
Different columns can have different character sets and/or collations, but...
If you compare columns of different charset or collation (WHERE a.x = b.y), indexes cannot be used.
utf8 does not handle all of Chinese, nor does it handle some Emoji. For those, you need utf8mb4.
On other issues...
In INT(5), the (5) means nothing. Check out SMALLINT UNSIGNED with a range of 0..65535.
time_start decimal(11,0) is strange for a time. If it is a unix timestamp, either TIMESTAMP or INT UNSIGNED should work ok. See also TIME.
dst char(1) COLLATE utf8_bin -- this takes 3 bytes, because of utf8. Perhaps you want CHARACTER SET ascii so it will be only 1 byte?
InnoDB tables really should be given an explicit PRIMARY KEY. (Probably zone_id?)
You are making good a good choice for your sectoreName column. Notice one thing: utf8mb4_unicode_ci is a good collation for most language. But, for Spanish, it gets the alphabet wrong: in that language N and Ñ are considered different letters. Ñ appears immediately after N in the collating sequence. But in other European language they are considered the same letter. So, your Spanish-language users will, when they ask for Niña, get back Niña and Nina. That may appear to them as a mistake. (But, they're probably used to getting this sort of thing from pan-European software applications.)
You should use utf8mb4 as your character set throughout any new application. So, use that instead of utf8 in your timezone table. Using the _bin collation for your abbreviation column is fine.

Understanding character sets in column and data and how they affect select queries

After reading articles about how to properly store your data inside a mysql database, I sort of have a good understanding of how character sets work. Yet, there are still some instances that I find hard to understand. Let me give a quick example: Say I have a table with a varchar column using the latin1 charset (for example purposes)
CREATE TABLE `foobar` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`foo` varchar(100) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
and I force insert a latin1 character using its hex value (á for instance, which would be E1)
SET NAMES 'latin1';
INSERT INTO foobar SET foo = UNHEX('E1')
why does a SELECT query against this table under a latin1 connection fail to give me the value I am expecting? I get the replacement character instead. I would have expected it to just give back the value as it is since the column was in latin1, the connection itself was in latin1 and the value stored is a valid latin1 character. Is mysql trying to interpret E1 as a UTF8 value? Or am I missing something here?
Using a UTF8 connection (SET NAMES 'utf8') before selecting gives me the expected value of á, which I guess is correct since from what I understand, if the connection charset is different from the column charset being selected, then mysql would read data as whatever the column charset was (so E1 gets parsed as a latin1 character, or á)

Convert from utf8_general_ci to utf8_unicode_ci

I have a utf8_general_ci database that I'm interested in converting to utf8_unicode_ci.
I've tried the following commands
ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; (for every single table)
But that seems to change the charset for future data but doesn't convert the actual existing data from utf8_general_ci to utf8_unicode_ci.
Is there any way to convert the existing data to utf8_unicode_ci?
SHOW CREATE TABLE to see if it really set the CHARACTER SET and COLLATION on the columns, not just the defaults.
What was the CHARACTER SET before the ALTERs?
Do SELECT col, HEX(col) ... for some field that should have utf8 in it. This will help us determine if you really have utf8 in the table. The encoding for characters is different based on CHARACTER SET; the HEX helps discover such.
The ordering (WHERE, ORDER BY, etc) is controlled by COLLATION. The indexes probably had to be rebuilt based on your ALTER TABLE. Did big tables with indexes take a 'long' time to convert?
To actually see the difference between utf8_general_ci and utf8_unicode_ci, you need a "combining accent" or, more simply, the German ß versus ss:
mysql> SELECT 'ß' = 'ss' COLLATE utf8_general_ci,
'ß' = 'ss' COLLATE utf8_unicode_ci;
+-------------------------------------+-------------------------------------+
| 'ß' = 'ss' COLLATE utf8_general_ci | 'ß' = 'ss' COLLATE utf8_unicode_ci |
+-------------------------------------+-------------------------------------+
| 0 | 1 |
+-------------------------------------+-------------------------------------+
However, to test that in your tables, you would need to store those values and use WHERE or GROUP_CONCAT or something else to determine the equality.
What 'proof' do you have that the ALTERs failed to achieve the collation change?
(Addressing other comments: REPAIR should be irrelevant. CONVERT TO tells the ALTER to actually modify the data, so it should have done the desired action.)
You have to change the collation of every field in every table. As you say, the collation of the table is only the default value for fields created later, and the collation of the database is only the default value for tables created later.
As Lorenz Meyer said, the collation of the table is only the default value for fields created later and you need to set the defaults for the columns explicitly too.
Such a change looks like:
ALTER TABLE mytable CHANGE mycolumn mycolumn varchar(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci

Converting mysql tables from latin1 to utf8

I'm trying to convert some mysql tables from latin1 to utf8. I'm using the following command, which seems to mostly work.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
However, on one table I get an error about a duplicate key entry. This is caused by a unique index on a "name" field. It seems when converting to utf8, any "special" characters are indexed as their straight english equivalent. For example, there is already a record with a name field value of "Dru". When converting to utf8, a record with "Drü" is considered a duplicate. The same with "Patrick" and "Påtrìçk".
Here is how to reproduce the issue:
CREATE TABLE `example` ( `name` char(20) CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (`name`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO example (name) VALUES ('Drü'),('Dru'),('Patrick'),('Påtrìçk');
ALTER TABLE example convert to character set utf8 collate utf8_general_ci;
ERROR 1062 (23000): Duplicate entry 'Dru' for key 1
The reason why the strings 'Drü' and 'Dru' evaluate as the same is that in the utf8_general_ci collation, they count as "the same". The purpose of a collation for a character set is to provide a set of rules as to when strings are the same, when one sorts before the other, and so on.
If you want a different set of comparison rules, you need to choose a different collation. You can see the available collations for the utf8 character set by issuing SHOW COLLATION LIKE 'utf8%'. There are a bunch of collations intended for text that is mostly in a specific language; there is also the utf8_bin collation which compares all strings as binary strings (i.e. compares them as sequences of 0s and 1s).
UTF8_GENERAL_CI is accent insensitive.
Use UTF8_BIN or a language-specific collation.