I'm trying to convert some mysql tables from latin1 to utf8. I'm using the following command, which seems to mostly work.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
However, on one table I get an error about a duplicate key entry. This is caused by a unique index on a "name" field. It seems when converting to utf8, any "special" characters are indexed as their straight english equivalent. For example, there is already a record with a name field value of "Dru". When converting to utf8, a record with "Drü" is considered a duplicate. The same with "Patrick" and "Påtrìçk".
Here is how to reproduce the issue:
CREATE TABLE `example` ( `name` char(20) CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (`name`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO example (name) VALUES ('Drü'),('Dru'),('Patrick'),('Påtrìçk');
ALTER TABLE example convert to character set utf8 collate utf8_general_ci;
ERROR 1062 (23000): Duplicate entry 'Dru' for key 1
The reason why the strings 'Drü' and 'Dru' evaluate as the same is that in the utf8_general_ci collation, they count as "the same". The purpose of a collation for a character set is to provide a set of rules as to when strings are the same, when one sorts before the other, and so on.
If you want a different set of comparison rules, you need to choose a different collation. You can see the available collations for the utf8 character set by issuing SHOW COLLATION LIKE 'utf8%'. There are a bunch of collations intended for text that is mostly in a specific language; there is also the utf8_bin collation which compares all strings as binary strings (i.e. compares them as sequences of 0s and 1s).
UTF8_GENERAL_CI is accent insensitive.
Use UTF8_BIN or a language-specific collation.
Related
I have a MySQL Database with the Charset utfmb4 and the collate utf8mb4_unicode_ci.
No, I noticed that this influences my search queries where I use like '%grün%'.
This would also match 'Grund'.
I found that this behavior is because of the charset and collate of my Tables/Columns.
Now I want to switch the tables to the collate utf8mb4_de_pb_0900_ai_ci to avoid the wrong selection of german umlaute.
So first I change the default settings for my database which is accepted
ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_de_pb_0900_ai_ci;
Setting the default setting for my first table is also accepted
ALTER TABLE tablename CHARACTER SET utf8mb4 COLLATE utf8mb4_de_pb_0900_ai_ci;
But when I want to convert the existing data to the new settings I get an error
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_de_pb_0900_ai_ci;
Referencing column 'column1' and referenced column 'id' in foreign key constraint 'contraintname_fkey' are incompatible.
I can do this with every table and always get the error that the constraint is not compatible as the foreign table is not converted.
I found clever Queries to generate all alter statements, but I can not execute them because of the error described above.
Is there an easy way to do this?
You can disable checking of foreign keys while you are altering your tables.
SET FOREIGN_KEY_CHECKS=0;
...Your ALTER TABLE queries...
SET FOREIGN_KEY_CHECKS=1;
Remember that AI in the collation means Accent Insensitive, meaning accents are not taken into account when comparing text. For a collation that is sensitive to accents use a collation with _AS_ in its name.
When running SHOW CREATE TABLE `my_table`;, I notice that COLLATE utf8mb4_unicode_ci is shown for every char, varchar, and text column in the table. This seems a bit redundant since the collation is already declared in the table_option portion of the create statement.
mysql> SHOW CREATE TABLE `my_table`;
| Table | Create Table
| my_table | CREATE TABLE `my_table` (
...
`char_col_1` char(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`varchar_col_1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`varchar_col_2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`varchar_col_3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`text_col_1` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
...
) ENGINE=InnoDB AUTO_INCREMENT=1816178 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
This behavior is noticeable in both MySQL 5.7 and MySQL 8.0 and therefore most likely in other versions as well.
Is this behavior normal and acceptable, or is it a symptom of something that is misconfigured either with the table, database, or MySQL instance?
On the other hand, since collation can be individually set for any specific column, perhaps it is better to explicitly display the collation for every column to avoid any ambiguity or assumptions, even in cases where the collation of the column matches the collation of the table?
You have touched only the tip of the iceberg.
I think the settings on the table are just defaults for columns that are defined without charset or collate.
Ditto for ALTER TABLE ADD COLUMN -- will inherit from the table defaults.
I think that the column settings are put into the information_schema.COLUMNS table and that won't change with an ALTER TABLE .. MODIFY COLUMN ..
Similarly, the table charset and collation inherit from the database definition, and will be frozen as the table is defined.
About defaults:
The old default charset was latin1
The current default is utf8mb4; this is unlikely to ever change in the future.
Every collation applies to exactly one charset, and the charset name is the beginning of the collation name.
Each charset has exactly one "default" collation: latin1_swedish_ci, utf8_unicode_ci, utf8mb4_0900_ai_ci, etc.
That default collation (for a given charset) has rarely, if ever, changed. Perhaps the only change has been for utf8mb4 between 5.7 and 8.0??
(The more I experiment, the less certain I am about all this.)
Best practice: Always explicitly set CHARSET and COLLATE for each string column.
Secondary considerations:
Use utf8mb4, if available, for most string (VARCHAR / TEXT).
Use the latest available collation (Unicode keeps improving it); currently utf8mb4_0900_ai_ci.
Use ascii for things that are clearly only ascii -- country-code, postal_code, hex, etc. Mostly these can use CHAR(..)
Use ascii_general_ci or ascii_bin, depending on whether you need case folding.
Yes, it is redundant to have CHARACTER SET and COLLATION the same in a table definition and a column definition.
Having explicit column definitions means that anyone changes the table definitions of CHARACTER SET or COLLATION the column will remain identical.
We have an database where the default character set for tables and columns is set to utf8 encoding
But with character set encoding of utf8 , we are unable to save emojis
To support saving of emojis,
a) We had to change the character set of table and columns to utf8mb4
b) We had to change the collation of table and columns to utf8mb4_unicode_ci
c) Update our JDBC driver so that it supports the unicode encoding
With the above changes we are able to save the emoji in our columns.
Question:
1) Do I need to delete existing indexes (varchar columns) and recreate the indexes as earlier with utf8 each character used to take 3 bytes and now with utf8mb4 encoding it will occupy 4 bytes ?
An index is an ordered list of pointers to the rows of the table. The ordering is based on both the CHARACTER SET and COLLATION of the column(s) of the index. If you change either, the index must be rebuilt. A "pointer" (in this context) is a copy of the PRIMARY KEY.
You should do one or the other of
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATE ...,;
which converts all the text columns in the table. Or, if you need to leave some with their current charset/collation, then change each column:
ALTER TABLE tbl MODIFY col_name ... CHARACTER SET utf8mb4 COLLATE ...;
where the first '...' is the rest of the column definition (VARCHAR, NOT NULL, whatever).
Any indexes that involve the columns being changed will be rebuilt. In particular, note that a VARCHAR PRIMARY KEY is effectively in each secondary index.
The collation utf8mb4_unicode_ci is rather old; you might prefer utf8mb4_unicode_520_ci, especially since it handles Emoji as being distinct rather than lumped together (IIRC).
The fact that utf8 is a subset of utf8mb4 is not relevant; MySQL sees it as a change and fails to take any short cuts.
I have a utf8_general_ci database that I'm interested in converting to utf8_unicode_ci.
I've tried the following commands
ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; (for every single table)
But that seems to change the charset for future data but doesn't convert the actual existing data from utf8_general_ci to utf8_unicode_ci.
Is there any way to convert the existing data to utf8_unicode_ci?
SHOW CREATE TABLE to see if it really set the CHARACTER SET and COLLATION on the columns, not just the defaults.
What was the CHARACTER SET before the ALTERs?
Do SELECT col, HEX(col) ... for some field that should have utf8 in it. This will help us determine if you really have utf8 in the table. The encoding for characters is different based on CHARACTER SET; the HEX helps discover such.
The ordering (WHERE, ORDER BY, etc) is controlled by COLLATION. The indexes probably had to be rebuilt based on your ALTER TABLE. Did big tables with indexes take a 'long' time to convert?
To actually see the difference between utf8_general_ci and utf8_unicode_ci, you need a "combining accent" or, more simply, the German ß versus ss:
mysql> SELECT 'ß' = 'ss' COLLATE utf8_general_ci,
'ß' = 'ss' COLLATE utf8_unicode_ci;
+-------------------------------------+-------------------------------------+
| 'ß' = 'ss' COLLATE utf8_general_ci | 'ß' = 'ss' COLLATE utf8_unicode_ci |
+-------------------------------------+-------------------------------------+
| 0 | 1 |
+-------------------------------------+-------------------------------------+
However, to test that in your tables, you would need to store those values and use WHERE or GROUP_CONCAT or something else to determine the equality.
What 'proof' do you have that the ALTERs failed to achieve the collation change?
(Addressing other comments: REPAIR should be irrelevant. CONVERT TO tells the ALTER to actually modify the data, so it should have done the desired action.)
You have to change the collation of every field in every table. As you say, the collation of the table is only the default value for fields created later, and the collation of the database is only the default value for tables created later.
As Lorenz Meyer said, the collation of the table is only the default value for fields created later and you need to set the defaults for the columns explicitly too.
Such a change looks like:
ALTER TABLE mytable CHANGE mycolumn mycolumn varchar(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
Recently I changed a bunch of columns to utf8_general_ci (the default UTF-8 collation) but when attempting to change a particular column, I received the MySQL error:
Column 'node_content' cannot be part of FULLTEXT index
In looking through docs, it appears that MySQL has a problem with FULLTEXT indexes on some multi-byte charsets such as UCS-2, but that it should work on UTF-8.
I'm on the latest stable MySQL 5.0.x release (5.0.77 I believe).
Oops, so I have found the answer to my problem:
All columns of a FULLTEXT index must have not only the same character set but also the same collation.
My FULLTEXT index had utf8_unicode_ci on one of its columns, and utf8_general_ci on its other columns.
Just to add to Thomas's good advice: And to sort things out in PHPMyAdmin you have to change the characterset for all columns AT THE SAME TIME.
Just wasted half a day trying again and again to change the columns one at a time and continually getting the error message about the FULLTEXT index.
For DBeaver/database tool users.
When you use interface to modify more than one column, the tool generate commands like this :
ALTER TABLE databaseName.tableName MODIFY COLUMN columnName1 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
ALTER TABLE databaseName.tableName MODIFY COLUMN columnName2 varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
This is not working because you must modify the charsets at the same time.
So, you have to change it manually, in one command :
ALTER TABLE databaseName.tableName
MODIFY COLUMN columnName1 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL,
MODIFY COLUMN columnName2 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
utf8 or utf8mb4 ? See here.