Our previous programmer set the wrong collation in a table (Mysql). He set it up with Latin collation, when it should be UTF8, and now I have issues. Every record with Chinese and Japan character turn to ??? character.
Is possible to change collation and get back the detail of character?
change database collation:
ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
change table collation:
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
change column collation:
ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
What do the parts of utf8mb4_0900_ai_ci mean?
3 bytes -- utf8
4 bytes -- utf8mb4 (new)
v4.0 -- _unicode_
v5.20 -- _unicode_520_
v9.0 -- _0900_ (new)
_bin -- just compare the bits; don't consider case folding, accents, etc
_ci -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
_bin -- simple, fast
_general_ci -- fails to compare multiletters; eg ss=ß, somewhat fast
... -- slower
_0900_ -- (8.0) much faster because of a rewrite
More info:
What are the differences between utf8_general_ci and utf8_unicode_ci?
What's the difference between utf8_general_ci and utf8_unicode_ci?
How to change collation of database, table, column?
What's the difference between utf8_general_ci and utf8_unicode_ci?
Here's how to change all databases/tables/columns. Run these queries and they will output all of the subsequent queries necessary to convert your entire database to character encoding utf8mb4 and collations to the MySQL 8 default of utf8mb4_0900_ai_ci. Hope this helps!
-- Change DATABASE Default Collation
SELECT
CONCAT('ALTER DATABASE `', SCHEMA_NAME,'` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND SCHEMA_NAME LIKE 'database_name';
Note that changing the schema default changes the default for new tables (and their columns). It does not modify existing columns of existing tables.
-- Change TABLE Collation / Char Set
SELECT
CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.TABLES
WHERE TABLE_SCHEMA
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND TABLE_TYPE = 'BASE TABLE'
AND TABLE_SCHEMA LIKE 'database_name';
Note that changing the table default changes the default for new columns. It does not modify existing columns on existing tables.
-- Change COLUMN Collation / Char Set
SELECT
CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` MODIFY COLUMN `', COLUMN_NAME, '` ', COLUMN_TYPE,
IF(COLUMN_DEFAULT IS NOT NULL, CONCAT(' DEFAULT \'', COLUMN_DEFAULT, '\''), ''),
IF(IS_NULLABLE = 'YES', ' NULL ', ' NOT NULL '),
' COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND COLLATION_NAME IS NOT NULL
AND TABLE_SCHEMA LIKE 'database_name'
AND COLLATION_NAME = 'old_collation_name';
This changes the actual columns and the database behaviour on queries. It does not, however, convert the data if the data is not in a compatible collation/character set. See https://dev.mysql.com/blog-archive/mysql-8-0-collations-migrating-from-older-collations/ for details on migrating from older collations. We also assume here that your default values do not include a single quote - it would need to be escaped - and we ensure that COLLATION_NAME is not NULL to exclude columns with integers, timestamps etc.
We filter out the built-in system schemas such as sys and mysql in all three cases, as these should likely not be modified unless you have an explicit reason to do so.
Beware that in Mysql, the utf8 character set is only a subset of the real UTF8 character set. In order to save one byte of storage, the Mysql team decided to store only three bytes of a UTF8 characters instead of the full four-bytes. That means that some east asian language and emoji aren't fully supported. To make sure you can store all UTF8 characters, use the utf8mb4 data type, and utf8mb4_bin or utf8mb4_general_ci in Mysql.
Adding to what David Whittaker posted, I have created a query that generates the complete table and columns alter statement that will convert each table. It may be a good idea to run
SET SESSION group_concat_max_len = 100000;
first to make sure your group concat doesn't go over the very small limit as seen here.
SELECT a.table_name, concat('ALTER TABLE ', a.table_schema, '.', a.table_name, ' DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci, ',
group_concat(distinct(concat(' MODIFY ', column_name, ' ', column_type, ' CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ', if (is_nullable = 'NO', ' NOT', ''), ' NULL ',
if (COLUMN_DEFAULT is not null, CONCAT(' DEFAULT \'', COLUMN_DEFAULT, '\''), ''), if (EXTRA != '', CONCAT(' ', EXTRA), '')))), ';') as alter_statement
FROM information_schema.columns a
INNER JOIN INFORMATION_SCHEMA.TABLES b ON a.TABLE_CATALOG = b.TABLE_CATALOG
AND a.TABLE_SCHEMA = b.TABLE_SCHEMA
AND a.TABLE_NAME = b.TABLE_NAME
AND b.table_type != 'view'
WHERE a.table_schema = ? and (collation_name = 'latin1_swedish_ci' or collation_name = 'utf8mb4_general_ci')
GROUP BY table_name;
A difference here between the previous answer is it was using utf8 instead of ut8mb4 and using t1.data_type with t1.CHARACTER_MAXIMUM_LENGTH didn't work for enums. Also, my query excludes views since those will have to altered separately.
I simply used a Perl script to return all these alters as an array and iterated over them, fixed the columns that were too long (generally they were varchar(256) when the data generally only had 20 characters in them so that was an easy fix).
I found some data was corrupted when altering from latin1 -> utf8mb4. It appeared to be utf8 encoded latin1 characters in columns would get goofed in the conversion. I simply held data from the columns I knew was going to be an issue in memory from before and after the alter and compared them and generated update statements to fix the data.
here describes the process well. However, some of the characters that didn't fit in latin space are gone forever. UTF-8 is a SUPERSET of latin1. Not the reverse. Most will fit in single byte space, but any undefined ones will not (check a list of latin1 - not all 256 characters are defined, depending on mysql's latin1 definition)
Related
I have a MySQL database, full of data, with collation = latin1 - default collation.
If I change that collation to utf8 - default collation, a superset of the collation above,
would there be any corruption in the existing data?
Your data will not be corrupted, but you must take care to update the collation of all tables and columns and not just change the default collation of the database as this would only apply to new tables. If you do not alter all existing tables you might encounter some odd behaviour when running queries where data are compared.
I would recommend that you run the following queries as taken from David Whittaker's answer
Heres how to change all databases/tables/columns. Run these queries and they will output all of the subsequent queries necessary to convert your entire schema to utf8. Hope this helps!
-- Change DATABASE Default Collation
SELECT DISTINCT concat('ALTER DATABASE `', TABLE_SCHEMA, '` CHARACTER SET utf8 COLLATE utf8_unicode_ci;')
from information_schema.tables
where TABLE_SCHEMA like 'database_name';
-- Change TABLE Collation / Char Set
SELECT concat('ALTER TABLE `', TABLE_SCHEMA, '`.`', table_name, '` CHARACTER SET utf8 COLLATE utf8_unicode_ci;')
from information_schema.tables
where TABLE_SCHEMA like 'database_name';
-- Change COLUMN Collation / Char Set
SELECT concat('ALTER TABLE `', t1.TABLE_SCHEMA, '`.`', t1.table_name, '` MODIFY `', t1.column_name, '` ', t1.data_type , '(' , t1.CHARACTER_MAXIMUM_LENGTH , ')' , ' CHARACTER SET utf8 COLLATE utf8_unicode_ci;')
from information_schema.columns t1
where t1.TABLE_SCHEMA like 'database_name' and t1.COLLATION_NAME = 'old_charset_name';
How could I do this SQL query?
Name Database: database_1
Column: Collation
Value: latin1_swedish_ci
I need the list of ALL "database_1" tables that have columns whose value contains the text %latin1%.
My final goal: I need to change these values in all the tables that contain this data, for example:
CHARACTER SET latin1 COLLATE latin1_swedish_ci
Change to:
CHARACTER SET utf8 COLLATE utf8_unicode_ci
I already have the consult ready for replacement:
ALTER TABLE `blog_sitemapconf` CHANGE `value` `value` VARCHAR(100) CHARACTER SET latin1 COLLATE latin1_english_ci NOT NULL DEFAULT '';
But I need to get the name of all tables that contain columns with
CHARACTER SET latin1 COLLATE latin1_swedish_ci
You can query this information from the information_schema.columns table:
SELECT table_name, column_name
FROM information_schema.columns
WHERE collation_name LIKE '%latin1%'
First, determine how you will be doing the charset conversion. If the table is correctly encoded in latin1, then this is the desired conversion:
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
If, instead, you have utf8 bytes stored in latin1 columns, then the 'fix' is more complex.
However, that changes all CHAR/TEXT columns in the table. Is that OK?
If that is OK, then this will generate all the ALTERs.
SELECT DISTINCT
CONCAT("ALTER TABLE ", table_schema, ".", table_name,
" CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;")
FROM information_schema.columns
WHERE character_set = 'latin1'
AND table_schema NOT IN ('mysql', 'information_schema', 'performance_schema');
You can then copy/paste them into the mysql commandline tool.
If you need to change only individual columns, then the fix is difficult because the rest of the attributes of each column must be repeated. This is not easy to do in a query.
I am developing with mariadb and Spring, JdbcTemplate.
At first, we made DB charset as utf8, but now we have to change it into utf8mb4 because of emojis.
Till now I update individual charset with query something like below.
ALTER TABLE WT_WORKS CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE WT_WORKS CHANGE WORKS_TITLE WORKS_TITLE VARCHAR(300) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE WT_WORKS CHANGE WORKS_DESC WORKS_DESC VARCHAR(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
However It is now efficient because of table relations. For example when I insert into WT_WORKS, also need to into WT_WRITERS. It looks impossible to find every tables and columns.
So I want to know change these at once.(Including Procedures and Functions). -- something like (v_name VARCHAR(10)) into (v_name VARCHAR(10) CHARSET utf8mb4).
Thanks for answer.
FYI. my my.cnf got follow settings
[client]
default-character-set=utf8mb4
[mysqld]
collation-server = utf8mb4_unicode_ci
character-set-server = utf8mb4
[client]
default-character-set = utf8mb4
SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME
FROM information_schema.COLUMNS
WHERE CHARACTER_SET_NAME = 'utf8'
will list all the tables that still have some column set to utf8. When them, do ALTER TABLE .. CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;
This should generate all the ALTERs you need:
SELECT DISTINCT
CONCAT(
"ALTER TABLE ", TABLE_SCHEMA, ".", TABLE_NAME,
" CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;"
)
FROM information_schema.COLUMNS
WHERE CHARACTER_SET_NAME = 'utf8'
Then copy and paste them into the mysql command line tool.
Caveat: If you have some columns with charsets other than utf8, they will be blindly converted to utf8mb4. This would be bad for hex, ascii, etc., columns, such as country_code, uuid, md5, etc.
You could do something similar to change individual columns instead.
You do not need to do both.
This question is about existing tables. Changing the default for a database is well covered.
Most questions here on Stack Overflow seems to assume that you want to change both the character set and the collation - say from latin1_general_ci to utf8_spanish_ci. The answers typically recommends:
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8 COLLATE utf8_spanish_ci;
This seems like a waste when the character set is unchanged - only the collation should be updated - but maybe MySQL is clever enough so it discovers that the character set is the same and no resources are wasted?
The question becomes: If only the collation is changing, what is the recommended procedure to update it for existing tables (and their columns)?
I think the answer is:
For each table, do
ALTER TABLE <table_name> COLLATE utf8_spanish_ci;
For each column, do
ALTER TABLE <table_name> MODIFY <column_name> <column_type> COLLATE utf8_spanish_ci;
You can use the information_schema database to generate the SQL to run - for just the needed tables
SELECT concat('ALTER TABLE `', TABLE_SCHEMA, '`.`', table_name, '` COLLATE utf8_spanish_ci;')
from information_schema.tables
where TABLE_SCHEMA like 'database_name' and TABLE_COLLATION = 'utf8_general_ci';
and columns
SELECT concat('ALTER TABLE `', TABLE_SCHEMA, '`.`', table_name, '` MODIFY `', column_name, '` ', COLUMN_TYPE, ' COLLATE utf8_spanish_ci;')
from information_schema.columns
where TABLE_SCHEMA like 'database_name' and COLLATION_NAME = 'utf8_general_ci';
I have a utf8_general_ci database that I'm interested in converting to utf8_unicode_ci.
I've tried the following commands
ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; (for every single table)
But that seems to change the charset for future data but doesn't convert the actual existing data from utf8_general_ci to utf8_unicode_ci.
Is there any way to convert the existing data to utf8_unicode_ci?
SHOW CREATE TABLE to see if it really set the CHARACTER SET and COLLATION on the columns, not just the defaults.
What was the CHARACTER SET before the ALTERs?
Do SELECT col, HEX(col) ... for some field that should have utf8 in it. This will help us determine if you really have utf8 in the table. The encoding for characters is different based on CHARACTER SET; the HEX helps discover such.
The ordering (WHERE, ORDER BY, etc) is controlled by COLLATION. The indexes probably had to be rebuilt based on your ALTER TABLE. Did big tables with indexes take a 'long' time to convert?
To actually see the difference between utf8_general_ci and utf8_unicode_ci, you need a "combining accent" or, more simply, the German ß versus ss:
mysql> SELECT 'ß' = 'ss' COLLATE utf8_general_ci,
'ß' = 'ss' COLLATE utf8_unicode_ci;
+-------------------------------------+-------------------------------------+
| 'ß' = 'ss' COLLATE utf8_general_ci | 'ß' = 'ss' COLLATE utf8_unicode_ci |
+-------------------------------------+-------------------------------------+
| 0 | 1 |
+-------------------------------------+-------------------------------------+
However, to test that in your tables, you would need to store those values and use WHERE or GROUP_CONCAT or something else to determine the equality.
What 'proof' do you have that the ALTERs failed to achieve the collation change?
(Addressing other comments: REPAIR should be irrelevant. CONVERT TO tells the ALTER to actually modify the data, so it should have done the desired action.)
You have to change the collation of every field in every table. As you say, the collation of the table is only the default value for fields created later, and the collation of the database is only the default value for tables created later.
As Lorenz Meyer said, the collation of the table is only the default value for fields created later and you need to set the defaults for the columns explicitly too.
Such a change looks like:
ALTER TABLE mytable CHANGE mycolumn mycolumn varchar(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci