Collation change utf8mb4_unicode_ci to utf8mb4_general_ci - mysql

For my databases, I used utf8mb4_unicode_ci with utf8mb4 character set as a default. This was a mistake and the folks who are using the databases I created are complaining about the collation. I need to convert it to utf8mb4_general_ci. Am I able to get away with just changing the DB using an alter statement such as:
ALTER DATABASE `#{database}` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Or, will I need to change each individual table and deal with columns even though the charset is the same between the two collations? I can't seem to find definitive answers on this... I'm using MySQL 5.7.2x .

utf8mb4_unicode_520_ci is better than either of the collations you mentioned.
Why are they complaining? Perhaps JOINs are failing to use indexes. I would argue with them that the old tables should be changed.
ALTER DATABASE only sets up defaults for future tables. It will not do what you need.
ALTER TABLE ... CONVERT TO ... for each table is needed. See http://mysql.rjweb.org/doc.php/limits#767_limit_in_innodb_indexes for a similar ALTER. It provides a way to automatically generate all the ALTERs.

Related

Database conversion from latin1 to utf8mb4, what about indexes?

I noticed that my MODX database still uses latin1 character set in the database and in its tables. I would like to convert them to utf8mb4 and update collations accordingly.
Not totally sure how I should do this. Is this correct?
I alter every table to use utf8mb4 and utf8_unicode_ci?
I update the default character set and collation of the database.
Are indexes updated automatically? Is there something else I should be aware of?
A bonus question: what would be the most suitable latest utf8_unicode collation? Western languages should work.
Changing the default character sets of a table or a schema does not change the data in the column itself, it only changes the default to apply the next time you add a table or add a column to a table.
To convert current data, alter one table at a time:
ALTER TABLE <name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
The collation utf8mb4_0900_ai_ci is faster than earlier collations (at least according to the documentation), and it's the most current and accurate. This collation requires MySQL 8.0.
The most current collation in MySQL 5.7 is utf8_unicode_520_ci.
A table-conversion like this rebuilds the indexes, so there's nothing else you need to do.

tables sum collation is latin1_swedish_ci, but all individual tables show utf8_general_ci

This MySQL db was set up in Godaddy from installing WordPress. Things on the site were acting glitchy - swapping out theme and deactivating plugins didn't help, so I decided to take a look at the tables using phpmyAdmin.
I've never seen this before - all tables use utf8_general_ci (there are 13 tables), but at the bottom is the summary of everything and the collation shows latin1_swedish_ci, and not the utf8...
Seems like this shouldn't be. What can I do to make it all uniform, using the utf8... and not the latin1_swedish?
To help explain, SHOW CREATE TABLE gives you something like this:
CREATE TABLE `h2u` (
`c` varchar(9) CHARACTER SET utf8 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This says that the column c is utf8 (and some utf8 collation, probably utf8_general_ci), but the default for any other columns you might add is latin1 (and some latin1 collation, probably latin1_swedish_ci).
The CHARACTER SET and COLLATION on the column is what matters.
(It is sad that 3rd party software, while trying to be helpful, sometimes obscures useful info.)
Edit
I would guess from the picture that the default for the database is latin1 / latin1_swedish_ci and the default for each table is utf8 / utf8_general_ci. But the important thing is what the setting is for each column; the image does not show that.
The character set and collation for each column is important if you are using anything other than English text. Are you?
Yes, you have observed something strange. I am trying to say that it is probably not important, and very unlikely to cause any glitches other than with non-English text.
The following would have created output similar to the image:
CREATE DATABASE wp DEFAULT CHARACTER SET latin1;
CREATE TABLE wp_users (...) DEFAULT CHARACTER SET utf8; -- overriding the 'latin1'

Is it possible to enable emojis for specific columns of specific tables?

First, I would like to assure you that I have done my "homework" and have read this, this, this and this. Also, one of my former questions is closely related to this one, but in that question I am dealing about flourishlib's compatibility issues with utf8mb4. This question deals with a deeper level. Let's suppose that I have several tables and I want to modify just a few columns to have utf8mb4 encoding, to preserve some storage space and performance after the change. If I changed the whole database to have an encoding of utf8mb4, then its size would increase with 33%, which will affect its performance badly as well. So, we have selected four columns from three different tables to support emojis. These are:
users.bio (tinytext, utf8_general_ci)
questions.question (longtext, utf8_general_ci)
questions.answer (longtext, ut8_general_ci)
comments.comment (tinytext, utf8_general_ci)
As a result, my action plan is as follows:
Create a backup of the database
Run these commands:
alter table comments change comment comment tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table users change bio bio tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change question question longtext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change answer answer longtext character set utf8mb4 collate utf8mb4_unicode_ci;
Expectations:
this should make the specified columns use utf8mb4 instead of utf8
existent data will be correctly converted to utf8mb4, that is, previous texts will be preserved and the users will be able to correctly read their content
other columns will not be changed
queries involving the affected tables will be slower
Are my expectations accurate? Do I need to change the connection? Thanks
You need utf8mb4 in any columns that are storing Chinese.
In VARCHAR(...) utf8mb4, each "character" takes 1-4 bytes. No 33% increase.
On the other hand, CHAR(10) utf8mb4 is always allocated 40 bytes.
You do need to establish that your client is talking utf8mb4, not just utf8. That comes in some parameter in the connection or with SET NAMES utf8mb4.
If you need to automate the ALTERs, it is pretty easy to generate them via a SELECT into information_schema.
Addenda
Expectations 1-3: Yes.
Expectation 4 (queries involving the affected tables will be slower) -- processing will be essentially the same speed.

Sense of command collate in create table sql

I understand function of command collate (a little). It is truth that I did not test if it is possible to have tables with various collation (or even various charset) inside one DB.
But I found that (at least in phpmyadmin) when I create any DB, I set its charset and collation - and if I miss this command in CREATE TABLE ..., then automatically will be set collation set in creation of DB.
So, my question is: What is sense of presence of command collate in sql of CREATE TABLE ... if it can be missing there - and is recommended to have collate in CREATE TABLE ... or is it irrelevant?
In SQL Server if you don't specify the COLLATE it is defaulted to what ever DB is set to. Thus there is no danger in not specifying.
In MySQL behavior is the same:
The table character set and collation are used as default values for
column definitions if the column character set and collation are not
specified in individual column definitions. MySQL Reference
Collate is only used when you want to specify to non-default value. If all you are using is English character set than you have nothing to worry about it. If you store data from multiple languages than you have specify specific collation to ensure what characters are stored correctly.

How to change the default collation of a table?

create table check2(f1 varchar(20),f2 varchar(20));
creates a table with the default collation latin1_general_ci;
alter table check2 collate latin1_general_cs;
show full columns from check2;
shows the individual collation of the columns as 'latin1_general_ci'.
Then what is the effect of the alter table command?
To change the default character set and collation of a table including those of existing columns (note the convert to clause):
alter table <some_table> convert to character set utf8mb4 collate utf8mb4_unicode_ci;
Edited the answer, thanks to the prompting of some comments:
Should avoid recommending utf8. It's almost never what you want, and often leads to unexpected messes. The utf8 character set is not fully compatible with UTF-8. The utf8mb4 character set is what you want if you want UTF-8. – Rich Remer Mar 28 '18 at 23:41
and
That seems quite important, glad I read the comments and thanks #RichRemer . Nikki , I think you should edit that in your answer considering how many views this gets. See here https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html and here What is the difference between utf8mb4 and utf8 charsets in MySQL? – Paulpro Mar 12 at 17:46
MySQL has 4 levels of collation: server, database, table, column.
If you change the collation of the server, database or table, you don't change the setting for each column, but you change the default collations.
E.g if you change the default collation of a database, each new table you create in that database will use that collation, and if you change the default collation of a table, each column you create in that table will get that collation.
It sets the default collation for the table; if you create a new column, that should be collated with latin_general_ci -- I think. Try specifying the collation for the individual column and see if that works. MySQL has some really bizarre behavior in regards to the way it handles this.
may need to change the SCHEMA not only table
ALTER SCHEMA `<database name>` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci ;
as Rich said - utf8mb4
(mariaDB 10)