What difference in schema VS table VS column CHARSET in MySQL? - mysql

What difference in schema CHARSET VS table CHARSET VS column CHARSET in MySQL?
When I change my table's charset to utf8, can I use utf8mb4 charset in my column?
Thanks.

Specifying a character set on database level is in fact defining the default character set for tables.
Doing the same for tables defines the default character set for columns.
Since you can't go further down the road, specifying a character set on a column will definitely use the character set for everything you store in that column.
When you don't specify a character set on column level, the character set of the table is used. And if that is not specified the character set of the database is used.

When creating a table, the backup for charset and collation is the settings for the schema.
Once you have created the table, it now has a default charset and collation. (This is subtly different than what fancyPants said.)
Similarly, when creating a column (either as part of creating the table, or with ALTER .. ADD COLUMN), you can be explicit about charset and collation, or it can inherit from the defaults given for the table. Again, the column's definition is now frozen.
Doing SHOW CREATE TABLE will show an override or continue to leave the implicit inheritance. SELECT .. FROM information_schema.columns .. makes it clearer that every column has a charset and collation.
That is, there is no "dynamic" inheritance at "run time". The inheritance is only when the table or column is created.
Note that each charset has a default collation. And each collation belongs to a specific charset (see the first part of the collation name). So, specifying either the charset or collation implicitly specifies the other.

Related

Incorrect string value error for unconventional characters

So I'm using a wrapper to fetch user data from instagram. I want to select the display names for users, and store them in a MYSQL database. I'm having issues inserting some of the display names, dealing with, specifically, an incorrect string value error:
Now, I've dealt with this issue before with accent marks, letters with umlauts, etc. The solution would be to change the collation to utf8_general_ci under the utf8 charset.
So as you can see, some of the display names I'm pulling have very unique characters that I'm not sure mySQL can recognize at all, i.e.:
แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ
So I receive:
Error Code: 1366. Incorrect string value: '\xF0\x9D\x99\x87\xF0\x9D...' for column 'dummy' at row 1
Here's my sql code
CREATE TABLE test_table(
id INT AUTO_INCREMENT,
dummy VARCHAR(255),
PRIMARY KEY(id)
);
INSERT INTO test_table (dummy)
VALUES ('แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ');
Any thoughts on a proper charset + collation pair that can handle characters like this? Not sure where to look for a solution, so I come here to see if anyone dealt with this.
P.S., I've tried utf8mb4 charset with utf8mb4_unicode_ci and utf8mb4_bin collations as well.
The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.
The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:
Note
Historically, MySQL has used utf8 as an alias for utf8mb3;
beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output
of SHOW statements and in Information Schema tables when this
character set is meant.
At some point in the future utf8 is expected to become a reference to
utf8mb4. To avoid ambiguity about the meaning of utf8, consider
specifying utf8mb4 explicitly for character set references instead of
utf8.
You should also be aware that the utf8mb3 character set is deprecated
and you should expect it to be removed in a future MySQL release.
Please use utf8mb4 instead.
You may have tried to change your table in the following way:
ALTER TABLE test_table CHARSET=utf8mb4;
But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:
ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;
Or to convert all string or TEXT columns in a table in one statement:
ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;
That would be ๐™‡ - L MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL L
It requires the utf8mb4 Character set to even represent it. "F0" is the clue; it is the first of 4 bytes in a 4-byte UTF-8 character. It cannot be represented in MySQL's "utf8". Collation is (mostly) irrelevant.
Most, not all, of the characters in แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ also need utf8mb4. They are "MATHEMATICAL BOLD FRAKTUR" letters.
(Meanwhile, Bill gives you more of an answer.)

Database conversion from latin1 to utf8mb4, what about indexes?

I noticed that my MODX database still uses latin1 character set in the database and in its tables. I would like to convert them to utf8mb4 and update collations accordingly.
Not totally sure how I should do this. Is this correct?
I alter every table to use utf8mb4 and utf8_unicode_ci?
I update the default character set and collation of the database.
Are indexes updated automatically? Is there something else I should be aware of?
A bonus question: what would be the most suitable latest utf8_unicode collation? Western languages should work.
Changing the default character sets of a table or a schema does not change the data in the column itself, it only changes the default to apply the next time you add a table or add a column to a table.
To convert current data, alter one table at a time:
ALTER TABLE <name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
The collation utf8mb4_0900_ai_ci is faster than earlier collations (at least according to the documentation), and it's the most current and accurate. This collation requires MySQL 8.0.
The most current collation in MySQL 5.7 is utf8_unicode_520_ci.
A table-conversion like this rebuilds the indexes, so there's nothing else you need to do.

MySQL character encoding change. Is data integrity preserved?

I will have to convert the database encoding from latin-1 to utf-8.
I'm aware of the fact that converting the database is done via the command of
ALTER DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
Source and converting an existing table is done via the command of
ALTER TABLE tbl_name
[[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]
Source.
However, the database is already existent and there is sensitive information involved. My question is whether the data I already have will be changed. The purpose of this question is that I have to give an estimate before I do the change.
Every (character string-type) column has its own character set and collation metadata.
If, when the column's data type was specified (i.e. when it was last created or altered), no character set/collation was explicitly given, then the table's default character set and collation would be used for the column.
If, when the table was specified, no default character set/collation was explicitly given, then the database's default character set and collation would be used for the table's default.
The commands that you quote in your question merely alter such default character sets/collations for the database and table respectively. In other words, they will only affect tables and columns that are created thereafterโ€”they will not affect existing columns (or data).
To update existing data, you should first read the Changing the Character Set section of the manual page on ALTER TABLE:
Changing the Character Set
To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
The statement also changes the collation of all character columns. If you specify no COLLATE clause to indicate which collation to use, the statement uses default collation for the character set. If this collation is inappropriate for the intended table use (for example, if it would change from a case-sensitive collation to a case-insensitive collation), specify a collation explicitly.
For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET changes the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 ร— 65,535 = 196,605 bytes. That length does not fit in a TEXT column's length bytes, so MySQL converts the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.
To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns. For example:
ALTER TABLE t MODIFY latin1_text_col TEXT CHARACTER SET utf8;
ALTER TABLE t MODIFY latin1_varchar_col VARCHAR(M) CHARACTER SET utf8;
If you specify CONVERT TO CHARACTER SET binary, the CHAR, VARCHAR, and TEXT columns are converted to their corresponding binary string types (BINARY, VARBINARY, BLOB). This means that the columns no longer will have a character set attribute and a subsequent CONVERT TO operation will not apply to them.
If charset_name is DEFAULT in a CONVERT TO CHARACTER SET operation, the character set named by the character_set_database system variable is used.
ย Warning
The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
To change only the default character set for a table, use this statement:
ALTER TABLE tbl_name DEFAULT CHARACTER SET charset_name;
The word DEFAULT is optional. The default character set is the character set that is used if you do not specify the character set for columns that you add to a table later (for example, with ALTER TABLE ... ADD column).
When the foreign_key_checks system variable is enabled, which is the default setting, character set conversion is not permitted on tables that include a character string column used in a foreign key constraint. The workaround is to disable foreign_key_checks before performing the character set conversion. You must perform the conversion on both tables involved in the foreign key constraint before re-enabling foreign_key_checks. If you re-enable foreign_key_checks after converting only one of the tables, an ON DELETE CASCADE or ON UPDATE CASCADE operation could corrupt data in the referencing table due to implicit conversion that occurs during these operations (Bug #45290, Bug #74816).

Can mysql charset for table and column be different?

Does it makes sense to have two different charset for table and a single column in the same table ? or will it create problem, especially for the below mentioned example ?
For example,
Table charset - latin1
Column C1 charset - utf8mb4
Tables don't have a charset anyway, the only thing they have is a default charset. The only thing that has an actual "physical" charset are columns, because they're the only thing that actually stores data. The way it works is that if you're not setting an explicit charset for a column, the table's default is used. And if the table doesn't have a default, the database's default is used. And if that doesn't have a default, the server's default is used.

Sense of command collate in create table sql

I understand function of command collate (a little). It is truth that I did not test if it is possible to have tables with various collation (or even various charset) inside one DB.
But I found that (at least in phpmyadmin) when I create any DB, I set its charset and collation - and if I miss this command in CREATE TABLE ..., then automatically will be set collation set in creation of DB.
So, my question is: What is sense of presence of command collate in sql of CREATE TABLE ... if it can be missing there - and is recommended to have collate in CREATE TABLE ... or is it irrelevant?
In SQL Server if you don't specify the COLLATE it is defaulted to what ever DB is set to. Thus there is no danger in not specifying.
In MySQL behavior is the same:
The table character set and collation are used as default values for
column definitions if the column character set and collation are not
specified in individual column definitions. MySQL Reference
Collate is only used when you want to specify to non-default value. If all you are using is English character set than you have nothing to worry about it. If you store data from multiple languages than you have specify specific collation to ensure what characters are stored correctly.