How to find out mysql field level charset? - mysql

I need to convert latin1 charset of a table to utf8.
Quoting from mysql docs:
The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8mb4). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8mb4;
This answer shows how to find out charset at DB level, table level, and column level. But I need to find out the charset of the actual stored values. How can I do that?
Since my connector/j jdbc connection string doesn't specify any characterEncoding or connectionCollation properties, it is possible that it used utf8 by default to store the values, in which case I don't need any conversion, just change the table metadata.
mysql-connector-java version: 8.0.22
mysql database version: 5.6
spring boot version: 2.5.x

The character set of the string in a given column should be the same as the column definition.
There have been cases where people accidentally store the bytes of the wrong encoding in a column. For example, they store bytes of a latin1 encoding in a utf8 field. This is a terrible idea, because queries can't tell the difference. Those bytes may not be valid values of the column's defined encoding, and this results in garbage data. Cleaning up a table where some of the strings are stored in the wrong encoding is an unpleasant chore.
So I strongly urge you to store only strings encoded in a compatible way according to the column's definition, and to assume that all strings are stored this way.

To answer the title:
SHOW CREATE TABLE tablename shows the detault charset for the table and any overrides for individual columns.
Don't blindly use CONVERT TO, especially the 2-step ALTER you are showing. Let's see what is in the table now (SELECT col, HEX(col) ... for something with accented text.
See Trouble with UTF-8 characters; what I see is not what I stored for the main 4 types of problems.
This gives several cases and how to fix them. http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
One case involves using CONVERT TO; two other cases involve using BLOB or VARBINARY.

Related

Migrating from SQL server to MySQL using pentaho unicode issue

I have a problem migrating data from SQL server to MySQL. I have nvarchar columns in SQL server and am exporting them to a Unicode textfile. But when I am importing the column into an utf-8 table of MySQL I get an error for duplicate value: Mysql sees no difference between 'Kaneko, Shûsuke' and 'Kaneko, Shusuke'. I am trying to get these values into a unique column.
What's wrong?
must I use another charset in MySQL?
I also tried converting the textfile to utf8 before importing into MySQL, but still getting the same error.
It seems the problem in your Mysql Table creation. First use SHOW CREATE TABLE on mysql prompt and see its table structure. Have you used right charset and collate. You can read here mysql docs
Many times collation is indeed not only case insensitive, but also partly accent insensitive, so ñ = n. (as Joni Salonen points out, this is incorrect!) but á = a.
So we can use binary collation but its have own drawback.Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.
An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the sme letter are sorted next to each other)
The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.
Use a binary collation for the specific column (possibly your best bet)
For ex-
drop table cc;
CREATE TABLE cc ( c CHAR(100) primary key ) DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
insert into cc values ( 'Kaneko, Shûsuke' );
insert into cc values ( 'Kaneko, Shusuke' );

What should be the correct MySql collation store in this case?

I'm storing strings on a Mysql database.
Some of the strings have single quotes which then get stored like this:
People’s
Is this the proper way to store these strings or should I set a different mysql collation?
I have tried the following without luck....
utf8_general_ci
latin1_swedish_ci
Where are you setting the collation? You should be using UTF-8 in three places:
as the collation of each row that contains character data. You can set the default collation for the table or database so that new columns pick it up, but if you already have a table, ALTERing its default collation doesn't change the collation of the existing rows.
as the encoding of the connection between your application and MySQL. This can be set manually using the SET NAMES statement, or, better, with the suitable API call for your environment (for example mysql_set_charset() in PHP, or the charset argument to connect() in Python MySQLdb).
in your output. For example if producing a web page, by using the Content-Type: text/html;charset=utf-8 header/meta.
You can store the string "People’s" as UTF-8-hidden-in-Latin-1 "People’s" by using Latin-1 throughout, since you'll still get the same bytes out as you put in. But that way you won't get sensible results from ordering or case-insenstive-comparisons of non-ASCII characters.

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.

What will happen to existing data if I change the collation of a column in MySQL?

I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.
My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?
I will change with phpMyAdmin web client.
The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.
Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.
A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....
A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.
When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.
When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.
When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.
MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?
Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars
INSERT INTO Test VALUES ('英國華僑');
I select them and get rubbish (as expected).
SELECT text from Test;
gives
text
????
I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result
text
????
This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.
Your data will stay in place but it won't be fixed.
Valid data will be properly converted:
When you change a data type using
CHANGE or MODIFY, MySQL tries to
convert existing column values to the
new type as well as possible. Warning:
This conversion may result in
alteration of data.
http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
... and more specifically:
To convert a binary or nonbinary
string column to use a particular
character set, use ALTER TABLE. For
successful conversion to occur, one of
the following conditions must
apply:[...] If the column has a
nonbinary data type (CHAR, VARCHAR,
TEXT), its contents should be encoded
in the column character set, not some
other character set. If the contents
are encoded in a different character
set, you can convert the column to use
a binary data type first, and then to
a nonbinary column with the desired
character set.
http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html
So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.
To sum up:
Use HEX() to find out if you still have your data.
Make your tests in a copy of your table.
My question is, what will happen with my existing data if I change my
collation to utf8_unicode now?
Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).
Will it destroy or corrupt the existing data or will the data remain,
but the new data will be saved as utf8 as it should?
Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.
I will change with phpMyAdmin web client.
Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options
CAUTION! Some problems are solved via
ALTER TABLE ... CONVERT TO ...
Some are solved via a 2-step process
ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...
If you do the wrong one, you will have a worse mess!
Do SELECT HEX(col), col ... to see what you really have.
Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.