What will happen to existing data if I change the collation of a column in MySQL? - mysql

I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.
My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?
I will change with phpMyAdmin web client.

The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.
Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.
A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....
A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.
When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.
When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.
When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.
MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?

Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars
INSERT INTO Test VALUES ('英國華僑');
I select them and get rubbish (as expected).
SELECT text from Test;
gives
text
????
I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result
text
????
This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.
Your data will stay in place but it won't be fixed.

Valid data will be properly converted:
When you change a data type using
CHANGE or MODIFY, MySQL tries to
convert existing column values to the
new type as well as possible. Warning:
This conversion may result in
alteration of data.
http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
... and more specifically:
To convert a binary or nonbinary
string column to use a particular
character set, use ALTER TABLE. For
successful conversion to occur, one of
the following conditions must
apply:[...] If the column has a
nonbinary data type (CHAR, VARCHAR,
TEXT), its contents should be encoded
in the column character set, not some
other character set. If the contents
are encoded in a different character
set, you can convert the column to use
a binary data type first, and then to
a nonbinary column with the desired
character set.
http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html
So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.
To sum up:
Use HEX() to find out if you still have your data.
Make your tests in a copy of your table.

My question is, what will happen with my existing data if I change my
collation to utf8_unicode now?
Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).
Will it destroy or corrupt the existing data or will the data remain,
but the new data will be saved as utf8 as it should?
Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.
I will change with phpMyAdmin web client.
Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options

CAUTION! Some problems are solved via
ALTER TABLE ... CONVERT TO ...
Some are solved via a 2-step process
ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...
If you do the wrong one, you will have a worse mess!
Do SELECT HEX(col), col ... to see what you really have.
Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Related

How to find out mysql field level charset?

I need to convert latin1 charset of a table to utf8.
Quoting from mysql docs:
The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8mb4). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8mb4;
This answer shows how to find out charset at DB level, table level, and column level. But I need to find out the charset of the actual stored values. How can I do that?
Since my connector/j jdbc connection string doesn't specify any characterEncoding or connectionCollation properties, it is possible that it used utf8 by default to store the values, in which case I don't need any conversion, just change the table metadata.
mysql-connector-java version: 8.0.22
mysql database version: 5.6
spring boot version: 2.5.x
The character set of the string in a given column should be the same as the column definition.
There have been cases where people accidentally store the bytes of the wrong encoding in a column. For example, they store bytes of a latin1 encoding in a utf8 field. This is a terrible idea, because queries can't tell the difference. Those bytes may not be valid values of the column's defined encoding, and this results in garbage data. Cleaning up a table where some of the strings are stored in the wrong encoding is an unpleasant chore.
So I strongly urge you to store only strings encoded in a compatible way according to the column's definition, and to assume that all strings are stored this way.
To answer the title:
SHOW CREATE TABLE tablename shows the detault charset for the table and any overrides for individual columns.
Don't blindly use CONVERT TO, especially the 2-step ALTER you are showing. Let's see what is in the table now (SELECT col, HEX(col) ... for something with accented text.
See Trouble with UTF-8 characters; what I see is not what I stored for the main 4 types of problems.
This gives several cases and how to fix them. http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
One case involves using CONVERT TO; two other cases involve using BLOB or VARBINARY.

Why does imported MySQL dump show some column value as BLOB?

I imported a MySQL table dump from my server to my local system. I am using phpMyAdmin to view my local MySQL databases. The column that held emails shows as BLOB - instead of actual email.
When I press on "edit" it shows correct emails, but not on listing.
I am totally confused as why this happens. Can anyone suggest solution?
BLOB data type is meant to store arbitrary binary data (Binary Large OBject = BLOB), for example an image or another document. It does not make sense to show the value as-is. It would look the same as viewing an image in a text editor.
You have several options, depending on the version of phpMyAdmin you are using - which I unfortunately do not know.
Make phpMyAdmin show BLOB values by default.
Show BLOB values for a complete result set.
These two possibilities are covered by an already asked question.
But basically, this is just fighting symptoms instead of curing the disease. The question is: Why did you chose the email field to be a BLOB? Basically, a VARCHAR is enough. I do not know the MySQL version you are running, but since MySQL 5.0.3, a VARCHAR can be as large as 65k byte.
ALTER TABLE
`table`
CHANGE
`email` `email` VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;
The query above changes the field email to a VARCHAR(255). Pay attention to the length you like to use, and to the character set and collation. But UTF-8 should be very fine in this case.
It shows the text of "Blob" as a placeholder as the blob datatype (which is the datatype that your emails are stored in) is vastly variable in size. For large Blob sizes, if the data were displayed, it could take up 100s + screens to display.

MySQL CharacterSet Conversion of a column will impact the existing values or not

I use MySQL 5.1 version
I have a MySQL table 'server_info' which has 2 columns 'server_id' as well as 'server_details' of which the character set of 'server_details' column was 'latin1' and i have 100 rows of data in the table. Now, I got some error while updating the table with some specific string values. Error message follows
'Incorrect String value for column server_details....'
Which i realized that was due to the different character set.
So i decide to change my character to 'utf-8' to support the new string as well.
If i change the character set will the existing data affected?
what will happen to the existing data?
What all things i should be careful about before doing this
conversion?
Is this conversion, the right fix for my problem?
While i checked in MySQL 5.6 version The conversion gave me a messgage that '100 rows are affected' Does that mean existing data also got converted to the new character set?
If Yes, Will this be the same behaviour i can expect in MySQL 5.1
version also?
If you change the Character Set there will be no impact on the existing values. Because during the insertion the data lost ( or doesn't aware of) some extra information about the Character Set.
To fix this problem you have to convert the text manually.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.

SSIS - Data from MySql to MsSql some characters are?

I just transferred some data from MySql to MsSql (2K5) in a text field, some of my characters, such as apostrophes, are now ? (question mark) to me this indicates some sort of collation or character set error, right?
To be honest, I don't know which one should I be using
The MySql db currect charset is utf8_general_ci and in ms sql is SQL_Latin1_General_CP1_CI_AS .
I have tried changing the charset of the mysql table to latin1_swedish_ci, however this doesnt help
Thanks for the input
Have you tried changing the target (SQL Server) column data type to NVARCHAR?
The utf8_general_ci collation on the MySQL column indicates a Unicode data type. If the source is Unicode, so should be the target - for the easiest transition.
Collations themselves play a minor role here. They just affect comparison and sorting.
You might also need to check the SSIS type of the columns in your dataflow. Remember, the data type and character set is set at the connection manager on the source (and that may involve a conversion from the original native character set). Also, any operations like derived columns or conversions will have a character set which can be altered and will persist down that column's lineage in the data flow. At the end when it gets to the destination, there could be additional character set coercion/conversion.