Migrating from SQL server to MySQL using pentaho unicode issue - mysql

I have a problem migrating data from SQL server to MySQL. I have nvarchar columns in SQL server and am exporting them to a Unicode textfile. But when I am importing the column into an utf-8 table of MySQL I get an error for duplicate value: Mysql sees no difference between 'Kaneko, Shûsuke' and 'Kaneko, Shusuke'. I am trying to get these values into a unique column.
What's wrong?
must I use another charset in MySQL?
I also tried converting the textfile to utf8 before importing into MySQL, but still getting the same error.

It seems the problem in your Mysql Table creation. First use SHOW CREATE TABLE on mysql prompt and see its table structure. Have you used right charset and collate. You can read here mysql docs
Many times collation is indeed not only case insensitive, but also partly accent insensitive, so ñ = n. (as Joni Salonen points out, this is incorrect!) but á = a.
So we can use binary collation but its have own drawback.Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.
An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the sme letter are sorted next to each other)
The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.
Use a binary collation for the specific column (possibly your best bet)
For ex-
drop table cc;
CREATE TABLE cc ( c CHAR(100) primary key ) DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
insert into cc values ( 'Kaneko, Shûsuke' );
insert into cc values ( 'Kaneko, Shusuke' );

Related

How to find out mysql field level charset?

I need to convert latin1 charset of a table to utf8.
Quoting from mysql docs:
The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8mb4). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8mb4;
This answer shows how to find out charset at DB level, table level, and column level. But I need to find out the charset of the actual stored values. How can I do that?
Since my connector/j jdbc connection string doesn't specify any characterEncoding or connectionCollation properties, it is possible that it used utf8 by default to store the values, in which case I don't need any conversion, just change the table metadata.
mysql-connector-java version: 8.0.22
mysql database version: 5.6
spring boot version: 2.5.x
The character set of the string in a given column should be the same as the column definition.
There have been cases where people accidentally store the bytes of the wrong encoding in a column. For example, they store bytes of a latin1 encoding in a utf8 field. This is a terrible idea, because queries can't tell the difference. Those bytes may not be valid values of the column's defined encoding, and this results in garbage data. Cleaning up a table where some of the strings are stored in the wrong encoding is an unpleasant chore.
So I strongly urge you to store only strings encoded in a compatible way according to the column's definition, and to assume that all strings are stored this way.
To answer the title:
SHOW CREATE TABLE tablename shows the detault charset for the table and any overrides for individual columns.
Don't blindly use CONVERT TO, especially the 2-step ALTER you are showing. Let's see what is in the table now (SELECT col, HEX(col) ... for something with accented text.
See Trouble with UTF-8 characters; what I see is not what I stored for the main 4 types of problems.
This gives several cases and how to fix them. http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
One case involves using CONVERT TO; two other cases involve using BLOB or VARBINARY.

How to avoid invalid errors when comparing integers with explicit collation in MySQL 5.6

Okay, not the clearest title ever; feel free to improve.
I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.
I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.
I figured this should be a fairly easy task; just do:
SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin
But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.
No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:
For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.
Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.
Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?
SET NAMES utf8mb4;
CREATE TABLE x(s VARCHAR(11) COLLATE utf8mb4_unicode_520_ci NOT NULL);
INSERT INTO x (s)
VALUES ('aha'), ('áha̱'), ('ā̧́ḫà̀'),
('i'), ('i̯');
SELECT s FROM x GROUP BY s HAVING COUNT(*) > 1;
Comes back with
aha
i
without any complaints about numeric stuff.
I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.
What I am doing differently than your case?
When adding an explicate COLLATE clause, put it on the component of the query that needs it. (COLLATE does not apply to the query as a whole; different parts can be collated differently.)

MySQL Handling of à i utf8 group by primary key

I have a utf8 table in mysql 5.5.19.
When I select data from it, I get data such as RoÃdorf and Rosdorf on different rows.
When I group by on the data however, or select distinct, I get only RoÃdorf.
My basic problem is that the data is exported to hive (which gets the correct group by with both different alternatives) and then imports this to mysql again. The import will then fail because mysql will treat RoÃdorf and Rosdorf as the same key (and PK constraint will brake). I guess this is for the same reason as the above distinction, so an answer to that is very helpful.
Use utf8_bin as collation, it will differentiate accented and non-accented characters
See MySQL collation charts

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.

What will happen to existing data if I change the collation of a column in MySQL?

I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.
My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?
I will change with phpMyAdmin web client.
The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.
Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.
A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....
A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.
When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.
When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.
When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.
MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?
Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars
INSERT INTO Test VALUES ('英國華僑');
I select them and get rubbish (as expected).
SELECT text from Test;
gives
text
????
I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result
text
????
This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.
Your data will stay in place but it won't be fixed.
Valid data will be properly converted:
When you change a data type using
CHANGE or MODIFY, MySQL tries to
convert existing column values to the
new type as well as possible. Warning:
This conversion may result in
alteration of data.
http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
... and more specifically:
To convert a binary or nonbinary
string column to use a particular
character set, use ALTER TABLE. For
successful conversion to occur, one of
the following conditions must
apply:[...] If the column has a
nonbinary data type (CHAR, VARCHAR,
TEXT), its contents should be encoded
in the column character set, not some
other character set. If the contents
are encoded in a different character
set, you can convert the column to use
a binary data type first, and then to
a nonbinary column with the desired
character set.
http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html
So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.
To sum up:
Use HEX() to find out if you still have your data.
Make your tests in a copy of your table.
My question is, what will happen with my existing data if I change my
collation to utf8_unicode now?
Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).
Will it destroy or corrupt the existing data or will the data remain,
but the new data will be saved as utf8 as it should?
Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.
I will change with phpMyAdmin web client.
Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options
CAUTION! Some problems are solved via
ALTER TABLE ... CONVERT TO ...
Some are solved via a 2-step process
ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...
If you do the wrong one, you will have a worse mess!
Do SELECT HEX(col), col ... to see what you really have.
Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases