While testing some code I stumbled on the following MySQL error:
Error Code: 1267. Illegal mix of collations (utf8_general_ci,IMPLICIT) and ( utf8mb4_general_ci,COERCIBLE) for operation '='
I was using a WHERE statement on a standard MySQL UTF-8 collation column which contained a character using 4 bytes. Unless I misunderstood, while reading, I found the following information:
MySQL's original UTF-8 implementation was incomplete (supporting maximum 3 bytes)
The way to solve this is a new collation called utf8mb4 which by no means a new encoding but only used by MySQL to patch their original mistake.
On my end I see no reasons to use the original MySQL UTF-8 implementation since it's incomplete. So I did a few server side configuration to make sure all defaults were pointing to utf8mb4. Everything seemed fine but now on my application: I can use 🐼 characters in my form without having to worry about MySQL.
My problem now remains that when I connect with MySQL Workbench, it seems that the encoding is being forced to UTF-8. So even if my application works correctly, if I want to run tests directly in MySQL Workbench, I get the "Illegal mix of collation" error unless I run this fix (in Workbench) after starting the application:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
I found this old question (MySQL Workbench charset) where it seemed impossible to overwrite the setting but even after I spent too much time searching for the config, I cannot believe this is still the case??
For now, I'm afraid, you will have to live with that. There's a WL for MySQL to rename that encoding to utf8 (throwing out the existing 3 byte variant). So it makes sense to keep utf8 in MySQL Workbench or we have to use different settings for different servers, which makes things more complicated.
Related
Final Update
I was able to easily migrate the data with Talend. No errors, and it worked perfectly the first time with no special settings. This shows what an utter piece of garbage the MySQL Workbench Migration tool is. While the learning curve of Talend is rough (it's not intuitive at all), it appears to be one of the best data migration solutions out there. I recommend using it. Note I never figured out why the migration failed (as seen below). I'm just walking away from the utter garbage Oracle has pushed on the community. Oh, and Talend migrated the data to utf8mb4/utf8_general_ci without a hitch.
Please note there are updates at the bottom.
We have to migrate an export from TrackerRMS (which luckily doesn't have FK constraints, but the data is a total mess) to MySQL. Restoring the backup of the TrackerRMS data to SQL Server is cake; no issues. The problem is copying the data from SQL Server to MySQL.
MySQL Workbench Migration can handle all but 4 of the tables; but those 4 tables are the key problem. They have crazy content in their fields which causes the migration tool to choke. I attempted to export the data as .sql from HeidiSQL and it chokes as well.
The source table problem fields are NVARCHAR(MAX) and SQL_Latin1_General_CP1_CI_AS collation.
Note I've tried changing the collation of the source SQL Server table columns to Latin1_General_100_BIN2_UTF8 and Latin1_General_100_CI_AI_SC_UTF8 and there is no effect.
The errors are:
ERROR: `Backup_EmpowerAssociates`.`BACKUP_documents`:Inserting Data: Incorrect string value: '\xF0\x9F\x93\x8A x...' for column 'filepath' at row 13
ERROR: `Backup_EmpowerAssociates`.`BACKUP_activities`:Inserting Data: Incorrect string value: '\xF0\x9F\x91\x80' for column 'subject' at row 42
ERROR: `Backup_EmpowerAssociates`.`BACKUP_resourcehistory`:Inserting Data: Incorrect string value: '\xF0\x9D\x91\x82(\xF0...' for column 'jobdescription' at row 80
This tells me the source data has 4-byte character details (which is beyond the standard utf8). Note the destination database in MySQL is utf8mb4 and utf8mb4_unicode_ci collated, and has the default settings as such. No connection settings override this.
When migrating I use Microsoft SQL Server and ODBC (native) for localhost (SQL Server) with default options. I've also tried turning ANSI off, but it has no impact. Note the ODBC configuration for SQL Server has no charset or collation settings or options. For target, I use the localhost stored connection which I use for general access.
Note the MySQL Workbench migration tool defines the receiving table columns (for the above problem columns) as LONGTEXT CHARACTER SET 'utf8mb4'.
Could the issue be the migration proxy (ODBC?) is somehow converting it to utf8 (even though I don't have that selected)? But if that was the case, wouldn't the incoming data not be erroring out in the migration process as a UTF8MB4 solution (4-byte vs less)?
Note I tried creating and adjusting the destination MySQL table (by adjusting the SQL in the migration tool) as CHARSET latin1 and latin1_general_ci collation. Same issue.
Migration simply does not want to work (this is with SQL Server source being SQL_Latin1_General_CP1_CI_AS). And I've tried it with UTF8 both on and off for driver. No effect.
Does anyone with migration experience recognize this issue, or have recommendations on how to resolve the problem? I'm fine with scrubbing the source data in SQL Server before I migrate - I just don't know the best method to do that (or if it's necessary).
Thanks!
===
UPDATE 1
This is very strange; using the below technique to show values that won't convert, this is the result:
SELECT filepath, CONVERT(varchar,filepath) FROM BACKUP_documents WHERE filepath <> CONVERT(varchar, Filepath);
Why on earth is the data being truncated upon convert with a simple filename at the "c" in documents?
Here's a capture that might also help resolve this issue.
But the strange part is MSSQL is showing normal text (without special characters) as being non-ASCII. I'm wondering if the folks at TrackerRMS are running code written in another country/language and it's messing up the data, but it's something that's not visible?
UPDATE 2
So to make things clear, here's what one of the characters that is messing up the data looks like.
I was able to easily migrate the data with Talend. No errors, and it worked perfectly the first time with no special settings. This shows what an utter piece of garbage the MySQL Workbench Migration tool is. While the learning curve of Talend is rough (it's not intuitive at all), it appears to be one of the best data migration solutions out there. I recommend using it. Note I never figured out why the migration failed (as seen below). I'm just walking away from the utter garbage Oracle has pushed on the community. Oh, and Talend migrated the data to utf8mb4/utf8_general_ci without a hitch.
I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.
I have a mysql database with a charset utf8 of all the tables.
I am using SQLDeveloper to access and query the database with the latest JConnector JDBC driver.
When executing a simple query such as SELECT 'Варна'; equivalent to SELECT 'Варна' from DUAL;, which contains Bulgarian language, SQLDeveloper returns '?????'. This makes selects from the database in which I have used Bulgarian language return NULL, because their where clauses (containing Bulgarian language) mismatch the uft8 Bulgarian characters in the database. (When the select doesn't use Bulgarian language at all SQLDeveloper returns completely correct values and displays the Bulgarian language returned as a result of the query correctly.)
The Preferences -> Environment -> Encoding in SQLDeveloper is set currently to UTF-8, but I have tried virtually every applicable encoding listed in there and even the simplest query SELECT 'Варна' from DUAL; still does not return back the correct value Варна.
I have looked into setting the variable NLS_LANG, thinking this may be the cause but to no avail. (Perhaps it is the key after all but I am unable to actually configure it properly).
Edit: In order to reproduce the problem and visualise it (as I realise I may have explained it poorly) just go in SQLDeveloper and connect to a mysql database and execute the query SELECT 'Варна' from DUAL;.
Edit2: Clarifications.
Edit3: As shown by the comment left by #tenhouse it appears that this may be a bug.
Edit4: As stated below as a comment, the above query SELECT 'Варна' from DUAL; works perfectly fine without any modifications and/or settings fiddling on MySQL Workbench.
Edit5: Please, feel free to correct the title and/or tags if you feel that something can be improved as there is still no answer to the problem.
Edit6: By now can I assume that it really is a bug? Could anyone advise me where exactly to report it - is it a JConnector or SQLDeveloper related bug. I would think that I have to report it as a SQLDeveloper bug but I'd rather get a confirmation before possibly wasting their time.
Edit7: Tried to clarify it even further in my hopes for an answer.
Edit8: (Important!) My current database is hosted on linux (Ubuntu 12.04, MySQL 5.5.28) server. If, however, I install MySQL on a fresh Windows machine and create a utf8 db there, querying through SQLDeveloper works as it is supposed to, SELECT 'Варна' from DUAL; actually returns Варна. Could someone please confirm this?
So I didn't know this myself till having this problem a few months back, but MySQL actually offers the ability for different encodings for clients, databases, and connections. MySQL will convert (or collate) the requests/responses from/to a client to different encodings as specified by the client or in his config file. So even though the database is storing stuff as utf8, if the client is set to latin1, your going to see latin1 as your result encoding. The easiest way to check this is to spin up a connection to MySQL and run the following query:
SHOW VARIABLES LIKE "%char%";
You should see a whole bunch of encodings for different connections/sources. From your description, I imagine most of these will not be utf8. Here's mysql's doc on what each of these mean. You can test if this in fact the problem by doing a SET NAMES 'utf8'; or charset utf8; (can't remember which one) and running your queries again to see if that fixes the problem.
A summary of what each of these guys does (since the docs leave some stuff out):
character_set_client: specifies how data is encoded when sending from client to server. Anything connecting through MySQL's API is not a client (ex. php's mysqli, most C/C++ wrapper libs)
character_set_database: specifies the encoding for data stored in the database
character_set_filesystem: not really sure, but I believe how data is written to disk?
character_set_results: the encoding that MySQL returns query results
character_set_server: server's default set (not really sure cases where this is used)
character_set_system: not sure on this one either
character_sets_dir: where your collation/encoding definitions are located
Most of these guys can be specified by editing your my.cnf and sticking your defaults in there.
I'm not exactly sure how JConnector works, but I imagine it uses MySQL's C API, in which case you'll need to do something like the following somewhere in the code. Maybe JConnector has a way for you to set this through him. I'm not sure, but here's the syntax for the MySQL API:
mysql_options( myLink, MYSQL_SET_CHARSET_NAME, "utf8" );
EDIT: For MySQL 5.5
You can try a command like this ::
ALTER DATABASE CHARACTER SET WE8ISO8859P5;
Please restart the DB after changing the characterset.
More details refer this link where it explains about the encoding required for different languages
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67789/ch3.htm
after you connect with a mysql_connect:
$dbcnx = mysql_connect($dbhost, $dbuser, $dbpass)
you do this query:
mysql_query("SET
character_set_results = 'utf8',
character_set_client = 'utf8',
character_set_connection = 'utf8',
character_set_database = 'utf8',
character_set_server = 'utf8'",
$dbcnx);
Now this will set the encoding for what is returned, what happens on the server - so all of it has the same encoding.
In your following query's, you specify this connection to be used
Export
Add [?characterEncoding=utf8]
<StringRefAddr addrType="customUrl">
<Contents>jdbc:mysql://instance_host_name:3306/database_name?characterEncoding=utf8</Contents>
</StringRefAddr>
Import
Quick question as I have never run into this before.
On a webhost I am running the query:
SET NAMES 'utf8'
This is returning the following error:
Error: Unknown system variable 'NAMES'
I haven't run across this before. I get similar errors when trying to specify CURRENT_TIMESTAMP as a default column value as well as setting the collation of a table.
The MySQL queries I am running have worked on literally hundreds of hosting accounts before this one. On contacting the host I was fobbed off saying it was probably my code.
Is the likely hood that this is a dodgy MySQL install? Host says they are running MySQL5
SET NAMES is available since MySQL 4.1, which brought large scale changes to character set handling and full UTF-8 support. Quite sure you have a MySQL version <4.1 in front of you. Try a
SELECT VERSION();
as a1ex07 has recommended.
Older versions of MySQL can only handle 8-bit character data. They can still store UTF-8 data as byte sequences, but they are not aware of it. There are several backdraws to storing UTF-8 in MySQL <4.1. For example string lengths can exceed given column limits although the number of characters should fit. Also the modern string comparison functions do not exist (they correctly compare special characters and different ways to write them, i.e. "ß" vs. "ss" in German).
Mysql's environment is following:
character_set_database="big5"
And when I send a SQL which contains tranditional Chinese
(such as "select * from a where name =
'中')
from jdbc to mysql database, it will throw the following exception:
Illegal mix of collations (big5_chinese_ci,IMPLICIT), (latin1_swedish_ci,COERCIBLE), (latin1_swedish_ci,COERCIBLE) for operation ' IN ''
How can i solve this ?
But we need to do that between oracle and mysql, and when my program get the data from oracle(it's encoding is ISO-8859-1) and pass it into the SQL statement in JDBC, it will have such problem, but i can't change the collation of oracle. How to solve this? Why JSP can't solve this automatically ?
I have tried to convert but Chinese characters can not be saved into Latin1 character set.
might this cause the problem ?
Check your collations.
The database itself can have one collation and the tables another one totally different.
If you mix collations from two tables, you get this error.
Also, the swedish collation seems to be the default for databases (have no idea of why).
What's your encoding in your java project side? Did you make sure that it's big-5 too?
JSP can't solve the problem. How should JSP know, what you want?
Do you make any encoding-transformation in your JSP or do you just put the oracle-data into the mysql-database?
In generaly it is important to have the same encoding in your script, your tables and very important in the connection to the database.