MySQL utf8_bin collation equivalent for Azure SQL database - mysql

I am trying to migrate a MySql application to Azure.
The pricing for Azure's MySql database seems to be quite higher than the "SQL Databases" option so i decided to go for that "SQL database" option.
The last step for the resource set-up is to choose a collation.
In MySQL i use utf8_bin but that collation seems not to be valid for "SQL Database".
Is there an equivalent collation?
I need to store UTF characters, case sensitive and accent sensitive comparison and i almost never sort strings.
I did some research on the internet, but couldn't find any information about Azure's collations
Edit:
After additional researches i've come across 'Latin1_General_BIN2' that should do the job. I'm not sure that 'Latin' can handle all utf8 characters (eg. ʖ, ޖ, etc) - and i did not yet fully grasped the difference between BIN and BIN2 collations

that collation is not UTF8 capable. Up to this moment, existing collations in SQL Server and Azure SQL DB are non-Unicode, with Unicode being enabled (UTF-16) with the NCHAR and NVARCHAR (and SQLVARIANT) data types.
That being said, we are now running a private preview of UTF8 support in SQL Server and Azure SQL DB, so I'd like to further discuss with you.
Will you be at Ignite? If so please look for me in the SQL Server booth. If not, can you please send me an email to utf8team#microsoft.com?
Thank you!

Related

Migrating from Latin1 SQL Server to utf8mb4 MySQL Incorrect String Error Problems

Final Update
I was able to easily migrate the data with Talend. No errors, and it worked perfectly the first time with no special settings. This shows what an utter piece of garbage the MySQL Workbench Migration tool is. While the learning curve of Talend is rough (it's not intuitive at all), it appears to be one of the best data migration solutions out there. I recommend using it. Note I never figured out why the migration failed (as seen below). I'm just walking away from the utter garbage Oracle has pushed on the community. Oh, and Talend migrated the data to utf8mb4/utf8_general_ci without a hitch.
Please note there are updates at the bottom.
We have to migrate an export from TrackerRMS (which luckily doesn't have FK constraints, but the data is a total mess) to MySQL. Restoring the backup of the TrackerRMS data to SQL Server is cake; no issues. The problem is copying the data from SQL Server to MySQL.
MySQL Workbench Migration can handle all but 4 of the tables; but those 4 tables are the key problem. They have crazy content in their fields which causes the migration tool to choke. I attempted to export the data as .sql from HeidiSQL and it chokes as well.
The source table problem fields are NVARCHAR(MAX) and SQL_Latin1_General_CP1_CI_AS collation.
Note I've tried changing the collation of the source SQL Server table columns to Latin1_General_100_BIN2_UTF8 and Latin1_General_100_CI_AI_SC_UTF8 and there is no effect.
The errors are:
ERROR: `Backup_EmpowerAssociates`.`BACKUP_documents`:Inserting Data: Incorrect string value: '\xF0\x9F\x93\x8A x...' for column 'filepath' at row 13
ERROR: `Backup_EmpowerAssociates`.`BACKUP_activities`:Inserting Data: Incorrect string value: '\xF0\x9F\x91\x80' for column 'subject' at row 42
ERROR: `Backup_EmpowerAssociates`.`BACKUP_resourcehistory`:Inserting Data: Incorrect string value: '\xF0\x9D\x91\x82(\xF0...' for column 'jobdescription' at row 80
This tells me the source data has 4-byte character details (which is beyond the standard utf8). Note the destination database in MySQL is utf8mb4 and utf8mb4_unicode_ci collated, and has the default settings as such. No connection settings override this.
When migrating I use Microsoft SQL Server and ODBC (native) for localhost (SQL Server) with default options. I've also tried turning ANSI off, but it has no impact. Note the ODBC configuration for SQL Server has no charset or collation settings or options. For target, I use the localhost stored connection which I use for general access.
Note the MySQL Workbench migration tool defines the receiving table columns (for the above problem columns) as LONGTEXT CHARACTER SET 'utf8mb4'.
Could the issue be the migration proxy (ODBC?) is somehow converting it to utf8 (even though I don't have that selected)? But if that was the case, wouldn't the incoming data not be erroring out in the migration process as a UTF8MB4 solution (4-byte vs less)?
Note I tried creating and adjusting the destination MySQL table (by adjusting the SQL in the migration tool) as CHARSET latin1 and latin1_general_ci collation. Same issue.
Migration simply does not want to work (this is with SQL Server source being SQL_Latin1_General_CP1_CI_AS). And I've tried it with UTF8 both on and off for driver. No effect.
Does anyone with migration experience recognize this issue, or have recommendations on how to resolve the problem? I'm fine with scrubbing the source data in SQL Server before I migrate - I just don't know the best method to do that (or if it's necessary).
Thanks!
===
UPDATE 1
This is very strange; using the below technique to show values that won't convert, this is the result:
SELECT filepath, CONVERT(varchar,filepath) FROM BACKUP_documents WHERE filepath <> CONVERT(varchar, Filepath);
Why on earth is the data being truncated upon convert with a simple filename at the "c" in documents?
Here's a capture that might also help resolve this issue.
But the strange part is MSSQL is showing normal text (without special characters) as being non-ASCII. I'm wondering if the folks at TrackerRMS are running code written in another country/language and it's messing up the data, but it's something that's not visible?
UPDATE 2
So to make things clear, here's what one of the characters that is messing up the data looks like.
I was able to easily migrate the data with Talend. No errors, and it worked perfectly the first time with no special settings. This shows what an utter piece of garbage the MySQL Workbench Migration tool is. While the learning curve of Talend is rough (it's not intuitive at all), it appears to be one of the best data migration solutions out there. I recommend using it. Note I never figured out why the migration failed (as seen below). I'm just walking away from the utter garbage Oracle has pushed on the community. Oh, and Talend migrated the data to utf8mb4/utf8_general_ci without a hitch.

MySQL Workbench Connection Encoding

While testing some code I stumbled on the following MySQL error:
Error Code: 1267. Illegal mix of collations (utf8_general_ci,IMPLICIT) and ( utf8mb4_general_ci,COERCIBLE) for operation '='
I was using a WHERE statement on a standard MySQL UTF-8 collation column which contained a character using 4 bytes. Unless I misunderstood, while reading, I found the following information:
MySQL's original UTF-8 implementation was incomplete (supporting maximum 3 bytes)
The way to solve this is a new collation called utf8mb4 which by no means a new encoding but only used by MySQL to patch their original mistake.
On my end I see no reasons to use the original MySQL UTF-8 implementation since it's incomplete. So I did a few server side configuration to make sure all defaults were pointing to utf8mb4. Everything seemed fine but now on my application: I can use 🐼 characters in my form without having to worry about MySQL.
My problem now remains that when I connect with MySQL Workbench, it seems that the encoding is being forced to UTF-8. So even if my application works correctly, if I want to run tests directly in MySQL Workbench, I get the "Illegal mix of collation" error unless I run this fix (in Workbench) after starting the application:
SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'
I found this old question (MySQL Workbench charset) where it seemed impossible to overwrite the setting but even after I spent too much time searching for the config, I cannot believe this is still the case??
For now, I'm afraid, you will have to live with that. There's a WL for MySQL to rename that encoding to utf8 (throwing out the existing 3 byte variant). So it makes sense to keep utf8 in MySQL Workbench or we have to use different settings for different servers, which makes things more complicated.

Convert database font from MS SQL to mysql utf8?

I have old database at one windows dedicated server and now i buy a new linux dedicated server with php and mysql.
I plan to using php to pull out database from ms sql server row by row and put it to mysql database.
But problem is mysql using utf8_unicode_ic and i don't know which charset MS SQL server used.
THanks for help.
Have you tried just running your code? Odds are it'll "Just work".
Caveats below:
You may run into issues in your data (although this is highly unlikely) because the character set you're referring to is actually a collation. That is, it defines the string "ABCDEFGH" to be equal to "abcdefgh". The "_ci" part of utf8_unicode_ci means it's case insensitive.
Some quick googling found that MySQL defaults to case and accent sensitive collation, that's good, because SQL Server does the same. You should check the collation of the SQL Server database, if it's "SQL_Latin1_General_CP1_CI_AS" you should be good.
SQL Server stores character-based data in extended (i.e. depending on Windows, operating system, code-pages/encodings installed and used on server machine) ASCII for non-unicode (char, varchar, text, etc.) types and in unicode (nchar, nvarchar, ntext, etc.) types. I believe internet has plenty of material on this FAQ topic

SQL Server localization

Does SQL Server take over the localization from the server it's installed on? Or can you define the locale for each instance/database?
Which setting is responsible for having comma or period when a double is saved to the database?
SQL Server has a server collation and each database can either use the server collation or can be set to a different collation.
The format of the datatype will be taken from the Database collation. Providing that a collation has not been explicitly set for the column.
SQL Server Collations
Remember that if you use different collation for columns that you are trying to compare, you will need to use COLLATE and that will cause the argument to be a "non searchable argument", that is indexes will not be used to satisfy that statement.

character set problem in mysql

Mysql's environment is following:
character_set_database="big5"
And when I send a SQL which contains tranditional Chinese
(such as "select * from a where name =
'中')
from jdbc to mysql database, it will throw the following exception:
Illegal mix of collations (big5_chinese_ci,IMPLICIT), (latin1_swedish_ci,COERCIBLE), (latin1_swedish_ci,COERCIBLE) for operation ' IN ''
How can i solve this ?
But we need to do that between oracle and mysql, and when my program get the data from oracle(it's encoding is ISO-8859-1) and pass it into the SQL statement in JDBC, it will have such problem, but i can't change the collation of oracle. How to solve this? Why JSP can't solve this automatically ?
I have tried to convert but Chinese characters can not be saved into Latin1 character set.
might this cause the problem ?
Check your collations.
The database itself can have one collation and the tables another one totally different.
If you mix collations from two tables, you get this error.
Also, the swedish collation seems to be the default for databases (have no idea of why).
What's your encoding in your java project side? Did you make sure that it's big-5 too?
JSP can't solve the problem. How should JSP know, what you want?
Do you make any encoding-transformation in your JSP or do you just put the oracle-data into the mysql-database?
In generaly it is important to have the same encoding in your script, your tables and very important in the connection to the database.