How to detect UTF-8 characters in a Latin1 encoded column - MySQL - mysql

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.

Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.
This produces results like this:
latin1 utf8
----------------------------------------
Björn Björn

Since your question is not completely clear, let's assume some scenarios:
Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.

There is a script on github to help with this sort of a thing.

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.
Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).

Related

How to debug invalid data in MySQL utf8mb4 column in Etherpad Lite database

We're running Etherpad Lite and we're trying to migrate database from MySQL to PostgreSQL.
MySQL database 'value' column is of type utf8mb4. However, around 10% of all rows contain value that is in fact encoded in Windows-1252 or ISO-8859-15 instead of UTF-8. How is this possible? Does not MySQL validate the UTF-8 before entering it into the column?
PostgreSQL cannot accept the invalid values during migration because it does validate the data and hits e.g. raw byte 0xE4 (ISO-8859-15: ä) which should be encoded as byte sequence 0xC3 0xA4 in UTF-8.
Is this a known "feature" of MySQL? Is there any way to always get real UTF-8 from utf8mb4 column?
If
you say the client is using latin1 (etc), and
you say the column is utf8 (or utf8mb4), and
you provide hex E4
Then all is well. The E4 will be converted during the INSERT into C3A4 and that is what is stored. Do SELECT HEX(...) ... to verify.
If
you say the client is using utf8 (or utf8mb4), and
you say the column is utf8 (or utf8mb4), and
you provide hex C3A4
Again, all is well. The C3A4 goes directly into the table.
Here's a messy case:
If
you say the client is using latin1, and
you say the column is utf8 (or utf8mb4), and
but you provide hex C3A4
Then, MySQL is obligated to convert two characters (C3 and A4) into utf8, yielding C383C2A4. I call this "double encoding".
Follow the Best Practice in Trouble with UTF-8 characters; what I see is not what I stored and use its suggested way to test the data. Then come back with more details.
Probably the only way for 10% of the data to be mis-interpreted is for 10% of the data to be encoded differently. So, please provide hex for a 10% example and for a 90% example. And provide the hex in the client before inserting and in the table after it is inserted.
No solution is known. This is probably a bug in MySQL which should disallow storing non-UTF-8 data in case client connnection and column type are both utf8mb4.
I no longer use MySQL for anything so I haven't bothered to try to figure this bug any more. Nowadays, I'm using PostgreSQL for everything instead.

When to use what encoding on tables

I recently had to changed mysql from latin-1 to utf-8 to handle Russian characters. They were originally showing up as ?????.
I also had to change a couple of tables in my database to utf8mb4. I originally had these set to utf8 but this did not have enough bits to handle certain characters.
I have to make a change to a production database and want to ensure that i do not have any issues a few months down the line with a particular encoding type.
So my question is when do i use what encoding on a table?
You have multiple questions.
The "???" probably came from converting from latin1 to utf8 incorrectly. The data is now lost, since only '?' remains. SELECT HEX(...) ... to confirm that all you get is 3F (?) where you should get something useful.
See "question marks" in Trouble with utf8 characters; what I see is not what I stored .
utf8mb4 and utf8 handle Cyrillic (Russian) identically, so the CHARACTER SET is not the issue with respect to the "???".
If you have an original copy of the data, then probably you want the 3rd item in here -- "CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset". That is what I call the two-step ALTER.
As for avoiding future issues... See "Best Practice" in my first link. If all you need is European (including Russian), either utf8 or utf8mb4 will suffice. But if you want Emoji or all of Chinese, then go with utf8mb4.
Also, note that you must specify what charset the client is using; this is a common omission, and was probably part of what got you in trouble in the first place.

What should be the correct MySql collation store in this case?

I'm storing strings on a Mysql database.
Some of the strings have single quotes which then get stored like this:
People’s
Is this the proper way to store these strings or should I set a different mysql collation?
I have tried the following without luck....
utf8_general_ci
latin1_swedish_ci
Where are you setting the collation? You should be using UTF-8 in three places:
as the collation of each row that contains character data. You can set the default collation for the table or database so that new columns pick it up, but if you already have a table, ALTERing its default collation doesn't change the collation of the existing rows.
as the encoding of the connection between your application and MySQL. This can be set manually using the SET NAMES statement, or, better, with the suitable API call for your environment (for example mysql_set_charset() in PHP, or the charset argument to connect() in Python MySQLdb).
in your output. For example if producing a web page, by using the Content-Type: text/html;charset=utf-8 header/meta.
You can store the string "People’s" as UTF-8-hidden-in-Latin-1 "People’s" by using Latin-1 throughout, since you'll still get the same bytes out as you put in. But that way you won't get sensible results from ordering or case-insenstive-comparisons of non-ASCII characters.

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.