Rails, MySQL, Unicode data and latin1 tables - Where to go from here? - mysql

I'm not 100% sure on the particulars, so I'd love someone straightening me out, but I'll forge ahead with what I think is going on...
When I first setup my database, I used the default character encoding of the system without even thinking, and it was latin1. I never even thought about i18n/l10n. It just didn't occur to me. I just accepted the defaults and went with it.
Anyways, I've been using the database exclusively for a Rails app, and we've now got several GB of data, 100,000s of rows, and many international users. I've noticed that many of our foreign users are inserting data that seems to be Unicode / non-latin1. Here is an example:
What about crazy Unicode stuff? ☢ ☠ ☭
database.yml
Here is our database.yml file.
development:
adapter: mysql
database: XXX
username: YYY
password: ZZZ
host: localhost
encoding: utf8
As you can see, we're setting our character encoding to utf8. However, all our tables have a default character set of latin1. I'm sure of this.
Update After looking closely, our production database.yml does not specify an encoding, while my local copy was specifying utf8. This was causing problems when I would dump the production database and import it locally. It seems now that the import was working fine, but Rails was reading it incorrectly.
mysql CLI tool
When I view the data via the mysql CLI tool, it displays all the Unicode characters correctly. However, the 'show create table' statement clearly shows that the tables are default charset latin1. This leads me to believe that MySQL is somehow smart enough to store non-latin1 data.
HTTP header
Our HTTP Content-Type header is set to utf-8, like so:
Content-Type: text/html; charset=utf-8
Conversion Attempts
I've played a little with converting our tables to utf-8 encoding, all with no success. Mainly I tried dumping the database, running iconv to convert, then re-importing with the tables set to utf-8. MySQL had no errors, but all the Unicode data was garbled.
What to do?
I'm kind of stuck as to what to do (if anything). I'm a strong believer in not fixing what isn't broken, but this whole situation worries me. We've never had any complaints from users about not being able to store their data, and everything seems to be working fine. I'd just like to know what exactly is going on, who/what is doing the conversion (MySQL? Ruby? Rails? MySQL connection?), and any tips on how to proceed.

Most likely the data stored in your tables is valid UTF-8, but MySQL thinks it's Latin-1 (because that's the datatype the column was declared with). It is also valid Latin-1, of course, since AFAIK any arbitrary sequence of bytes is valid Latin-1.
What happens when you convert to UTF-8 is that MySQL sees valid Latin-1 encoded data and converts that to the equivalent valid UTF-8. This means that you get data that's double-UTF-8-encoded, which is why it is garbled.
The way to get around this is to convert the column to a binary string and then to UTF-8 from there. MySQL does not convert the string when you do this (because you're converting it via a format that just says, "treat this string as a series of 0s and 1s").
ALTER TABLE MyTable
MODIFY MyColumn CHAR(100) CHARACTER SET binary,
MODIFY MyColumn CHAR(100) CHARACTER SET utf8

What worked for me (and others) was to use the mysql2 adapter.
In your Gemfile:
gem "mysql2"
In config/database.yml:
adapter: mysql2
And, you should remember to set your database character set to UTF-8 as well, but as I understand, you also did this :-)
Hope this helps?

Related

UTF-8 encoding problem while importing a sql file

I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.

Should I alter my non utf-8 database (mysql) (default?) to utf-8 on a running Drupal 7 site?

I've got running Drupal 7 site on MySQL. Just recognised that I have the Database Encoding (and collation) other than UTF-8 while Drupal 7 docs say it needs UTF-8 (with utf8_general_ci collation). At the same time all my tables are the necessary utf-8 and utf8_general_ci. (I guess drupal setup did it this way).
My questions are:
should I leave the whole system as it is or should I just alter my database to required encoding or is it necessary to convert anything after I altered the database to utf-8?
Would leaving the whole system as it is cause me any trouble in the future?
Is this setting for the database just a default and it doesn't matter at all for me as all my tables are set to the proper utf-8?
Thanks
You may or may not already be in trouble.
When you connect, do you establish the connection as being utf8? (SET NAMES utf8, or whatever Drupal does to achieve that.)
Is the data in your client encoded utf8? (For English, this does not matter.)
The CHARACTER SET of every column that currently exists is already set in stone. Even if some columns are latin1 and some are utf8, the client will see them as whatever SET NAMES has established. The inconsistency does not 'hurt'. At least not until you try to store Arabic in a latin1 column.
I would not blindly try to change anything without understanding how to correctly change it. Some attempts could make things worse.

Detailed instructions on converting a MYSQL DB and its data from latin to UTF-8. Too much diff info out there

Can you someone please provide the best way to convert not only a mysql database and all its tables from latin1_swedish_ci to UTF-8, with their contents? I have been researching all over Stackoverflow as well as elsewhere and the suggestions are always different.
Some people suggest just using these commands on the tables and databases:
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Others say that this just changes the database and tables, but not the contents.
Some suggest dumping the db, create a new table with the right char set and collation, and importing the old db into that. Does this actually convert the data as well?
mysqldump --skip-opt --set-charset --skip-set-charset
Others suggest running iconv against the dumped DB before importing? Is this really needed or would the import into a UTF-8 db do the conversion?
Finally, other suggest altering the database, convert char/blog tables to binary, and the converting back.
There are so many different methods that it has become very confusing.
Can someone please provide a concise step-by-step instruction, or point me to one, on how I can go about convert my latin DBs and their content to UTF-8? Even better if there is a script that automates this process against a database.
Thanks in advance.
The are two different problems which are often conflated:
change the specification of a table or column on how it should store data internally
convert garbled mojibake data to its intended characters
Each text column in MySQL has an associated charset attribute, which specifies what encoding text stored in this column should be stored as internally. This only really influences what characters can be stored in this column and how efficient the data storage is. For example, if you're storing a ton of Japanese text, sjis as an encoding may be a lot more efficient than utf8 and save you a bit of disk space.
The column encoding does not in any way influence in what encoding data is input and output to/from the database. This is a separate setting, the connection encoding, which is established for every individual client every time you connect to the database. MySQL will convert data on the fly between the connection encoding and the column/table charset as needed. You can connect to the database with a utf8 connection, send it Japanese text destined for an sjis column, and MySQL will convert from utf8 to sjis on the fly (and back in reverse on the way out).
Now, if you've screwed up the connection encoding (as happens way too often) and you've inserted text in a different encoding than your connection encoding specified (e.g. your connection encoding was latin1 but you actually sent UTF-8 encoded data), then you're storing garbage in your database and you need to recover that. If that's your issue, see How to convert wrongly encoded data to UTF-8?.
However, if all your data is peachy and all you want to do is tell MySQL to store data in a different encoding from now on, you only need this:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
MySQL will convert the current data from its current charset to the new charset and store future data in the new charset. That's all.
Here is an example from the Moodle community:
https://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8
(Scroll down to "Explained".)
The author does first an SQL dump, which is a big SQL file. Then he copies the file. After, he makes coding corrections with sed on the copied file. Finally he imports the copied and corrected SQL dump file back into the database.
I can recommend this because with this single steps it is easy to inspect if they have been done right. If something goes wrong, just go back to the last step and try it another way.
Use the MySQL Workbench to handle this. http://dev.mysql.com/doc/workbench/en/index.html
Run the migration wizard to produce a script that will create the database schema.
Edit that script to alter the collation and character set (notepad++ search replace is just fine for this) and the shema name so you don't overwrite the existing database.
Run the script to create the copy under a new name.
Use the migration wizard to bulk transfer the data to the new schema. It will handle all the conversion for you and ensure that your data is still good.

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

Ruby - mysql2 driver changing encoding / various utf-8 issues

I have an API running on Sinatra. It queries a mysql databases, and returns data in json or xml format. I'm having a problem with unicode data. If I query the production database from the console, I'll get data correctly:
persönlichen
However, in my API results (or if I were to query the database in irb using the mysql2 gem), I get this:
persönlichen
Everything works swimmingly on my development box, which is confounding my efforts to solve the problem.
I have done everything I can to make sure that the database is utf-8 only (encodings, collations, client and server character sets are all utf-8). I'm using the mysql2 driver, which supposedly forces everything to utf-8. I'm setting :encoding => 'UTF8' on my active record connection.
What am I missing?
I was able to nail the problem down - the data wasn't encoded correctly in the database. I was populating my database using a sql dump file - I added this to the top, and everything worked great:
set names utf8;
create database if not exists `my_db_name` CHARACTER SET utf8 COLLATE utf8_general_ci;