Databases: column encoding, when is it important? - mysql

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...

The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

Related

Detailed instructions on converting a MYSQL DB and its data from latin to UTF-8. Too much diff info out there

Can you someone please provide the best way to convert not only a mysql database and all its tables from latin1_swedish_ci to UTF-8, with their contents? I have been researching all over Stackoverflow as well as elsewhere and the suggestions are always different.
Some people suggest just using these commands on the tables and databases:
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Others say that this just changes the database and tables, but not the contents.
Some suggest dumping the db, create a new table with the right char set and collation, and importing the old db into that. Does this actually convert the data as well?
mysqldump --skip-opt --set-charset --skip-set-charset
Others suggest running iconv against the dumped DB before importing? Is this really needed or would the import into a UTF-8 db do the conversion?
Finally, other suggest altering the database, convert char/blog tables to binary, and the converting back.
There are so many different methods that it has become very confusing.
Can someone please provide a concise step-by-step instruction, or point me to one, on how I can go about convert my latin DBs and their content to UTF-8? Even better if there is a script that automates this process against a database.
Thanks in advance.
The are two different problems which are often conflated:
change the specification of a table or column on how it should store data internally
convert garbled mojibake data to its intended characters
Each text column in MySQL has an associated charset attribute, which specifies what encoding text stored in this column should be stored as internally. This only really influences what characters can be stored in this column and how efficient the data storage is. For example, if you're storing a ton of Japanese text, sjis as an encoding may be a lot more efficient than utf8 and save you a bit of disk space.
The column encoding does not in any way influence in what encoding data is input and output to/from the database. This is a separate setting, the connection encoding, which is established for every individual client every time you connect to the database. MySQL will convert data on the fly between the connection encoding and the column/table charset as needed. You can connect to the database with a utf8 connection, send it Japanese text destined for an sjis column, and MySQL will convert from utf8 to sjis on the fly (and back in reverse on the way out).
Now, if you've screwed up the connection encoding (as happens way too often) and you've inserted text in a different encoding than your connection encoding specified (e.g. your connection encoding was latin1 but you actually sent UTF-8 encoded data), then you're storing garbage in your database and you need to recover that. If that's your issue, see How to convert wrongly encoded data to UTF-8?.
However, if all your data is peachy and all you want to do is tell MySQL to store data in a different encoding from now on, you only need this:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
MySQL will convert the current data from its current charset to the new charset and store future data in the new charset. That's all.
Here is an example from the Moodle community:
https://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8
(Scroll down to "Explained".)
The author does first an SQL dump, which is a big SQL file. Then he copies the file. After, he makes coding corrections with sed on the copied file. Finally he imports the copied and corrected SQL dump file back into the database.
I can recommend this because with this single steps it is easy to inspect if they have been done right. If something goes wrong, just go back to the last step and try it another way.
Use the MySQL Workbench to handle this. http://dev.mysql.com/doc/workbench/en/index.html
Run the migration wizard to produce a script that will create the database schema.
Edit that script to alter the collation and character set (notepad++ search replace is just fine for this) and the shema name so you don't overwrite the existing database.
Run the script to create the copy under a new name.
Use the migration wizard to bulk transfer the data to the new schema. It will handle all the conversion for you and ensure that your data is still good.

Storing swedish characters in mysql database

I'm having problems storing Swedish characters in my MySQL database. I want to store them in my table called users with the collation utf8-bin. Even though I'm using utf8, the characters å ä ö gets stored as Ã¥ ä ö and I don't know why. Retrieving the data and echoing it gives me the same output, with the weird characters instead of å ä ö. Any help is appreciated.
Call
mysql_set_charset("utf8");
After connecting and before making any queries.
Your database charset is just for storage, not for transmission between app and database.
There are several places, where you have to pay attention to the encoding.
Database: you already use an utf8 collation, so that's fine
Database connection: use mysqli_set_charset to set the charset of the connection, if you're using mysqli. Other database drivers have similar functions.
Output encoding of the page: You can use HTTP headers or meta tags. If you want to be on the safe side, specify both.
You should make sure that the database connection uses the Swedish encoding and the encoding of the page output is correct as well. Different encoding causes many of these problems. Read more about character encodings here.
There are several parameters that you have to consider here. For this to work well now and in the future. ALL different interactions with the text has to be in same encoding. Even within db (for joins to work well etc).
The encoding of the data beeing inserted (set in header of page and / or utf_8 encoding when inserted).
The encoding in db tables, (i would recommend utf8_swedish for all)
The encoding of the page viewing results from db (set this in header)
The encoding of the page beeing edited. Its possible to open documents in different encoding. This is a big issue if you are not familiar with it. Open and save documents in right encoding.
There use to be a problem concerning the connection encoding to, set this correct, but today it is a smaller problem than a couple of years ago, because of changes.
A couple of notes. Are you sure your data is stored like that, or just presented wrong, via for instance phpmyademin? Try to print with print utf8_encode($text)
Or, utf8_decode() function, that gives you some insight...

migrating mysql DBs from one host to another, encoding issues

I am migrating a very large number of mysql DBs from a few shared web hosts to one shared web host.
The majority of these are Portuguese, so there's quite a few special characters. Some of the DBs which I am migrating are in latin1, some are cp1251, some are utf8.
Of course, simply dumping the DBs, and then restoring the dumps onto the new host completely botches the encoding and "?" characters and other nonsense shows up in the actual websites associated with the databases.
On a small scale, it would be acceptable to muck about with the html charset tags, to know what to dump/restore as, but the problem is that we're dealing with thousands databases and websites, and the migrations are all done automatically via several scripts.
I'm looking for suggestions on the best way of dumping/restoring these DBs assuming that the script doing the work will not know the encoding which is specified in the HTML tags.
So far, I have tried using the actual mysqldump tool, as well as mimicking it with a php script, and dumping to and from memory instead of to and from a text file, neither of these seem to replicate the data perfectly from one to the other without encoding issues.
Should I be using UTF8 to encode the dump, then restoring as is regardless of the html codepage?
Dumping and restoring both in UTF8 regardless of HTML codepage?
Dumping and restoring in the default charset found in each create table statement?
My understanding of the implications and effects of these different scenarios is limited, but what I need to know is basically if there is a way to perfectly replicate data without encoding issues between 2 database servers without knowing the codepage used by the HTML of the script which is accessing the data.
Encodings are a very difficult problem to tackle, especially when moving databases. Try first to do a structural import, and then compare exactly the new structure with the old one, taking special care in database character set, table default character set and columns character sets. You can get these informations very easily from the information_schema database.
Once those are absolutelly mirorred, you can begin the import. However, beware of the fact that you can hold characters in differend encoding types in differend encoded columns (it is quite common to have utf8 valid characters in a latin1 column, latin 1 is a 1 byte character set, while utf8 can have characters of up to 3 bytes).
You can try various methods after this to convert the dumps but as far as i know so far there is not a 100% valid method to convert this type of cases of mixed encoding types in same column. Eventually you might need to do some manual cleanup. But hopefully the first approach will suffice, and everything will be fine.

Accent insensitive search on a problematic database

I have a database that contains data in different languages. Some languages use accents (like áéíóú) and I need to search in this data as the accents doesn't exist (search for 'campeon' should return 'campeón' as a valir result).
The problem is that the tables in my database (utf8_unicode_ci) are not storing utf8 characteres. If you see the data through phpmyadmin the words with accents looks like this: campeón
After some researching, I've found (in a StackOverflow question) that the problem is related to the inexistence of a SET NAMES [charset]. In fact, I've made some testings and if I set names to utf8, everything works as expected.
Well, I have the solution, what's the problem? The problem is that the database is in production, so there are thousands of strings in the database. If I change the character set the client will use, all already existing string will become invalid. The question is: is there any way to:
perform accent-insensitive searches in a database that uses a wrong charset like mine?
transform safely the data in the tables to the appropriate charset?
continue working with mixed charsets (latin1 and utf8) in the database, assuming that latin1 data will not be accent-insensitive?
If anybody has experience in any of the solutions I propose or has a new one, I'll be very thankful if share.
The problem being that the data was inserted using the wrong connection encoding, you can fix it by
Exporting the data using the wrong connection encoding, just like you have used it thus far, followed by
Importing the data using the correct utf8 connection encoding.
That will fix the encoding problem, after which search will work as expected.
What if you create a copy of the table at the beginning of your session, alter the copy's charset, perform all your queries from that, and then drop the table at the end of your session? I don't know how practical this would be - depends on how often you need to perform these queries and how big the table is.

which database supports to insert vietnamese characters?

I tried inserting Vietnamese characters into MySQL database through my java program. It is getting inserted but certain characters are being inserted as junk. And while trying to retrieve, i'm getting the same junk values in place of some characters. Can anyone tel me what should be done? Is there a problem in MySQL or is there any DB that supports these characters?
Example of ‘junk’, and code?
In general you need to make sure:
your tables are created with UTF-8 collation on all text columns. This can be done at several levels: config default-character-set=utf8, db CREATE DATABASE ... DEFAULT CHARACTER SET utf8, table CREATE TABLE ... DEFAULT CHARACTER SET utf8, and column column VARCHAR(255) CHARACTER SET utf8. After the initial creation you can only do it by ALTER on the columns; changing the default character sets won't change the column.
that your connection to the database is in UTF-8 encoding, by specifying useUnicode=true and characterEncoding=UTF-8 properties in your connection string or properties. Ensure you have an up-to-date MySQL Connector as there have been grievous bugs here in the past.
that nothing else in your processing stream is mangling the characters before they get to the database connection, or on the way back out. Ensure you aren't using the default encoding anywhere because it is probably wrong. Setting the flag -Dfile.encoding=UTF-8 may help with that as a temporary workaround, but you don't want to rely on it.
(And if part of your testing involves printing to the terminal, be aware that the Windows command prompt won't be able to do anything with UTF-8 so you will definitely see junk there.)
Hi there no problem to store vietnamese characters, but check mysql FAQ first:
http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html