MySQL Collation Issue - mysql

In my company, the tables in the database were poorly created. Each table has a different collation and charset.
This is very bad, sure, but it makes queries loose a lot of performance til the point the server crashes (and it isn't even a great database...).
I would like to know if there are any good MySQL tools, commands or procedures for converting table collation and charset.
Just executing the alter table and executing convert is braking special characters. Is it normal or I am doing something wrong?
EDIT:
As exemple: I have a table finance with uft8 collation and a table expense with latin swedish. Each table has between 1000 and 5000 rows. The following query takes about 15 second to execute:
select ex.* from expense ex
inner join finance fin on fin.ex_id = ex.id
Executing much complexer queries with bigger tables runs much faster when they have the same collation.
EDIT 2:
Another error in the database: row ids are all varchar(15), not int.

I know the fun of inheriting legacy schemas created by folks who think 'collation' is some form of illness.
The best option is to export the table with it's data to a SQL dump file using good ole' mysqldump. Then modify the create statements manually in the dump file to set the character set and collation. I'm a big fan of 'utf8'. If the dump file is huge, use command line stuff like sed to efficiently edit the file without having to open it in an editor.
Then drop the existing table re-import the modified dump.
Any other way you do this in my experience can be a roll of the dice.
This might be a good time to convert them all to the same storage engine as well or upgrade your MySQL server to 5.5.

I don't recommend to use a "tool" to fix this.
BEFORE YOU DO ANYTHING DUMP YOUR DB TO HAVE A BACKUP IN CASE YOU MESS IT UP ;)
You can streamline your character sets and collation two ways
Method 1: Move your data
Create a completely new database with correct character sets and collations configured in all tables
Fill your new tables with INSERT SELECT statements
e.g.
INSERT INTO newdatabase.table SELECT * FROM olddatabase.table
MySQL will automatically convert your data into the correct character set
Method 2: Alter your tables
If you change the character set of a existing table, all existing contents will be converted as well
e.g.
old table
CREATE TABLE `myWrongCharsetTable` (
`name` varchar(255) COLLATE latin1_german1_ci NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
put some data in for demo
INSERT INTO `myWrongCharsetTable` (`name`) VALUES ( 'I am a latino string' );
INSERT INTO `myWrongCharsetTable` (`name`) VALUES ( 'Mein Name ist Müller' );
INSERT INTO `myWrongCharsetTable` (`name`) VALUES ( 'Mein Name ist Möller' );
SELECT * FROM myWrongCharsetTable INTO outfile '/tmp/mylatinotable.csv';
On a UTF-8 console I do this
# cat /tmp/mylatinotable.csv
I am a latino string
Mein Name ist M▒ller
Mein Name ist M▒ller
right, strange charset.. this is latin 1 displayed on a utf-8 console
# cat /tmp/mylatinotable.csv | iconv -f latin1 -t utf-8
I am a latino string
Mein Name ist Müller
Mein Name ist Möller
Yep, all good
So how do I fix this now??
ALTER TABLE myWrongCharsetTable
MODIFY name varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
DEFAULT CHARSET = utf8 COLLATE utf8_unicode_ci;
That's it :)
Writing the outfile again
mysql> SELECT * FROM myWrongCharsetTable INTO outfile '/tmp/latinoutf8.csv';
Query OK, 3 rows affected (0.01 sec)
mysql> exit
Bye
dbmaster-001 ~ # cat /tmp/latinoutf8.csv
I am a latino string
Mein Name ist Müller
Mein Name ist Möller
Worked, all fine and we're happy
EDIT:
There's actually another method
Method 3: Dump, modify and reload your data
If you're good with sed and awk you can automate this, or edit the file manually
# dump the structure, possibly routines and triggers
mysqldump -h yourhost -p -u youruser --no-data --triggers --skip-comments --routines yourdatabase > database_structure_routines.sql
# dump the data
mysqldump -h yourhost -p -u youruser --no-create-info --skip-triggers --skip-routines yourdatabase > database_data.sql
Now open the database_structure_routines.sql in an editor of your choice and modify the tables to your needs
I recommend to drop all the comments like /*!40101 SET character_set_client = utf8 */ in your dumpfile because this could overwrite table defaults
When you're done, create a new database and structure
mysql > CREATE DATABASE `newDatabase` DEFAULT CHARSET utf8 COLLATE utf8_unicode_ci;
mysql > use `newDatabase`
mysql > ./database_structure_routines.sql;
Don't forget to recheck your tables
mysql > SHOW CREATE TABLE `table`;
If that's all right you can reimport your data, charset conversion again will be done automatically
mysql -h yourhost -p -u youruser newDatabase < database_data.sql
Hope this helps

You could try using CONVERT or CAST to change the charset - create a new column and use CAST to fill the new column with new corrected charset.
http://dev.mysql.com/doc/refman/5.0/en/charset-convert.html

Related

How does mysqldump write binary data into files for MySQL logical backup?

I am using mysqldump to back up a table. The schema is as follows:
CREATE TABLE `student` (
`ID` bigint(20) unsigned DEFAULT NULL,
`DATA` varbinary(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I can use the following command to backup my data in the table.
mysqldump -uroot -p123456 tdb > dump.sql.
Now I want to write my own code using the MySQL c interface to generate the file similar to dump.sql.
So I just
read the data, and store it int char* p(using function mysql_fetch_row);
write data into file using fprintf(f,"%s",p);
However, when I check the table fields written into the file, I find that the file generated by mysqldump and by my own program are different.For example,
one data field in the file generated by mysqldump
'[[ \\^X\í^G\ÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ'
one data field in the file generated by my program
[[ \^Xí^GÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ
So, My question is: Why is writting data using sprintf(f,"%s",xx) for backup not correct? Is it enough to just add ' ' in the front and end of the string? If so, what if the data of that field happen to have ' in it?
Also, I wonder what it means to write some unprintable characters into a text file.
Also, I read stackoverflow.com/questions/16559086 and tried --hex-blob option. Is it OK if I transform every byte of the binary data into hex form and then write simple text strings into the dump.sql.
Then, instead of getting
'[[ \\^X\í^G\ÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ'
I got something like
0x5B5B095C18ED07D1586009433BB751F95E44F4378ABC217B9661D3B98C0948C0614872111EBD6EF70BFE98495A9F33FE
All the characters are printable now!
However, If I choose this method, I wonder if I can meet problems when I use other encoding schemes other than latin1.
Also, the above words are all my own ideas, I also wonder I there are other ways to back up data using the C interface.
Thank you for your help!
latin1, utf8, etc are CHARACTER SETs. They apply to TEXT and VARCHAR columns, not BLOB and VARBINARY columns.
Using --hex-blob is a good idea.
If you have "unprintable characters" in TEXT or CHAR, then either you have been trying to put a BLOB into such -- naughty -- or the print mechanism does is not set for the appropriate charset.

Store bullet point unicode characters in Mysql and UTF8

I am reading data from a CSV text file using Coldfusion and inserting it into a table. The database is UTF8, and the table is UTf8.
This string •Detroit Diesel series-60 engine keeps getting stored in the Description field as
•Detroit Diesel series-60 engine. (This is what I get from the database, not displayed in the browser.)
I can manually insert the string into a new record from the command line, and the characters are correctly preserved. UTF8 must support the bullet character. What can I be doing wrong?
Datasource connection string:
this.datasources["blabla"] = {
class: 'org.gjt.mm.mysql.Driver'
, connectionString: 'jdbc:mysql://localhost:3306/blabla?useUnicode=true&characterEncoding=UTF-8&jdbcCompliantTruncation=true&allowMultiQueries=false&useLegacyDatetimeCode=true'
, username: 'nottellingyou'
, password: "encrypted:zzzzzzz"
};
CREATE TABLE output, minus several columns
CREATE TABLE `autos` (
`VIN` varchar(30) NOT NULL,
`Description` text,
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
In addition, I've run
ALTER TABLE blabla.autos
MODIFY description TEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Full code of import file here: https://gist.github.com/mborn319/c40573d6a58f88ec6bf373efbbf92f29
CSV file here. See line 7: http://pastebin.com/fM7fFtXD
In my CFML script, I tried dumping the data per suggestion from #Leigh and #Rick James. I then saw that the characters are garbled BEFORE insertion into Mysql. Based on this, I realized I needed to specify the charset when reading the file.
<cffile
action="read"
file="#settings.csvfile#"
variable="autodata"
charset="utf-8">
Result: •Detroit Diesel series-60 engine. This can now insert correctly into the database.

\u00a0 becomes  in MYSQL database [duplicate]

I have my database properly set to UTF-8 and am dealing with a database containing Japanese characters. If I do SELECT *... from the mysql command line, I properly see the Japanese characters. When pulling data out of the database and displaying it on a webpage, I see it properly.
However, when viewing the table data in phpMyAdmin, I just see garbage text. ie.
ç§ã¯æ—¥æœ¬æ–™ç†ãŒå¥½ãã§ã™ã€‚日本料ç†ã‚...
How can I get phpMyAdmin to display the characters in Japanese?
The character encoding on the HTML page is set to UTF-8.
Edit:
I have tried an export of my database and opened up the .sql file in geany. The characters are still garbled even though the encoding is set to UTF-8. (However, doing a mysqldump of the database also shows garbled characters).
The character set is set correctly for the database and all tables ('latin' is not found anywhere in the file)
CREATE DATABASE `japanese` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
I have added the lines to my.cnf and restarted mysql but there is no change. I am using Zend Framework to insert data into the database.
I am going to open a bounty for this question as I really want to figure this out.
Unfortunately, phpMyAdmin is one of the first php application that talk to MySQL about charset correctly. Your problem is most likely due to the fact that the database does not store the correct UTF-8 strings at first place.
In order to correctly display the characters correctly in phpMyAdmin, the data must be correctly stored in the database. However, convert the database into correct charset often breaks web apps that does not aware charset-related feature provided by MySQL.
May I ask: is MySQL > version 4.1? What web app is the database for? phpBB? Was the database migrated from an older version of the web app, or an older version of MySQL?
My suggestion is not to brother if the web app you are using is too old and not supported. Only convert database to real UTF-8 if you are sure the web app can read them correctly.
Edit:
Your MySQL is > 4.1, that means it's charset-aware. What's the charset collation settings for you database? I am pretty sure you are using latin1, which is MySQL name for ASCII, to store the UTF-8 text in 'bytes', into the database.
For charset-insensitive clients (i.e. mysql-cli and php-mod-mysql), characters get displayed correctly since they are being transfer to/from database as bytes. In phpMyAdmin, bytes get read and displayed as ASCII characters, that's the garbage text you seem.
Countless hours had been spend years ago (2005?) when MySQL 4.0 went obsolete, in many parts of Asia. There is a standard way to deal with your problem and gobbled data:
Back up your database as .sql
Open it up in UTF-8 capable text editor, make sure they look correct.
Look for charset collation latin1_general_ci, replace latin1 to utf8.
Save as a new sql file, do not overwrite your backup
Import the new file, they will now look correctly in phpMyAdmin, and Japanese on your web app will become question marks. That's normal.
For your php web app that rely on php-mod-mysql, insert mysql_query("SET NAMES UTF8"); after mysql_connect(), now the question marks will be gone.
Add the following configuration my.ini for mysql-cli:
# CLIENT SECTION
[mysql]
default-character-set=utf8
# SERVER SECTION
[mysqld]
default-character-set=utf8
For more information about charset on MySQL, please refer to manual:
http://dev.mysql.com/doc/refman/5.0/en/charset-server.html
Note that I assume your web app is using php-mod-mysql to connect to the database (hence the mysql_connect() function), since php-mod-mysql is the only extension I can think of that still trigger the problem TO THIS DAY.
phpMyAdmin use php-mod-mysqli to connect to MySQL. I never learned how to use it because switch to frameworks* to develop my php projects. I strongly encourage you do that too.
Many frameworks, e.g. CodeIgniter, Zend, use mysqli or pdo to connect to databases. mod-mysql functions are considered obsolete cause performance and scalability issue. Also, you do not want to tie your project to a specific type of database.
If you're using PDO don't forget to initiate it with UTF8:
$con = new PDO('mysql:host=' . $server . ';dbname=' . $db . ';charset=UTF8', $user, $pass, array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8"));
(just spent 5 hours to figure this out, hope it will save someone precious time...)
I did a little more googling and came across this page
The command doesn't seem to make sense but I tried it anyway:
In the file /usr/share/phpmyadmin/libraries/dbi/mysqli.dbi.lib.php at the end of function PMA_DBI_connect() just before the return statement I added:
mysqli_query($link, "SET SESSION CHARACTER_SET_RESULTS =latin1;");
mysqli_query($link, "SET SESSION CHARACTER_SET_CLIENT =latin1;");
And it works! I now see Japanese characters in phpMyAdmin. WTF? Why does this work?
I had the same problem,
Set all text/varchar collations in phpMyAdmin to utf-8 and in php files add this:
mysql_set_charset("utf8", $your_connection_name);
This solved it for me.
the solution for this can be as easy as :
find the phpmysqladmin connection function/method
add this after database is conncted $db_conect->set_charset('utf8');
phpmyadmin doesn't follow the MySQL connection because it defines its proper collation in phpmyadmin config file.
So if we don't want or if we can't access server parameters, we should just force it to send results in a different format (encoding) compatible with client i.e. phpmyadmin
for example if both the MySQL connection collation and the MySQL charset are utf8 but phpmyadmin is ISO, we should just add this one before any select query sent to the MYSQL via phpmyadmin :
SET SESSION CHARACTER_SET_RESULTS =latin1;
Here is my way how do I restore the data without looseness from latin1 to utf8:
/**
* Fixes the data in the database that was inserted into latin1 table using utf8 encoding.
*
* DO NOT execute "SET NAMES UTF8" after mysql_connect.
* Your encoding should be the same as when you firstly inserted the data.
* In my case I inserted all my utf8 data into LATIN1 tables.
* The data in tables was like ДЕТСКИÐ.
* But my page presented the data correctly, without "SET NAMES UTF8" query.
* But phpmyadmin did not present it correctly.
* So this is hack how to convert your data to the correct UTF8 format.
* Execute this code just ONCE!
* Don't forget to make backup first!
*/
public function fixIncorrectUtf8DataInsertedByLatinEncoding() {
// mysql_query("SET NAMES LATIN1") or die(mysql_error()); #uncomment this if you already set UTF8 names somewhere
// get all tables in the database
$tables = array();
$query = mysql_query("SHOW TABLES");
while ($t = mysql_fetch_row($query)) {
$tables[] = $t[0];
}
// you need to set explicit tables if not all tables in your database are latin1 charset
// $tables = array('mytable1', 'mytable2', 'mytable3'); # uncomment this if you want to set explicit tables
// duplicate tables, and copy all data from the original tables to the new tables with correct encoding
// the hack is that data retrieved in correct format using latin1 names and inserted again utf8
foreach ($tables as $table) {
$temptable = $table . '_temp';
mysql_query("CREATE TABLE $temptable LIKE $table") or die(mysql_error());
mysql_query("ALTER TABLE $temptable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci") or die(mysql_error());
$query = mysql_query("SELECT * FROM `$table`") or die(mysql_error());
mysql_query("SET NAMES UTF8") or die(mysql_error());
while ($row = mysql_fetch_row($query)) {
$values = implode("', '", $row);
mysql_query("INSERT INTO `$temptable` VALUES('$values')") or die(mysql_error());
}
mysql_query("SET NAMES LATIN1") or die(mysql_error());
}
// drop old tables and rename temporary tables
// this actually should work, but it not, then
// comment out this lines if this would not work for you and try to rename tables manually with phpmyadmin
foreach ($tables as $table) {
$temptable = $table . '_temp';
mysql_query("DROP TABLE `$table`") or die(mysql_error());
mysql_query("ALTER TABLE `$temptable` RENAME `$table`") or die(mysql_error());
}
// now you data should be correct
// change the database character set
mysql_query("ALTER DATABASE DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci") or die(mysql_error());
// now you can use "SET NAMES UTF8" in your project and mysql will use corrected data
}
Change latin1_swedish_ci to utf8_general_ci in phpmyadmin->table_name->field_name
This is where you find it on the screen:
First, from the client do
mysql> SHOW VARIABLES LIKE 'character_set%';
This will give you something like
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
where you can inspect the general settings for the client, connection, database
Then you should also inspect the columns from which you are retrieving data with
SHOW CREATE TABLE TableName
and inspecting the charset and collation of CHAR fields (though usually people do not set them explicitly, but it is possible to give CHAR[(length)] [CHARACTER SET charset_name] [COLLATE collation_name] in CREATE TABLE foo ADD COLUMN foo CHAR ...)
I believe that I have listed all relevant settings on the side of mysql.
If still getting lost read fine docs and perhaps this question which might shed some light (especially how I though I got it right by looking only at mysql client in the first go).
1- Open file:
C:\wamp\bin\mysql\mysql5.5.24\my.ini
2- Look for [mysqld] entry and append:
character-set-server = utf8
skip-character-set-client-handshake
The whole view should look like:
[mysqld]
port=3306
character-set-server = utf8
skip-character-set-client-handshake
3- Restart MySQL service!
Its realy simple to add multilanguage in myphpadmin if you got garbdata showing in myphpadmin, just go to myphpadmin click your database go to operations tab in operation tab page see collation section set it to utf8_general_ci, after that all your garbdata will show correctly. a simple and easy trick
The function and file names don't match those in newer versions of phpMyAdmin. Here is how to fix in the newer PHPMyAdmins:
Find file:
phpmyadmin/libraries/DatabaseInterface.php
In function: public function query
Right after the opening { add this:
if($link != null){
mysqli_query($link, "SET SESSION CHARACTER_SET_RESULTS =latin1;");
mysqli_query($link, "SET SESSION CHARACTER_SET_CLIENT =latin1;");
}
That's it. Works like a charm.
I had exactly the same problem. Database charset is utf-8 and collation is utf8_unicode_ci. I was able to see Unicode text in my webapp but the phpMyAdmin and sqldump results were garbled.
It turned out that the problem was in the way my web application was connecting to MySQL. I was missing the encoding flag.
After I fixed it, I was able to see Greek characters correctly in both phpMyAdmin and sqldump but lost all my previous entries.
just uncomment this lines in libraries/database_interface.lib.php
if (! empty($GLOBALS['collation_connection'])) {
// PMA_DBI_query("SET CHARACTER SET 'utf8';", $link, PMA_DBI_QUERY_STORE);
//PMA_DBI_query("SET collation_connection = '" .
//PMA_sqlAddslashes($GLOBALS['collation_connection']) . "';", $link, PMA_DBI_QUERY_STORE);
} else {
//PMA_DBI_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci';", $link, PMA_DBI_QUERY_STORE);
}
if you store data in utf8 without storing charset you do not need phpmyadmin to re-convert again the connection. This will work.
Easier solution for wamp is:
go to phpMyAdmin,
click localhost,
select latin1_bin for Server connection collation,
then start to create database and table
Add:
mysql_query("SET NAMES UTF8");
below:
mysql_select_db(/*your_database_name*/);
It works for me,
mysqli_query($con, "SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'");
ALTER TABLE table_name CONVERT to CHARACTER SET utf8;
*IMPORTANT: Back-up first, execute after

Does mysqldump handle binary data reliably?

I have some tables in MySQL 5.6 that contain large binary data in some fields. I want to know if I can trust dumps created by mysqldump and be sure that those binary fields will not be corrupted easily when transferring the dump files trough systems like FTP, SCP and such. Also, should I force such systems to treat the dump files as binary transfers instead of ascii?
Thanks in advance for any comments!
No, it is not always reliable when you have binary blobs. In that case you MUST use the "--hex-blob" flag to get correct results.
Caveat from comment below:
If you combine the --hex-blob with the -T flag (file per table) then the hex-blob flag will be ignored, silently
I have a case where these calls fail (importing on a different server but both running Centos6/MariaDB 10):
mysqldump --single-transaction --routines --databases myalarm -uroot -p"PASSWORD" | gzip > /FILENAME.sql.gz
gunzip < FILENAME.sql.gz | mysql -p"PASSWORD" -uroot --comments
It produces a file that silently fails to import. Adding "--skip-extended-insert" gives me a file that's much easier to debug, and I find that this line is generated but can't be read (but no error is reported either exporting or importing):
INSERT INTO `panels` VALUES (1003,1,257126,141,6562,1,88891,'??\\\?ŖeV???,NULL);
Note that the terminating quote on the binary data is missing in the original.
select hex(packet_key) from panels where id=1003;
--> DE77CF5C075CE002C596176556AAF9ED
The column is binary data:
CREATE TABLE `panels` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`enabled` tinyint(1) NOT NULL DEFAULT '1',
`serial_number` int(10) unsigned NOT NULL,
`panel_types_id` int(11) NOT NULL,
`all_panels_id` int(11) NOT NULL,
`installers_id` int(11) DEFAULT NULL,
`users_id` int(11) DEFAULT NULL,
`packet_key` binary(16) NOT NULL,
`user_deleted` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
...
So no, not only can you not necessarily trust mysqldump, you can't even rely on it to report an error when one occurs.
An ugly workaround I used was to mysqldump excluding the two afflicted tables by adding options like this to the dump:
--ignore-table=myalarm.panels
Then this BASH script hack. Basically run a SELECT that produces INSERT values where the NULL columns are handled and the binary column gets turned into an UNHEX() call like so:
(123,45678,UNHEX("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"),"2014-03-17 00:00:00",NULL),
Paste it into your editor of choice to play with it if you need to.
echo "SET UNIQUE_CHECKS=0;SET FOREIGN_KEY_CHECKS=0;DELETE FROM panels;INSERT INTO panels VALUES " > all.sql
mysql -uroot -p"PASSWORD" databasename -e "SELECT CONCAT('(',id,',', enabled,',', serial_number,',', panel_types_id,',', all_panels_id,',', IFNULL(CONVERT(installers_id,CHAR(20)),'NULL'),',', IFNULL(CONVERT(users_id,CHAR(20)),'NULL'), ',UNHEX(\"',HEX(packet_key),'\"),', IF(ISNULL(user_deleted),'NULL',CONCAT('\"', user_deleted,'\"')),'),') FROM panels" >> all.sql
echo "SET UNIQUE_CHECKS=1;SET FOREIGN_KEY_CHECKS=1;" > all.sql
That gives me a file called "all.sql" that needs the final comma in the INSERT turned into a semicolon, then it can be run as above. I needed the "large import buffer" tweaks set in both the interactive mysql shell and the command line to process that file because it's large.
mysql ... --max_allowed_packet=1GB
When I reported the bug I was eventually pointed at the "--hex-blob" flag, which does the same as my workaround but in a trivial from my side way. Add that option, blobs get dumped as hex, the end.
The dumps generated from mysqldump can be trusted.
To avoid problems with encodings, binary transfers, etc, use the --hex-blob option, so it translates each byte in a hex number (for example, 'abc' becomes 0x616263). It will make the dump bigger, but it will be the most compatible and secure way to have the info (since it will be pure text, no weird misinterpretations due to special symbols generated with the binary data on a text file).
You can ensure the integrity (and speed up the transfer) of the dump files packing it on a rar or zip file. That way you can easily detect that it didn't get corrupted with the transfer.
When you try to load it on your server, check you have assigned on your my.cnf server config file
[mysqld]
max_allowed_packet=600M
or more if needed.
BTW right now i just did a migration, and dumped lots of binary data with mysqldump and it worked perfectly.
Yes, you can trust dumps generated by mysqldump.
Yes, you should use binary transfer in order to avoid any encoding conversion during transfer. MySQL dump adds control commands to the dump so that the server interprets the file in a specific encoding when reimporting. You do not want to change this encoding.

The dreaded MySQL import encoding issue - revisited

I'm having the standard MySQL import encoding issue, but I can't seem to solve it.
My client has had a WordPress installation running for some time. I've dumped the database to a file, and imported it locally. The resulting pages have a splattering of � characters throughout.
I've inspected the database properties on both sides:
production: show create database wordpress;
CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */
local: show create database wordpress;
CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */
production: show create table wp_posts;
CREATE TABLE `wp_posts` (
`ID` bigint(20) unsigned NOT NULL auto_increment,
...
KEY `post_date_gmt` (`post_date_gmt`)
) ENGINE=MyISAM AUTO_INCREMENT=7932 DEFAULT CHARSET=utf8
local: show create table wp_posts;
CREATE TABLE `wp_posts` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
...
KEY `post_date_gmt` (`post_date_gmt`)
) ENGINE=MyISAM AUTO_INCREMENT=7918 DEFAULT CHARSET=utf8
I've spent hours reading forums on how to squash the �, but I can't get anything to work. 99% of the answers say to match the character set between the databases. What I think should work if the following:
mysqldump --opt --compress --default-character-set=latin1 -uusername -ppassword wordpress | ssh username#anotherserver.net mysql --default-character-set=latin1 -uusername -ppassword wordpress
I've done it using the utf8 char-set as well. Still with the �'s.
I've tried modifying the SQL dump directly, putting with utf8 or latin1 in the "SET names UTF8" line. Still with the �'s.
Strange Symptoms
I'd expect these � characters to appear in place of special characters in the content, like ñ or ö, but I've seen it where there would normally be just a space. I've also seen it in place of apostrophes (but not all apostrophes), double quotes, and trademark symbols.
The � marks are pretty rare. They appear on average three to four times per page.
I don't see any �'s when viewing the database through Sequel Pro (locally or live). I don't see any �'s in the SQL when viewing through Textmate.
What am I missing?
EDIT
More info:
I've tried to determine what the live database thinks the encoding is. I ran show table status, and it seems that the Collations are a mix of utf8_general_ci,utf8_binandlatin1_swedish_ci`. What are they different? Does it matter?
I also ran: show variables like "character_set_database" and got latin1;
This is how I ended up solving my problem:
First mysqldump -uusername -ppassword --default-character-set=latin1 database -r dump.sql
Then run this script:
$search = array('/latin1/');
$replace = array('utf8');
foreach (range(128, 255) as $dec) {
$search[] = "/\x".dechex($dec)."/";
$replace[] = "&#$dec;";
}
$input = fopen('dump.sql', 'r');
$output = fopen('result.sql', 'w');
while (!feof($input)) {
$line = fgets($input);
$line = preg_replace($search, $replace, $line);
fwrite($output, $line);
}
fclose($input);
fclose($output);
The script finds all the hex characters above 127 and encoded them into their HTML entities.
Then mysql -uusername -ppassword database < result.sql
A common problem with older WordPress databases and even newer ones is that the database tables get set as latin-1 but the contents are actually encoded as UTF-8. If you try to export as UTF-8 MySQL will attempt to convert the (supposedly) Latin-1 data to UTF-8 resulting in double encoded characters since the data was already UTF-8.
The solution is to export the tables as latin-1. Since MySQL thinks they are already latin-1 it will do a straight export.
Change the character set from ‘latin1′ to ‘utf8′.
Since the dumped data was not converted during the export process, it’s actually UTF-8 encoded data.
Create your new table as UTF-8 If your CREATE TABLE command is in your SQL dump file, change the character set from ‘latin1′ to ‘utf8′.
Import your data normally. Since you’ve got UTF-8 encoded data in your dump file, the declared character set in the dump file is now UTF-8, and the table you’re importing into is UTF-8, everything will go smoothly
I was able to resolve this issue by modifying my wp-config.php as follows:
/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');
/** The Database Collate type. Don't change this if in doubt. */
define( 'DB_COLLATE', 'utf8_general_ci' );
I think you can fix this issue by this way:
$link = mysql_connect('localhost', 'mysql_user', 'mysql_password');
$db = mysql_select_db('mysql_db', $link);
mysql_query('set names utf8', $link);