import utf-8 mysqldump to latin1 database - mysql

I have a dump file of a phpnuke site somehow in utf8. I'm trying to reopen this site on a new server. But nuke uses latin1.
I need a way to create a latin1 database using this utf-8 dump file.
I tried everything I could think of. iconv, mysql replace, php replace...

Add the SET NAMES 'utf8'; statement at the beginning of your dump.
This will indicate to MySQL that the commands it is about to receive are in UTF8.
It will store the data in whatever character set your tables are currently set in; in this case if your database is in latin1, data will be stored in latin1.
From http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html:
SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server, “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)
One last thing is that latin1 has much less characters available than utf8. For anything other than the western European languages, you will lose data.
For instance, assuming column test is latin1. The first entry will appear correctly (the French accent is within latin1); however the second entry in Korean will show as question marks.
SET NAMES 'utf8';
INSERT INTO TESTME(test) VALUES ('Bienvenue sur Wikipédia');
INSERT INTO TESTME(test) VALUES ('한국어 위키백과에 오신 것을 환영합니다!');

Related

What happens when SET NAMES does not match a table's character set?

I have a database with tables set as CHARACTER SET=utf8mb4, however I've discovered that the mysql connection (SET NAMES) has been latin1.
My questions are:
What happens to that data? Has MySQL been converting it to utf8mb4 to store it, or has it stored the data as latin1?
If I need to convert it, should I connect as SET NAMES latin1 to take the data out, convert it, and put it back in, or connect as utf8mb4 to pull the data out?
Data is stored in the character set defined in the table, so in your case all the data will be stored in utf8mb4.
Every connection to the database uses a character set for sending and receiving data. That character set might be different, so mysql converts between the storage character set and the connection character set.
So, to answer your questions:
1) mysql converts it.
2) you don't need to convert it. But you need to make sure that if you use set names, you actually are sending data using the character set you claim.
More here: http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.

Mysql turns ' into ’?

How can I stop mysql from converting ' into ’ when I do an insert?
i believe it has something to do with charset or something?
I am using php to do the mysql_insert.
The single quotation mark you posted is called an 'acute accent', which is often converted from the generic single quotation mark by some web applications. It's a UTF8 character, which when inserted into a Latin-1 database translates to '’'. This means that you need to change MySQL's charset to UTF8, or alternatively change your website's charset to Latin-1. The former would be preferred:
ALTER DATABASE YourDatabase CHARACTER SET utf8;
ALTER TABLE YourTableOne CONVERT TO CHARACTER SET utf8;
ALTER TABLE YourTableTwo CONVERT TO CHARACTER SET utf8;
...
ALTER TABLE YourTableN CONVERT TO CHARACTER SET utf8;
Maybe someone will know the answer immediately, but I don't. However here are a few suggestions on what to examine (and possibly expand the question on)
When dealing with encodings and escaping you should include the full history of data
how was it created
what happened to it before the problem (did it have to go through backup, e-mail, was it created on a different server, OS, etc..; if it was transferred then was it as text file?)
The above is because anything that writes to a text file (browser, mysql client, web server, php application, to name a few layers that could have done it) can mess up character coding.
To troubleshoot, you can start eliminating, and thus the first step (in my book), is to
connect to mysql server using mysql command line client.
check the output of SHOW VARIABLES LIKE 'character_set%'
(so even in this simple environment you have 7 values that can influence how the data is parsed, stored and/or displayed
inspect SHOW CREATE TABLE TableName, and look for charset and collation info, both default for the table and explicit definition on columns
Having said all of the above, I don't think any western script would transcode a single quote character. So you might need to look at your escaping and other data processing.
EDIT
Most of the above from answer and discussion here
This is what I've done, and it worked for me:
First make sure that column containing ' is utf8_general_ci
Then add the mysql_set_charset to your code
$db=mysql_connect("localhost", $your_username, $your_password);
mysql_set_charset('utf8',$db);
mysql_select_db($your_db_name, $db);