I've had my self hosted Wordpress blog for a long time. I just realized that my DB is not UTF8 and certain plugins won't work correctly.
My question is this. How does a very novice mysql'er go about converting my database? As you can imagine, I'm very hesitant to do this on my own as I have 5 years worth of posts I don't want to jack up.
Can anyone point me in the right direction, or even better step me through the process for converting everything to UTF8?
After backing up your database as Konerak said, run this for every table:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8;
(you may want to check with SHOW FULL COLUMNS FROM tablename whether all (text)columns are now indeed corrrect)
And right after you connect to MySQL, run the query:
SET NAMES utf8;
Now, to tell your audience you are using utf8: you could craft a custom header in every page or in an always included file, I however find it easier to put this in an .htaccess for Apache in the root:
php_value default_charset "UTF-8"
If you have non-ASCII content in flat files instead of only in the database, you'll have to convert them too. Your favorite editor may have a batch convert tool, or you can use iconv.
Related
Can you someone please provide the best way to convert not only a mysql database and all its tables from latin1_swedish_ci to UTF-8, with their contents? I have been researching all over Stackoverflow as well as elsewhere and the suggestions are always different.
Some people suggest just using these commands on the tables and databases:
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Others say that this just changes the database and tables, but not the contents.
Some suggest dumping the db, create a new table with the right char set and collation, and importing the old db into that. Does this actually convert the data as well?
mysqldump --skip-opt --set-charset --skip-set-charset
Others suggest running iconv against the dumped DB before importing? Is this really needed or would the import into a UTF-8 db do the conversion?
Finally, other suggest altering the database, convert char/blog tables to binary, and the converting back.
There are so many different methods that it has become very confusing.
Can someone please provide a concise step-by-step instruction, or point me to one, on how I can go about convert my latin DBs and their content to UTF-8? Even better if there is a script that automates this process against a database.
Thanks in advance.
The are two different problems which are often conflated:
change the specification of a table or column on how it should store data internally
convert garbled mojibake data to its intended characters
Each text column in MySQL has an associated charset attribute, which specifies what encoding text stored in this column should be stored as internally. This only really influences what characters can be stored in this column and how efficient the data storage is. For example, if you're storing a ton of Japanese text, sjis as an encoding may be a lot more efficient than utf8 and save you a bit of disk space.
The column encoding does not in any way influence in what encoding data is input and output to/from the database. This is a separate setting, the connection encoding, which is established for every individual client every time you connect to the database. MySQL will convert data on the fly between the connection encoding and the column/table charset as needed. You can connect to the database with a utf8 connection, send it Japanese text destined for an sjis column, and MySQL will convert from utf8 to sjis on the fly (and back in reverse on the way out).
Now, if you've screwed up the connection encoding (as happens way too often) and you've inserted text in a different encoding than your connection encoding specified (e.g. your connection encoding was latin1 but you actually sent UTF-8 encoded data), then you're storing garbage in your database and you need to recover that. If that's your issue, see How to convert wrongly encoded data to UTF-8?.
However, if all your data is peachy and all you want to do is tell MySQL to store data in a different encoding from now on, you only need this:
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
MySQL will convert the current data from its current charset to the new charset and store future data in the new charset. That's all.
Here is an example from the Moodle community:
https://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8
(Scroll down to "Explained".)
The author does first an SQL dump, which is a big SQL file. Then he copies the file. After, he makes coding corrections with sed on the copied file. Finally he imports the copied and corrected SQL dump file back into the database.
I can recommend this because with this single steps it is easy to inspect if they have been done right. If something goes wrong, just go back to the last step and try it another way.
Use the MySQL Workbench to handle this. http://dev.mysql.com/doc/workbench/en/index.html
Run the migration wizard to produce a script that will create the database schema.
Edit that script to alter the collation and character set (notepad++ search replace is just fine for this) and the shema name so you don't overwrite the existing database.
Run the script to create the copy under a new name.
Use the migration wizard to bulk transfer the data to the new schema. It will handle all the conversion for you and ensure that your data is still good.
So I am currently working on a certain project where I need to create a database in which its records will hold both English and Arabic names.
I am creating this using PhpMyAdmin where it works perfectly fine for English names, however all the Arabic names appear as "?????".
To solve this issue I tried to use "set name 'utf8' ", however it didn't work. Googling this problem I realized that PhpMyAdmin does not support either Arabic or Special characters.
I am not sure if there is any workaround for this issue. Do you have any suggestion to solve it ?
Thanks in advance
First, is your database capable of storing Unicode? SHOW CREATE TABLE table_name; will hopefully show your character set as utf8. If not this should fix it:
ALTER TABLE table_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Also make sure your PHPMyAdmin settings contain this:
$cfg['DefaultCharset'] = 'utf_8';
$cfg['DefaultConnectionCollation'] = 'utf8_general_ci';
phpMyAdmin has no problem with UTF8, which as far as I can tell means Arabic has been fully supported for some time. phpMyAdmin just shows (accurately) what is stored in your database; if you're not seeing what you expect it's almost always because your application is misbehaving. Perhaps your Google search turned up quite old information; I'm not sure how long phpMyAdmin has supported these special characthers but looking at the development log file it seems that it's been at least since 2008, and almost certainly even prior to that. Anyway:
The phpMyAdmin wiki has considerable detail on the matter and is a good place to start. There's also a quite comprehensive guide here at Stackoverflow, or this link to another very similar question. You can read more about properly setting the application charset here, and I'll refer you once again to the phpMyAdmin wiki for information on recovering/repairing the situation.
To summarize: the problem is almost certainly in how your application stores data, not how phpMyAdmin displays it. Make sure your database and tables are using a utf character set. In your application code, make sure you set your connection charset properly. Recovery is rather painless and can be achieved by switching the column charset first to binary then whatever utf8-variant makes the most sense for you.
Add these 2 line at bottom of the my.ini file.
then restart the wamp server.
character-set-server=utf8
collation-server=utf8_general_ci
Fisrt of all navigate to the following link: http://localhost/phpmyadmin/index.php
enter image description here
make the Server connection collation: utf8_unicode_ci
And all the Arabic data fields will be displayed in the phpMyAdmin Databases.
I just made a database restore from mysql workbench and found out that liferay does not display UTF-8 spec characters e.g ÅÄÖ, these letters are instead displayed as a question mark.
I wonder if anyone knows the solution for this issue? Do I have to specify a charset while importing the sql files? And if so how do I do that in mysql workbench?
To be honest I have no idea if the mysql restore has a direct effect on what happened, I'm just describing what I did before the issue occurred.
If you do a restore to a new database, make sure that this database defaults the character set to UTF-8:
create database lportal character set utf8;
Then import your data into that table.
Let me also use this opportunity to link my favourite site to generate great UTF-8 testdata: http://www.fliptitle.com - great if you need testdata for people who know only ASCII languages but still need immediate feedback on correct encoding with data they're able to interpret. You don't seem to be one of them, but I guess others that are in this group might stumble upon this later.
We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.
Wonder if anyone can help.
I recently had an issue with UTF8 in the Database and pages of a bespoke CMS I inherited. Going forward that's all sorted now, the code and DB has been changed to cater and properly convert, however I have an issue in that existing entries in the DB are obvioulsy sat there in the old character format and I need to convert all those.
Eg Ķ, ī
I was going to run an replace in the mysql DB to replace all these, but what I could do with is knowing what all these weird characters translate to eg ó.
Can anyone recommend a good table/reference to look at ? I have been searching but can't seem to come up with the right thing.
If I understand right these are two byte UTF8 characters.
Thanks
Try running these values in utf8_decode.
It looks like they've been valid, then utf8_encode'd.
If that's the case, try running a loop and update these rows.