Easiest way to repair charset and collation on a mysql database? - mysql

Because mysql default settings are not quite Unicode friendly it can happen quite often to endup with a database with broken charset.
Usually you just want to reconfigure it to use utf8 character set and utf8_unicode_ci collation.
Which is the easiest command to do this for a given database?
Warning: do not post links to untested scripts, I tested at least 5 of them (written in bash/perl/php/python) and they all failed to repair a database where the collation was set correctly at database and table level but not at column level.

I managed to write a solution myself and published to:
https://gist.github.com/1068021
Notes:
mysqldump is borken, even if tell it to not include CHARSET it will include them if it i set at column level.
this solution does not assume a default charset at mysql-server level so it set it at database level and resets it to defaults for table and column level.
Feed free to post bugs or patches, I will try to solve them fast.

Related

How to convert a big MySQL Database from utf8 to utf8mb4?

I have a MySQL server (MariaDB Server version 10.5 running on Debian bullseye) with a 600 GB database. Due to compatibility issues I have to switch from UTF-8 to UTF8MB4.
I've found a few things about it, but I'm still unsure of the best way to do it. Since this is a productive system and a rollback is only possible on the testing system, I'm concerned about the data integrity and fear that difficulties may arise afterwards, for example with regard to performance.
What is the best and safest way to convert the database?
Is there anything special to consider?
Thanks for suggestions.
This article covers this process well. https://mathiasbynens.be/notes/mysql-utf8mb4
TL;DR;
So the biggest concern about this whole process is data integrity. What we mean by this is that we want our data not to get lost. So as a first step I would suggest creating a backup of the database. As soon as you have this copy then if something goes south then you can always have the copy to cover you. Safety first! So let's breakdown the process.
Create a backup of the database
Upgrade the MySQL server at least to v5.5.3 because this is the version that the UTF8MB4 started being supported
Modify the database, tables and columns. You can do this like
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# The above line is just an example for a `VARCHAR` column.)
Now since mysql has kept this feature backwards compatible then you shouldn't be seeing any weird behaviour (data loss).
Check the maximum length of columns and index keys. When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. So if you have a tinytext the maximum number of data that is able to store is 255 bytes. So with the increasing overhead you will be able to store less data. So in general check especially check your tinyx fields(you should check all of them but the tiny are more likely to run out of bytes).
Modify connection, client, and server character sets. Wherever you have utf8 in general you should replace with utf8mb4.
Run mysqlcheck -u root -p --auto-repair --optimize --all-databases in order to avoid running into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown.

Issue when deploying mysql db (utf8mb4_unicode_520_ci -> utf8mb4_unicode_ci)

I started working on a wordpress on my dev machine. mysql version is 5.6, and worpdress is 4.7 so its already using the utf8mb4_unicode_520_ci encoding if it detects its possible.
My problem is that on my hosting (mysql 5.5) utf8mb4_unicode_520_ci is not recognized as a valid encoding. So I'm trying to target utf8mb4_unicode_ci encoding as my hosting knows about this one, and if I understand correctly, this would - in opposition to going to utf8 - allow me to keep the 4 bytes.
I tried several different combinaison of encoding and collation set up for the db, but nothing successful (from here How to convert an entire MySQL database characterset and collation to UTF-8?).
I tried several combination of encoding and collation in the wp-config, but nothing.
Everything that is coming from the database (like post titles and post contents displays badly encoded char for all diatrics, anything else is displayed appropriately )
menu label from the database display incorrectly, where the hardcoded/translated label display correctly
I think I need to convert the actual content of the database, changing charset and collation does not seems to be enough.
I found this but it does not address my problem directly, or if it does I missed it.
Any help would be appreciated
————————————————————————————————
UPDATE :
here is the precise procedure I went through:
Initial situation:
I installed a wordpress (4.6.1) locally (on my dev machine, mysql 5.6.28).
I worked on the theme and plugin locally
(at this moment I have, locally, a database that is utf8_general_ci and tables that are utf8mb4_unicode_520_ci
Problem:
I want to deploy my wordpress on my hosting (mysql: 5.5 - db collation seems to be utf8mb4_unicode_ci).
I mysqldump the db locally, then try to import it on my hostings' phpmyadmin.
This gives error :
Unknown collation: 'utf8mb4_unicode_520_ci'
solution 1 change the tables charset to utf8mb4_unicode_ci:
On my hosting sql server, utf8mb4_unicode_520_ci is not available and I can't get a more recent version of mysql.
utf8mb4_unicode_ci seems like the closest and is available on my hosting sql server.
from various so question, I adapt a bash script to change charset and collation of my tables
for tbl in wp_sij2017_commentmeta wp_sij2017_comments wp_sij2017_cwa wp_sij2017_links wp_sij2017_options wp_sij2017_postmeta wp_sij2017_posts wp_sij2017_term_relationships wp_sij2017_term_taxonomy wp_sij2017_termmeta wp_sij2017_terms wp_sij2017_usermeta wp_sij2017_users wp_sij2017_woocommerce_api_keys wp_sij2017_woocommerce_attribute_taxonomies wp_sij2017_woocommerce_downloadable_product_permissions wp_sij2017_woocommerce_order_itemmeta wp_sij2017_woocommerce_order_items wp_sij2017_woocommerce_payment_tokenmeta wp_sij2017_woocommerce_payment_tokens wp_sij2017_woocommerce_sessions wp_sij2017_woocommerce_shipping_zone_locations wp_sij2017_woocommerce_shipping_zone_methods wp_sij2017_woocommerce_shipping_zones wp_sij2017_woocommerce_tax_rate_locations wp_sij2017_woocommerce_tax_rates; do
mysql --execute="ALTER TABLE wp_sij_2017_original_copy.${tbl} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
done
I run this script on the local db
I now have all my tables set to collation utf8mb4_unicode_ci
My db collation is still utf8
I mysqldump the db, then import it to my hosting and...
Import is successful.
I search and replace siteurl in the db.
I then visit the online website, I got SOME diatrics that renders a "question mark char"
Any text coming from the db has decoding issue AT SOME POINT
The source/html markup also has those "question mark char"
I have no idea where to look or what to do next
Clarification: CHARACTER SETs utf8 and utf8mb4 specify how characters are encoded into bytes. COLLATIONs *_unicode_*, etc, specify how those character compare.
The encoding for utf8mb4_unicode_ci and utf8mb4_unicode_520_ci are the same because they are encoded in the character set utf8mb4.
"database that is utf8_general_ci and tables that are utf8mb4_unicode_520_ci" -- that probably means that new tables in that database, unless specifically stated, will be CHARACTER SET utf8 COLLATION utf8_general_ci. That is the database setting is just a default for CREATE TABLE. Since your tables are already CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci, the database default is not relevant to them.
As long as the CHARACTER SET stays utf8mb4, no Emoji, Chinese, etc will be lost or otherwise mangled.
Do not use mysql40; it did not know about any CHARACTER SETs. Do not use CONVERT or CAST. Etc.
I assume the 520 is coming from the output of mysqldump? Do you have an editor that can handle a file that big? If so, simply edit it to change utf8mb4_unicode_520_ci to utf8mb4_unicode_ci throughout. Then load the dump. Problem solved?
Your fix
You did ALTER ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci on your local machine. That is probably an even better way -- since it will put your dev and prod machine in line with each other. That should have worked. Don't worry about what the "database" claims.
I'm find 'utf8mb4_unicode_520_ci' and replace with 'utf8mb4_unicode_ci' in .sql file.
Its simplest why to solve this.

Should I alter my non utf-8 database (mysql) (default?) to utf-8 on a running Drupal 7 site?

I've got running Drupal 7 site on MySQL. Just recognised that I have the Database Encoding (and collation) other than UTF-8 while Drupal 7 docs say it needs UTF-8 (with utf8_general_ci collation). At the same time all my tables are the necessary utf-8 and utf8_general_ci. (I guess drupal setup did it this way).
My questions are:
should I leave the whole system as it is or should I just alter my database to required encoding or is it necessary to convert anything after I altered the database to utf-8?
Would leaving the whole system as it is cause me any trouble in the future?
Is this setting for the database just a default and it doesn't matter at all for me as all my tables are set to the proper utf-8?
Thanks
You may or may not already be in trouble.
When you connect, do you establish the connection as being utf8? (SET NAMES utf8, or whatever Drupal does to achieve that.)
Is the data in your client encoded utf8? (For English, this does not matter.)
The CHARACTER SET of every column that currently exists is already set in stone. Even if some columns are latin1 and some are utf8, the client will see them as whatever SET NAMES has established. The inconsistency does not 'hurt'. At least not until you try to store Arabic in a latin1 column.
I would not blindly try to change anything without understanding how to correctly change it. Some attempts could make things worse.

Arabic characters doesn't show in phpMyAdmin

So I am currently working on a certain project where I need to create a database in which its records will hold both English and Arabic names.
I am creating this using PhpMyAdmin where it works perfectly fine for English names, however all the Arabic names appear as "?????".
To solve this issue I tried to use "set name 'utf8' ", however it didn't work. Googling this problem I realized that PhpMyAdmin does not support either Arabic or Special characters.
I am not sure if there is any workaround for this issue. Do you have any suggestion to solve it ?
Thanks in advance
First, is your database capable of storing Unicode? SHOW CREATE TABLE table_name; will hopefully show your character set as utf8. If not this should fix it:
ALTER TABLE table_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Also make sure your PHPMyAdmin settings contain this:
$cfg['DefaultCharset'] = 'utf_8';
$cfg['DefaultConnectionCollation'] = 'utf8_general_ci';
phpMyAdmin has no problem with UTF8, which as far as I can tell means Arabic has been fully supported for some time. phpMyAdmin just shows (accurately) what is stored in your database; if you're not seeing what you expect it's almost always because your application is misbehaving. Perhaps your Google search turned up quite old information; I'm not sure how long phpMyAdmin has supported these special characthers but looking at the development log file it seems that it's been at least since 2008, and almost certainly even prior to that. Anyway:
The phpMyAdmin wiki has considerable detail on the matter and is a good place to start. There's also a quite comprehensive guide here at Stackoverflow, or this link to another very similar question. You can read more about properly setting the application charset here, and I'll refer you once again to the phpMyAdmin wiki for information on recovering/repairing the situation.
To summarize: the problem is almost certainly in how your application stores data, not how phpMyAdmin displays it. Make sure your database and tables are using a utf character set. In your application code, make sure you set your connection charset properly. Recovery is rather painless and can be achieved by switching the column charset first to binary then whatever utf8-variant makes the most sense for you.
Add these 2 line at bottom of the my.ini file.
then restart the wamp server.
character-set-server=utf8
collation-server=utf8_general_ci
Fisrt of all navigate to the following link: http://localhost/phpmyadmin/index.php
enter image description here
make the Server connection collation: utf8_unicode_ci
And all the Arabic data fields will be displayed in the phpMyAdmin Databases.

default database collation not respected while importing

In my database, the collation was originally utf8_general_ci. However, I noticed that utf8_unicode_ci is necessary because of better sorting accuracy.
So I exported all database using phpmyadmin and checked that the word "COLLATION" does not appear in the exported sql file (except for only 2 times in one table where it is set to binary) so generally this script is collation agnostic and should not imply any specific collation when importing but use database default.
After dropping all tables, the database collation was changed to utf8_unicode_ci and then the import script was run from phpmyadmin. But as a result, all tables and all columns are shown again with utf8_general_ci collation (and sorting is incorrect). Why?? And what to do to change it?
P.S. The export/import script contains commented line at the beginning:
/*!40101 SET #OLD_COLLATION_CONNECTION=##COLLATION_CONNECTION */;
I don't know if it has any impact while importing, but after opening mysql console, the command show variables like 'collation_connection'shows COLLATION_CONNECTION as cp852_general_ci.
However, in phpmyadmin->variables the variable 'collation_connection' is set to utf8_general_ci. But there is no way to change it.
That happens because the database export is setting the character set on every table, and such a clause comes with a default collation that depends on the character set, not on the collation of your connection. utf8_general_ci is the default collation for utf8.
You'll have to convert your tables with something like ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; or edit your database export if this is affordable.
As for the MySQL console: the command-line client is pretty much broken on Windows. It'll never support, display or read Unicode, and you're getting a per-connection collation for that client that matches your Windows so-called OEM character set for your locale. This is a Windows misfeature that's difficult to workaround in portable software. PHPMyAdmin uses a web server and doesn't suffer from this problem. I advise you to use a UNIX-like operating system like GNU/Linux for any serious work in any case, not just for this reason. As an added benefit, MySQL, Apache and your whole application stack perform better on Linux.