MySQL character set with issue - mysql

I have an issue with the character set for my MySQL database. I am trying to support all languages so I have set the collation to utf8_general_ci and the character set variables to utf8.
The columns of the tables (and the tables themselves) are also set to utf8...
This seems to work fine on my local database...
...but not on the live database...
The live database is hosted on AWS RDS and I have set the character set and collation parameters to utf8/utf8_general_ci as above and rebooted.
This is both an issue in Workbench as well as when the data is queried from code. The value has not been saved properly to the database like it has been locally.
Is there something I'm missing?

Something is set up differently. See "best practice" and "question mark" in Trouble with UTF-8 characters; what I see is not what I stored
If you need "all" languages, you should use CHARACTER SET utf8mb4 so that you can handle the 4-byte characters in Chinese.

Related

Issue when deploying mysql db (utf8mb4_unicode_520_ci -> utf8mb4_unicode_ci)

I started working on a wordpress on my dev machine. mysql version is 5.6, and worpdress is 4.7 so its already using the utf8mb4_unicode_520_ci encoding if it detects its possible.
My problem is that on my hosting (mysql 5.5) utf8mb4_unicode_520_ci is not recognized as a valid encoding. So I'm trying to target utf8mb4_unicode_ci encoding as my hosting knows about this one, and if I understand correctly, this would - in opposition to going to utf8 - allow me to keep the 4 bytes.
I tried several different combinaison of encoding and collation set up for the db, but nothing successful (from here How to convert an entire MySQL database characterset and collation to UTF-8?).
I tried several combination of encoding and collation in the wp-config, but nothing.
Everything that is coming from the database (like post titles and post contents displays badly encoded char for all diatrics, anything else is displayed appropriately )
menu label from the database display incorrectly, where the hardcoded/translated label display correctly
I think I need to convert the actual content of the database, changing charset and collation does not seems to be enough.
I found this but it does not address my problem directly, or if it does I missed it.
Any help would be appreciated
————————————————————————————————
UPDATE :
here is the precise procedure I went through:
Initial situation:
I installed a wordpress (4.6.1) locally (on my dev machine, mysql 5.6.28).
I worked on the theme and plugin locally
(at this moment I have, locally, a database that is utf8_general_ci and tables that are utf8mb4_unicode_520_ci
Problem:
I want to deploy my wordpress on my hosting (mysql: 5.5 - db collation seems to be utf8mb4_unicode_ci).
I mysqldump the db locally, then try to import it on my hostings' phpmyadmin.
This gives error :
Unknown collation: 'utf8mb4_unicode_520_ci'
solution 1 change the tables charset to utf8mb4_unicode_ci:
On my hosting sql server, utf8mb4_unicode_520_ci is not available and I can't get a more recent version of mysql.
utf8mb4_unicode_ci seems like the closest and is available on my hosting sql server.
from various so question, I adapt a bash script to change charset and collation of my tables
for tbl in wp_sij2017_commentmeta wp_sij2017_comments wp_sij2017_cwa wp_sij2017_links wp_sij2017_options wp_sij2017_postmeta wp_sij2017_posts wp_sij2017_term_relationships wp_sij2017_term_taxonomy wp_sij2017_termmeta wp_sij2017_terms wp_sij2017_usermeta wp_sij2017_users wp_sij2017_woocommerce_api_keys wp_sij2017_woocommerce_attribute_taxonomies wp_sij2017_woocommerce_downloadable_product_permissions wp_sij2017_woocommerce_order_itemmeta wp_sij2017_woocommerce_order_items wp_sij2017_woocommerce_payment_tokenmeta wp_sij2017_woocommerce_payment_tokens wp_sij2017_woocommerce_sessions wp_sij2017_woocommerce_shipping_zone_locations wp_sij2017_woocommerce_shipping_zone_methods wp_sij2017_woocommerce_shipping_zones wp_sij2017_woocommerce_tax_rate_locations wp_sij2017_woocommerce_tax_rates; do
mysql --execute="ALTER TABLE wp_sij_2017_original_copy.${tbl} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
done
I run this script on the local db
I now have all my tables set to collation utf8mb4_unicode_ci
My db collation is still utf8
I mysqldump the db, then import it to my hosting and...
Import is successful.
I search and replace siteurl in the db.
I then visit the online website, I got SOME diatrics that renders a "question mark char"
Any text coming from the db has decoding issue AT SOME POINT
The source/html markup also has those "question mark char"
I have no idea where to look or what to do next
Clarification: CHARACTER SETs utf8 and utf8mb4 specify how characters are encoded into bytes. COLLATIONs *_unicode_*, etc, specify how those character compare.
The encoding for utf8mb4_unicode_ci and utf8mb4_unicode_520_ci are the same because they are encoded in the character set utf8mb4.
"database that is utf8_general_ci and tables that are utf8mb4_unicode_520_ci" -- that probably means that new tables in that database, unless specifically stated, will be CHARACTER SET utf8 COLLATION utf8_general_ci. That is the database setting is just a default for CREATE TABLE. Since your tables are already CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci, the database default is not relevant to them.
As long as the CHARACTER SET stays utf8mb4, no Emoji, Chinese, etc will be lost or otherwise mangled.
Do not use mysql40; it did not know about any CHARACTER SETs. Do not use CONVERT or CAST. Etc.
I assume the 520 is coming from the output of mysqldump? Do you have an editor that can handle a file that big? If so, simply edit it to change utf8mb4_unicode_520_ci to utf8mb4_unicode_ci throughout. Then load the dump. Problem solved?
Your fix
You did ALTER ... CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci on your local machine. That is probably an even better way -- since it will put your dev and prod machine in line with each other. That should have worked. Don't worry about what the "database" claims.
I'm find 'utf8mb4_unicode_520_ci' and replace with 'utf8mb4_unicode_ci' in .sql file.
Its simplest why to solve this.

How to convert mysql latin1 to utf8

I inherited a web system that I need to develop further.
The system seems to be created by someone who read two chapters of a PHP tutorial and thought he could code...
So... the webpage itself is in UTF8 and displays and inputs everything in it. The database tables have been created with UTF8 character set. But, in the config, there is "SET NAMES LATIN1". In other words, UTF8 encoded strings are populated into the database with forced latin1 coding.
Is there a way to convert this mess to actually store in utf8 and get rid of latin1?
I tried this, but since the database table is set to utf8, this does not work. Also tried this one without success.
I maybe able to do this by reading all tables in PHP with latin1 encoding then write them back to a new database in utf8, but I want to avoid it if possible.
I managed to solve it by running updates on text fields like this:
UPDATE table SET title = CONVERT(CONVERT(CONVERT(title USING latin1) USING binary) USING UTF8)
The situation isn't as bad as you think it is, unless you already have lots of non-Roman characters (that is, characters that aren't representable in Latin-1) in your database already. Latin-1 is a proper subset of utf8. Your web app works in utf8 and your tables' contents are in utf8 as well. So there's no need to convert the tables.
So, try changing the SET NAMES latin1 to SET NAMES utf8. It will probably solve your problem, by allowing your php program's connection to work with the same character set as the code on either end of the connection.
Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html
Change column data type
to VARBINARY
and it's automatically convert the latin1 datas
Thank you guys.
I hope it's helpful.

Should I alter my non utf-8 database (mysql) (default?) to utf-8 on a running Drupal 7 site?

I've got running Drupal 7 site on MySQL. Just recognised that I have the Database Encoding (and collation) other than UTF-8 while Drupal 7 docs say it needs UTF-8 (with utf8_general_ci collation). At the same time all my tables are the necessary utf-8 and utf8_general_ci. (I guess drupal setup did it this way).
My questions are:
should I leave the whole system as it is or should I just alter my database to required encoding or is it necessary to convert anything after I altered the database to utf-8?
Would leaving the whole system as it is cause me any trouble in the future?
Is this setting for the database just a default and it doesn't matter at all for me as all my tables are set to the proper utf-8?
Thanks
You may or may not already be in trouble.
When you connect, do you establish the connection as being utf8? (SET NAMES utf8, or whatever Drupal does to achieve that.)
Is the data in your client encoded utf8? (For English, this does not matter.)
The CHARACTER SET of every column that currently exists is already set in stone. Even if some columns are latin1 and some are utf8, the client will see them as whatever SET NAMES has established. The inconsistency does not 'hurt'. At least not until you try to store Arabic in a latin1 column.
I would not blindly try to change anything without understanding how to correctly change it. Some attempts could make things worse.

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.

Should I migrate a MySQL database with a latin1_swedish_ci collation to utf-8 and, if so, how?

The MySQL database used by my Rails application currently has the default collation of latin1_swedish_ci. Since the default charset of Rails applications (including mine) is UTF-8, it seems sensible to me to use the utf8_general_ci collation in the database.
Is my thinking correct?
Assuming it is, what would be the best approach to migrate the collation and all the data in the database to the new encoding?
UTF-8, as well as any other Unicode encoding scheme, can store characters in any language, so it is an excellent choice of codepage for your database.
The collation setting, on the other hand, is a completely separate issue from the encoding scheme. It involves sort orders, upper/lowercase conversions, string equality comparisons, and things like that which are language-specific. The collation setting should match the language that is used in the database.
The UTF-8 general collation is (I am assuming here—I'm not familiar with MySQL in particular) used for situations where the language is unknown and some simple default ordering is needed. It probably corresponds to the Unicode code point ordering, which is almost certainly not what you want if you're storing Swedish.
Convert to UTF-8 as the charset.
Collation settings are only used for sorting and stuff like that. Choose the collation that most of your users would expect.
Providing your existing data in the database is CORRECTLY encoded in latin1, converting the tables to utf8 (using ALTER TABLE, as described in the docs) should just work.
Then all your application needs to do is continue doing whatever it did before. If your application wants to use unicode characters, it should set its connection encoding to utf8 and use utf8, but that's its own problem.
The problem is that a large number of crap web apps have historically sent utf8 data to mysql and told it to treat it as latin1. MySQL will honour this perfectly and save junk into the tables, as instructed.
Converting the tables from latin1 to utf8 will NOT repair this mistake, as you genuinely do have total rubbish in there. Repairing them is nontrivial, particularly if during the lifetime of the app it's been talking different types of rubbish to the database.
Use below mysql query to convert your column :
ALTER TABLE users MODIFY description VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
To see full details about your table :
SHOW FULL COLUMNS FROM users;