Mysql- unicode characters are fine but not accents - mysql

I came accross a mind puzzling problem with mysql encoding today and would appreciate ideas on how to debug that further.
I had to update an old perl application, using mysql 5.6, which originally just in English and to which I had to add some unicode support (for khmer script).
I figured it would be best to do a test install. Took a dump of the prod db, imported into a test db, changed the charset of the tables that needed support to utf8 collate utf8_unicode_cli.
All worked well so went to apply to production. Ran the sql migration scripts to change charsets, deployed the new code and ... khmer characters do store/show fine but legacy è characters show as question mark with black square.
What really puzzles me is that
test and prod run on the same (windows) box, same mysql server instance
both test and prod databases have the same charsets et collation
for the table in question, test and prod show create table statements are identical
the same code connected to test works fine but connected to prod doesn't
I thought maybe the original data got mangled in the process so deleted it and reinserting it through the app interface. Still worked on test but not prod.
Same code works on test so code is probably not the issue.
Both on same server instance so probably not server config issue.
Khmer script works fine so probably not a utf "configuration" issue.
New data is wrongly handled so probably not a data migration/convertion issue.
So 2 questions:
is the question mark with black square a sign of double encoding or just wrong encoding
how can I debug this further? Anyway to see "raw" mysql stored data for example so I could compare?
Any input greatly appreciated.

When trying to use utf8/utf8mb4, if you see Black Diamonds with question marks,
one of these cases exists:
Case 1 (original bytes were not utf8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT were not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were utf8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>
Not relevant, but since you brought it up:
When trying to use utf8/utf8mb4, if you see Mojibake, check the following.
This discussion applies to Double Encoding, which is not necessarily visible.
The bytes to be stored need to be utf8-encoded.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4).
HTML should start with <meta charset=UTF-8>.

Related

UTF-8 encoding problem while importing a sql file

I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.

which database supports to insert vietnamese characters?

I tried inserting Vietnamese characters into MySQL database through my java program. It is getting inserted but certain characters are being inserted as junk. And while trying to retrieve, i'm getting the same junk values in place of some characters. Can anyone tel me what should be done? Is there a problem in MySQL or is there any DB that supports these characters?
Example of ‘junk’, and code?
In general you need to make sure:
your tables are created with UTF-8 collation on all text columns. This can be done at several levels: config default-character-set=utf8, db CREATE DATABASE ... DEFAULT CHARACTER SET utf8, table CREATE TABLE ... DEFAULT CHARACTER SET utf8, and column column VARCHAR(255) CHARACTER SET utf8. After the initial creation you can only do it by ALTER on the columns; changing the default character sets won't change the column.
that your connection to the database is in UTF-8 encoding, by specifying useUnicode=true and characterEncoding=UTF-8 properties in your connection string or properties. Ensure you have an up-to-date MySQL Connector as there have been grievous bugs here in the past.
that nothing else in your processing stream is mangling the characters before they get to the database connection, or on the way back out. Ensure you aren't using the default encoding anywhere because it is probably wrong. Setting the flag -Dfile.encoding=UTF-8 may help with that as a temporary workaround, but you don't want to rely on it.
(And if part of your testing involves printing to the terminal, be aware that the Windows command prompt won't be able to do anything with UTF-8 so you will definitely see junk there.)
Hi there no problem to store vietnamese characters, but check mysql FAQ first:
http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html

WordPress encoding problem

I'm having what seems to be a problem related to WordPress, though it could be something else.
Here's what's happening:
I have a blog with posts using utf-8 characters (simple ones like ’). The characters all display correctly currently, however I'm moving my site to another server and seeing problems with all the utf-8 chars (’ becomes ’).
I first thought the problem was with MySQL, but after looking into it it seems not to be the case. I created the new database by doing a synch with Navicat, and have confirmed that both db's and all tables are utf-8. When viewing the data in either db in any SQL program I've tried (Sequel Pro, Navicat) the chars show up unencoded (’). I've tried various synching methods, including ones that others have said solved encoding problems, but they did not work for me.
What confirmed it for me, was setting up a test php script which pulled a single post_content field from each database. In the test script the chars show up encoded (’) regardless of which db they come from.
I checked the apache config file and found that HTTP_ACCEPT_CHARSET is set to the same (ISO-8859-1,utf-8;q=0.7,*;q=0.7) on both systems.
Soooo, I'm left thinking that it's a WordPress issue, though of course I could be wrong.
Any help would be truly appreciated, I’ve been banging my head on this for awhile now ;)
Thanks.
What you are seeing is UTF-8 data being interpreted as if it were ISO-8859-1 (or Win-1252, or another single-byte encoding). Problems like this are almost always a mismatch between the headers being sent to the browser and the actual encoding. Something is telling the browser that the stream is ISO-8859-1 while actually sending UTF-8.
So, I've finally ended up using a plug-in to solve the problem. Here are the steps I took:
Migrate the structure and content of the old database to the new database using Navicat for MySQL (though I think any method of copying will work).
Change the encoding of the columns in the wp_posts table to utf8 using ALTER TABLE 'wp_posts' CHANGE COLUMN 'post_content' 'post_content' longtext CHARACTER SET utf8 NOT NULL after 'post_date_gmt';
Use the ISO to UTF content plug in to convert any non-encoded chars innthe table to utf chars.

Is it OK to fix a character encoding error using SQL REPLACE?

I have a (Wordpress) blog and some of my older posts have a character encoding problem where £ displays as £ (i.e. a pound sign prepended with a capital 'A' with a hat on).
The problem is at the DB level, so I was going to run the following SQL statement:
update wp_posts set post_content = replace(post_content, ‘£’, ‘£’);
Would this be foolish?
Background info (not required to read):
How did this problem happen? I don't know. The blog has been though various updates (including from Wordpress Version 2.1.3 when the default table CHARSET changed from latin1 to utf8) and been migrated to and from various machines and I guess at some point Wordpress must have written UTF-8 encoded characters into the Database that had a CHARSET of latin1, or vice-versa. I know I should have been more careful (yes I have read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).
How have I ensured that this doesn't happen again? I have made sure my encodings are consistent. All MySQL tables use CHARSET utf-8 and the HEAD section of blog pages set <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
It should be ok. The best thing is the following:
Make a dump of your blog db
Load it to another db
Perform the replace on the temporary db
Check!
If all goes well, perform it on the production db as well.
Well, I would say that it would probably be the best "solution" to the problem.
As the data has been stored using the wrong encoding somewhere along the line, the original data is lost and there is no real solution. You just have to try to salvage what you can from the corrupt data that you have.
If it's only isolated to a single character, you are lucky. There may be byte codes that didn't translate into any available character, so if that happened anywhere you wouldn't have a character combination that is possible to identify, you would just have a character replaced by another or a missing character. It would only be possible to spot that manually.
Sure you have data in one encoding and the table with another one. You can fix this within mysql.
Check here
Don't do that!
Use a trigger on update/insert if you really need to.
EDIT: hmm, after reading your situation, I would suggest making a backup copy of the DB and trying what you said. I think it would work, as long as you're not planning to ever do it again (which seems to be the case)

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.