MySQL encoding weird characters - mysql

As many people already had, I have a problem in MySQL with the encoding of my data.
More specifically, the collation of the table seems to be utf8_general_ci. The data inserted is inserted well, but when a select is done, some characters get translated badly:
Marie-Thérèse becomes Marie-Thérèse.
Is it possible to do a select and translate these characters back to the original value, or is it impossible? It's harder to change the original table in my case, so I'd rather solve it in my select query.

When using phpmyadmin (or the like) and looking at those entries, are those entries okay?
update: if not, the inserts are probably flawed already, and the connection from the insertion script must be adapted.
If so, then it's not technically MySQL's fault but the software connecting to it. See for example: UTF-8 all the way through . You have to set some parameters on/after opening the connection.
btw: The collation should be irrelevant. http://dev.mysql.com/doc/refman/5.7/en/charset-general.html
The gist is: a collation tells the you, how you have to order/compare strings, which is mainly important for special characters like äöü in German or àéô in French/... because their local/regional collation say, ä is - for ordering purposes - exactly like a (for example), in another collation, ä could be distinctly after a or even after z.

In the end it seems like the problem was with running it all through a cronbjob.
We run a script through a cronjob that generates the insert statements. Apparently, when running the script manually, everything goes well, but when running the same script through a cronjob, the data got messed up. We solved that by this article:http://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/
We had to add a variable LANG in the etc/environment file.

Related

run mysql without collation (utf-8 only)

I run a sqlite3 database with utf8-strings from many languages. For various reasons I want to move to mysql, but I constantly run into trouble because of the mysql-collation feature.
One problem is that I am not even able to reliably know what is in my database. (For example I get "?" for non-latin characters and "�" for latin-based characters like öé, etc. - but I have absolutely no idea whether the problem lies in the import from sqlite3 to mysql or in reading from the mysql-database.)
Is there a way to get rid of this "feature" and let mysql do what I tell it without trying to be smart? I use UTF-8 everywhere and I never need any mangling of strings: Input is always UTF-8 and output should be always UTF-8. Also I really would like to know what really is stored in the database - i.e. without a collation-feature corrupting the data during readout.
You could use the MySQL VARBINARY column type, which stores a sequence of arbitrary bytes without interpreting them in any particular charset (or maybe VARCHAR BINARY, which is subtly different).
MySQL uses latin1_swedish_ci unless you specify something different explicitly. That's the opposite of smart. You have to be smart and change that default. This can be done with e.g. the --character-set-server and --collation-server command line options. See Specifying Character Sets and Collations for other means and further options.

Accent insensitive search on a problematic database

I have a database that contains data in different languages. Some languages use accents (like áéíóú) and I need to search in this data as the accents doesn't exist (search for 'campeon' should return 'campeón' as a valir result).
The problem is that the tables in my database (utf8_unicode_ci) are not storing utf8 characteres. If you see the data through phpmyadmin the words with accents looks like this: campeón
After some researching, I've found (in a StackOverflow question) that the problem is related to the inexistence of a SET NAMES [charset]. In fact, I've made some testings and if I set names to utf8, everything works as expected.
Well, I have the solution, what's the problem? The problem is that the database is in production, so there are thousands of strings in the database. If I change the character set the client will use, all already existing string will become invalid. The question is: is there any way to:
perform accent-insensitive searches in a database that uses a wrong charset like mine?
transform safely the data in the tables to the appropriate charset?
continue working with mixed charsets (latin1 and utf8) in the database, assuming that latin1 data will not be accent-insensitive?
If anybody has experience in any of the solutions I propose or has a new one, I'll be very thankful if share.
The problem being that the data was inserted using the wrong connection encoding, you can fix it by
Exporting the data using the wrong connection encoding, just like you have used it thus far, followed by
Importing the data using the correct utf8 connection encoding.
That will fix the encoding problem, after which search will work as expected.
What if you create a copy of the table at the beginning of your session, alter the copy's charset, perform all your queries from that, and then drop the table at the end of your session? I don't know how practical this would be - depends on how often you need to perform these queries and how big the table is.

Erlang Emysql encoding difference between prepared and regular Query

I Have wrote a question which got a right answer here about emysql encoding.
The answer pinpoint another question...
I'm trying to store iPhone emojis into a database...
When I do :
Query = io_lib:format("UPDATE Users SET c=\"~s\" WHERE id=~B", [C, Id]),
emysql:execute(mydb, Query).
Everything works fine...
But with:
emysql:prepare(update_c, <<"UPDATE Users SET c=? WHERE id=?">>),
emysql:execute(mydb, update_c, [C, Id]).
I'm retrieving Mojibake. EDITED TO USE THE CORRECT TERM
I'm connecting with :
emysql:add_pool(my_db, 3, "login", "password", "db.mydomain.com", 3306, "MyTable", latin1)
Unfortunately, I cannot use utf8 because of the previous software that used the database and stored emoji's that way, If I do use utf8, it will work with the new system, but not with rows inserted by the old one.
EDIT:
I really would really like to use prepared statement, that would prevent SQL injection effectively.
Edit: should be fixed in 253b7f94f9b04526e6868d7b693e6e9ee41de374. Thanks for feedback.
https://github.com/Eonblast/Emysql/commit/253b7f94f9b04526e6868d7b693e6e9ee41de374
I believe it's an error in Emysql and I think I fixed it. Still working out the unit tests so it all makes sense. I'll let you know when it's posted to github.
I opened an issue for this: https://github.com/Eonblast/Emysql/issues/24
Essentially, you are tricking the driver and the database because you open the connection with latin-1 but the database is utf-8. Then you trip over the automatic conversion.
Still, I think you are right that the driver should respect that you set the connection to latin-1 and not do the magic of automatic conversion to utf-8. If you read issue #14 at Eonblast/Emysql at github you'll find I always suspected automatic conversion was a bad idea.
However, just from the fact that the unit tests for the conversions are now blowing up by the factor of four (and pose some rather uninteresting but mind boggling fringe issues I can't get my head around) I think tricking the database the way you do is likewise a bad idea. If you can, you should clean this up rather than rely on the mechanics in-between to hold. There are multiple levels in MySQL where conversions occur. As you know you can set the connection, the database, also the table to a character set. It's a great way to produce bugs. Can you describe why you could not? Because you have no control and must act blind to encoding? I'd like to know if there is a real case where you can't live without this hack.
Regardless, your complaint about the setting of the connection to latin-1 probably showed the way to eliminate all or most of the guessing in the character conversions in Emysql. That's very much appreciated and I hope I'll have a solution for you later today.
Henning
Just convert you table to UTF-8:
ALTER TABLE Users CONVERT TO CHARACTER SET utf8;
Then you can use utf-8 with new data and the old will have been converted to UTF-8 aswell.

Is it OK to fix a character encoding error using SQL REPLACE?

I have a (Wordpress) blog and some of my older posts have a character encoding problem where £ displays as £ (i.e. a pound sign prepended with a capital 'A' with a hat on).
The problem is at the DB level, so I was going to run the following SQL statement:
update wp_posts set post_content = replace(post_content, ‘£’, ‘£’);
Would this be foolish?
Background info (not required to read):
How did this problem happen? I don't know. The blog has been though various updates (including from Wordpress Version 2.1.3 when the default table CHARSET changed from latin1 to utf8) and been migrated to and from various machines and I guess at some point Wordpress must have written UTF-8 encoded characters into the Database that had a CHARSET of latin1, or vice-versa. I know I should have been more careful (yes I have read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).
How have I ensured that this doesn't happen again? I have made sure my encodings are consistent. All MySQL tables use CHARSET utf-8 and the HEAD section of blog pages set <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
It should be ok. The best thing is the following:
Make a dump of your blog db
Load it to another db
Perform the replace on the temporary db
Check!
If all goes well, perform it on the production db as well.
Well, I would say that it would probably be the best "solution" to the problem.
As the data has been stored using the wrong encoding somewhere along the line, the original data is lost and there is no real solution. You just have to try to salvage what you can from the corrupt data that you have.
If it's only isolated to a single character, you are lucky. There may be byte codes that didn't translate into any available character, so if that happened anywhere you wouldn't have a character combination that is possible to identify, you would just have a character replaced by another or a missing character. It would only be possible to spot that manually.
Sure you have data in one encoding and the table with another one. You can fix this within mysql.
Check here
Don't do that!
Use a trigger on update/insert if you really need to.
EDIT: hmm, after reading your situation, I would suggest making a backup copy of the DB and trying what you said. I think it would work, as long as you're not planning to ever do it again (which seems to be the case)

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.