Reference for converting UTF8 in a MySql DB - mysql

Wonder if anyone can help.
I recently had an issue with UTF8 in the Database and pages of a bespoke CMS I inherited. Going forward that's all sorted now, the code and DB has been changed to cater and properly convert, however I have an issue in that existing entries in the DB are obvioulsy sat there in the old character format and I need to convert all those.
Eg Ķ, ī
I was going to run an replace in the mysql DB to replace all these, but what I could do with is knowing what all these weird characters translate to eg ó.
Can anyone recommend a good table/reference to look at ? I have been searching but can't seem to come up with the right thing.
If I understand right these are two byte UTF8 characters.
Thanks

Try running these values in utf8_decode.
It looks like they've been valid, then utf8_encode'd.
If that's the case, try running a loop and update these rows.

Related

MySQL encoding weird characters

As many people already had, I have a problem in MySQL with the encoding of my data.
More specifically, the collation of the table seems to be utf8_general_ci. The data inserted is inserted well, but when a select is done, some characters get translated badly:
Marie-Thérèse becomes Marie-Thérèse.
Is it possible to do a select and translate these characters back to the original value, or is it impossible? It's harder to change the original table in my case, so I'd rather solve it in my select query.
When using phpmyadmin (or the like) and looking at those entries, are those entries okay?
update: if not, the inserts are probably flawed already, and the connection from the insertion script must be adapted.
If so, then it's not technically MySQL's fault but the software connecting to it. See for example: UTF-8 all the way through . You have to set some parameters on/after opening the connection.
btw: The collation should be irrelevant. http://dev.mysql.com/doc/refman/5.7/en/charset-general.html
The gist is: a collation tells the you, how you have to order/compare strings, which is mainly important for special characters like äöü in German or àéô in French/... because their local/regional collation say, ä is - for ordering purposes - exactly like a (for example), in another collation, ä could be distinctly after a or even after z.
In the end it seems like the problem was with running it all through a cronbjob.
We run a script through a cronjob that generates the insert statements. Apparently, when running the script manually, everything goes well, but when running the same script through a cronjob, the data got messed up. We solved that by this article:http://www.logikdev.com/2010/02/02/locale-settings-for-your-cron-job/
We had to add a variable LANG in the etc/environment file.

Liferay does not display UTF8 characters anymore

I just made a database restore from mysql workbench and found out that liferay does not display UTF-8 spec characters e.g ÅÄÖ, these letters are instead displayed as a question mark.
I wonder if anyone knows the solution for this issue? Do I have to specify a charset while importing the sql files? And if so how do I do that in mysql workbench?
To be honest I have no idea if the mysql restore has a direct effect on what happened, I'm just describing what I did before the issue occurred.
If you do a restore to a new database, make sure that this database defaults the character set to UTF-8:
create database lportal character set utf8;
Then import your data into that table.
Let me also use this opportunity to link my favourite site to generate great UTF-8 testdata: http://www.fliptitle.com - great if you need testdata for people who know only ASCII languages but still need immediate feedback on correct encoding with data they're able to interpret. You don't seem to be one of them, but I guess others that are in this group might stumble upon this later.

Accent insensitive search on a problematic database

I have a database that contains data in different languages. Some languages use accents (like áéíóú) and I need to search in this data as the accents doesn't exist (search for 'campeon' should return 'campeón' as a valir result).
The problem is that the tables in my database (utf8_unicode_ci) are not storing utf8 characteres. If you see the data through phpmyadmin the words with accents looks like this: campeón
After some researching, I've found (in a StackOverflow question) that the problem is related to the inexistence of a SET NAMES [charset]. In fact, I've made some testings and if I set names to utf8, everything works as expected.
Well, I have the solution, what's the problem? The problem is that the database is in production, so there are thousands of strings in the database. If I change the character set the client will use, all already existing string will become invalid. The question is: is there any way to:
perform accent-insensitive searches in a database that uses a wrong charset like mine?
transform safely the data in the tables to the appropriate charset?
continue working with mixed charsets (latin1 and utf8) in the database, assuming that latin1 data will not be accent-insensitive?
If anybody has experience in any of the solutions I propose or has a new one, I'll be very thankful if share.
The problem being that the data was inserted using the wrong connection encoding, you can fix it by
Exporting the data using the wrong connection encoding, just like you have used it thus far, followed by
Importing the data using the correct utf8 connection encoding.
That will fix the encoding problem, after which search will work as expected.
What if you create a copy of the table at the beginning of your session, alter the copy's charset, perform all your queries from that, and then drop the table at the end of your session? I don't know how practical this would be - depends on how often you need to perform these queries and how big the table is.

Erlang Emysql encoding difference between prepared and regular Query

I Have wrote a question which got a right answer here about emysql encoding.
The answer pinpoint another question...
I'm trying to store iPhone emojis into a database...
When I do :
Query = io_lib:format("UPDATE Users SET c=\"~s\" WHERE id=~B", [C, Id]),
emysql:execute(mydb, Query).
Everything works fine...
But with:
emysql:prepare(update_c, <<"UPDATE Users SET c=? WHERE id=?">>),
emysql:execute(mydb, update_c, [C, Id]).
I'm retrieving Mojibake. EDITED TO USE THE CORRECT TERM
I'm connecting with :
emysql:add_pool(my_db, 3, "login", "password", "db.mydomain.com", 3306, "MyTable", latin1)
Unfortunately, I cannot use utf8 because of the previous software that used the database and stored emoji's that way, If I do use utf8, it will work with the new system, but not with rows inserted by the old one.
EDIT:
I really would really like to use prepared statement, that would prevent SQL injection effectively.
Edit: should be fixed in 253b7f94f9b04526e6868d7b693e6e9ee41de374. Thanks for feedback.
https://github.com/Eonblast/Emysql/commit/253b7f94f9b04526e6868d7b693e6e9ee41de374
I believe it's an error in Emysql and I think I fixed it. Still working out the unit tests so it all makes sense. I'll let you know when it's posted to github.
I opened an issue for this: https://github.com/Eonblast/Emysql/issues/24
Essentially, you are tricking the driver and the database because you open the connection with latin-1 but the database is utf-8. Then you trip over the automatic conversion.
Still, I think you are right that the driver should respect that you set the connection to latin-1 and not do the magic of automatic conversion to utf-8. If you read issue #14 at Eonblast/Emysql at github you'll find I always suspected automatic conversion was a bad idea.
However, just from the fact that the unit tests for the conversions are now blowing up by the factor of four (and pose some rather uninteresting but mind boggling fringe issues I can't get my head around) I think tricking the database the way you do is likewise a bad idea. If you can, you should clean this up rather than rely on the mechanics in-between to hold. There are multiple levels in MySQL where conversions occur. As you know you can set the connection, the database, also the table to a character set. It's a great way to produce bugs. Can you describe why you could not? Because you have no control and must act blind to encoding? I'd like to know if there is a real case where you can't live without this hack.
Regardless, your complaint about the setting of the connection to latin-1 probably showed the way to eliminate all or most of the guessing in the character conversions in Emysql. That's very much appreciated and I hope I'll have a solution for you later today.
Henning
Just convert you table to UTF-8:
ALTER TABLE Users CONVERT TO CHARACTER SET utf8;
Then you can use utf-8 with new data and the old will have been converted to UTF-8 aswell.

WordPress encoding problem

I'm having what seems to be a problem related to WordPress, though it could be something else.
Here's what's happening:
I have a blog with posts using utf-8 characters (simple ones like ’). The characters all display correctly currently, however I'm moving my site to another server and seeing problems with all the utf-8 chars (’ becomes ’).
I first thought the problem was with MySQL, but after looking into it it seems not to be the case. I created the new database by doing a synch with Navicat, and have confirmed that both db's and all tables are utf-8. When viewing the data in either db in any SQL program I've tried (Sequel Pro, Navicat) the chars show up unencoded (’). I've tried various synching methods, including ones that others have said solved encoding problems, but they did not work for me.
What confirmed it for me, was setting up a test php script which pulled a single post_content field from each database. In the test script the chars show up encoded (’) regardless of which db they come from.
I checked the apache config file and found that HTTP_ACCEPT_CHARSET is set to the same (ISO-8859-1,utf-8;q=0.7,*;q=0.7) on both systems.
Soooo, I'm left thinking that it's a WordPress issue, though of course I could be wrong.
Any help would be truly appreciated, I’ve been banging my head on this for awhile now ;)
Thanks.
What you are seeing is UTF-8 data being interpreted as if it were ISO-8859-1 (or Win-1252, or another single-byte encoding). Problems like this are almost always a mismatch between the headers being sent to the browser and the actual encoding. Something is telling the browser that the stream is ISO-8859-1 while actually sending UTF-8.
So, I've finally ended up using a plug-in to solve the problem. Here are the steps I took:
Migrate the structure and content of the old database to the new database using Navicat for MySQL (though I think any method of copying will work).
Change the encoding of the columns in the wp_posts table to utf8 using ALTER TABLE 'wp_posts' CHANGE COLUMN 'post_content' 'post_content' longtext CHARACTER SET utf8 NOT NULL after 'post_date_gmt';
Use the ISO to UTF content plug in to convert any non-encoded chars innthe table to utf chars.