WordPress encoding problem - mysql

I'm having what seems to be a problem related to WordPress, though it could be something else.
Here's what's happening:
I have a blog with posts using utf-8 characters (simple ones like ’). The characters all display correctly currently, however I'm moving my site to another server and seeing problems with all the utf-8 chars (’ becomes ’).
I first thought the problem was with MySQL, but after looking into it it seems not to be the case. I created the new database by doing a synch with Navicat, and have confirmed that both db's and all tables are utf-8. When viewing the data in either db in any SQL program I've tried (Sequel Pro, Navicat) the chars show up unencoded (’). I've tried various synching methods, including ones that others have said solved encoding problems, but they did not work for me.
What confirmed it for me, was setting up a test php script which pulled a single post_content field from each database. In the test script the chars show up encoded (’) regardless of which db they come from.
I checked the apache config file and found that HTTP_ACCEPT_CHARSET is set to the same (ISO-8859-1,utf-8;q=0.7,*;q=0.7) on both systems.
Soooo, I'm left thinking that it's a WordPress issue, though of course I could be wrong.
Any help would be truly appreciated, I’ve been banging my head on this for awhile now ;)
Thanks.

What you are seeing is UTF-8 data being interpreted as if it were ISO-8859-1 (or Win-1252, or another single-byte encoding). Problems like this are almost always a mismatch between the headers being sent to the browser and the actual encoding. Something is telling the browser that the stream is ISO-8859-1 while actually sending UTF-8.

So, I've finally ended up using a plug-in to solve the problem. Here are the steps I took:
Migrate the structure and content of the old database to the new database using Navicat for MySQL (though I think any method of copying will work).
Change the encoding of the columns in the wp_posts table to utf8 using ALTER TABLE 'wp_posts' CHANGE COLUMN 'post_content' 'post_content' longtext CHARACTER SET utf8 NOT NULL after 'post_date_gmt';
Use the ISO to UTF content plug in to convert any non-encoded chars innthe table to utf chars.

Related

Storing swedish characters in mysql database

I'm having problems storing Swedish characters in my MySQL database. I want to store them in my table called users with the collation utf8-bin. Even though I'm using utf8, the characters å ä ö gets stored as Ã¥ ä ö and I don't know why. Retrieving the data and echoing it gives me the same output, with the weird characters instead of å ä ö. Any help is appreciated.
Call
mysql_set_charset("utf8");
After connecting and before making any queries.
Your database charset is just for storage, not for transmission between app and database.
There are several places, where you have to pay attention to the encoding.
Database: you already use an utf8 collation, so that's fine
Database connection: use mysqli_set_charset to set the charset of the connection, if you're using mysqli. Other database drivers have similar functions.
Output encoding of the page: You can use HTTP headers or meta tags. If you want to be on the safe side, specify both.
You should make sure that the database connection uses the Swedish encoding and the encoding of the page output is correct as well. Different encoding causes many of these problems. Read more about character encodings here.
There are several parameters that you have to consider here. For this to work well now and in the future. ALL different interactions with the text has to be in same encoding. Even within db (for joins to work well etc).
The encoding of the data beeing inserted (set in header of page and / or utf_8 encoding when inserted).
The encoding in db tables, (i would recommend utf8_swedish for all)
The encoding of the page viewing results from db (set this in header)
The encoding of the page beeing edited. Its possible to open documents in different encoding. This is a big issue if you are not familiar with it. Open and save documents in right encoding.
There use to be a problem concerning the connection encoding to, set this correct, but today it is a smaller problem than a couple of years ago, because of changes.
A couple of notes. Are you sure your data is stored like that, or just presented wrong, via for instance phpmyademin? Try to print with print utf8_encode($text)
Or, utf8_decode() function, that gives you some insight...

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:
mysql ... database_name < script.sql
Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.
Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):
alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;
But it didn't helped.
Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.
mysql ... --default-character-set=utf8 database_name < script.sql
It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).
So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?
Especially I would like to know:
What parts of database rely on column encoding setting? When this setting has any real meaning?
On what occasions implicit conversion of column encoding is done?
How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.
Hope someone help me to clear things up...
The biggest reason, in my view, is that it breaks your DB consistency.
it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.
Now to your questions:
When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.
As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.
Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.
Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.
I think the best practice today would be to:
have all your environments being UTF8 ready, including shells;
have all your databases defaulting to UTF8 encoding.
This way you'll have less field for errors.

Reference for converting UTF8 in a MySql DB

Wonder if anyone can help.
I recently had an issue with UTF8 in the Database and pages of a bespoke CMS I inherited. Going forward that's all sorted now, the code and DB has been changed to cater and properly convert, however I have an issue in that existing entries in the DB are obvioulsy sat there in the old character format and I need to convert all those.
Eg Ķ, ī
I was going to run an replace in the mysql DB to replace all these, but what I could do with is knowing what all these weird characters translate to eg ó.
Can anyone recommend a good table/reference to look at ? I have been searching but can't seem to come up with the right thing.
If I understand right these are two byte UTF8 characters.
Thanks
Try running these values in utf8_decode.
It looks like they've been valid, then utf8_encode'd.
If that's the case, try running a loop and update these rows.

Is it OK to fix a character encoding error using SQL REPLACE?

I have a (Wordpress) blog and some of my older posts have a character encoding problem where £ displays as £ (i.e. a pound sign prepended with a capital 'A' with a hat on).
The problem is at the DB level, so I was going to run the following SQL statement:
update wp_posts set post_content = replace(post_content, ‘£’, ‘£’);
Would this be foolish?
Background info (not required to read):
How did this problem happen? I don't know. The blog has been though various updates (including from Wordpress Version 2.1.3 when the default table CHARSET changed from latin1 to utf8) and been migrated to and from various machines and I guess at some point Wordpress must have written UTF-8 encoded characters into the Database that had a CHARSET of latin1, or vice-versa. I know I should have been more careful (yes I have read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)).
How have I ensured that this doesn't happen again? I have made sure my encodings are consistent. All MySQL tables use CHARSET utf-8 and the HEAD section of blog pages set <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
It should be ok. The best thing is the following:
Make a dump of your blog db
Load it to another db
Perform the replace on the temporary db
Check!
If all goes well, perform it on the production db as well.
Well, I would say that it would probably be the best "solution" to the problem.
As the data has been stored using the wrong encoding somewhere along the line, the original data is lost and there is no real solution. You just have to try to salvage what you can from the corrupt data that you have.
If it's only isolated to a single character, you are lucky. There may be byte codes that didn't translate into any available character, so if that happened anywhere you wouldn't have a character combination that is possible to identify, you would just have a character replaced by another or a missing character. It would only be possible to spot that manually.
Sure you have data in one encoding and the table with another one. You can fix this within mysql.
Check here
Don't do that!
Use a trigger on update/insert if you really need to.
EDIT: hmm, after reading your situation, I would suggest making a backup copy of the DB and trying what you said. I think it would work, as long as you're not planning to ever do it again (which seems to be the case)

Croatian diacritic signs in MySQL db (utf-8)

Diacritic signs http://img98.imageshack.us/img98/3383/dijakritickiznakovi.gif
So, symbols belows display title should be displayed that way.
UTF-8 entities are listed below HTML (utf-8) title (here is list: LINK)
And last line shows what is stored in my database.
Collation of db table is utf8_unicode_ci.
I suppose that symbols in db shouldn't be as they are in my case?
They are displaying correctly on page when loaded from database, but they all of them are not displayed by utf-8 table from given link. Even if I see them correctly maybe someone other won't?
Setting the MySQL table charset is not enough - you should also take care to set the correct charset for the client, the connection and the results, which defaults may differ from server to server making your database less than portable: the same database content might be displayed differently moving to another server.
I've been storing slovenian text into MySQL for some time now and this is what works for me:
the first thing you do after connecting should be to issue a "SET NAMES utf8" query
make sure that the strings you're storing are utf-8 to start with: if you're taking them from a web page form make sure the page is UTF-8
be careful what tools do you use to browse/edit the database contents online: PhpMysqlAdmin is definitely unsafe.
Hope this helps.
You appear to be trying to store HTML-encoded strings in your database. Don't do that, it will only break your ability to do string operations like searching reliably. You should be able to store raw UTF-8 encoded characters as bytes in your database.
You don't say what environment you're using to read the database or how you get the ‘incorrect’ string at the bottom (which is UTF-8 bytes read using ISO-8859-1 encoding). If they appear in your web page (and you're specifying UTF-8 in the headers and/or <meta> tag), you're presumably pretty much there.