Charset Problem while migrating database

Charset Problem while migrating database - mysql

I have a custom made CMS that I must migrate to work on Wordpress. Everything worked fine except the charset module.
Since this is about a Rumanian blog content, there are some special chars used (this will be ă, î, ș, â, Ț). When i insert this content on wordpress wp_posts, Wordpress displays them as "?".
I've tried all kind of stuff, like changing the charset from utf8 to latin1, latin2, and so on, but no result.
Even more, when i try to replace that special characters with normal ones (eg: ă to a, î to i) nothing happens, the content remains the same (there are actually some chars that are changed but not all)
What i do wrong and what i must do to do it right?
Thanks!

Character sets are a complete nightmare. What I'd do is to use mysqldump to dump your database to a sql file. Check to see if the special chars still look right.
Then, using find and replace in a text editor, replace all special characters with the correct html entity. e.g. Ă becomes Ă.
http://meta.wikimedia.org/wiki/Help:Romanian_characters
Then delete your database, set all conceivable settings to utf-8, and import your dump.
Wordpress also has an extensive article about character encodings.
Good luck!

Related

Is there a special code for a half symbol?

Because my database is about postage stamps there are a lot of half symbols in the descriptions. They type into mysql with no problem but when I get the results of a query all the half symbols have been replaced by a question mark.
Is there a special code I should be using when I input the descriptions instead of using the half symbol? Otherwise, is there another solution like changing the character set. I'm using utf-8 at the moment.

It seems like you are facing character encoding issue. You should use UTF-8 everywhere:
Make sure the column that contains text data is encoded as utf8_...
Check that when you get information from database, you keep this encoding. You can force it by sending SET NAMES utf8; before any request to MySQL.
Check that when you display this information the encoding is UTF-8 (in a webpage, that means <meta charset='utf-8'> in <head>).

mysql - How to save ñ

Whenever I try to save ñ it becomes ? in the mysql database. After some few readings it is suggested that I have to change my jsp charset to UTF-8. For some reasons I have to stick to ISO-8859-1. My database table encoding is latin1. How can I fix this? Please help.

Go to your database administration with MySQL WorkBench for example, put the Engine to InnoDB and the collation to utf8-utf8_general_ci.

You state in your question that you require a ISO-8859-1 backend (latin1), and a Unicode (UTF-8) frontend. This setup is crazy, because the set on the frontend is much larger than that allowed in the database. The sanest thing would be using the same encoding through the software stack, but also using Unicode only for storage would make sense.
As you should know, a String is a human concept for a sequence of characters. In computer programs, a String is not that: it can be viewed as a sequence of characters, but it's really a pair data structure: a stream of bytes and an encoding.
Once you understand that passing a String is really passing bytes and a scheme, let's see who sends what:
Browser to HTTP server (usually same encoding as the form page, so UTF-8. The scheme is specified via Content-Type. If missing, the server will pick one based on its own strategy, for example default to ISO-8859-1 or a configuration parameter)
HTTP Server to Java program (it's Java to Java, so the encoding doesn't matter since we pass String objects)
Java client to MySQL server (the Connector/J documentation is quite convoluted - it uses the character_set_server system variable, possibly overridden by the characterEncoding connection parameter)
To understand where the problem lies, first assure that the column is really stored as latin1:
SELECT character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema = :DATABASE
AND table_name = :TABLE
AND column_name = :COLUMN;
Then write the Java string you get from the request to a log file:
logger.info(request.getParameter("word"));
And finally see what actually is in the column:
SELECT HEX(:column) FROM :table
At this point you'll have enough information to understand the problem. If it's really a question mark (and not a replacement character) likely it's MySQL trying to transcode a character from a larger set (let's say Unicode) to a narrower one which doesn't contain it. The strange thing here is that ñ belongs to both ISO-8859-1 (0xF1, decimal 241) and Unicode (U+00F1), so it'd seem like there's a third charset (maybe a codepage?) involved in the round trip.
More information may help (operating system, HTTP server, MySQL version)

Change your db table content encoding to UTF-8
Here's the command for whole DB conversion
ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
And this is for single tables conversion
ALTER TABLE db_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

change your table collate to utf8_spanish_ci
where ñ is not equal to n but if you want both characters to be equal use
utf8_general_ci instead

I try several combinations, but this works for me:
VARCHAR(255) BINARY CHARACTER SET utf8 COLLATE utf8_bin
When data retrieve in dbforge express, shows like:
NIÃ‘A
but in the application shows like:
NIÑA

I had the same problem. Found out that is not an issue about encoding UTF-8 or whatever charset. I imported my data from windows ANSI and all my Ñ and ñ where put in the database perfectly as it should be. Example last names showed on database last_name = "MUÑOZ". I was able to select normally from the database with query Select * from database where last_name LIKE "%muñoz%" and phpmyadmin show me results fine. It selected all "MUÑOZ" and "MUNOZ" without a problem. So phpmyadmin does show all my Ñ and ñ without any problems.
The problem was the program itself. All my characters mention, showed as you describe with the funky "MU�OZ" question mark. I had follow all advice everywhere. Set my headers correctly and tried all my charsets available. Even used google fonts and whatsoever font available to display correctly those last names, but no success.
Then I remembered an old program that was able to do the trick back and forth transparently and peeked into the code to figure it out: The database itself, showing all my special characters was the problem. Remember, I uploaded using windows ANSI encoding. Phpmyadmin did as expected, uploaded all as instructed.
The old program fixed this problem translating the Ñ to its UNICODE HTML Entity: Ñ (see chart here https://www.compart.com/en/unicode/U+00D1 ) a process done back and forth from MySQL to the app.
So you just need to change your database strings containing the letter Ñ and ñ to their corresponding UNICODE to reflect correctly on your browser with UTF charset.
In my case, I solved my issues replacing all my Ñ and ñ for their corresponding UNICODE in all the last names in my database.
UPDATE database_name
SET
last_name = REPLACE(last_name,
'MUÑOZ',
'MUÑOZ');
Now, Im able to display, browse, even search all my correct last names and accents/tildes, proper to spanish language. I hope this helps. It was a pain to figure it out, but an old program solved the problem. Best regards and happy coding !

Removing strange characters from MySQL data

Somewhere along the way, between all the imports and exports I have done, a lot of the text on a blog I run is full of weird accented A characters.
When I export the data using mysqldump and load it into a text editor with the intention of using search-and-replace to clear out the bad characters, searching just matches every "a" character.
Does anyone know any way I can successfully hunt down these characters and get rid of them, either directly in MySQL or by using mysqldump and then reimporting the content?

This is an encoding problem; the Â is a non-breaking space (HTML entity ) in Unicode being displayed in Latin1.
You might try something like this... first we check to make sure the matching is working:
SELECT * FROM some_table WHERE some_field LIKE BINARY '%Â%'
This should return any rows in some_table where some_field has a bad character. Assuming that works properly and you find the rows you're looking for, try this:
UPDATE some_table SET some_field = REPLACE( some_field, BINARY 'Â', '' )
And that should remove those characters (based on the page you linked, you don't really want an nbsp there as you would end up with three spaces in a row between sentences etc, you should only have one).
If it doesn't work then you'll need to look at the encoding and collation being used.
EDIT: Just added BINARY to the strings; this should hopefully make it work regardless of encoding.

The accepted answer did not work for me.
From here http://nicj.net/mysql-converting-an-incorrect-latin1-column-to-utf8/ I have found that the binary code for Â character is c2a0 (by converting the column to VARBINARY and looking what it turns to).
Then here http://www.oneminuteinfo.com/2013/11/mysql-replace-non-ascii-characters.html found the actual solution to remove (replace) it:
update entry set english_translation = unhex(replace(hex(english_translation),'C2A0','20')) where entry_id = 4008;
The query above replaces it to a space, then a normal trim can be applied or simply replace to '' instead.

I have had this problem and it is annoying, but solvable. As well as Â you may find you have a whole load of characters showing up in your data like these:
â€œ
This is connected to encoding changes in the database, but so long as you do not have any of these characters in your database that you want to keep (e.g. if you are actually using a Euro symbol) then you can strip them out with a few MySQL commands as previously suggested.
In my case I had this problem with a Wordpress database that I had inherited, and I found a useful set of pre-formed queries that work for Wordpress here http://digwp.com/2011/07/clean-up-weird-characters-in-database/
It's also worth noting that one of the causes of the problem in the first place is opening a database in a text editor which might change the encoding in some way. So if you can possibly manipulate the database using MySQL only and not a text editor this will reduce the risk of causing further trouble.

Saving special characters to MySQL database

I am currently having an issue saving special characters (® & ©) to a mySQL database.
On a local stack (local to my development machine) the operation runs smoothly and what is entered in the front-end gets saved as what is expected.
With the same front-end code on the centralised server when saving to the DB the character is proceeded with other characters it is saved as Â®.
This isn't an issue when viewing the record on the front-end as it reverses what it does and displays correctly.
The issue comes from another separate system which utilises the same database and doesn't do any alteration so appears as the Â® instead of simply ®.
The reason for my confusion is that it seems to work as required on my local stack but not on the centralised server.
I am looking for ideas as to what to look for which might be causing this on the servers configuration compared to my local.
Thanks in advance
Mark

#Pranav Hosangadi (thanks) covers three areas to check for consistency of encoding. The following solution adds to that. It may also be worth considering (a variation of) #Soaice Mircea's answer (thanks also) for some scenarios whereby this answer doesn't fix the problem although this wasn't required when I was able to reproduce and find a solution to your problem. #Pranav's line of thinking seems to be successful for this problem as it's about consistency of using one character set everywhere rather than a particular one.
five things to do:
ensure database charset and tables use same charset throughout, inspect this in phpmyadmin for example, note and this charset for use below
use php header() function with database charset e.g.:
header('Content-Type: text/html; charset=latin1_swedish_ci');
insert meta tag in html header e.g.:
<meta http-equiv="content-type"
content="text/html;charset=latin1_swedish_ci">
add charset-accept in form tag
<form action=\"testsubmit.php\" method=\"post\" accept-charset=\"latin1_swedish_ci\">
set charset of mysql connection e.g.:
$con = mysql_connect("localhost","test","test");
mysql_set_charset ( "latin1_swedish_ci", $con );

after you select the DB in php code add this:
mysql_query("set names 'utf8'");
also check if the column where you store the text has utf8 encoding.

Where did I go wrong with this unicode field in MySQL?

I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"BingÂ® BlangÂ™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?

There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "Â™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "Â™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)

I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)

Could it be that some of the columns have an explicitly different character set than the table default?

something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008