How to clean data with special characters in MySQL - mysql

How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?

That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .
Réation -> Réation after undoing the double-encoding.
Yet you say R'action -- I wonder if you were typing é as e' or 'e??
I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)
First, we need to verify that it is actually an ordinary double-encoding.
SELECT col, HEX(col) FROM ... WHERE ...
should give you this for the hex for Réation:
52 E9 6174696F6E -- latin1 encoding
52 C3A9 6174696F6E -- utf8 encoding
52 C383 C2A9 6174696F6E -- double encoding
(Ignore the spacing.)
If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.
Now, see if the double-encoding fix will fix it (before fixing it):
SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
BINARY(CONVERT(col USING latin1)) USING utf8mb4)
USING latin1)) USING utf8mb4)
FROM tbl;
You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!
Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:
First, let's assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
Do that for each column in each table with the problem.
(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)
from Comment
CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β' (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".
If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.

Related

Why is 'João' coming out as 'Jo\u00e3o'?

I have this column in my database table:
`data` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci DEFAULT NULL
Names like 'João' are inserted. But they're showing up as Jo\u00e3o. E.g.:
{"4":"jo\u00e3o da silva"}
I tried changing the character set and the collation, but it didn't seem to help. What can I do in order to fix it?
My database "character set" settings:
First of all, \u00e3 is not generated by MySQL. It is, however, optionally generated by PHP's json_encode(). Be sure to use JSON_UNESCAPED_UNICODE in the second argument to that function.
Meanwhile, those codes are properly interpreted by web browsers, so you won't notice issues there. And reading and writing from and to a database won't change them. But note that any backslash needs to be escaped when INSERTing into a database table.
For use in MySQL tables, I prefer to have the connection and server settings consistently set at utf8mb4 so that Unicode stuff simply comes and goes without conversion.
I agree with "never trust your screen". About the only way to see what is actually stored in the database is to use SELECT HEX(col)... For ã:
UTF-8 (utf8mb4): Hex: C3A3
latin1: Hex: E3
But, for \u00e3, the hex would be 5C7530306533
In PHP, there is bin2hex().

Losing data on converting MySQL latin1_swedish_ci to utf8_unicode_ci

When I try to convert data from latin1_swedish_ci to utf8_unicode_ci I loose data ! The TEXT column is cut at the first special character.
For example:
Becomes:
Yet I tried many ways to convert my column and all solutions end up deleting data at the first special character!
I tried by phpMyAdmin or with this SQL request:
UPDATE `page` SET page_text = CONVERT(cast(CONVERT(page_text USING latin1) AS BINARY) USING utf8);
I also tried the php script :
https://github.com/nicjansma/mysql-convert-latin1-to-utf8/blob/master/mysql-convert-latin1-to-utf8.php
With all the time the same result, data are lost at first special character!
What should I do?
UPDATE
I could change the data to utf8 with
ALTER TABLE page CONVERT TO CHARACTER SET utf8mb4;
or
ALTER TABLE page CONVERT TO CHARACTER SET utf8;
without loosing data but it does not display properly special characters.
Using the php function utf8_encode($myvar); does display correctly special characters.
To convert a table, use
ALTER TABLE ... CONVERT TO ...
Or, to change individually columns, use
ALTER TABLE ... MODIFY COLUMN ...
Instead, you seem to have done something different. For further analysis, please provide SELECT col, HEX(col) ... before and after the conversion, plus the conversion used.
See "truncated" in this . The proper fix is found here, but depends on what you see from the HEX.

MySQL Database text encoding errors

I'm trying to clean up a table I inherited. There's a text column with text in languages other than English and often times the text will look like this: Phénix
I know that it's supposed to be the French word: phénix
So I guess the é would be a failed encoding for the letter é
Does anyone know why this would happen, and is there any way to fix it? The same encoding errors keep on popping up, so is there something like an alphabet equivalent for these encoding errors that I could use to match up against the correct characters?
thanks
CONVERT(BINARY(CONVERT(CONVERT(BINARY(CONVERT('é' USING latin1)) USING utf8) USING latin1)) USING utf8)
--> 'é'
You have Double-Encoding.
Here's what probably happened.
The client had characters encoded as utf8 (good); and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8 (good).
Let's walk through what happens to e-acute: é.
The hex for that, in utf8 is 2 bytes: C3A9.
SET NAMES latin1 saw it as 2 latin1-encoded characters à and © (hex: C3 and A9)
Since the target was CHARACTER SET utf8, those 2 characters needed to be converted.
à was converted to utf8 (hex C383) and © (hex C2A9)
So, 4 bytes were stored (hex C383C2A9 for é)
When reading it back out, the reverse steps were performed,
and the end user possibly noticed nothing wrong. What is wrong:
The data stored is 2 times as big as it should be (3x for Asian languages).
Comparisions for equal, greater than, etc may not work as expected.
ORDER BY may not work as expected.
The fix (2 parts):
Be sure to do SET NAMES utf8; (or equivalent, such as mysqli_set_charset('utf8')).
Something like this will repair your data:
UPDATE ... SET col = CONVERT(BINARY(CONVERT(
CONVERT(UNHEX(col) USING utf8)
USING latin1)) USING utf8);

MYSQL 5.1.61 sorting for Central European languages in utf8

I have a problem with sorting MYSQL result..
SELECT * FROM table WHERE something ORDER BY column ASC
column is set to utf8_unicode_ci..
As a result I first get rows which have column starting with Bosnian letters and then the others after that..
šablabl
šeblabla
čeblabla
aaaa
bbaa
bbb
ccc
MYSQL version is 5.1.61
Bgi is right. You need to use an appropriate collation. Unfortunately, MySQL doesn't have a Central European unicode collation yet. MariaDb, the MySQL fork being maintained by MySQL's creators, does.
So you can convert your text from utf8 to latin2 and then order with a Central European collating sequence. For example.
SELECT *
FROM tab
ORDER BY CONVERT(text USING latin2) COLLATE latin2_croatian_ci
See this fiddle: http://sqlfiddle.com/#!2/c8dd4/1/0
It is because the way of unicode is made. All the "normal" latin characters got back the same numerical correspondance they had in ASCII, and other characters from other cultures were added after. That means if your alphabet has other characters than the 26 regular ASCII ones, it wont appear in the correct order in Unicode.
I think you should try to change the collation on your column (maybe you'll have to change the charset also, but maybe not).
Use a Central European collation.
Good luck !!
If that's really what you see you have found a bug: utf8_unicode_ci is supposed to consider š equivalent to s and č equivalent to c!
In any case it's true that MySQL does not have great support of utf8 collations for Central European languages: you get only Czech, Slovak, and Slovenian. If none of those work for you, I guess you'll have to create your own utf8 collation, or use a non-Unicode character set and use the collations available there.
Older question and plenty of answers.
Maybe the way I deal with problems will help someone.
I use PDO. My DB is utf-8.
First - my db singleton code (relevant part of it). I set 'SET NAMES' to 'utf8' for all connections.
$attrib_array = array(PDO::MYSQL_ATTR_INIT_COMMAND => 'SET NAMES utf8');
if (DB_HANDLER)
$attrib_array[PDO::ATTR_ERRMODE] = PDO::ERRMODE_EXCEPTION;
self::$instance = new PDO(DB_TYPE.':host='.DB_HOST.';dbname='.DB_NAME, DB_USER, DB_PASS, $attrib_array);
Second - my sorting looks something like this - collation depends on language (sample shows polish):
ORDER BY some_column COLLATE utf8_polish_ci DESC
To make things more streamlined I use a constant, which I define in lang translation file, so when file is pulled, proper collation constant is set. Of course I have 'utf8_general_ci' as default. Example:
define('MY_LOCALIZED_COLLATE', 'COLLATE utf8_polish_ci');
Now, my (relevant part of) query looks like this:
" ... ORDER BY some_column " . MY_LOCALIZED_COLLATE . " DESC" ;
Above works in most cases.
If you are missing collation set, you may try to add one yourself.
More detailed info about creating such set - see here: http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
EDIT:
Just one more thing I noticed:
if you have list to sort in e.g. Polish
and you have to force proper collation for sorting (as described above)
and you use e.g. INT column as sorting vector
... then you better have collation set (e.g. to UTF8), or you will get SQL errors, e.g.:
"Syntax error or access violation: 1253 COLLATION 'utf8_polish_ci' is not valid for CHARACTER SET 'latin1'"
... strange, but true

Weird coding type convert to utf8

I have over 1k records in my database with values that looks very weird:
Lưu Bích vỠViệt Nam làm liveshow
However when i view them in utf-8 it looks fine and readable. How do I instantly convert all these to ut8 that looks like this inside mysql:
Lưu Bích về Việt Nam làm liveshow
Any kind of help is greatly appreciated. Thank you!
I'm going to assume the column encoding is utf8. If it's not, change it because latin1 does not have the characters needed for Việt.
At this point what you have in the column is doubly UTF-8 encoded text. If all text is mangled in this same way you can solve this problem by changing the column type first to latin1 text, then to blob, and then to utf8 text. But if some of the data in the column is singly encoded you need to detect the broken values and update only those. This update statement tries to do that:
update mytable set mycolumn = #txt where char_length(mycolumn) =
length(#txt := convert(binary convert(mycolumn using latin1) using utf8));
Alternatively you can define a function that does a "safe" utf-8 conversion, detecting when the original data is OK and returning a converted version only if it's not, and then do the update with that.