I'm trying to clean up a table I inherited. There's a text column with text in languages other than English and often times the text will look like this: Phénix
I know that it's supposed to be the French word: phénix
So I guess the é would be a failed encoding for the letter é
Does anyone know why this would happen, and is there any way to fix it? The same encoding errors keep on popping up, so is there something like an alphabet equivalent for these encoding errors that I could use to match up against the correct characters?
thanks
CONVERT(BINARY(CONVERT(CONVERT(BINARY(CONVERT('é' USING latin1)) USING utf8) USING latin1)) USING utf8)
--> 'é'
You have Double-Encoding.
Here's what probably happened.
The client had characters encoded as utf8 (good); and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8 (good).
Let's walk through what happens to e-acute: é.
The hex for that, in utf8 is 2 bytes: C3A9.
SET NAMES latin1 saw it as 2 latin1-encoded characters à and © (hex: C3 and A9)
Since the target was CHARACTER SET utf8, those 2 characters needed to be converted.
à was converted to utf8 (hex C383) and © (hex C2A9)
So, 4 bytes were stored (hex C383C2A9 for é)
When reading it back out, the reverse steps were performed,
and the end user possibly noticed nothing wrong. What is wrong:
The data stored is 2 times as big as it should be (3x for Asian languages).
Comparisions for equal, greater than, etc may not work as expected.
ORDER BY may not work as expected.
The fix (2 parts):
Be sure to do SET NAMES utf8; (or equivalent, such as mysqli_set_charset('utf8')).
Something like this will repair your data:
UPDATE ... SET col = CONVERT(BINARY(CONVERT(
CONVERT(UNHEX(col) USING utf8)
USING latin1)) USING utf8);
Related
How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?
That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .
Réation -> Réation after undoing the double-encoding.
Yet you say R'action -- I wonder if you were typing é as e' or 'e??
I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)
First, we need to verify that it is actually an ordinary double-encoding.
SELECT col, HEX(col) FROM ... WHERE ...
should give you this for the hex for Réation:
52 E9 6174696F6E -- latin1 encoding
52 C3A9 6174696F6E -- utf8 encoding
52 C383 C2A9 6174696F6E -- double encoding
(Ignore the spacing.)
If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.
Now, see if the double-encoding fix will fix it (before fixing it):
SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
BINARY(CONVERT(col USING latin1)) USING utf8mb4)
USING latin1)) USING utf8mb4)
FROM tbl;
You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!
Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:
First, let's assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
Do that for each column in each table with the problem.
(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)
from Comment
CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β' (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".
If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.
I have strings (English words + foreign word + emojis) stored in the Mysql DB.
The data is loaded with
charset = 'latin1'
Then I preproccess the data with
str = str.encode('latin-1').decode('utf-8')
After doing so everything looks good except for the Unicode symbols that look like \u'******'
I would appreciate any help.
Don't use encode/decode, it only adds to your woes.
Your description not clear on the path taken for Emoji. Were they correctly encoded in UTF-8, but then mangled when stored into a latin1 column in the table?
Or was it something else?
See "Best practice" in Trouble with UTF-8 characters; what I see is not what I stored
If erroneously stored into latin1 column see "CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset" in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
How to csv-import cyrillic text into latin1-swedish-ci encoded table in PhpMyAdmin?
My problem:
Everything works fine with front-end. But now I need to import ~1000 rows in a table. I'd prefer UTF-8, but the table is in latin1-swedish-ci. When I even prepare my CSV in latin1-swedish-ci, there is no such option in the PhpMyAdmin import settings.
First, keep in mind that the encoding of the file and the CHARACTER SET of the column/table do not need to be the same.
Second, latin1 cannot represent Cyrillic. Perhaps cp1251 has what you need. Caution: I think it can handle only English and Russian, not the rest of Europe.
UTF-8 (MySQL's utf8mb4) is the way to go. This will involve 2 bytes per Cyrillic character.
Before mangling things worse, find out what the encoding of your data is. The HEX for Д (Capital DE):
C4 -- in cp1251 aka ISO/IEC 8859-5
D094 -- in utf8 or utf8mb4
I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:
I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*
For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).
Is there any way to fix it although there is some data already inserted in the database?
That feels like "double-encoded" Hebrew strings.
This partially recovers the text:
UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1)))
--> '� �תה מסכי�,��
I do not know what leads to the � symbols.
Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.
Please provide the output from that SELECT, then I will work on how to recover the data.
Edit
Here's what I think happened.
The client had characters encoded as utf8; and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8.
Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123
So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.
UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;
If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)
I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.
the db collation has to be utf8_general_ci
the collation of the table with hebrew has to be utf8_general_ci
for example:
CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `col`.`hebrew` (
`id` INT NOT NULL AUTO_INCREMENT,
`heb` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`)
) CHARACTER SET utf8
COLLATE utf8_general_ci;
INSERT INTO hebrew(heb) values ('שלום');
I've got a table with lots of texts in different languages. The table and all the fields are iso-8859-1. The data that is in the fields was encoded in different charsets before it was inserted. It was not converted to latin1, so it often shows up as gibberish. It can be one of these:
iso-8859-1
iso-8859-7
windows-1250
windows-1251
windows-1254
windows-1257
I know which column maps to which kind of encoding.
What I would like to do is convert them to utf-8 when I select them, in a way that they will not show up as gibberish in the final output.
My idea was to use CONVERT(), but the docs are only talking about converting a string to another encoding. The string's encoding seems to be taken from the field's encoding. But that's broken for me.
Here's an example of what data looks like if I look at it directly in the DB.
Ðàä³àëüíà øèíà ïðèçíà÷åíà äëÿ áåçäîð³ææÿ.Ñïåö³àëüíî äëÿ Land Rover
This is supposed to be Ukranian and was encoded in cp1251 before being put into the latin1 field. If I simply SELECT foo FROM bar it without any conversion and display it to a webbrowser, telling it to use cp1251 it will show up correctly as cyrillic text in the browser.
What I think I need to do now is ignore the fact that the field is latin1 and convert from cp1251 to utf8 (or utf8mb4).
However, this is not doing what I want:
select convert(foo_ua using 'cp1251') from mytable;
It comes out as this, so the latin1 is obviously still in there.
????????? ???? ?????????? ??? ??????????.?????????? ??? Land Rover
I tried CAST() as well, but to the same result. How do I tell it to ignore the latin1 and just convert to cp1251, and how do I go to utf8 from there?