Weird coding type convert to utf8 - mysql

I have over 1k records in my database with values that looks very weird:
Lưu Bích vỠViệt Nam làm liveshow
However when i view them in utf-8 it looks fine and readable. How do I instantly convert all these to ut8 that looks like this inside mysql:
Lưu Bích về Việt Nam làm liveshow
Any kind of help is greatly appreciated. Thank you!

I'm going to assume the column encoding is utf8. If it's not, change it because latin1 does not have the characters needed for Việt.
At this point what you have in the column is doubly UTF-8 encoded text. If all text is mangled in this same way you can solve this problem by changing the column type first to latin1 text, then to blob, and then to utf8 text. But if some of the data in the column is singly encoded you need to detect the broken values and update only those. This update statement tries to do that:
update mytable set mycolumn = #txt where char_length(mycolumn) =
length(#txt := convert(binary convert(mycolumn using latin1) using utf8));
Alternatively you can define a function that does a "safe" utf-8 conversion, detecting when the original data is OK and returning a converted version only if it's not, and then do the update with that.

Related

Latin-1 to UTF-8 conversion with mixed content

Originally, we ran a database with UTF-8 encoding and we had to migrate to a different server which was using latin-1 by mistake. The problem is that most names contain special foreign characters and they get rendered weird without the proper encoding:
For example:
Nezihe Şükran Akkaş
LÜTFİ ÇOBAN
Eren Karagözlü
I was able to convert it back to UTF using the following query:
SELECT convert(cast(convert(name using latin1) as binary) using UTF8) AS name FROM users;
The above names now appeared correctly:
Nezihe Şükran Akkaş
LÜTFİ ÇOBAN
Eren Karagözlü
However, all the data that was previously encoded as proper UTF-8 now appears as (NULL)
My question is how do I convert only the broken encoding rows and leave the properly encoded one's untouched? Right now, it's "either or". The problem is they are mixed in terms of order so I can't separate them by ID.
Any clue would help. Thanks!
Can you not just select coalesce( convert(...), name )?

Losing data on converting MySQL latin1_swedish_ci to utf8_unicode_ci

When I try to convert data from latin1_swedish_ci to utf8_unicode_ci I loose data ! The TEXT column is cut at the first special character.
For example:
Becomes:
Yet I tried many ways to convert my column and all solutions end up deleting data at the first special character!
I tried by phpMyAdmin or with this SQL request:
UPDATE `page` SET page_text = CONVERT(cast(CONVERT(page_text USING latin1) AS BINARY) USING utf8);
I also tried the php script :
https://github.com/nicjansma/mysql-convert-latin1-to-utf8/blob/master/mysql-convert-latin1-to-utf8.php
With all the time the same result, data are lost at first special character!
What should I do?
UPDATE
I could change the data to utf8 with
ALTER TABLE page CONVERT TO CHARACTER SET utf8mb4;
or
ALTER TABLE page CONVERT TO CHARACTER SET utf8;
without loosing data but it does not display properly special characters.
Using the php function utf8_encode($myvar); does display correctly special characters.
To convert a table, use
ALTER TABLE ... CONVERT TO ...
Or, to change individually columns, use
ALTER TABLE ... MODIFY COLUMN ...
Instead, you seem to have done something different. For further analysis, please provide SELECT col, HEX(col) ... before and after the conversion, plus the conversion used.
See "truncated" in this . The proper fix is found here, but depends on what you see from the HEX.

How to clean data with special characters in MySQL

How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?
That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .
Réation -> Réation after undoing the double-encoding.
Yet you say R'action -- I wonder if you were typing é as e' or 'e??
I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)
First, we need to verify that it is actually an ordinary double-encoding.
SELECT col, HEX(col) FROM ... WHERE ...
should give you this for the hex for Réation:
52 E9 6174696F6E -- latin1 encoding
52 C3A9 6174696F6E -- utf8 encoding
52 C383 C2A9 6174696F6E -- double encoding
(Ignore the spacing.)
If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.
Now, see if the double-encoding fix will fix it (before fixing it):
SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
BINARY(CONVERT(col USING latin1)) USING utf8mb4)
USING latin1)) USING utf8mb4)
FROM tbl;
You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!
Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:
First, let's assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
Do that for each column in each table with the problem.
(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)
from Comment
CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β' (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".
If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.

How can I convert BLOB to text in MySQL?

I have a BLOB field in one of tables and I used the following command to convert it to text:
ALTER TABLE mytable
ADD COLUMN field1_new TEXT;
update mytable set
field1_new = CONVERT(field1 USING utf8);
This did not work and gave me some random characters. Like:
9x
This result is returned as a content of message which does not make sense. I changed the character set to 'latin1'. This one gave me a larger sequence of characters yet still something non-sense. For example:
¢xœ}T]k1|/ô?¬Á/‡ZJpMK“–<$„Ô¥ôqO§»ÑI®¤³¹ß...
Is there anyway to figure out what character set the BLOB field is using so that I can convert it to text properly?
Any help with this problem will be much appreciated. Thanks
Edited: I have to also mention that I used CAST command and it returned:
�x�}T]k1|/�?��/��ZJpMK��<$�ԥ�qO���I������������$:���̬�4�...
try using cast:
CAST(field1 AS CHAR(10000) CHARACTER SET utf8)
you can see this post also for more:How do I convert from BLOB to TEXT in MySQL?

How do I convert strings to utf-8 that are in a latin-1 field but where originally another charset in a MySQL SELECT query?

I've got a table with lots of texts in different languages. The table and all the fields are iso-8859-1. The data that is in the fields was encoded in different charsets before it was inserted. It was not converted to latin1, so it often shows up as gibberish. It can be one of these:
iso-8859-1
iso-8859-7
windows-1250
windows-1251
windows-1254
windows-1257
I know which column maps to which kind of encoding.
What I would like to do is convert them to utf-8 when I select them, in a way that they will not show up as gibberish in the final output.
My idea was to use CONVERT(), but the docs are only talking about converting a string to another encoding. The string's encoding seems to be taken from the field's encoding. But that's broken for me.
Here's an example of what data looks like if I look at it directly in the DB.
Ðàä³àëüíà øèíà ïðèçíà÷åíà äëÿ áåçäîð³ææÿ.Ñïåö³àëüíî äëÿ Land Rover
This is supposed to be Ukranian and was encoded in cp1251 before being put into the latin1 field. If I simply SELECT foo FROM bar it without any conversion and display it to a webbrowser, telling it to use cp1251 it will show up correctly as cyrillic text in the browser.
What I think I need to do now is ignore the fact that the field is latin1 and convert from cp1251 to utf8 (or utf8mb4).
However, this is not doing what I want:
select convert(foo_ua using 'cp1251') from mytable;
It comes out as this, so the latin1 is obviously still in there.
????????? ???? ?????????? ??? ??????????.?????????? ??? Land Rover
I tried CAST() as well, but to the same result. How do I tell it to ignore the latin1 and just convert to cp1251, and how do I go to utf8 from there?