Latin-1 to UTF-8 conversion with mixed content - mysql

Originally, we ran a database with UTF-8 encoding and we had to migrate to a different server which was using latin-1 by mistake. The problem is that most names contain special foreign characters and they get rendered weird without the proper encoding:
For example:
Nezihe Şükran Akkaş
LÜTFİ ÇOBAN
Eren Karagözlü
I was able to convert it back to UTF using the following query:
SELECT convert(cast(convert(name using latin1) as binary) using UTF8) AS name FROM users;
The above names now appeared correctly:
Nezihe Şükran Akkaş
LÜTFİ ÇOBAN
Eren Karagözlü
However, all the data that was previously encoded as proper UTF-8 now appears as (NULL)
My question is how do I convert only the broken encoding rows and leave the properly encoded one's untouched? Right now, it's "either or". The problem is they are mixed in terms of order so I can't separate them by ID.
Any clue would help. Thanks!

Can you not just select coalesce( convert(...), name )?

Related

Special characters show as 'BLOB' when typing SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

I'm using MySQL 8.0.31 and learning using the Sakila dataset. I tried typing
SELECT CHAR(128,129,130,131,132,133,134,135,136,137); but the result shows
image
I also checked the default character set and it is 'utf8mb4'
I don't see a lot of answers and I'm a beginner. Please help
Edit:
I am expecting this result:
image2
This is taken from Learning SQL book by Alan B.
From the Book:
the following examples show the
location of the accented characters along with other special characters, such as currency symbols:
mysql> SELECT CHAR(128,129,130,131,132,133,134,135,136,137);
result: Çüéâäàåçêë
"BLOB" is a datatype used in databases to contain binary data (that is, not representable as text).
The string of characters you built is not representable in the default charset (UTF8), so MySQL does not know how to print it out, and just says is binary content.
The example in the book you are reading surely is assuming the default DB charset is ASCII. Since it is not, you must specify it:
SELECT CHAR(128,129,130,131,132,133,134,135,136,137 USING ascii);

How do I convert strings to utf-8 that are in a latin-1 field but where originally another charset in a MySQL SELECT query?

I've got a table with lots of texts in different languages. The table and all the fields are iso-8859-1. The data that is in the fields was encoded in different charsets before it was inserted. It was not converted to latin1, so it often shows up as gibberish. It can be one of these:
iso-8859-1
iso-8859-7
windows-1250
windows-1251
windows-1254
windows-1257
I know which column maps to which kind of encoding.
What I would like to do is convert them to utf-8 when I select them, in a way that they will not show up as gibberish in the final output.
My idea was to use CONVERT(), but the docs are only talking about converting a string to another encoding. The string's encoding seems to be taken from the field's encoding. But that's broken for me.
Here's an example of what data looks like if I look at it directly in the DB.
Ðàä³àëüíà øèíà ïðèçíà÷åíà äëÿ áåçäîð³ææÿ.Ñïåö³àëüíî äëÿ Land Rover
This is supposed to be Ukranian and was encoded in cp1251 before being put into the latin1 field. If I simply SELECT foo FROM bar it without any conversion and display it to a webbrowser, telling it to use cp1251 it will show up correctly as cyrillic text in the browser.
What I think I need to do now is ignore the fact that the field is latin1 and convert from cp1251 to utf8 (or utf8mb4).
However, this is not doing what I want:
select convert(foo_ua using 'cp1251') from mytable;
It comes out as this, so the latin1 is obviously still in there.
????????? ???? ?????????? ??? ??????????.?????????? ??? Land Rover
I tried CAST() as well, but to the same result. How do I tell it to ignore the latin1 and just convert to cp1251, and how do I go to utf8 from there?

mysql - How to save ñ

Whenever I try to save ñ it becomes ? in the mysql database. After some few readings it is suggested that I have to change my jsp charset to UTF-8. For some reasons I have to stick to ISO-8859-1. My database table encoding is latin1. How can I fix this? Please help.
Go to your database administration with MySQL WorkBench for example, put the Engine to InnoDB and the collation to utf8-utf8_general_ci.
You state in your question that you require a ISO-8859-1 backend (latin1), and a Unicode (UTF-8) frontend. This setup is crazy, because the set on the frontend is much larger than that allowed in the database. The sanest thing would be using the same encoding through the software stack, but also using Unicode only for storage would make sense.
As you should know, a String is a human concept for a sequence of characters. In computer programs, a String is not that: it can be viewed as a sequence of characters, but it's really a pair data structure: a stream of bytes and an encoding.
Once you understand that passing a String is really passing bytes and a scheme, let's see who sends what:
Browser to HTTP server (usually same encoding as the form page, so UTF-8. The scheme is specified via Content-Type. If missing, the server will pick one based on its own strategy, for example default to ISO-8859-1 or a configuration parameter)
HTTP Server to Java program (it's Java to Java, so the encoding doesn't matter since we pass String objects)
Java client to MySQL server (the Connector/J documentation is quite convoluted - it uses the character_set_server system variable, possibly overridden by the characterEncoding connection parameter)
To understand where the problem lies, first assure that the column is really stored as latin1:
SELECT character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema = :DATABASE
AND table_name = :TABLE
AND column_name = :COLUMN;
Then write the Java string you get from the request to a log file:
logger.info(request.getParameter("word"));
And finally see what actually is in the column:
SELECT HEX(:column) FROM :table
At this point you'll have enough information to understand the problem. If it's really a question mark (and not a replacement character) likely it's MySQL trying to transcode a character from a larger set (let's say Unicode) to a narrower one which doesn't contain it. The strange thing here is that ñ belongs to both ISO-8859-1 (0xF1, decimal 241) and Unicode (U+00F1), so it'd seem like there's a third charset (maybe a codepage?) involved in the round trip.
More information may help (operating system, HTTP server, MySQL version)
Change your db table content encoding to UTF-8
Here's the command for whole DB conversion
ALTER DATABASE db_name DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
And this is for single tables conversion
ALTER TABLE db_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
change your table collate to utf8_spanish_ci
where ñ is not equal to n but if you want both characters to be equal use
utf8_general_ci instead
I try several combinations, but this works for me:
VARCHAR(255) BINARY CHARACTER SET utf8 COLLATE utf8_bin
When data retrieve in dbforge express, shows like:
NIÑA
but in the application shows like:
NIÑA
I had the same problem. Found out that is not an issue about encoding UTF-8 or whatever charset. I imported my data from windows ANSI and all my Ñ and ñ where put in the database perfectly as it should be. Example last names showed on database last_name = "MUÑOZ". I was able to select normally from the database with query Select * from database where last_name LIKE "%muñoz%" and phpmyadmin show me results fine. It selected all "MUÑOZ" and "MUNOZ" without a problem. So phpmyadmin does show all my Ñ and ñ without any problems.
The problem was the program itself. All my characters mention, showed as you describe with the funky "MU�OZ" question mark. I had follow all advice everywhere. Set my headers correctly and tried all my charsets available. Even used google fonts and whatsoever font available to display correctly those last names, but no success.
Then I remembered an old program that was able to do the trick back and forth transparently and peeked into the code to figure it out: The database itself, showing all my special characters was the problem. Remember, I uploaded using windows ANSI encoding. Phpmyadmin did as expected, uploaded all as instructed.
The old program fixed this problem translating the Ñ to its UNICODE HTML Entity: Ñ (see chart here https://www.compart.com/en/unicode/U+00D1 ) a process done back and forth from MySQL to the app.
So you just need to change your database strings containing the letter Ñ and ñ to their corresponding UNICODE to reflect correctly on your browser with UTF charset.
In my case, I solved my issues replacing all my Ñ and ñ for their corresponding UNICODE in all the last names in my database.
UPDATE database_name
SET
last_name = REPLACE(last_name,
'MUÑOZ',
'MUÑOZ');
Now, Im able to display, browse, even search all my correct last names and accents/tildes, proper to spanish language. I hope this helps. It was a pain to figure it out, but an old program solved the problem. Best regards and happy coding !

Weird coding type convert to utf8

I have over 1k records in my database with values that looks very weird:
Lưu Bích vỠViệt Nam làm liveshow
However when i view them in utf-8 it looks fine and readable. How do I instantly convert all these to ut8 that looks like this inside mysql:
Lưu Bích về Việt Nam làm liveshow
Any kind of help is greatly appreciated. Thank you!
I'm going to assume the column encoding is utf8. If it's not, change it because latin1 does not have the characters needed for Việt.
At this point what you have in the column is doubly UTF-8 encoded text. If all text is mangled in this same way you can solve this problem by changing the column type first to latin1 text, then to blob, and then to utf8 text. But if some of the data in the column is singly encoded you need to detect the broken values and update only those. This update statement tries to do that:
update mytable set mycolumn = #txt where char_length(mycolumn) =
length(#txt := convert(binary convert(mycolumn using latin1) using utf8));
Alternatively you can define a function that does a "safe" utf-8 conversion, detecting when the original data is OK and returning a converted version only if it's not, and then do the update with that.

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )