Losing data on converting MySQL latin1_swedish_ci to utf8_unicode_ci - mysql

When I try to convert data from latin1_swedish_ci to utf8_unicode_ci I loose data ! The TEXT column is cut at the first special character.
For example:
Becomes:
Yet I tried many ways to convert my column and all solutions end up deleting data at the first special character!
I tried by phpMyAdmin or with this SQL request:
UPDATE `page` SET page_text = CONVERT(cast(CONVERT(page_text USING latin1) AS BINARY) USING utf8);
I also tried the php script :
https://github.com/nicjansma/mysql-convert-latin1-to-utf8/blob/master/mysql-convert-latin1-to-utf8.php
With all the time the same result, data are lost at first special character!
What should I do?
UPDATE
I could change the data to utf8 with
ALTER TABLE page CONVERT TO CHARACTER SET utf8mb4;
or
ALTER TABLE page CONVERT TO CHARACTER SET utf8;
without loosing data but it does not display properly special characters.
Using the php function utf8_encode($myvar); does display correctly special characters.

To convert a table, use
ALTER TABLE ... CONVERT TO ...
Or, to change individually columns, use
ALTER TABLE ... MODIFY COLUMN ...
Instead, you seem to have done something different. For further analysis, please provide SELECT col, HEX(col) ... before and after the conversion, plus the conversion used.
See "truncated" in this . The proper fix is found here, but depends on what you see from the HEX.

Related

How to clean data with special characters in MySQL

How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?
That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .
Réation -> Réation after undoing the double-encoding.
Yet you say R'action -- I wonder if you were typing é as e' or 'e??
I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)
First, we need to verify that it is actually an ordinary double-encoding.
SELECT col, HEX(col) FROM ... WHERE ...
should give you this for the hex for Réation:
52 E9 6174696F6E -- latin1 encoding
52 C3A9 6174696F6E -- utf8 encoding
52 C383 C2A9 6174696F6E -- double encoding
(Ignore the spacing.)
If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.
Now, see if the double-encoding fix will fix it (before fixing it):
SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
BINARY(CONVERT(col USING latin1)) USING utf8mb4)
USING latin1)) USING utf8mb4)
FROM tbl;
You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!
Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:
First, let's assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
Do that for each column in each table with the problem.
(In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)
from Comment
CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β' (Greek beta)
CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".
If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.

Implement search in codeigniter

Everyone know how to implement search in page with codeiginter.
One issue is that my keyword, for example "Șaweqă" has diacritisc. How to search into database table with this keyword.
Simple $this->db->like($keyword); return database error.
My database character encoding is set to utf8-general-ci and the table where i do search has rows with utf8_romanian_ci encoding.
The output of SHOW CREATE TABLE should look one of these ways:
the_column ... CHARACTER SET utf8, -- or utf8mb4
or, if not specified on the_column, then
CREATE TABLE ... (...) DEFAULT CHARACTER SET utf8; -- or utf8mb4
To check that Șaweqă was stored (via INSERT, etc) correctly as utf8, do something like SELECT HEX(col).... You should get C898 61 77 65 71 C483 for Șaweqă. (I added spaces to distinguish the characters.)
To check that it is fetched (via SELECT, etc) correctly, look at the output.
Try allowing special character in the CI config file :
$config['permitted_uri_chars'] = 'a-z 0-9~%.:#_\-ä';
And if you are using GET method, try to urldecode() the keyword :
$keyword = urldecode($keyword);

Converting non-utf8 database to utf-8

I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:
I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*
For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).
Is there any way to fix it although there is some data already inserted in the database?
That feels like "double-encoded" Hebrew strings.
This partially recovers the text:
UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1)))
--> '� �תה מסכי�,��
I do not know what leads to the � symbols.
Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.
Please provide the output from that SELECT, then I will work on how to recover the data.
Edit
Here's what I think happened.
The client had characters encoded as utf8; and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8.
Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123
So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.
UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;
If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)
I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.
the db collation has to be utf8_general_ci
the collation of the table with hebrew has to be utf8_general_ci
for example:
CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `col`.`hebrew` (
`id` INT NOT NULL AUTO_INCREMENT,
`heb` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`)
) CHARACTER SET utf8
COLLATE utf8_general_ci;
INSERT INTO hebrew(heb) values ('שלום');

MySQL does not identify character '?' on select

I got a table in MySQL with the following columns:
id name email address borningDate
I have a form in a HTML page that submits this data to a servlet, responsible for saving it at the database. Due to charset issues (already fixed), I saved a row like this, when trying to store letters with accents:
19 ? ? ? 2015-03-01
and now I want to delete this row.
Yeah, doing this:
DELETE FROM table WHERE id=19;
works nice. My didatic question is: why, if I try something like this:
DELETE FROM table WHERE name='?';
it returns 0 rows affected, like if it can't see ? as a valid character?
Try doing
SELECT id, HEX(name), HEX(email), HEX(address), borningDate FROM table
This will tell you what's actually in the database. It probably isn't actually ASCII question marks. The question marks are probably substitution characters applied when MySQL tries to convert the column's character set to the connection's character set.
To manage this more specifically, do SHOW CREATE TABLE table and look for the character set being used for the text columns. This probably shows up at the end of the table definition as something like DEFAULT CHARSET utf8 or some such thing. But it might be specified in the column definition.
Once you know the character set, issue the command SET NAMES charset, for example, SET NAMES utf8. Then reissue your commands and see if you get better results than the ? substitution character. That assumes, of course, that the client program you are using can handle the character set mentioned.

How can I convert BLOB to text in MySQL?

I have a BLOB field in one of tables and I used the following command to convert it to text:
ALTER TABLE mytable
ADD COLUMN field1_new TEXT;
update mytable set
field1_new = CONVERT(field1 USING utf8);
This did not work and gave me some random characters. Like:
9x
This result is returned as a content of message which does not make sense. I changed the character set to 'latin1'. This one gave me a larger sequence of characters yet still something non-sense. For example:
¢xœ}T]k1|/ô?¬Á/‡ZJpMK“–<$„Ô¥ôqO§»ÑI®¤³¹ß...
Is there anyway to figure out what character set the BLOB field is using so that I can convert it to text properly?
Any help with this problem will be much appreciated. Thanks
Edited: I have to also mention that I used CAST command and it returned:
�x�}T]k1|/�?��/��ZJpMK��<$�ԥ�qO���I������������$:���̬�4�...
try using cast:
CAST(field1 AS CHAR(10000) CHARACTER SET utf8)
you can see this post also for more:How do I convert from BLOB to TEXT in MySQL?