Converting non-utf8 database to utf-8 - mysql

I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:
I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*
For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).
Is there any way to fix it although there is some data already inserted in the database?

That feels like "double-encoded" Hebrew strings.
This partially recovers the text:
UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1)))
--> '� �תה מסכי�,��
I do not know what leads to the � symbols.
Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.
Please provide the output from that SELECT, then I will work on how to recover the data.
Edit
Here's what I think happened.
The client had characters encoded as utf8; and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8.
Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123
So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.
UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;
If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)

I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.
the db collation has to be utf8_general_ci
the collation of the table with hebrew has to be utf8_general_ci
for example:
CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `col`.`hebrew` (
`id` INT NOT NULL AUTO_INCREMENT,
`heb` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`)
) CHARACTER SET utf8
COLLATE utf8_general_ci;
INSERT INTO hebrew(heb) values ('שלום');

Related

MySQL - utf8 characters no displaying correctly on web frontend

I have a database which has the latin1 default characterset - info obtained by running the following statement:
SELECT default_character_set_name FROM information_schema.SCHEMATA
WHERE schema_name = "schemaname";
The default character set for each table and column in this database is set to utf8.
When I look at the data in the tables I can see data is stored as utf8 e.g the currency symbol € is stored in the table as €. Similarly apostraphes are stored as ’ etc.
On the web frontend I have the following meta tag and so the characters render correctly.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
However I'm also seeing a lot of � symbols on the webpage which I don't see inside the database?
When I change the database connection to include the charset utf8 as follows: mysql:host=myhost;dbname=mydatabase;charset=utf8, the diamond symbols disappear but then all the other utf8
characters revert to exactly how they are saved in the database e.g. the € symbol renders as € on the webpage?
Why is this happening?
How do I fix this and also change character set to utf8mb4?
Any help appreciated.
* UPDATE *
Tried the following steps:
for the database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Not sure if Step 3 is necessary since when I do SHOW CREATE TABLE after step 2, whilst the definition doesn't display the column charset it does display the default charset for the table as utf8mb4. As a sanity check I did run step 3 on one of the tables columns but it makes no difference - € is being rendered on the page as € with db connection as follows:
`mysql:host=myhost;dbname=mydatabase;charset=utf8mb4`
I had to run the following on each column I wanted converting which seems to fix some issues
UPDATE tbl_profiles SET profile =
convert(cast(convert(profile using latin1) as binary) using UTF8MB4);
but still seeing characters such as Iâm and «Âand ⢠rendered on the webpage
Any ideas?
* UPDATE 2 *
After running steps 1 and 2 above I have a table column as follows:
`job_salary` VARCHAR(150) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
The following query on this column returns the following result:
SELECT job_salary FROM tbl_jobs WHERE job_id = 2235;
€30,000 plus excellent benefits
I execute the following statement on this column:
UPDATE tbl_jobs SET job_salary = CONVERT(BINARY(CONVERT(job_salary USING latin1)) USING utf8mb4);
But I get the following error which means some other record has a invalid utf8mb4
Invalid utf8mb4 character string: '\x8010000 to \x8020000 Per: annum'
First, let's discuss the Mojibake of the Euro sign. All of this applies to both utf8 and utf8mb4, since the Euro is encoded the same way and there is.
It is very likely that the data was initially stored incorrectly. If you can get back to the INSERT program, let's check for:
The bytes to be stored need to be UTF-8-encoded. What was the client programming language? Where did the data come from?
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Do you have the connection parameters?
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). It sounds like this was always correct.
HTML should start with .
What is currently in the table?
SELECT col, HEX(col) FROM ... WHERE ...
A correctly stored Euro sign (€) should have hex E282AC. (Interpreting that as latin1 yields €.
If instead, you see hex C3A2E2809AC2AC, you have "double encoding", and the display is probably €.
I have identified several possible fixes, but have not yet determined which applies in your case. The likely candidate is
CHARACTER SET utf8mb4 with double-encoding:
To verify it (before fixing it), please do something like:
SELECT col,
CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4),
HEX(
CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4)
)
FROM ...
WHERE ...
Do not apply a fix on top of another fix. I have struggled for a long time to decipher how character set problems occur and what to do to 'fix' a single problem. But when the wrong fix is applied, I am at a loss to unravel the mess.

How to enter the Indian Rupee symbol in MySQL Server 5.1?

I am unable to find the exact solution for MySQL
The thing is the column supports by default UTF-8 encoding which consists of 3 bytes. The Indian Rupee Symbol, since it is new has a 4 byte encoding. So we have to change the character encoding to utf8_general_ci by,
ALTER TABLE test_tb MODIFY COLUMN col VARCHAR(255)
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;
After executing the above query simply execute the following query to insert the symbol,
insert into test_tb values("₹");
Ta-Da!!!
You are talking Oracle, yet it is tagged MySQL. Which do you want? And what language and/or client tool are you using?
Copy and paste it. Which Rupee do you like? ৲ ৳ ૱ ௹ ₨ ꠸
Probably you want this one:
UNHEX('E282A8') = '₨'
which is U+20A8 or 8360 in non-MySQL contexts
You need to have CHARACTER SET utf8 on the table/column.
You need to have done SET NAMES utf8 (or equivalent) when connecting.
Simplest way to do it is, utf8mb4 stores all the symbols
ALTER TABLE AsinBuyBox CONVERT TO CHARACTER SET utf8mb4;

Can I convert MySQL database character set from latin1 to utf8 without losing data?

I want to convert my database to store unicode symbols.
Currently the tables have:
latin_swedish_ci collation and latin1 character set
OR
utf8_general_ci collation and utf8 character set
I am not sure how the existing data is encoded, but I suppose it is utf-8 encoded, as I am using Django which I think encodes the data in utf-8 before sending to the database.
My question is:
Can I convert the tables to utf8_unicode_ci collation and utf-8 character set using the following queries without messing up the existing data? (as sugested in this post)
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Considering latin1 is subset of utf-8, I think it sould work. What do you guys think?
Thank you in advance.
P.S: The version of MySQL is: 5.1
Latin1 is not a subset of UTF-8 - ASCII is. Latin1, however, is represented in Unicode.
CONVERT TO should work, as long as the data was stored in the correct encoding in the first place. Django may have used UTF-8 on the database connection, but the database should have re-encoded on the fly.
To check the actual encoding used - Use the mysql command-line tool to execute an SQL query that selects a row that you know contains non-ASCII characters. Then use the mysql HEX() function to check the bytes used. If you see bytes greater than > 0x7f, check that they don't correspond to valid characters in https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
If you have c396 sitting in a latin1 column, and you want it to mean Ö, then you are half way to "double encoding". Do not use CONVERT TO; that will really get you into "double encoding".
Instead, you need the 2-step ALTER.
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
If you have already messed it up further, and now the Ö is hex C383E28093, then you need to fix double encoding.
This gets you the latin1 byte in 2 steps:
CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1) --> 'Ö' (C396)
HEX(CONVERT(CONVERT(UNHEX('C396') USING utf8) USING latin1)) --> 'Ö' in latin1 (D6)
This gets you the 2-byte utf8 encoding:
CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1)) USING utf8)
Do you want the column to be latin1? Or utf8?

MySQL can't restore table from backup - #1366 - Incorrect string value

A site I'm working on recently had an issue with the database, apparently it got corrupted when they restored the tables any text field with strange symbols (eg half symbol and degree symbol) the text field stopped at the character before that symbol). I've got a copy of the table and distilled it down to the code below:
CREATE TABLE `products2` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into products2 values
(25, 0x5468652044504D203931322069732061206C617267652033BD204469676974204C434420566F6C746D657465722E20546865207369676E616C206265696E67206D6561737572656420697320616C736F207573656420746F20706F77657220746865206D657465722C20696E636C7564696E6720746865206261636B6C696768742E20546865206D657465722066656174757265732061203320746F20363056206D6561737572656D656E742072616E67652C20776974682061207265736F6C7574696F6E206F662031306D56206265747765656E20332E303020616E642031392E39395620616E64203130306D56206265747765656E2032302E3020616E642036302E30562E205768656E2074686520766F6C746167652064726F70732062656C6F772033562C204C4F20697320646973706C617965642028646F776E20746F20322E38562C207768656E2074686520646973706C61792077696C6C207475726E206F6666292E209148499220697320646973706C61796564207768656E2074686520766F6C7461676520676F65732061626F7665203630562E0D0A0D0A5363726577207465726D696E616C7320616C6C6F7720666F7220717569636B20616E64206561737920636F6E6E656374696F6E2E20546865206D6574657220697320686F7573656420696E206120726F6275737420636172726965722077686963682063616E20626520626F6C74656420696E20706C616365206F722070616E656C206D6F756E746564207573696E6720746865206C6F772070726F6669206C652062657A656C20616E6420636C6970732070726F76696465642E20416E2049503637202F204E454D412034582062657A656C20697320616C736F20617661696C61626C6520666F722070726F74656374696F6E20616761696E7374206475737420616E64206D6F6973747572652E0D0A0D0A417320746869732069732061206E65772064657369676E2077652073756767657374207468617420796F7520636F6E74616374204C617363617220666F7220757020746F2064617465206C6561642D74696D6520696E666F726D6174696F6E206265666F7265206F72646572696E67206F6E6C696E652E0D0A)
This throws an error:
#1366 - Incorrect string value: '\xBD Digi...' for column 'description' at row 1
Looking into this problem on stackoverflow and around the web it seems to be an issue with the encoding, I've tried changing the collation to utf_unicode_ci on the description field and the collation of the table to utf_bin (and all combinations of those) all to no avail.
I can't redo the dump as it's a backup. I don't understand how the system can output the dump but not accept it back - presumably the backup is via the command line (not certain) and I am using PHPMyAdmin to restore it I don't know if that makes a difference.
If it's not possible to import the data I'd be grateful if someone could tell me how to read the encoded data into text that I can then manually cut and paste.
Decoding the first 32 bytes as ASCII, we have (where ? is the 0xBD byte about which MySQL is complaining):
The DPM 912 is a large 3? Digit
A little bit of Googling for "DPM 912" suggests to me that character should be the vulgar one-half fraction, &half;.
A number of character sets encode that character with the byte 0xBD, but one in particular jumps out: windows-1252—which was not only the default codepage in the (pre-Unicode) Windows world, but is also MySQL's default encoding. It'd be a good guess that your data is encoded in windows-1252.
As explained in the MySQL manual, you can specify the encoding of a string literal by prefixing it with the encoding name:
A character string literal may have an optional character set introducer and COLLATE clause:
[_charset_name]'string' [COLLATE collation_name]
It goes on to say:
An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), or before bit-field literal notation (b'literal' and 0bnnnn).
Therefore (and because MySQL refers to windows-1252 as latin1), you could change your INSERT command to:
INSERT INTO products2 VALUES (25, _latin1 0x5468652044504D203931322069...);
The documentation also states:
For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
That is, if such an introducer is omitted (as it was in your original INSERT statement), the character set is assumed to be that defined by the character_set_connection system variable.
As mentioned here, there are number of ways of setting that variable (including by specifying it when your client connects which, in phpMyAdmin, is set with the [DefaultCharset] configuration option, of which the default was latin1 prior to v3.4, but has been utf8 since - perhaps this change is the origin of your problems; one can also specify the character set of import files with [Import][charset]). If one doesn't specify the desired character set upon connecting, issuing any of these commands after connecting but before your INSERT command will fix it (you could, for example, add one of them to the top of your dump file):
SET NAMES 'latin1';
SET CHARACTER SET latin1;
SET character_set_connection = latin1;
My recommendation, which makes the dumpfile as portable as possible, would be to add SET NAMES 'latin1' to the top of it.

Select MySQL rows with Japanese characters

Would anyone know of a reliable method (with mySQL or otherwise) to select rows in a database that contain Japanese characters? I have a lot of rows in my database, some of which only have alphanumeric characters, some of which have Japanese characters.
Rules when you have problem with character sets:
While creating database use utf8 encoding:
CREATE DATABASE _test DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Make sure all text fields (varchar and text) are using UTF-8:
CREATE TABLE _test.test (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE = MyISAM;
When you make a connection do this before you query/update the database:
SET NAMES utf8;
With phpMyAdmin - Choose UTF-8 when you login.
set web page encoding to utf-8 to make sure all post/get data will be in UTF-8 (or you'll have to since converting is painful..). PHP code (first line in the php file or at least before any output):
header('Content-Type: text/html; charset=UTF-8');
Make sure all your queries are written in UTF8 encoding. If using PHP:
6.1. If PHP supports code in UTF-8 - just write your files in UTF-8.
6.2. If php is compiled without UTF-8 support - convert your strings to UTF-8 like this:
$str = mb_convert_encoding($str, 'UTF-8', '<put your file encoding here');
$query = 'SELECT * FROM test WHERE name = "' . $str . '"';
That should make it work.
Following on to the helpful answer NickSoft, i had to set the encoding on the db connection to get it to work.
&characterEncoding=UTF8
Then the SET NAMES utf8; seemed to be redundant
As teneff stated, just use SELECT.
When installing MySQL, use UTF-8 as charset. Then, choosing utf8_general_ci as collation should do the work.
As Frosty stated, just use SELECT.
Look up the lowest and highest valued Japanese characters in the Unicode charts at http://www.unicode.org/roadmaps/bmp/ and use REGEXP. It may use several different regions of characters to get the whole Japanese character set. As long as you use the UTF-8 charset and utf8_general_ci collation, you should be able to use a REGEXP '[a-gk-nt-z]' where a-g represents one range of Unicode characters from the charts, k-n represents another range, etc.
There is limited number of japanese characters. You can search for these using
SELECT ... LIKE '%カ%'
Alternatively you can try their hexadecimal denomination -
SELECT ...LIKE CONCAT('%',CHAR(0x30ab),'%')
You may find useful this UTF-8 Japanese subset
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=12448
Supposing you're using UTF-8 character set for fields, queries, results...