Hi I am trying to insert some Chinese characters in my sql database. For some character I am not getting any issues, but for the 2nd insert 2nd character I am getting error as shown below. Has anyone faced this kind of issue before. am I choosing wrong character set or collation?
Reference
https://dev.mysql.com/doc/refman/5.7/en/faqs-cjk.html#faq-cjk-why-cjk-fail-searches
CREATE TABLE testing (test VARCHAR(500) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
INSERT INTO testing VALUES('薛');
INSERT INTO testing VALUES('薛𦆱萍'); -- ERROR - MySQL Database Error: Incorrect string value: '\xF0\xA6\x86\xB1\xE8\x90...' for column 'test' at row 1
select * from testing;
According to Unicode character inspector the UTF encoding for 薛𦆱萍 is:
薛 = E8 96 9B
𦆱 = F0 A6 86 B1
萍 = E8 90 8D
MySQL complaints about this:
\xF0\xA6\x86\xB1\xE8\x90
So everything is apparently correct, save for a little implementation detail about utf8_unicode_ci which, in MySQL, is an incomplete UTF-8 encoding that only accepts up to three byte characters. Thus 𦆱 cannot be stored as utf8_unicode_ci.
You need to switch to some utf8mb4_... encoding.
In addition to having the column set to utf8mb4, you must also tell MySQL that the client is speaking utf8mb4 (not just utf8). For some clients, this is best done when making the connection, for some there is a secondary command, such as SET NAMES utf8mb4 which should be performed right after connecting.
Debugging Q&A: Trouble with UTF-8 characters; what I see is not what I stored
That seems to be a rather unusual (or new?) Chinese character; even http://unicode.scarfboy.com/?s=f0a686b1 fails to show the graphic.
Related
I was trying to determine an error in a java program that loads MySQL tables every night.
Error in the log was java.sql.SQLException: Incorrect string value:
'\xEF\xBF\xBD\xEF\xBF\xBD...' for column 'manager' at row 1.
Finally determined there was a new name in the data (loading from a flat file) - FRANÇOIS - and it was the cedilla that was giving the error. Program still loaded everything, just left that field blank.
When I ran a SHOW FULL COLUMNS FOR tablename, it was latin1_swedish_ci. I know very little about collation, charsets.
What should I change the collation to in order for it to accept this?
(To long for a comment)
Need to see more details.
Don't use latin1; use utf8.
Connect with ?useUnicode=yes&characterEncoding=UTF-8 in the getConnection() call
Use CHARACTER SET utf8 in the table and/or column definition. Please provide SHOW CREATE TABLE for confirmation.
EFBFBD is the "replacement" character, implying that you had garbage coming it.
Loading a flat file -- Can you get the hex of Ç from the file? If it is C7 it is latin1 and you should specify latin1 on the load. Is it LOAD DATA? Or something else?
If it is C387 then it is utf8; good.
More discussion, debugging, best practice, etc: Trouble with utf8 characters; what I see is not what I stored
Terminology: "Collation" (eg, latin1_swedish_ci) refers to sort order. Your problem is with "Character set" (eg, latin1 or utf8).
I currently do have an address table in MYSQL, with its Character Set set to 'utf8' and Collation to 'utf8_unicode_ci'. There exists a column name Address and I am trying to store the city name Łódź into the Address column. I tried to key in directly into the table at SQLyog Community 64, as well as using the tool MYSQL for Excel but it keeps showing the error 'Incorrect string value'.
I have tried to set the Character Set set to 'utf8mb4' and Collation to 'utf8mb4_unicode_ci'and it still gives me the same error.
Any help on how should I set the character set and collation in order to store Łódź? This city name is just one of many examples, and moving forward I may experience other similar characters as well. What can I use for a universal character set?
(utf8 and utf8mb4 work equally for Polish characters.)
You have not provided enough details about the flow of the characters, but the following should provide debugging for MySQL:
Trouble with utf8 characters; what I see is not what I stored
When stored correctly, the utf8 (or utf8mb4) encoding for Łódź is hex C581 C3B3 64 C5BA.
I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:
I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*
For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).
Is there any way to fix it although there is some data already inserted in the database?
That feels like "double-encoded" Hebrew strings.
This partially recovers the text:
UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1)))
--> '� �תה מסכי�,��
I do not know what leads to the � symbols.
Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.
Please provide the output from that SELECT, then I will work on how to recover the data.
Edit
Here's what I think happened.
The client had characters encoded as utf8; and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8.
Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123
So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.
UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;
If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)
I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.
the db collation has to be utf8_general_ci
the collation of the table with hebrew has to be utf8_general_ci
for example:
CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `col`.`hebrew` (
`id` INT NOT NULL AUTO_INCREMENT,
`heb` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`)
) CHARACTER SET utf8
COLLATE utf8_general_ci;
INSERT INTO hebrew(heb) values ('שלום');
I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'
Encoding to UTF-8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'
When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'
Double-encoding aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'
Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.
Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.
the goal now is to figure out the exact process of corruption so it can be reversed
As documented under ALTER TABLE Syntax:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
In your case:
change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
then drop the encoding information (by modifying the column so that it becomes a binary string):
ALTER TABLE my_table MODIFY my_column BLOB;
then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).
If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():
SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
FROM my_table
Is mysql combining latin1 and cp1252 when it handles conversions?
As documented under West European Character Sets:
MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.
A site I'm working on recently had an issue with the database, apparently it got corrupted when they restored the tables any text field with strange symbols (eg half symbol and degree symbol) the text field stopped at the character before that symbol). I've got a copy of the table and distilled it down to the code below:
CREATE TABLE `products2` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into products2 values
(25, 0x5468652044504D203931322069732061206C617267652033BD204469676974204C434420566F6C746D657465722E20546865207369676E616C206265696E67206D6561737572656420697320616C736F207573656420746F20706F77657220746865206D657465722C20696E636C7564696E6720746865206261636B6C696768742E20546865206D657465722066656174757265732061203320746F20363056206D6561737572656D656E742072616E67652C20776974682061207265736F6C7574696F6E206F662031306D56206265747765656E20332E303020616E642031392E39395620616E64203130306D56206265747765656E2032302E3020616E642036302E30562E205768656E2074686520766F6C746167652064726F70732062656C6F772033562C204C4F20697320646973706C617965642028646F776E20746F20322E38562C207768656E2074686520646973706C61792077696C6C207475726E206F6666292E209148499220697320646973706C61796564207768656E2074686520766F6C7461676520676F65732061626F7665203630562E0D0A0D0A5363726577207465726D696E616C7320616C6C6F7720666F7220717569636B20616E64206561737920636F6E6E656374696F6E2E20546865206D6574657220697320686F7573656420696E206120726F6275737420636172726965722077686963682063616E20626520626F6C74656420696E20706C616365206F722070616E656C206D6F756E746564207573696E6720746865206C6F772070726F6669206C652062657A656C20616E6420636C6970732070726F76696465642E20416E2049503637202F204E454D412034582062657A656C20697320616C736F20617661696C61626C6520666F722070726F74656374696F6E20616761696E7374206475737420616E64206D6F6973747572652E0D0A0D0A417320746869732069732061206E65772064657369676E2077652073756767657374207468617420796F7520636F6E74616374204C617363617220666F7220757020746F2064617465206C6561642D74696D6520696E666F726D6174696F6E206265666F7265206F72646572696E67206F6E6C696E652E0D0A)
This throws an error:
#1366 - Incorrect string value: '\xBD Digi...' for column 'description' at row 1
Looking into this problem on stackoverflow and around the web it seems to be an issue with the encoding, I've tried changing the collation to utf_unicode_ci on the description field and the collation of the table to utf_bin (and all combinations of those) all to no avail.
I can't redo the dump as it's a backup. I don't understand how the system can output the dump but not accept it back - presumably the backup is via the command line (not certain) and I am using PHPMyAdmin to restore it I don't know if that makes a difference.
If it's not possible to import the data I'd be grateful if someone could tell me how to read the encoded data into text that I can then manually cut and paste.
Decoding the first 32 bytes as ASCII, we have (where ? is the 0xBD byte about which MySQL is complaining):
The DPM 912 is a large 3? Digit
A little bit of Googling for "DPM 912" suggests to me that character should be the vulgar one-half fraction, ½.
A number of character sets encode that character with the byte 0xBD, but one in particular jumps out: windows-1252—which was not only the default codepage in the (pre-Unicode) Windows world, but is also MySQL's default encoding. It'd be a good guess that your data is encoded in windows-1252.
As explained in the MySQL manual, you can specify the encoding of a string literal by prefixing it with the encoding name:
A character string literal may have an optional character set introducer and COLLATE clause:
[_charset_name]'string' [COLLATE collation_name]
It goes on to say:
An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), or before bit-field literal notation (b'literal' and 0bnnnn).
Therefore (and because MySQL refers to windows-1252 as latin1), you could change your INSERT command to:
INSERT INTO products2 VALUES (25, _latin1 0x5468652044504D203931322069...);
The documentation also states:
For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
That is, if such an introducer is omitted (as it was in your original INSERT statement), the character set is assumed to be that defined by the character_set_connection system variable.
As mentioned here, there are number of ways of setting that variable (including by specifying it when your client connects which, in phpMyAdmin, is set with the [DefaultCharset] configuration option, of which the default was latin1 prior to v3.4, but has been utf8 since - perhaps this change is the origin of your problems; one can also specify the character set of import files with [Import][charset]). If one doesn't specify the desired character set upon connecting, issuing any of these commands after connecting but before your INSERT command will fix it (you could, for example, add one of them to the top of your dump file):
SET NAMES 'latin1';
SET CHARACTER SET latin1;
SET character_set_connection = latin1;
My recommendation, which makes the dumpfile as portable as possible, would be to add SET NAMES 'latin1' to the top of it.