MYSQL: Inserting Traditional & Simplified Chinese in the same 'cell‘ - mysql

newbie here!
I have source data that contains both simplified and traditional Chinese in the same 'cell' (sorry, newbie using Excel speak here!), which I'm trying to load into MYSQL using "Load Data Infile".
The offending text is "到达广州新冶酒吧!一杯芝華士 嘈雜的音樂 行行色色的男女". It's got both simplified Chinese ("广") and traditional Chinese ("華").
When I load it into MySQL, I get the following error:
Error Code: 1366. Incorrect string value: '\xF0\xA3\x8E\xB4\xE8\x83...' for column > 'Description' at row 2
The collation of the database is UTF-8 default collation, and the input file is also UTF-8 encoded.
Is there any way I can either:
a) Make SQL accept this row of data (ideal), or
b) Get SQL to skip inserting this line of data?
Thanks! Do let me know if you need further detail.
Kevin

If 😼 was tripping it up, that's because 😼 is not in the Basic Multilingual Plane of Unicode; it's in the Supplementary Multilingual Plane, which is above U+FFFF and takes up 4 bytes in UTF-8 instead of 3. Fully conformant Unicode implementations treat them no differently, but MySQL charset utf8 doesn't accept characters above U+FFFF. If you have a recent version of MySQL, you can ALTER TABLE to use utf8mb4 which properly handles all Unicode characters. There are some catches to changing, as MySQL allocates 4 bytes per character instead of 3; see http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html for the details.
This issue is a duplicate of Inserting UTF-8 encoded string into UTF-8 encoded mysql table fails with "Incorrect string value" .

Related

detect with Python if the string will lead to "Incorrect string value" error in MySQL

I have a table in MySQL (5.7) database, which has collation utf8_unicode_ci,
and where I'm inserting some data with Python (3.6).
With some of the strings (for example, '\xCE\xA6') I get "Incorrect string value" error. On the DB side, I can mute this error by turning off the strict mode in MySQL, or changing the field's collation to utf8mb4.
However, such strings are "anomalies", and it is not desirable to change a collation or the sql_mode.
How can I detect in Python 3, that a given string will lead to "incorrect string value" error with MySQL, before inserting into a Table ?
Where do you get the error message? What operation is being performed?
C3A6 is the UTF-8 (cf MySQL's utf8 or utf8mb4) hex for æ; does it seem likely that that was the desired character?
To handle utf8 (or utf8mb4), you need to determine what the client's encoding. Sounds like UTF-8. So, when connecting to MySQL, tell it that -- use these in the connect call:
charset="utf8", use_unicode=True
If the character is in the python source, you need
# -*- coding: utf-8 -*-
at the beginning of the source.
Also the column you are inserting into needs to be CHARACTER SET utf8 (or utf8mb4).
utf8mb4 is needed for Emoji and some of Chinese; otherwise it is 'equivalent' to utf8.
Do not use decode() or any other conversion functions; that will just make things harder to fix. In this arena, two wrongs does not make a right; it makes a worse mess.
If you have other symptoms of garbled characters, see Trouble with UTF-8 characters; what I see is not what I stored
To discuss further, please provide the connection call, the SQL statement involved, SHOW CREATE TABLE, and anything else involved.
C3A6 is a valid utf8/utf8mb4 character æ, and could be interpreted as valid, though unlikely, latin1 æ. But it is invalid for CHARACTER SET ascii. (I don't know how the error message occurred unless the connection said ascii or some obscure charset.)

How to resolve an incorrect string value error (MySQL)

I have some documents, formatted in XML. I want to store their contents (raw text, formatting preserved) in cells in an SQL table, as LONGTEXT, so that I can simply grab the value of a cell and load it in a webpage later. I am doing this via MySQL Workbench.
However, when I try to apply the additions to my table, I get error 1366: Incorrect string value: \xE2\x80\xAF1, ...
I tried changing the character set to utf-8-general-ci and cp1251, but I keep getting the same errors.
Also, I searched the XML file for the string \xE2\x80\xAF1, but it's not even in the file.
Does anybody know what this string is?
The XML file is only 219KB so I think it should (very) easily fit in a LONGTEXT entry.
Does XML make use of any characters that could cause this error?
Am I missing another cause of the error?
Your code is not a literal text but points to NARROW NO-BREAK SPACE.
With UTF-8 simple characters are coded as one byte, other characters need two bytes.
And there are some characters which need three bytes. These characters tend to lead to such errors.
Find a related question here: freebcp: "Unicode data is odd byte size for column. Should be even byte size"
You need to
specify utf8 (or utf8mb4) for the connection established from your client (Workbench).
declare the column in question to be CHARACTER SET utf8 (or utf8mb4).

How can i let the MySQL innoDB support the new unicode character set, such as '\U0001f3b8'

How can I let the MySQL InnoDB support the new Unicode character set, such as \U0001f3b8.
The char \U0001f3b8 is a GUITAR which is generated by iPhone's input method.
The problem is that it cannot be saved in MySQL DB.
The column type that saved is VARCHAR, I have been tried to change to TEXT, still not working.
The Exception is :
Warning: Incorrect string value: '\xF0\x9F\x8E\xB8' for column
'message' at row 1
BTW : My working environment is python + Django
I assume that new unicode character set means full UTF-8 support. You need to change your table/column collations to one that include utf8mb4 encoding. Good old utf8 in an incomplete implementation that only supports up to three bytes. You need MySQL/5.5 for that.

mysql 5.5 utf-8 collation utf8_unicode_ci, pymysql

I've a problem with mysql 5.5 on os x. I'm working on a multilanguages project
and using MyISAM table. the default characterset is utf-8 and default collation utf8_unicode_ci.
Italian and German are fine, but spanish is not. I'm using python for manipulating datas,
pymysql driver with charset option to utf-8 and unicode true.
Practically all the specific spanish letters are a mess.
from python shell:
>>>r
>>>['Blas P\xc3\xa9rez Gonz\xc3\xa1lez, 4']
>>>print[0]
>>>Blas Pérez González, 4
after saving it to database and fetching it again:
>>>r
>>>(u'Blas P\xc3\xa9rez Gonz\xc3\xa1lez, 4')
>>>print r[0]
>>>Blas Pérez González, 4
I'm really confused, it clearly seems to be the same unicode string!
Thanks.
Better use java-style unicode escapes, like
u'\\u0e4f\\u032f\\u0361\\u0e4f'.decode('unicode-escape')
See similar question.
This ensures that you have unicode in the string.
Then the actual problem: try in mysql describe the_table. Still in the column definition one can set the character set. Try that to see if your table is okay.
For testing: Store u'Blas P\\u00e9rez Gonz\\u00e1lez'.decode('unicode-escape') in the database.
Then you know that the correct unicode string is stored.
If the database has correct db/table/field definitions, only the retrieval, not storing, may be at fault.

MySQL Error #1366 -- Chinese Characters Fail with big5_chinese encoding

The idea: I'm just trying to save some Chinese characters to a MySQL database.
The issue: apparently, some save while others don't. I've tried to just put em in via phpMyAdmin, but when I try to save them, they turn out to be question marks "?".
The query: UPDATE a9286500_chinese.chinese SET chinese = '贵' WHERE chinese.id =23 LIMIT 1 ;
The error: Warning: #1366 Incorrect string value: '\xE8\xB4\xB5' for column 'chinese' at row 1
The collation of the table is big5_chinese_ci.
Characters like 我 (wo) and 你 (ni) work, whereas characters like 贵 (gui) don't.
Thoughts?
That character (贵) is not encodable in Big5. If you need to handle both Simplified and Traditional Chinese, then you should use a Unicode encoding, like UTF-8.