Can't store certain chinese characters in MySQL - mysql

I'm importing csv files to MySQL using LOAD DATA with CHARACTER SET UTF8MB4. This workes well most often, but from time to time I still get errors like this:
ERROR 1300 (HY000): Invalid utf8mb4 character string: '楽天市場をみ'
It seems like if there are still some chinese characters that don't work and I have no idea why. Are these characters outside of utf8mb4? How can this be handled?
Edit: When opening the csv with notepad++ there seem to be an "invisible" part after the chinese letters, not sure if this is the reason or the chinese letters before: 楽天市場をみxE3x82

Is anything in the data-flow limiting that column to 20 bytes? E38292 is を; E382 seems to be a truncated UTF-8 character. I interpret 楽天市場をみxE3x82 as 6 well-formed 3-byte characters, plus 2 more bytes, hence 20.
I think the problem (and possible 20-byte limit) happened before creating the CSV file.

Related

MYSQL Workbench keeps importing illegal characters for the 'id' column

I keep on trying to import a CSV file, about 4000 chars long, into my MYSQL DB, via WorkBench...
Every time the ID column has an illegal character in it.
Why would MYSQL Workbench do this?
Right now it says \ufeffid ... so there is some character \ufeff before 'id'.
I exported my XLSX file as a CSV. It shouldnt have these characters.
That is the Unicode BOM character (code point/glyph), a zero-width space used to mark Unicode files as first character in the file. It is redundant (bad practice as we see), but so Windows Notepad discovers UTF-8 instead of the local charset.
With manual copying the first line, this BOM character may be copied to several lines in a file.
Somehow, somewhere, you need to get rid of them.
About the BOM, Byte-Order-Mark:
Unicode numbers all possible glyphs, code points, chars.
The conversion to binary data is in the form of UTF-8 (multi-byte), UTF-16LE (little endian), UTF-16BE (big endian), and UTF-32 LE/BE.
By a BOM character the encoding can be detected. For that it is U+FEFF, two different byte values.

How can I find out which character a certain string is encoding in my database?

Recently I exported parts of my mySQL database, and noticed that the text had several strange characters in it. For example, the string ’ often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’ stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter  often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’ means ', or, more importantly for me, what  stands for. I'd be very grateful if someone could point me in the right direction.
The latin1 encoding for ’ is (in hex) E28099.
The utf8 encoding for ’ is E28099.
But you pasted in C3A2E282ACE284A2, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’ in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT statement dutifully treated it as 3 latin1 characters E2 80 99 (visually ’), and converted each one to utf8, hex C3A2 E282AC E284A2.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â. In latin1, that is hex C2. In utf8, many punctuation marks are encoded as 2 bytes: C2xx. For example, the copyright symbol, © is utf8 hex C2A9, which is misinterpreted ©.

Character encoding issues with migrating from MSSQL to MySQL

We have an application called JIRA running on Windows using MSSQL and I need to migrate it to Linux/MySQL. The character encoding in the existing MSSQL db is latin1 but I need to use UTF-8 in MySQL.
I take an xml dump of the MSSQL data using a backup mechanism provided by the application. Run it through python filter to convert the encoding from latin1 to UTF-8. Here is the python code that was provided to me by my colleague.
#!/usr/bin/python
import codecs, re
try:
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
#fin = codecs.open('unicodestuff.txt', encoding='utf-8', errors='replace')
fin = codecs.open('entities.xml', encoding='latin1')
fout = codecs.open('stripped.xml', encoding='utf-8', mode='w', errors='replace')
for line in fin:
line = highpoints.sub(u'', line)
fout.write(line)
fin.close()
fout.close()
I take the filtered xml dump and using a "restore" mechanism in the application, I restore the data. However, after restoring the data, I spot checked few records on the MySQL side and I see some weird characters and I am assuming these are related to character encoding. For example,
On the MSSQL side, the text string is
““Number of debits exceeds maximum of 0”
“2-Restrict All Credits”
Default ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง
Branch : 724 มาบุญครอง
whereas on the MYSQL side, the corresponding text appears as
â??â??Number of debits exceeds maximum of 0â?
â??2-Restrict All Creditsâ?
Default à¸à¸­à¸à¸à¸£à¸°à¹à¸ à¸à¸à¸±à¸à¸à¸µà¸à¸¹à¸à¸à¹à¸­à¸ à¹à¸à¹à¹à¸¥à¸à¸à¸±à¸à¸à¸µà¹à¸¡à¹à¸à¸¹à¸à¸à¹à¸­à¸
Branch : 724 มาà¸à¸¸à¸à¸à¸£à¸­à¸
Can you please provide me some ideas to fix these character encoding issues? Kindly let me know if additional information is required.
Thanks
Sam
Clearly your XML file does not actually use the Latin-1 character set. You've shown that text such as "ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง" is present in it. The Latin-1 character set does what it says on the label: it represents letters from Latin alphabets. Those letters do not exist in it. If the headers in your XML file claim that it's in Latin-1, then those headers are untrue and the XML is, strictly speaking, not valid. But it might still be usable.
Now the problem is, what character encoding is that XML file actually using? To find out, you may have to examine the XML file in hexadecimal. There are three main possibilities: (1) it's using an old codepage such as 874 which contains these characters; (2) it's using UTF-16; (3) it's using UTF-8.
If you examine in hexadecimal a section of the XML which contains some of this non-latin text, and some of the latin letters nearby, here's what you might see. If it's in a codepage such as 874, each latin letter will be one byte with a value from 32 to 7F, and each nonlatin letter will be one (or possibly two?) bytes with values of 80 to FF. If it's in UTF-16, each latin letter will be two bytes, one from 32 to 7F and the other being always 00, and the nonlatin letters will be two bytes with neither being 00. If it's in UTF-8, the latin letters will be one byte from 32 to 7F, and the nonlatin letters will be (probably) three bytes, all being from 80 to FF.
There may be an alternative to examining hexadecimal. Some text editor programs can save text files in your choice of encoding formats. TextPad 7, for instance, can save as ANSI, DOS, UTF-8, Unicode, or Unicode (big-endian). The latter two options are actually UTF-16. Try loading the XML into such a program, and saving copies of it as UTF-8 and as Unicode. One of these copies should be the same size as the original (plus or minus two or three bytes), and the other will be a different size. Whichever matches the size is probably the correct format. If both differ, then you've got something weird.
Anyway, if you save a version as UTF-8 and then are able to open it and see your data intact, you should then be able to import that without using a Python translator.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

Do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'm swapping forum software, and the old forum database used Latin1 encoding. The new forum database uses UTF8 encoding for tables.
It looks like the importer script did a straight copy from one table to another without trying to fix any encoding issues.
I've been manually fixing the visible errors using a find-and-replace based on the conversion info listed here: http://www.i18nqa.com/debug/utf8-debug.html
The rest of the text looks fine and is completely readable.
My limited understanding is that UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated.
So do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?
I'd rather not because I've changed some of the BB Code tags on a number of the fields after they were stored in UTF 8, so concerned that those updates would have stuck UTF8 characters in the middle of the Latin1 characters, and trying to do a full conversion on mixed character sets will just muck things up further.
Any characters from ISO 8859-1 (Latin 1) in the range 0x80..0xFF need to be recoded as 2 bytes in UTF-8. The first byte is 0xC2 for 0x80..0xBF; the first byte is 0xC3 for 0xC0..0xFF. The second byte is derived from the original value from Latin 1 by setting the two most significant bits to 1 and 0. For the characters 0x80..0xBF, the value of the second byte is unchanged from Latin 1. If you were using 8859-15, you may have a few more complex conversions (the Euro symbol is encoded differently from other Latin 1 characters).
There are tools aplenty to assist. iconv is one such.