MySQL's UTF-8 character support - mysql

I am getting this exception
java.sql.SQLException: Incorrect string value: '\xC2\x99 Adm...' for column when trying to insert a value in MySQL table. I found that \xC2\x99 maps to U+0099 (or \u0099) which is a 2-Byte character. From documentation, a character with 3 Bytes or less is supported by MySQL's UTF-8. I also read about utfmb4, but since this character is 2 Bytes and still giving this error, so the issue might be something else. Please suggest.

Looks like you are using the default collation latin1_swedish_ci whereas you should be using utf8_general_ci since you meant to store UTF-8 data in the column. Check MySQL Documentation on Character Sets and Collations in MySQL
You can use an ALTER command to change the collation character set
ALTER TABLE your_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Character U+0099 (aka '<control>') can indeed be encoded in Latin-1 aka ISO-8859-1 (more specifically, it's 0x99) and your connection appears to be properly configured to use UTF-8.
I suspect the problem is a MySQL peculiarity: latin1 does not mean ISO-8859-1:
mysql> SHOW CHARACTER SET like 'Latin1';
+---------+----------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+----------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
+---------+----------------------+-------------------+--------+
1 row in set (0.00 sec)
And Windows-1252 does not have a position for U+0099:
ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also
called CP1252) except for the code points 128-159 (0x80-0x9F).
ISO-8859-1 assigns several control codes in this range. Windows-1252
has several characters, punctuation, arithmetic and business symbols
assigned to these code points.
From West European Character Sets in the MySQL manual:
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions. For example, 0x80 is the Euro
sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to
Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and
0x9d to 0x009d.
In short: you cannot use a latin1 column to stored such character. Since you're already using UTF-8 in your app you should consider upgrading your database to utf8 or, even better, utf8mb4.

You can add support for UTF-8 character set during MySQL DB schema creation, as by default schema creation sometimes it doesn't use the UTF-8 char set.
CREATE DATABASE dbName
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;

Related

Can I convert MySQL database character set from latin1 to utf8 without losing data?

I want to convert my database to store unicode symbols.
Currently the tables have:
latin_swedish_ci collation and latin1 character set
OR
utf8_general_ci collation and utf8 character set
I am not sure how the existing data is encoded, but I suppose it is utf-8 encoded, as I am using Django which I think encodes the data in utf-8 before sending to the database.
My question is:
Can I convert the tables to utf8_unicode_ci collation and utf-8 character set using the following queries without messing up the existing data? (as sugested in this post)
ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Considering latin1 is subset of utf-8, I think it sould work. What do you guys think?
Thank you in advance.
P.S: The version of MySQL is: 5.1
Latin1 is not a subset of UTF-8 - ASCII is. Latin1, however, is represented in Unicode.
CONVERT TO should work, as long as the data was stored in the correct encoding in the first place. Django may have used UTF-8 on the database connection, but the database should have re-encoded on the fly.
To check the actual encoding used - Use the mysql command-line tool to execute an SQL query that selects a row that you know contains non-ASCII characters. Then use the mysql HEX() function to check the bytes used. If you see bytes greater than > 0x7f, check that they don't correspond to valid characters in https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
If you have c396 sitting in a latin1 column, and you want it to mean Ö, then you are half way to "double encoding". Do not use CONVERT TO; that will really get you into "double encoding".
Instead, you need the 2-step ALTER.
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
If you have already messed it up further, and now the Ö is hex C383E28093, then you need to fix double encoding.
This gets you the latin1 byte in 2 steps:
CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1) --> 'Ö' (C396)
HEX(CONVERT(CONVERT(UNHEX('C396') USING utf8) USING latin1)) --> 'Ö' in latin1 (D6)
This gets you the 2-byte utf8 encoding:
CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1)) USING utf8)
Do you want the column to be latin1? Or utf8?

Unicode characters emoticons in MySQL with 4 bytes

I have to insert in Mysql Strings that may contain characters like '😂' . I tried this:
ALTER TABLE `table_name`
DEFAULT CHARACTER SET utf8mb4,
MODIFY `colname` VARCHAR(200)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;
and when I insert '😂';
INSERT INTO `table_name` (`col_name`) VALUES ('😂');
I get the following
SELECT * FROM `table_name`;
????
How can I do to get the correct value in the select statements?
Thanks a lot.
You will need to set the connection encoding to utf8mb4 as well. It depends on how you connect to MySQL how to do this. SET NAMES utf8mb4 is the API-independent SQL query to do so.
What MySQL calls utf8 is a dumbed down subset of actual UTF-8, covering only the BMP (characters 0000 through FFFF). utf8mb4 is actual UTF-8 which can encode all Unicode code points. If your connection encoding is utf8, then all data is squeezed though this subset of UTF-8 and you cannot send or receive characters above the BMP to or from MySQL.

mysql charsets, can I perform the conversion in python?

I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'
Encoding to UTF-8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'
When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'
Double-encoding aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'
Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.
Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.
the goal now is to figure out the exact process of corruption so it can be reversed
As documented under ALTER TABLE Syntax:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
In your case:
change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
then drop the encoding information (by modifying the column so that it becomes a binary string):
ALTER TABLE my_table MODIFY my_column BLOB;
then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).
If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():
SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
FROM my_table
Is mysql combining latin1 and cp1252 when it handles conversions?
As documented under West European Character Sets:
MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

utf8 and utf8_general_ci

I have problem inserting rows to my DB.
When a row contains characters like: 'è', 'ò', 'ò', '€', '²', '³' .... etc ... it returns an error like this (charset set to utf8):
Incorrect string value: '\xE8 pass...' for column 'descrizione' at row 1 - INSERT INTO materiali.listino (codice,costruttore,descrizione,famiglia) VALUES ('E 251-230','Abb','Relè passo passo','Relè');
But, if I set the charset to latin1 or *utf8_general_ci* it works fine, and no errors are found.
Can somebody explain me why does this happens? I always thought that utf8 was "larger" than latin1
EDIT: I also tried to use mysql_real_escape_string, but the error was always the same!!!!
mysql_real_escape_string() is not relevant, as it merely escapes string termination quotes that would otherwise enable an attacker to inject SQL.
utf8 is indeed "larger" than latin1 insofar as it is capable of representing a superset of the latter's characters. However, not every byte-sequence represents valid utf8 characters; whereas every possibly byte sequence does represent valid latin1 characters.
Therefore, if MySQL receives a byte sequence it expects to be utf8 (but which isn't), some characters could well trigger this "incorrect string value" error; whereas if it expects the bytes to be latin1 (even if they're not), they will be accepted - but incorrect data may be stored in the table.
Your problem is almost certainly that your connection character set does not match the encoding in which your application is sending its strings. Use the SET NAMES statement to change the current connection's character set, e.g. SET NAMES 'utf8' if your application is sending strings encoded as UTF-8.
Read about connection character sets for more information.
As an aside, utf8_general_ci is not a character set: it's a collation for the utf8 character set. The manual explains:
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
According to the doc for UTF-8, the default collation is utf8_general_ci.
If you want a specific order in your alphabet that is not the general_ci one, you should pick one of the utf8_* collation that are provided for the utf8 charset, whichever match your requirements in term of ordering.
Both your table and your connection to the DB should be encoded in utf8, preferably the same collation, read more about setting connection collation.
To be completely safe you should check your table collation and make sure it's utf8_* and that your connection is too, using the complete syntax of SET NAMES
SET NAMES 'utf8' COLLATE 'utf8_general_ci'
You can find information about the different collation here
mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");
Eurika, the above did it :-)

MySQL can't restore table from backup - #1366 - Incorrect string value

A site I'm working on recently had an issue with the database, apparently it got corrupted when they restored the tables any text field with strange symbols (eg half symbol and degree symbol) the text field stopped at the character before that symbol). I've got a copy of the table and distilled it down to the code below:
CREATE TABLE `products2` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into products2 values
(25, 0x5468652044504D203931322069732061206C617267652033BD204469676974204C434420566F6C746D657465722E20546865207369676E616C206265696E67206D6561737572656420697320616C736F207573656420746F20706F77657220746865206D657465722C20696E636C7564696E6720746865206261636B6C696768742E20546865206D657465722066656174757265732061203320746F20363056206D6561737572656D656E742072616E67652C20776974682061207265736F6C7574696F6E206F662031306D56206265747765656E20332E303020616E642031392E39395620616E64203130306D56206265747765656E2032302E3020616E642036302E30562E205768656E2074686520766F6C746167652064726F70732062656C6F772033562C204C4F20697320646973706C617965642028646F776E20746F20322E38562C207768656E2074686520646973706C61792077696C6C207475726E206F6666292E209148499220697320646973706C61796564207768656E2074686520766F6C7461676520676F65732061626F7665203630562E0D0A0D0A5363726577207465726D696E616C7320616C6C6F7720666F7220717569636B20616E64206561737920636F6E6E656374696F6E2E20546865206D6574657220697320686F7573656420696E206120726F6275737420636172726965722077686963682063616E20626520626F6C74656420696E20706C616365206F722070616E656C206D6F756E746564207573696E6720746865206C6F772070726F6669206C652062657A656C20616E6420636C6970732070726F76696465642E20416E2049503637202F204E454D412034582062657A656C20697320616C736F20617661696C61626C6520666F722070726F74656374696F6E20616761696E7374206475737420616E64206D6F6973747572652E0D0A0D0A417320746869732069732061206E65772064657369676E2077652073756767657374207468617420796F7520636F6E74616374204C617363617220666F7220757020746F2064617465206C6561642D74696D6520696E666F726D6174696F6E206265666F7265206F72646572696E67206F6E6C696E652E0D0A)
This throws an error:
#1366 - Incorrect string value: '\xBD Digi...' for column 'description' at row 1
Looking into this problem on stackoverflow and around the web it seems to be an issue with the encoding, I've tried changing the collation to utf_unicode_ci on the description field and the collation of the table to utf_bin (and all combinations of those) all to no avail.
I can't redo the dump as it's a backup. I don't understand how the system can output the dump but not accept it back - presumably the backup is via the command line (not certain) and I am using PHPMyAdmin to restore it I don't know if that makes a difference.
If it's not possible to import the data I'd be grateful if someone could tell me how to read the encoded data into text that I can then manually cut and paste.
Decoding the first 32 bytes as ASCII, we have (where ? is the 0xBD byte about which MySQL is complaining):
The DPM 912 is a large 3? Digit
A little bit of Googling for "DPM 912" suggests to me that character should be the vulgar one-half fraction, &half;.
A number of character sets encode that character with the byte 0xBD, but one in particular jumps out: windows-1252—which was not only the default codepage in the (pre-Unicode) Windows world, but is also MySQL's default encoding. It'd be a good guess that your data is encoded in windows-1252.
As explained in the MySQL manual, you can specify the encoding of a string literal by prefixing it with the encoding name:
A character string literal may have an optional character set introducer and COLLATE clause:
[_charset_name]'string' [COLLATE collation_name]
It goes on to say:
An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), or before bit-field literal notation (b'literal' and 0bnnnn).
Therefore (and because MySQL refers to windows-1252 as latin1), you could change your INSERT command to:
INSERT INTO products2 VALUES (25, _latin1 0x5468652044504D203931322069...);
The documentation also states:
For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
That is, if such an introducer is omitted (as it was in your original INSERT statement), the character set is assumed to be that defined by the character_set_connection system variable.
As mentioned here, there are number of ways of setting that variable (including by specifying it when your client connects which, in phpMyAdmin, is set with the [DefaultCharset] configuration option, of which the default was latin1 prior to v3.4, but has been utf8 since - perhaps this change is the origin of your problems; one can also specify the character set of import files with [Import][charset]). If one doesn't specify the desired character set upon connecting, issuing any of these commands after connecting but before your INSERT command will fix it (you could, for example, add one of them to the top of your dump file):
SET NAMES 'latin1';
SET CHARACTER SET latin1;
SET character_set_connection = latin1;
My recommendation, which makes the dumpfile as portable as possible, would be to add SET NAMES 'latin1' to the top of it.