MySQL Incorrect string value error - mysql

In a Django application with MySQL DB back-end users try to insert notes which contain some smileys and hearts and stuff which are Unicode characters. MySQL refuses the operations with an error:
(1366, "Incorrect string value: '\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F' for column 'note' at row 1")
(The column in question has longtext type. The Unicode characters in this case valid, it's a heart and a modifier https://codepoints.net/U+2764 https://codepoints.net/U+FE0F, so it's not that they would be 4 byte long UTF-8 characters. I made sure that MySQL's default character set is utf-8.)
What is interesting is that I cannot fully reproduce this error on my local developer environment. One particular difference is that it only emits a warning for that anomaly.
Update1:
This is still bothering to me:
mysql> SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name="sblive";
+----------------------------+
| default_character_set_name |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
I converted the specific table's charset to utf-8:
mysql> alter table uploads_uploads convert to character set utf8 COLLATE utf8_general_ci;
Query OK, 1209036 rows affected (1 min 10.31 sec)
Records: 1209036 Duplicates: 0 Warnings: 0
mysql> SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "sblive" AND table_name = "uploads_uploads" AND column_name = "note";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
+--------------------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE '%colla%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

You are asking for ❤️ followed by a "non-spacing" "VARIATION SELECTOR-16".
Your bytes are utf8 -- good
Your connection needs to specify utf8 -- does it?
Your TEXT column need to be declared CHARACTER SET utf8 -- is it? Use SHOW CREATE TABLE to verify.
If you are using HTML, it needs to say charset=UTF-8 -- does it?
Suggest you switch to utf8mb4 if the 'back-end users' are likely to enter more emoticons -- the 'Emoji' will need it.
Addenda
Let's check the data... Please run this
SELECT col, HEX(col) FROM ...
Those two character should deliver hex E29DA4 and EFB88F. If you see C3A2C29DC2A4C3AFC2B8C28F, you have "double encoding", which is a messier problem. 2764FE0F would indicate utf16, I think.

Related

Inserting 4-byte unicode characters into MySQL/MariaDB

When attempting to insert 💩 (for example, which is a 4-byte unicode char), both MySQL (5.7) and MariaDB (10.2/10.3/10.4) give the same error:
Incorrect string value: '\xF0\x9F\x92\xA9'
The statement:
mysql> insert into bob (test) values ('💩');
Here's my database's charset/collation:
mysql> select ##collation_database; +----------------------+
| ##collation_database |
+----------------------+
| utf8mb4_unicode_ci |
+----------------------+
1 row in set (0.00 sec)
mysql> SELECT ##character_set_database; +--------------------------+
| ##character_set_database |
+--------------------------+
| utf8mb4 |
+--------------------------+
1 row in set (0.00 sec)
The server's character set:
mysql> show global variables like '%character_set_server%'\G; *************************** 1. row ***************************
Variable_name: character_set_server
Value: utf8mb4
The table:
create table bob ( `test` TEXT NOT NULL );
mysql> SHOW FULL COLUMNS FROM bob;
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| test | text | utf8mb4_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.00 sec)
Can anyone point me in the right direction?
Yes, as you commented, you need to use SET NAMES utf8mb4.
Your 4-byte character must pass from your client through the database connection and into a table. All of those must support utf8mb4. If any one of them does not support utf8mb4, then 4-byte characters will not be able to get through.
SET NAMES utf8mb4 makes the database session expect clients to send string using that encoding. The default for character_set_client on MySQL 5.7 is utf8, so you need to set it to utf8mb4.
In MySQL 8.0.1 and later, the default character_set_client is utf8mb4 already, so you won't need to change it.

MySQL SUBSTRING() with non-utf8 encoding

I've got a MySQL database with latin1 encoding, and I'm struggling with function SUBSTRING() which is obviously counting bytes and not characters, as shown by the following scenario:
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| ééé |
+-------------------------------+
Everything normal up to now, let's switch the connection to latin1 encoding.
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| é� |
+-------------------------------+
The only way I found right now, is to convert string to utf-8 before function SUBSTRING() and convert it back to latin1 afterwards. Which is very ugly...
MySQL [hozana]> select convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1);
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1) |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| ééé |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
My question is, which is the right configuration to make in order to have SUBSTRING() working in latin1?
Note
Here is the configuration before and after set names:
MySQL [hozana]> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.5.54 |
+-----------+
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
User error.
When you say SET NAMES latin1, you are announcing to MySQL that the bytes coming from the client (you) are encoded in latin1. But they weren't. They were still in utf8.
When you typed ééééé, the bytes generate were these 10 bytes C3A9C3A9C3A9C3A9C3A9 Those were sent to mysql as 10 latin1 characters, namely ééééé. SUBSTRING, as requested, carved off the first 3 characters (but they were latin1 characters: éÃ, hex C3A9C3 and delivered them back to your UTF-8 client, which proceeded to interpret C3A9 as é, then gagged on the invalid UTF-8, hex C3, and puked on your terminal with its black diamond � (the "REPLACEMENT CHARACTER").
So, always be sure to establish the encoding of the client, either via something in the connection mechanism or with SET NAMES. All sorts of nasties can occur if you specify it incorrectly. Alas, this does not address your problem directly; but it addresses a lot of other things that can happen.
Oh, another thing. You say you have "a MySQL database with latin1 encoding". That is OK. You must still specify the client to be encoded in (apparently) utf8 or utf8mb4. MySQL will convert to the encoding of the column when you do an INSERT, and convert back the other way when you do a SELECT. Since é exists in latin1 as well as utf8, (and ditto for all other Western European accented letters), all should be well.
Perhaps you crafted the Question with a literal. Well, that does not necessarily reflect SELECTing from a table. So, I crafted a table with both a latin1 column and a utf8 column, each containing ééééé, verified that the HEX and LENGTH were different. Then testing SELECT SUBSTRING(col, 1, 3) correctly produced ééé in both cases.

Mysql 5.5 how to set up everything to utf8?

I just want everything default to utf8. I've checked this question but nothing help.
Currently, My /etc/my.cnf is
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
But when I restart the server, create a new database, it is still latin1(character_set_database and character_set_server):
mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
When I create a database, It is latin1:
mysql> create database d1;
Query OK, 1 row affected (0.00 sec)
mysql> use d1;
Database changed
mysql> show variables like "character_set_database";
+------------------------+--------+
| Variable_name | Value |
+------------------------+--------+
| character_set_database | latin1 |
+------------------------+--------+
1 row in set (0.00 sec)
When I create a table in this database, it can't recognize valid utf8 啊:
mysql> create table t1(name varchar(1) default '啊');
ERROR 1067 (42000): Invalid default value for 'name'
I know alter database d1 character set utf8; will fix this. But I just want everything default to utf8, is it possible?
This is tricky.
The character set and collation for the default database can be
determined from the values of the character_set_database and
collation_database system variables. The server sets these variables
whenever the default database changes. If there is no default
database, the variables have the same value as the corresponding
server-level system variables, character_set_server and
collation_server.
So one would assume the default for the collation-database is the same as the default for the collation-server variable.
Please check the following:
Is there any other config that would override your my.cnf, like /etc/mysql/mysql.cnf or ~/.my.cnf ?
The client (not server!) is setting its own collation upon startup, so you could set a client collation/encoding through [mysql] (not mysqld) or look if this is already set somewhere.
You do SHOW VARIABLES ... - this is querying SESSION based variables, try to query explicitly global settings through SHOW GLOBAL VARIABLES ...

MySQL UTF8 Issue

Okay, I have tried to import "CSV" file into MySQL for the past 24 hours but have failed miserably.
I have set name, set char and there is nothing left that I have not set to UTF8 but it still is not working. Not just for the DB and Tables, but for the server as well, still no use.
I am importing directly into MySQL so it is not PHP issue. I will be grateful if anyone can highlight where am I going wrong.
mysql> SHOW CREATE DATABASE `dict_2`;
+----------+--------------------------------------------------------------------
---------------------+
| Database | Create Database
|
+----------+--------------------------------------------------------------------
---------------------+
| dict_2 | CREATE DATABASE `dict_2` /*!40100 DEFAULT CHARACTER SET utf8 COLLAT
E utf8_unicode_ci */ |
+----------+--------------------------------------------------------------------
---------------------+
1 row in set (0.00 sec)
mysql> show variables like "%character%"; show variables like "%collation%";
+--------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | utf8 |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\xampp\mysql\share\charsets\ |
+--------------------------+--------------------------------+
8 rows in set (0.00 sec)
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)
In its current form, this question is impossible to answer.
We're left guessing...
That you're using a MySQL LOAD DATA statement.
You've verified that the characterset encoding of the .csv file is not ucs2.
You've verified that the characterset encoding of the .csv file is utf8 (i.e. matches the character_set_database system variable), of that you've specified the appropriate characterset in the CHARACTER SET clause of the LOAD DATA statement.
Beyond that, there's a whole slew of other things that might be wrong, but we're still just guessing.
Very frequently when something MySQL "fail miserably", there's some sort of indication, like an error message, or some other behavior that we can observe and describe.
In the question, the description of the failure mode is beyond vague, it's entirely non-existent.

MySQL: character encoding used by SELECT INTO?

I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.
I will focus on one character, the left smartquote: “
When I use SELECT from the console, it is printed without issue:
mysql> SELECT text FROM posts;
+-------+
| text |
+-------+
| “foo” |
+-------+
This means the data are being sent to my terminal as utf-8[0] (which is correct).
However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is not correctly encoded:
$ cat /tmp/x.csv
“fooâ€
Specifically, the “ is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.
What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?
Also, some miscellaneous facts:
SELECT ##character_set_database returns latin1
The text column is a VARCHAR(42):
mysql> DESCRIBE posts;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| text | varchar(42) | NO | MUL | | |
+-------+-------------+------+-----+---------+-------+
“ encoded as utf-8 yields \xe2\x80\x9c
\xe2\x80\x9c decoded as latin1 then re-encoded as utf-8 yields \xc3\xa2\xc2\x80\xc2\x9c (6 bytes).
Another data point: … (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.
Newer versions of MySQL have an option to set the character set in the outfile clause:
SELECT col1,col2,col3
FROM table1
INTO OUTFILE '/tmp/out.txt'
CHARACTER SET utf8
FIELDS TERMINATED BY ','
Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac bit (U+20AC) comes from in the middle.
When I try this, it works properly (but note how I put data in, and the variables set on the db server):
mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
mysql> create table sq (c varchar(10)) character set utf8;
mysql> show create table sq\G
*************************** 1. row ***************************
Table: sq
Create Table: CREATE TABLE `sq` (
`c` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.19 sec)
mysql> insert into sq values (unhex('E2809C'));
Query OK, 1 row affected (0.00 sec)
mysql> select hex(c), c from sq;
+--------+------+
| hex(c) | c |
+--------+------+
| E2809C | “ |
+--------+------+
1 row in set (0.00 sec)
mysql> select * from sq into outfile '/tmp/x.csv';
Query OK, 1 row affected (0.02 sec)
mysql> show variables like "%char%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
And from the shell:
/tmp$ hexdump -C x.csv
00000000 e2 80 9c 0a |....|
00000004
Hopefully there's a useful tidbit in there…
I've found that this works well.
SELECT convert(col_name USING latin1) FROM posts INTO OUTFILE '/tmp/x.csv' …;
To specifically address your question "What is this?", you have answered it yourself:
I suspect this is because “Column values are dumped using the binary character set. In effect, there is no character set conversion.” - dev.mysql.com/doc/refman/5.0/en/select-into.html
That is the way MySQL stores utf8 encoded data internally. It's a terribly inefficient variation of Unicode storage, apparently using a full three bytes for most characters, and not supporting four byte UTF-8 sequences.
As for how to convert it to real UTF-8 using INTO OUTFILE... I don't know. Using other mysqldump methods will do it though.
As you can see my MySQL database use latin1 and system is utf-8.
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
Every time I tried to export table I got strange encoded CSV file.
So, I put:
mysql_query("SET NAMES CP1252");
header('Content-Type: text/csv; charset=cp1252');
header('Content-Disposition: attachment;filename=output.csv');
as in my export script.
Then I have pure UTF-8 output.
Try SET CHARACTER SET <blah> before your select, <blah>=utf8 or latin1 etc...
See: http://dev.mysql.com/doc/refman/5.6/en/charset-connection.html
Or SET NAMES utf8; might work...
You can execute MySQL queries using the CLI tool (I believe even with an output format so it prints out CSV) and redirect to a file. Should do charset conversion and still give you access to do joins, etc.
You need to issue charset utf8 at the MySQL prompt before running the SELECT. This tells the server what to output the results as.