MySQL: character encoding used by SELECT INTO? - mysql

I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.
I will focus on one character, the left smartquote: “
When I use SELECT from the console, it is printed without issue:
mysql> SELECT text FROM posts;
+-------+
| text |
+-------+
| “foo” |
+-------+
This means the data are being sent to my terminal as utf-8[0] (which is correct).
However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is not correctly encoded:
$ cat /tmp/x.csv
“fooâ€
Specifically, the “ is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.
What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?
Also, some miscellaneous facts:
SELECT ##character_set_database returns latin1
The text column is a VARCHAR(42):
mysql> DESCRIBE posts;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| text | varchar(42) | NO | MUL | | |
+-------+-------------+------+-----+---------+-------+
“ encoded as utf-8 yields \xe2\x80\x9c
\xe2\x80\x9c decoded as latin1 then re-encoded as utf-8 yields \xc3\xa2\xc2\x80\xc2\x9c (6 bytes).
Another data point: … (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.

Newer versions of MySQL have an option to set the character set in the outfile clause:
SELECT col1,col2,col3
FROM table1
INTO OUTFILE '/tmp/out.txt'
CHARACTER SET utf8
FIELDS TERMINATED BY ','

Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac bit (U+20AC) comes from in the middle.
When I try this, it works properly (but note how I put data in, and the variables set on the db server):
mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
mysql> create table sq (c varchar(10)) character set utf8;
mysql> show create table sq\G
*************************** 1. row ***************************
Table: sq
Create Table: CREATE TABLE `sq` (
`c` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.19 sec)
mysql> insert into sq values (unhex('E2809C'));
Query OK, 1 row affected (0.00 sec)
mysql> select hex(c), c from sq;
+--------+------+
| hex(c) | c |
+--------+------+
| E2809C | “ |
+--------+------+
1 row in set (0.00 sec)
mysql> select * from sq into outfile '/tmp/x.csv';
Query OK, 1 row affected (0.02 sec)
mysql> show variables like "%char%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
And from the shell:
/tmp$ hexdump -C x.csv
00000000 e2 80 9c 0a |....|
00000004
Hopefully there's a useful tidbit in there…

I've found that this works well.
SELECT convert(col_name USING latin1) FROM posts INTO OUTFILE '/tmp/x.csv' …;

To specifically address your question "What is this?", you have answered it yourself:
I suspect this is because “Column values are dumped using the binary character set. In effect, there is no character set conversion.” - dev.mysql.com/doc/refman/5.0/en/select-into.html
That is the way MySQL stores utf8 encoded data internally. It's a terribly inefficient variation of Unicode storage, apparently using a full three bytes for most characters, and not supporting four byte UTF-8 sequences.
As for how to convert it to real UTF-8 using INTO OUTFILE... I don't know. Using other mysqldump methods will do it though.

As you can see my MySQL database use latin1 and system is utf-8.
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
Every time I tried to export table I got strange encoded CSV file.
So, I put:
mysql_query("SET NAMES CP1252");
header('Content-Type: text/csv; charset=cp1252');
header('Content-Disposition: attachment;filename=output.csv');
as in my export script.
Then I have pure UTF-8 output.

Try SET CHARACTER SET <blah> before your select, <blah>=utf8 or latin1 etc...
See: http://dev.mysql.com/doc/refman/5.6/en/charset-connection.html
Or SET NAMES utf8; might work...

You can execute MySQL queries using the CLI tool (I believe even with an output format so it prints out CSV) and redirect to a file. Should do charset conversion and still give you access to do joins, etc.

You need to issue charset utf8 at the MySQL prompt before running the SELECT. This tells the server what to output the results as.

Related

Cyrillic encoding in MySQL

In my.ini I've changed properties from latin1 to cp1251 (then restarted the server)
[mysql]
default-character-set=cp1251
............................
[mysqld]
default-character-set=cp1251
I create database
CREATE DATABASE library DEFAULT CHARSET=cp1251;
Make request to check out the encoding:
SELECT ##character_set_database, ##collation_database;
+--------------------------+----------------------+
| ##character_set_database | ##collation_database |
+--------------------------+----------------------+
| cp1251 | cp1251_general_ci |
+--------------------------+----------------------+
show variables like "char%";
+--------------------------+---------------------------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------------------------+
| character_set_client | cp1251 |
| character_set_connection | cp1251 |
| character_set_database | cp1251 |
| character_set_filesystem | binary |
| character_set_results | cp1251 |
| character_set_server | cp1251 |
| character_set_system | utf8 |
| character_sets_dir | C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\ |
+--------------------------+---------------------------------------------------------+
Create a table
CREATE TABLE genres (g_id INT, g_name VARCHAR(150)) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
As I try to insert cyrillic data, the Command Line window gets stuck:
mysql> INSERT INTO genres (g_id, g_name) VALUES (1, 'Поэзия');
'>
'>
'>
'>
Latin strings get inserted ok:
mysql> INSERT INTO genres (g_id, g_name) VALUES (1, 'Poetry');
Query OK, 1 row affected (0.06 sec)
Yesterday, after the whole day of trying and testing, I got it working well. Created some more tables and inserted some Cyrillic strings. But next morning and the whole day long I can't get it working again. The previously inserted data wouldn't display. After firing
set names utf8
the Cyrillic words appeared, numeric columns didn't show right. What have I missed?
It's not just one change.
character_set_client/connection/results, but not the other two that you changed, specify the encoding of the client.
The column definitions in the database tables need to have a character set that can handle Cyrillic. One way is to do this to each table:
ALTER TABLE t CONVERT TO cp1251;
Have you have already stored Cyrillic in latin1 columns?
Check by doing SELECT HEX(col) .... You may need the 2-step Alter as discussed in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
It would be best to switch to utf8mb4; that way you could handle all character sets throughout the world.
See also Trouble with UTF-8 characters; what I see is not what I stored
I have found a workaround. After starting cmd
C:\Users\nikol>chcp 866
Active code page: 866
Then after starting mysql
mysql> set names cp866;
Query OK, 0 rows affected (0.00 sec)
But when I select the data, there are multiple trailing spaces
mysql> SELECT * FROM genres;
+------+------------------+
| g_id | g_name |
+------+------------------+
| 1 | Поэзия |
| 2 | Программирование |
| 3 | Психология |
| 4 | Наука |
| 5 | Классика |
| 6 | Фантастика |
+------+------------------+
6 rows in set (0.00 sec)
I guess I'll have to TRIM

MySQL SUBSTRING() with non-utf8 encoding

I've got a MySQL database with latin1 encoding, and I'm struggling with function SUBSTRING() which is obviously counting bytes and not characters, as shown by the following scenario:
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| ééé |
+-------------------------------+
Everything normal up to now, let's switch the connection to latin1 encoding.
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| é� |
+-------------------------------+
The only way I found right now, is to convert string to utf-8 before function SUBSTRING() and convert it back to latin1 afterwards. Which is very ugly...
MySQL [hozana]> select convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1);
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1) |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| ééé |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
My question is, which is the right configuration to make in order to have SUBSTRING() working in latin1?
Note
Here is the configuration before and after set names:
MySQL [hozana]> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.5.54 |
+-----------+
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
User error.
When you say SET NAMES latin1, you are announcing to MySQL that the bytes coming from the client (you) are encoded in latin1. But they weren't. They were still in utf8.
When you typed ééééé, the bytes generate were these 10 bytes C3A9C3A9C3A9C3A9C3A9 Those were sent to mysql as 10 latin1 characters, namely ééééé. SUBSTRING, as requested, carved off the first 3 characters (but they were latin1 characters: éÃ, hex C3A9C3 and delivered them back to your UTF-8 client, which proceeded to interpret C3A9 as é, then gagged on the invalid UTF-8, hex C3, and puked on your terminal with its black diamond � (the "REPLACEMENT CHARACTER").
So, always be sure to establish the encoding of the client, either via something in the connection mechanism or with SET NAMES. All sorts of nasties can occur if you specify it incorrectly. Alas, this does not address your problem directly; but it addresses a lot of other things that can happen.
Oh, another thing. You say you have "a MySQL database with latin1 encoding". That is OK. You must still specify the client to be encoded in (apparently) utf8 or utf8mb4. MySQL will convert to the encoding of the column when you do an INSERT, and convert back the other way when you do a SELECT. Since é exists in latin1 as well as utf8, (and ditto for all other Western European accented letters), all should be well.
Perhaps you crafted the Question with a literal. Well, that does not necessarily reflect SELECTing from a table. So, I crafted a table with both a latin1 column and a utf8 column, each containing ééééé, verified that the HEX and LENGTH were different. Then testing SELECT SUBSTRING(col, 1, 3) correctly produced ééé in both cases.

MySQL command-line table column width with utf8

Why mysql command-line outputs utf8 columns twice as wide compared to non-utf columns? Example:
$ mysql -u user --default-character-set=utf8
mysql> select "αβγαβγαβγαβγαβγαβγαβγ";
+--------------------------------------------+
| αβγαβγαβγαβγαβγαβγαβγ |
+--------------------------------------------+
| αβγαβγαβγαβγαβγαβγαβγ |
+--------------------------------------------+
1 row in set (0.00 sec)
mysql> select "abcabcabcabcabcabcabc";
+-----------------------+
| abcabcabcabcabcabcabc |
+-----------------------+
| abcabcabcabcabcabcabc |
+-----------------------+
1 row in set (0.00 sec)
As you can see, first table has column twice as wide compared to second table, and this often breaks formatting when lines start to get more than half-screen wide.
I tried this on MySQL 14.14 and MariaDB 15.1.
Is there a way to output utf8 columns with the same width as non-utf?
edit:
MariaDB [(none)]> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
In the source code for mysql.cc (the source for the mysql client) there is an explanation in the comment block for function get_field_disp_length() which is used in the formatting of result set output.
Return the length of a field after it would be rendered into
text.
This doesn't know or care about multibyte characters. Assume we're
using such a charset. We can't know that all of the upcoming rows
for this column will have bytes that each render into some fraction
of a character. It's at least possible that a row has bytes that
all render into one character each, and so the maximum length is
still the number of bytes. (Assumption 1: This can't be better
because we can never know the number of characters that the DB is
going to send -- only the number of bytes. 2: Chars <= Bytes.)
In other words, since UTF8 can store characters that are 1 byte per character (like Latin characters), and the result can't know what the data is before it fetches it to display, it must assume any or all characters may be one byte per character.
The story might be different if you used a character set that uses a constant 2 bytes per character, like UCS-2. But I have never heard of anyone using UCS-2, since MySQL supports variable-length Unicode encodings.

MySQL Incorrect string value error

In a Django application with MySQL DB back-end users try to insert notes which contain some smileys and hearts and stuff which are Unicode characters. MySQL refuses the operations with an error:
(1366, "Incorrect string value: '\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F' for column 'note' at row 1")
(The column in question has longtext type. The Unicode characters in this case valid, it's a heart and a modifier https://codepoints.net/U+2764 https://codepoints.net/U+FE0F, so it's not that they would be 4 byte long UTF-8 characters. I made sure that MySQL's default character set is utf-8.)
What is interesting is that I cannot fully reproduce this error on my local developer environment. One particular difference is that it only emits a warning for that anomaly.
Update1:
This is still bothering to me:
mysql> SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name="sblive";
+----------------------------+
| default_character_set_name |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
I converted the specific table's charset to utf-8:
mysql> alter table uploads_uploads convert to character set utf8 COLLATE utf8_general_ci;
Query OK, 1209036 rows affected (1 min 10.31 sec)
Records: 1209036 Duplicates: 0 Warnings: 0
mysql> SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "sblive" AND table_name = "uploads_uploads" AND column_name = "note";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
+--------------------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE '%colla%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
You are asking for ❤️ followed by a "non-spacing" "VARIATION SELECTOR-16".
Your bytes are utf8 -- good
Your connection needs to specify utf8 -- does it?
Your TEXT column need to be declared CHARACTER SET utf8 -- is it? Use SHOW CREATE TABLE to verify.
If you are using HTML, it needs to say charset=UTF-8 -- does it?
Suggest you switch to utf8mb4 if the 'back-end users' are likely to enter more emoticons -- the 'Emoji' will need it.
Addenda
Let's check the data... Please run this
SELECT col, HEX(col) FROM ...
Those two character should deliver hex E29DA4 and EFB88F. If you see C3A2C29DC2A4C3AFC2B8C28F, you have "double encoding", which is a messier problem. 2764FE0F would indicate utf16, I think.

Why after executing set names utf8mb4, the column name changes to question mark?

Why after executing set names utf8mb4, the column name changes to question mark? See below:
mysql> show variables like 'character%' ;
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> select '\U+1F600';
+------+
| 😀 |
+------+
| 😀 |
+------+
mysql> set names utf8mb4;
mysql> select '\U+1F600';
+------+
| ? |
+------+
| 😀 |
+------+
In my opinion, utf8mb4 is designed to support these emoji characters. Why changed to utf8mb4, the column name changed to question mark?
In addition, I copied the emoji character from website(http://getemoji.com/) , then pasted it in terminal.If I just type '\U+1F600' manually. See below:
mysql> select '\U+1F600' ;
+---------+
| U+1F600 |
+---------+
| U+1F600 |
+---------+
So I guess when I pasted it in terminal there is something happened implicitly. And this implicitly conversion(😀 --> '\U+1F600') maybe could explain this phenomenpon.
This would appear to be expected behaviour according to MySQL documentation, where metadata is declared to be stored in utf8 (the non-4byte version).
It is returned to the client as character_set_result (utf8mb4), however most likely your virtual column name is being stored at utf8 to be compatible and comparable with all other metadata and thus the 4-byte part of the character is lost even though it is not in a real table.
See here:
https://dev.mysql.com/doc/refman/5.6/en/charset-metadata.html
I had found more info by using wireshark. See below:
Before executing set names utf8mb4
After executing set names utf8mb4
In this case the server can't find a Charset number, so the column name become a question mark. And it seems which Charset number does not matter, just need it is not Unknow. If I execute set names latin1, the response packet info is: