Why mysql command-line outputs utf8 columns twice as wide compared to non-utf columns? Example:
$ mysql -u user --default-character-set=utf8
mysql> select "αβγαβγαβγαβγαβγαβγαβγ";
+--------------------------------------------+
| αβγαβγαβγαβγαβγαβγαβγ |
+--------------------------------------------+
| αβγαβγαβγαβγαβγαβγαβγ |
+--------------------------------------------+
1 row in set (0.00 sec)
mysql> select "abcabcabcabcabcabcabc";
+-----------------------+
| abcabcabcabcabcabcabc |
+-----------------------+
| abcabcabcabcabcabcabc |
+-----------------------+
1 row in set (0.00 sec)
As you can see, first table has column twice as wide compared to second table, and this often breaks formatting when lines start to get more than half-screen wide.
I tried this on MySQL 14.14 and MariaDB 15.1.
Is there a way to output utf8 columns with the same width as non-utf?
edit:
MariaDB [(none)]> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
In the source code for mysql.cc (the source for the mysql client) there is an explanation in the comment block for function get_field_disp_length() which is used in the formatting of result set output.
Return the length of a field after it would be rendered into
text.
This doesn't know or care about multibyte characters. Assume we're
using such a charset. We can't know that all of the upcoming rows
for this column will have bytes that each render into some fraction
of a character. It's at least possible that a row has bytes that
all render into one character each, and so the maximum length is
still the number of bytes. (Assumption 1: This can't be better
because we can never know the number of characters that the DB is
going to send -- only the number of bytes. 2: Chars <= Bytes.)
In other words, since UTF8 can store characters that are 1 byte per character (like Latin characters), and the result can't know what the data is before it fetches it to display, it must assume any or all characters may be one byte per character.
The story might be different if you used a character set that uses a constant 2 bytes per character, like UCS-2. But I have never heard of anyone using UCS-2, since MySQL supports variable-length Unicode encodings.
Related
I've got a MySQL database with latin1 encoding, and I'm struggling with function SUBSTRING() which is obviously counting bytes and not characters, as shown by the following scenario:
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| ééé |
+-------------------------------+
Everything normal up to now, let's switch the connection to latin1 encoding.
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3) |
+-------------------------------+
| é� |
+-------------------------------+
The only way I found right now, is to convert string to utf-8 before function SUBSTRING() and convert it back to latin1 afterwards. Which is very ugly...
MySQL [hozana]> select convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1);
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| convert(cast(convert(substring(convert(cast(convert('éééé' using latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1) |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| ééé |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
My question is, which is the right configuration to make in order to have SUBSTRING() working in latin1?
Note
Here is the configuration before and after set names:
MySQL [hozana]> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.5.54 |
+-----------+
MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)
MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
User error.
When you say SET NAMES latin1, you are announcing to MySQL that the bytes coming from the client (you) are encoded in latin1. But they weren't. They were still in utf8.
When you typed ééééé, the bytes generate were these 10 bytes C3A9C3A9C3A9C3A9C3A9 Those were sent to mysql as 10 latin1 characters, namely ééééé. SUBSTRING, as requested, carved off the first 3 characters (but they were latin1 characters: éÃ, hex C3A9C3 and delivered them back to your UTF-8 client, which proceeded to interpret C3A9 as é, then gagged on the invalid UTF-8, hex C3, and puked on your terminal with its black diamond � (the "REPLACEMENT CHARACTER").
So, always be sure to establish the encoding of the client, either via something in the connection mechanism or with SET NAMES. All sorts of nasties can occur if you specify it incorrectly. Alas, this does not address your problem directly; but it addresses a lot of other things that can happen.
Oh, another thing. You say you have "a MySQL database with latin1 encoding". That is OK. You must still specify the client to be encoded in (apparently) utf8 or utf8mb4. MySQL will convert to the encoding of the column when you do an INSERT, and convert back the other way when you do a SELECT. Since é exists in latin1 as well as utf8, (and ditto for all other Western European accented letters), all should be well.
Perhaps you crafted the Question with a literal. Well, that does not necessarily reflect SELECTing from a table. So, I crafted a table with both a latin1 column and a utf8 column, each containing ééééé, verified that the HEX and LENGTH were different. Then testing SELECT SUBSTRING(col, 1, 3) correctly produced ééé in both cases.
I'm working with two mysql servers, trying to understand why they behave differently.
I've created identical tables on each:
| Field | Type | Collation |
+----------------+------------+-------------------+
| some_chars | char(45) | latin1_swedish_ci |
| some_text | text | latin1_swedish_ci |
and I've set identical character set variables:
| Variable_name | Value
+--------------------------+-------+
| character_set_client | utf8
| character_set_connection | utf8
| character_set_database | latin1
| character_set_filesystem | binary
| character_set_results | utf8
| character_set_server | latin1
| character_set_system | utf8
When I insert UTF-8 characters into the database on one server, I get an error:
DatabaseError: 1366 (HY000): Incorrect string value: '\xE7\xBE\x8E\xE5\x9B\xBD...'
The same insertion in the other server throws no error. The table just silently accepts the utf-8 insertion and renders a bunch of ? marks where the utf-8 characters should be.
Why is the behavior of the two servers different?
What command were you executing when you got the error?
Your data is obviously utf8 (good).
Your connection apparently is utf8 (good).
Your table/column is declared CHARACTER SET latin1? It should be utf8.
That is 美 - Chinese, correct? Some Chinese characters need 4-byte utf8. So you should use utf8mb4 instead of utf8 in all 3 cases listed above.
Other notes:
There is no substantive difference in this area in 5.6 versus 5.7.
##SQL_MODE is not relevant.
VARCHAR is usually advisable over CHAR.
Why after executing set names utf8mb4, the column name changes to question mark? See below:
mysql> show variables like 'character%' ;
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> select '\U+1F600';
+------+
| 😀 |
+------+
| 😀 |
+------+
mysql> set names utf8mb4;
mysql> select '\U+1F600';
+------+
| ? |
+------+
| 😀 |
+------+
In my opinion, utf8mb4 is designed to support these emoji characters. Why changed to utf8mb4, the column name changed to question mark?
In addition, I copied the emoji character from website(http://getemoji.com/) , then pasted it in terminal.If I just type '\U+1F600' manually. See below:
mysql> select '\U+1F600' ;
+---------+
| U+1F600 |
+---------+
| U+1F600 |
+---------+
So I guess when I pasted it in terminal there is something happened implicitly. And this implicitly conversion(😀 --> '\U+1F600') maybe could explain this phenomenpon.
This would appear to be expected behaviour according to MySQL documentation, where metadata is declared to be stored in utf8 (the non-4byte version).
It is returned to the client as character_set_result (utf8mb4), however most likely your virtual column name is being stored at utf8 to be compatible and comparable with all other metadata and thus the 4-byte part of the character is lost even though it is not in a real table.
See here:
https://dev.mysql.com/doc/refman/5.6/en/charset-metadata.html
I had found more info by using wireshark. See below:
Before executing set names utf8mb4
After executing set names utf8mb4
In this case the server can't find a Charset number, so the column name become a question mark. And it seems which Charset number does not matter, just need it is not Unknow. If I execute set names latin1, the response packet info is:
I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.
I will focus on one character, the left smartquote: “
When I use SELECT from the console, it is printed without issue:
mysql> SELECT text FROM posts;
+-------+
| text |
+-------+
| “foo” |
+-------+
This means the data are being sent to my terminal as utf-8[0] (which is correct).
However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is not correctly encoded:
$ cat /tmp/x.csv
“fooâ€
Specifically, the “ is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.
What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?
Also, some miscellaneous facts:
SELECT ##character_set_database returns latin1
The text column is a VARCHAR(42):
mysql> DESCRIBE posts;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| text | varchar(42) | NO | MUL | | |
+-------+-------------+------+-----+---------+-------+
“ encoded as utf-8 yields \xe2\x80\x9c
\xe2\x80\x9c decoded as latin1 then re-encoded as utf-8 yields \xc3\xa2\xc2\x80\xc2\x9c (6 bytes).
Another data point: … (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.
Newer versions of MySQL have an option to set the character set in the outfile clause:
SELECT col1,col2,col3
FROM table1
INTO OUTFILE '/tmp/out.txt'
CHARACTER SET utf8
FIELDS TERMINATED BY ','
Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac bit (U+20AC) comes from in the middle.
When I try this, it works properly (but note how I put data in, and the variables set on the db server):
mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
mysql> create table sq (c varchar(10)) character set utf8;
mysql> show create table sq\G
*************************** 1. row ***************************
Table: sq
Create Table: CREATE TABLE `sq` (
`c` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.19 sec)
mysql> insert into sq values (unhex('E2809C'));
Query OK, 1 row affected (0.00 sec)
mysql> select hex(c), c from sq;
+--------+------+
| hex(c) | c |
+--------+------+
| E2809C | “ |
+--------+------+
1 row in set (0.00 sec)
mysql> select * from sq into outfile '/tmp/x.csv';
Query OK, 1 row affected (0.02 sec)
mysql> show variables like "%char%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
And from the shell:
/tmp$ hexdump -C x.csv
00000000 e2 80 9c 0a |....|
00000004
Hopefully there's a useful tidbit in there…
I've found that this works well.
SELECT convert(col_name USING latin1) FROM posts INTO OUTFILE '/tmp/x.csv' …;
To specifically address your question "What is this?", you have answered it yourself:
I suspect this is because “Column values are dumped using the binary character set. In effect, there is no character set conversion.” - dev.mysql.com/doc/refman/5.0/en/select-into.html
That is the way MySQL stores utf8 encoded data internally. It's a terribly inefficient variation of Unicode storage, apparently using a full three bytes for most characters, and not supporting four byte UTF-8 sequences.
As for how to convert it to real UTF-8 using INTO OUTFILE... I don't know. Using other mysqldump methods will do it though.
As you can see my MySQL database use latin1 and system is utf-8.
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
Every time I tried to export table I got strange encoded CSV file.
So, I put:
mysql_query("SET NAMES CP1252");
header('Content-Type: text/csv; charset=cp1252');
header('Content-Disposition: attachment;filename=output.csv');
as in my export script.
Then I have pure UTF-8 output.
Try SET CHARACTER SET <blah> before your select, <blah>=utf8 or latin1 etc...
See: http://dev.mysql.com/doc/refman/5.6/en/charset-connection.html
Or SET NAMES utf8; might work...
You can execute MySQL queries using the CLI tool (I believe even with an output format so it prints out CSV) and redirect to a file. Should do charset conversion and still give you access to do joins, etc.
You need to issue charset utf8 at the MySQL prompt before running the SELECT. This tells the server what to output the results as.
+--------------------------+--------------------------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql-5.1.41-osx10.5-x86_64/share/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.1.41 |
+-----------+
1 row in set (0.00 sec)
mysql> select char(0x00FC);
+--------------+
| char(0x00FC) |
+--------------+
| ? |
+--------------+
1 row in set (0.00 sec)
Expecting actual utf8 character --> " ü " instead of " ? " Tried char(0x00FC using utf8) also, but no go.
Using mysql version 5.1.41
Been allover the Google, cannot find anything on this. The MySQL docs simply say that multibyte output is expected on values greater than 255, after mysql version 5.0.14.
Thanks
You are confusing UTF-8 with Unicode.
0x00FC is the Unicode code point for ü:
mysql> select char(0x00FC using ucs2);
+----------------------+
| char(0x00FC using ucs2) |
+----------------------+
| ü |
+----------------------+
In UTF-8 encoding, 0x00FC is represented by two bytes:
mysql> select char(0xC3BC using utf8);
+-------------------------+
| char(0xC3BC using utf8) |
+-------------------------+
| ü |
+-------------------------+
UTF-8 is merely a way of encoding Unicode characters in binary form. It is meant to be space efficient, which is why ASCII characters only take a single byte, and iso-8859-1 characters such as ü only take two bytes. Some other characters take three or four bytes, but they are much less common.
Adding to Martin's answer:
You can use an "introducer" instead of the CHAR() function. To do this, you specify the encoding, prefixed with an underscore, before the code point:
_utf16 0xFC
or:
_utf16 0x00FC
If the goal is to specify the code point instead of the encoded byte sequence, then you need to use an encoding in which the code point value just happens to be the encoded byte sequence. For example, as shown in Martin's answer, 0x00FC is both the code point value for ü and the encoded byte sequence for ucs2 / utf16 (they are effectively the same encoding for BMP characters, but I prefer to use "utf16" as it is consistent with "utf8" and "utf32", consistent in the "utf" theme).
But, utf16 only works for BMP characters (code points U+0000 - U+FFFF) in terms of specifying the code point value. If you want a Supplementary Character, then you will need to use the utf32 encoding. Not only does _utf32 0xFC return ü, but:
_utf32 0x1F47E
returns: 👾
For more details on these options, plus Unicode escape sequences for other languages and platforms, please see my post:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)