How to change encoding on fly in SELECT statement? - mysql

I have a table with a column, which has cp1251_general_ci collation. I don't want to change column collation, but I want to get data in utf8 encoding.
Is there a way to select any data somehow in a way that it looks just like a data with utf8_general_ci collation?
I.e. I need something like this
SELECT CONVERT_TO_UTF8(weirdColumn) FROM weirdTable

Here's a demo table using the cp1251 encoding. I'll insert some Cyrillic characters into it.
mysql> CREATE TABLE weirdTable (weirdColumn text) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
mysql> insert into weirdTable values ('ЂЃЉЌ');
mysql> select * from weirdTable;
+-------------+
| weirdColumn |
+-------------+
| ЂЃЉЌ |
+-------------+
Use MySQL's CONVERT() function to force the characters to a different encoding:
mysql> select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
Here's proof that the result has been converted to utf8. I create a table using metadata from the query result:
mysql> create table w2
as select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> show create table w2\G
*************************** 1. row ***************************
Table: w2
Create Table: CREATE TABLE `w2` (
`weirdColumnUtf8` longtext CHARACTER SET utf8
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select * from w2;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
On my MySQL instance, utf8mb4 is the default character encoding. That's okay; it's a superset of utf8, and the utf8 encoding is enough to store these characters. However, I generally recommend if you use utf8, there's no reason not to use utf8mb4.
If you change the character encoding, you cannot keep the cp1251 collation. Collations are specific to encodings. But you can use one of the collations associated with utf8 or utf8mb4. You can see the available collations for a given character encoding:
mysql> SHOW COLLATION WHERE Charset = 'utf8';
+--------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------------+---------+-----+---------+----------+---------+---------------+
...
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | PAD SPACE |
| utf8_general_mysql500_ci | utf8 | 223 | | Yes | 1 | PAD SPACE |
...

Related

Cyrillic encoding in MySQL

In my.ini I've changed properties from latin1 to cp1251 (then restarted the server)
[mysql]
default-character-set=cp1251
............................
[mysqld]
default-character-set=cp1251
I create database
CREATE DATABASE library DEFAULT CHARSET=cp1251;
Make request to check out the encoding:
SELECT ##character_set_database, ##collation_database;
+--------------------------+----------------------+
| ##character_set_database | ##collation_database |
+--------------------------+----------------------+
| cp1251 | cp1251_general_ci |
+--------------------------+----------------------+
show variables like "char%";
+--------------------------+---------------------------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------------------------+
| character_set_client | cp1251 |
| character_set_connection | cp1251 |
| character_set_database | cp1251 |
| character_set_filesystem | binary |
| character_set_results | cp1251 |
| character_set_server | cp1251 |
| character_set_system | utf8 |
| character_sets_dir | C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\ |
+--------------------------+---------------------------------------------------------+
Create a table
CREATE TABLE genres (g_id INT, g_name VARCHAR(150)) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
As I try to insert cyrillic data, the Command Line window gets stuck:
mysql> INSERT INTO genres (g_id, g_name) VALUES (1, 'Поэзия');
'>
'>
'>
'>
Latin strings get inserted ok:
mysql> INSERT INTO genres (g_id, g_name) VALUES (1, 'Poetry');
Query OK, 1 row affected (0.06 sec)
Yesterday, after the whole day of trying and testing, I got it working well. Created some more tables and inserted some Cyrillic strings. But next morning and the whole day long I can't get it working again. The previously inserted data wouldn't display. After firing
set names utf8
the Cyrillic words appeared, numeric columns didn't show right. What have I missed?
It's not just one change.
character_set_client/connection/results, but not the other two that you changed, specify the encoding of the client.
The column definitions in the database tables need to have a character set that can handle Cyrillic. One way is to do this to each table:
ALTER TABLE t CONVERT TO cp1251;
Have you have already stored Cyrillic in latin1 columns?
Check by doing SELECT HEX(col) .... You may need the 2-step Alter as discussed in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
It would be best to switch to utf8mb4; that way you could handle all character sets throughout the world.
See also Trouble with UTF-8 characters; what I see is not what I stored
I have found a workaround. After starting cmd
C:\Users\nikol>chcp 866
Active code page: 866
Then after starting mysql
mysql> set names cp866;
Query OK, 0 rows affected (0.00 sec)
But when I select the data, there are multiple trailing spaces
mysql> SELECT * FROM genres;
+------+------------------+
| g_id | g_name |
+------+------------------+
| 1 | Поэзия |
| 2 | Программирование |
| 3 | Психология |
| 4 | Наука |
| 5 | Классика |
| 6 | Фантастика |
+------+------------------+
6 rows in set (0.00 sec)
I guess I'll have to TRIM

Inserting 4-byte unicode characters into MySQL/MariaDB

When attempting to insert 💩 (for example, which is a 4-byte unicode char), both MySQL (5.7) and MariaDB (10.2/10.3/10.4) give the same error:
Incorrect string value: '\xF0\x9F\x92\xA9'
The statement:
mysql> insert into bob (test) values ('💩');
Here's my database's charset/collation:
mysql> select ##collation_database; +----------------------+
| ##collation_database |
+----------------------+
| utf8mb4_unicode_ci |
+----------------------+
1 row in set (0.00 sec)
mysql> SELECT ##character_set_database; +--------------------------+
| ##character_set_database |
+--------------------------+
| utf8mb4 |
+--------------------------+
1 row in set (0.00 sec)
The server's character set:
mysql> show global variables like '%character_set_server%'\G; *************************** 1. row ***************************
Variable_name: character_set_server
Value: utf8mb4
The table:
create table bob ( `test` TEXT NOT NULL );
mysql> SHOW FULL COLUMNS FROM bob;
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| test | text | utf8mb4_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.00 sec)
Can anyone point me in the right direction?
Yes, as you commented, you need to use SET NAMES utf8mb4.
Your 4-byte character must pass from your client through the database connection and into a table. All of those must support utf8mb4. If any one of them does not support utf8mb4, then 4-byte characters will not be able to get through.
SET NAMES utf8mb4 makes the database session expect clients to send string using that encoding. The default for character_set_client on MySQL 5.7 is utf8, so you need to set it to utf8mb4.
In MySQL 8.0.1 and later, the default character_set_client is utf8mb4 already, so you won't need to change it.

MySQL Incorrect string value error

In a Django application with MySQL DB back-end users try to insert notes which contain some smileys and hearts and stuff which are Unicode characters. MySQL refuses the operations with an error:
(1366, "Incorrect string value: '\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F' for column 'note' at row 1")
(The column in question has longtext type. The Unicode characters in this case valid, it's a heart and a modifier https://codepoints.net/U+2764 https://codepoints.net/U+FE0F, so it's not that they would be 4 byte long UTF-8 characters. I made sure that MySQL's default character set is utf-8.)
What is interesting is that I cannot fully reproduce this error on my local developer environment. One particular difference is that it only emits a warning for that anomaly.
Update1:
This is still bothering to me:
mysql> SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name="sblive";
+----------------------------+
| default_character_set_name |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
I converted the specific table's charset to utf-8:
mysql> alter table uploads_uploads convert to character set utf8 COLLATE utf8_general_ci;
Query OK, 1209036 rows affected (1 min 10.31 sec)
Records: 1209036 Duplicates: 0 Warnings: 0
mysql> SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "sblive" AND table_name = "uploads_uploads" AND column_name = "note";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
+--------------------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE '%colla%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
You are asking for ❤️ followed by a "non-spacing" "VARIATION SELECTOR-16".
Your bytes are utf8 -- good
Your connection needs to specify utf8 -- does it?
Your TEXT column need to be declared CHARACTER SET utf8 -- is it? Use SHOW CREATE TABLE to verify.
If you are using HTML, it needs to say charset=UTF-8 -- does it?
Suggest you switch to utf8mb4 if the 'back-end users' are likely to enter more emoticons -- the 'Emoji' will need it.
Addenda
Let's check the data... Please run this
SELECT col, HEX(col) FROM ...
Those two character should deliver hex E29DA4 and EFB88F. If you see C3A2C29DC2A4C3AFC2B8C28F, you have "double encoding", which is a messier problem. 2764FE0F would indicate utf16, I think.

Does mysql latin1 also support emoji character?

Now because below phenomenon I feel I totally do not understand character set. At first I think only utf8mb4 support Emoji character e.g. 😀.
See below:
As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters
But accidentally I found this phenomenon,see below:
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> show create table t4\G
*************************** 1. row ***************************
Table: t4
Create Table: CREATE TABLE `t4` (
`data` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> insert into t4 select '\U+1F600';
mysql> select * from t4;
+------+
| data |
+------+
| 😀 |
+------+
Now I'm very confused, it seems latin1 also could support emoji character. I know it must be an illusion, but I don't know how to clear it?
You cannot store anything other than iso-8859-1 characters into an latin1 field without converting it to e.g. base64
It might work, but will fail later at some point. In special having multibyte characters like emoticons.

MySQL: character encoding used by SELECT INTO?

I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.
I will focus on one character, the left smartquote: “
When I use SELECT from the console, it is printed without issue:
mysql> SELECT text FROM posts;
+-------+
| text |
+-------+
| “foo” |
+-------+
This means the data are being sent to my terminal as utf-8[0] (which is correct).
However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is not correctly encoded:
$ cat /tmp/x.csv
“fooâ€
Specifically, the “ is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.
What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?
Also, some miscellaneous facts:
SELECT ##character_set_database returns latin1
The text column is a VARCHAR(42):
mysql> DESCRIBE posts;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| text | varchar(42) | NO | MUL | | |
+-------+-------------+------+-----+---------+-------+
“ encoded as utf-8 yields \xe2\x80\x9c
\xe2\x80\x9c decoded as latin1 then re-encoded as utf-8 yields \xc3\xa2\xc2\x80\xc2\x9c (6 bytes).
Another data point: … (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.
Newer versions of MySQL have an option to set the character set in the outfile clause:
SELECT col1,col2,col3
FROM table1
INTO OUTFILE '/tmp/out.txt'
CHARACTER SET utf8
FIELDS TERMINATED BY ','
Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac bit (U+20AC) comes from in the middle.
When I try this, it works properly (but note how I put data in, and the variables set on the db server):
mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
mysql> create table sq (c varchar(10)) character set utf8;
mysql> show create table sq\G
*************************** 1. row ***************************
Table: sq
Create Table: CREATE TABLE `sq` (
`c` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.19 sec)
mysql> insert into sq values (unhex('E2809C'));
Query OK, 1 row affected (0.00 sec)
mysql> select hex(c), c from sq;
+--------+------+
| hex(c) | c |
+--------+------+
| E2809C | “ |
+--------+------+
1 row in set (0.00 sec)
mysql> select * from sq into outfile '/tmp/x.csv';
Query OK, 1 row affected (0.02 sec)
mysql> show variables like "%char%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
And from the shell:
/tmp$ hexdump -C x.csv
00000000 e2 80 9c 0a |....|
00000004
Hopefully there's a useful tidbit in there…
I've found that this works well.
SELECT convert(col_name USING latin1) FROM posts INTO OUTFILE '/tmp/x.csv' …;
To specifically address your question "What is this?", you have answered it yourself:
I suspect this is because “Column values are dumped using the binary character set. In effect, there is no character set conversion.” - dev.mysql.com/doc/refman/5.0/en/select-into.html
That is the way MySQL stores utf8 encoded data internally. It's a terribly inefficient variation of Unicode storage, apparently using a full three bytes for most characters, and not supporting four byte UTF-8 sequences.
As for how to convert it to real UTF-8 using INTO OUTFILE... I don't know. Using other mysqldump methods will do it though.
As you can see my MySQL database use latin1 and system is utf-8.
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
Every time I tried to export table I got strange encoded CSV file.
So, I put:
mysql_query("SET NAMES CP1252");
header('Content-Type: text/csv; charset=cp1252');
header('Content-Disposition: attachment;filename=output.csv');
as in my export script.
Then I have pure UTF-8 output.
Try SET CHARACTER SET <blah> before your select, <blah>=utf8 or latin1 etc...
See: http://dev.mysql.com/doc/refman/5.6/en/charset-connection.html
Or SET NAMES utf8; might work...
You can execute MySQL queries using the CLI tool (I believe even with an output format so it prints out CSV) and redirect to a file. Should do charset conversion and still give you access to do joins, etc.
You need to issue charset utf8 at the MySQL prompt before running the SELECT. This tells the server what to output the results as.