What collation does MySQL use by default for ORDER BY? - mysql

These queries both give the result I expect:
SELECT sex
FROM ponies
ORDER BY sex COLLATE latin1_swedish_ci ASC
SELECT sex
FROM ponies
ORDER BY CONVERT(sex USING utf8) COLLATE utf8_general_ci ASC
| f |
| f |
| m |
| m |
+---+
But this query gives a different result:
SELECT sex FROM ponies ORDER BY sex ASC
| m |
| m |
| f |
| f |
+---+
Here's the configuration:
SHOW VARIABLES LIKE 'collation\_%'
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
The table collation is latin1_swedish_ci.
MySQL server is 5.5.16.

Table Collations
Collation defaults are stored on a table-by-table basis. There is a server-set default, but that is applied to the table at the time it is created.
To find the collation for a specific table, run this query:
SHOW TABLE STATUS LIKE 'ponies'\G
You should see output like this:
*************************** 1. row ***************************
Name: ponies
Engine: MyISAM
Version: 10
Row_format: Fixed
Rows: 8
Avg_row_length: 20
Data_length: 160
Max_data_length: 5629499534213119
Index_length: 1024
Data_free: 0
Auto_increment: NULL
Create_time: 2012-02-27 10:16:25
Update_time: 2012-02-27 10:17:40
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
And you can see the Collation setting in that result.
Column collations
You can also override collation settings on particular columns within a table. A create table statement like this would create a latin1_swedish_ci table, with a utf8_polish_ci column:
CREATE TABLE ponies (
sex CHAR(1) COLLATE utf8_polish_ci
) CHARACTER SET latin1 COLLATE latin1_swedish_ci;
The best way to view the results of this is like this:
SHOW FULL COLUMNS FROM ponies;
Output:
+-------+---------+----------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+---------+----------------+------+-----+---------+-------+---------------------------------+---------+
| sex | char(1) | utf8_polish_ci | YES | | NULL | | select,insert,update,references | |
+-------+---------+----------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.00 sec)

The documentation says it uses a case insensitive character comparison by default. I don't see why you are not getting that result though.
The documentation also suggests using the binary qualifier for case sensitive comparison. I wonder if that would affect your result?:
SELECT sex FROM ponies ORDER BY BINARY sex ASC

This behaviour can be observed when sex is an ENUM in which case it is usually sorted by the numerical position in the ENUM definition. Only when a collation is explicitly given an it is sorted in alphabetical order.

Related

How to change encoding on fly in SELECT statement?

I have a table with a column, which has cp1251_general_ci collation. I don't want to change column collation, but I want to get data in utf8 encoding.
Is there a way to select any data somehow in a way that it looks just like a data with utf8_general_ci collation?
I.e. I need something like this
SELECT CONVERT_TO_UTF8(weirdColumn) FROM weirdTable
Here's a demo table using the cp1251 encoding. I'll insert some Cyrillic characters into it.
mysql> CREATE TABLE weirdTable (weirdColumn text) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
mysql> insert into weirdTable values ('ЂЃЉЌ');
mysql> select * from weirdTable;
+-------------+
| weirdColumn |
+-------------+
| ЂЃЉЌ |
+-------------+
Use MySQL's CONVERT() function to force the characters to a different encoding:
mysql> select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
Here's proof that the result has been converted to utf8. I create a table using metadata from the query result:
mysql> create table w2
as select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> show create table w2\G
*************************** 1. row ***************************
Table: w2
Create Table: CREATE TABLE `w2` (
`weirdColumnUtf8` longtext CHARACTER SET utf8
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select * from w2;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
On my MySQL instance, utf8mb4 is the default character encoding. That's okay; it's a superset of utf8, and the utf8 encoding is enough to store these characters. However, I generally recommend if you use utf8, there's no reason not to use utf8mb4.
If you change the character encoding, you cannot keep the cp1251 collation. Collations are specific to encodings. But you can use one of the collations associated with utf8 or utf8mb4. You can see the available collations for a given character encoding:
mysql> SHOW COLLATION WHERE Charset = 'utf8';
+--------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------------+---------+-----+---------+----------+---------+---------------+
...
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | PAD SPACE |
| utf8_general_mysql500_ci | utf8 | 223 | | Yes | 1 | PAD SPACE |
...

Inserting 4-byte unicode characters into MySQL/MariaDB

When attempting to insert 💩 (for example, which is a 4-byte unicode char), both MySQL (5.7) and MariaDB (10.2/10.3/10.4) give the same error:
Incorrect string value: '\xF0\x9F\x92\xA9'
The statement:
mysql> insert into bob (test) values ('💩');
Here's my database's charset/collation:
mysql> select ##collation_database; +----------------------+
| ##collation_database |
+----------------------+
| utf8mb4_unicode_ci |
+----------------------+
1 row in set (0.00 sec)
mysql> SELECT ##character_set_database; +--------------------------+
| ##character_set_database |
+--------------------------+
| utf8mb4 |
+--------------------------+
1 row in set (0.00 sec)
The server's character set:
mysql> show global variables like '%character_set_server%'\G; *************************** 1. row ***************************
Variable_name: character_set_server
Value: utf8mb4
The table:
create table bob ( `test` TEXT NOT NULL );
mysql> SHOW FULL COLUMNS FROM bob;
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| test | text | utf8mb4_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.00 sec)
Can anyone point me in the right direction?
Yes, as you commented, you need to use SET NAMES utf8mb4.
Your 4-byte character must pass from your client through the database connection and into a table. All of those must support utf8mb4. If any one of them does not support utf8mb4, then 4-byte characters will not be able to get through.
SET NAMES utf8mb4 makes the database session expect clients to send string using that encoding. The default for character_set_client on MySQL 5.7 is utf8, so you need to set it to utf8mb4.
In MySQL 8.0.1 and later, the default character_set_client is utf8mb4 already, so you won't need to change it.

Using case sensitive match in Case when fucntion

The command
case when ltrim(rtrim(City_old)) = ltrim(rtrim(City_New)) then 'Y'
doesn't consider case sensitive differences.
Can someone please help me in using a case-sensitive match in case when function? Thanks in advance
Case sensitivity or insensitivity is based on the string collation defined for your columns. MySQL defaults to a case-insensitive collation, so all comparisons will ignore case by default.
mysql> select case when 'city' = 'City' then 'Y' else 'N' end as matches;
+---------+
| matches |
+---------+
| Y |
+---------+
You can make a comparison in a case-sensitive manner by overriding the collation:
mysql> select case when 'city' collate utf8mb4_bin = 'City' then 'Y' else 'N' end as matches;
+---------+
| matches |
+---------+
| N |
+---------+
You must choose a collation that is compatible with the character set of the string you are comparing. You can check which compatible collations are supported by your current MySQL instance:
mysql> SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE character_set_name='utf8mb4';
+------------------------+--------------------+
| COLLATION_NAME | CHARACTER_SET_NAME |
+------------------------+--------------------+
| utf8mb4_general_ci | utf8mb4 |
| utf8mb4_bin | utf8mb4 |
| utf8mb4_unicode_ci | utf8mb4 |
| utf8mb4_icelandic_ci | utf8mb4 |
. . .
All the collations ending with _ci are case-insensitive. The only case-sensitive option above is utf8mb4_bin.
Likewise for utf8:
mysql> SELECT * FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE character_set_name='utf8';
+--------------------------+--------------------+
| COLLATION_NAME | CHARACTER_SET_NAME |
+--------------------------+--------------------+
| utf8_general_ci | utf8 |
| utf8_bin | utf8 |
| utf8_unicode_ci | utf8 |
| utf8_icelandic_ci | utf8 |
. . .
The choice also depends on the MySQL version you use. They keep introducing new character sets and collations, trying to make MySQL support standards better. For example, in MySQL 8.0, you can use collation utf8mb4_0900_as_cs
Read https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.html for more details.
Case sensitivity in MySQL is often achieved by using the binary operator:
(case when binary ltrim(rtrim(City_old)) = binary ltrim(rtrim(City_New))
then 'Y' else 'N'
end) as is_same
This assumes that the original collation of the two strings is the same (which seems reasonable for two columns in the same table).

Mysql convert table, collation not changing

I'm using mariadb (" 10.1.20-MariaDB-1~trusty") with utf8mb4. Now I'm in the process of converting all tables to "row_format = dynamic" and table collation "utf8mb4_unicode_ci". I've noticed that there are some rogue tables in my database that still have "utf8mb4_general_ci" as collation, like this one:
use database;
SHOW TABLE STATUS WHERE COLLATION != "utf8mb4_unicode_ci";
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+--------------------+---------+
| table | InnoDB | 10 | Dynamic | 5 | 3276 | 16384 | 0 | 32768 | 0 | NULL | 2016-12-21 21:12:18 | NULL | NULL | utf8mb4_general_ci | NULL | row_format=DYNAMIC |
Then of course i would run something like this:
ALTER TABLE table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Which would finish without error. Checking Table Status again afterwards, still reads
Collation = utf8mb4_general_ci
for that table.
Dumping and importing that same database into my local 5.6.32-78.0 Percona Server and doing the same there will result in the table collation being converted to utf8mb4_unicode_ci as desired.
Does anyone have an idea what might be the cause for that?
Most likely there are no columns in the table to convert, so the operation is skipped. Try to run
ALTER TABLE table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, FORCE;
or
ALTER TABLE table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, ALGORITHM=COPY;
A bug report has been created based on this question:
https://jira.mariadb.org/browse/MDEV-11637

MySQL convert to UTF8 without structure change

I have a rather large database that I am trying to convert from charset and collation latin1/latin1_swedish_ci to utf8mb4/utf8mb4_unicode_ci. I am hoping to setup replication to a slave, run the conversion, and then promote the slave when finished as to avoid downtime.
I noticed that when running the query...
ALTER TABLE `sometable` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
...MySQL automatically converts text to mediumtext or mediumtext to longtext, etc.
Is there a way to turn this feature off? It is nice that MySQL has this feature, but the problem is that it breaks replication because the structure of the tables on the slave is different from master.
As documented under ALTER TABLE Syntax:
For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column's length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.
To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns. For example:
ALTER TABLE t MODIFY latin1_text_col TEXT CHARACTER SET utf8;
ALTER TABLE t MODIFY latin1_varchar_col VARCHAR(M) CHARACTER SET utf8;
(Not really an answer, but some illustrative examples.)
Case 1: Text is correctly stored as latin1 in latin1 column; use CONVERT TO
mysql> CREATE TABLE alters (
-> c VARCHAR(11) CHARACTER SET latin1 NOT NULL
-> );
mysql> INSERT INTO alters (c) VALUES ('aabc'), (UNHEX('61e06263')), (UNHEX('61e16263'));
mysql> SELECT c, HEX(c) from alters;
+-------+----------+
| c | HEX(c) |
+-------+----------+
| aabc | 61616263 |
| aàbc | 61E06263 |
| aábc | 61E16263 |
+-------+----------+
mysql> ALTER TABLE alters CONVERT TO CHARACTER SET utf8;
mysql> SELECT c, HEX(c) from alters;
+-------+------------+
| c | HEX(c) |
+-------+------------+
| aabc | 61616263 |
| aàbc | 61C3A06263 |
| aábc | 61C3A16263 |
+-------+------------+
mysql> -- Observation: text was correctly converted to utf8.
mysql> SHOW CREATE TABLE alters\G
Create Table: CREATE TABLE `alters` (
`c` varchar(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Case 2: Text is correctly stored as latin1 in latin1 column; use "Double ALTER"
mysql> CREATE TABLE alters (
-> c VARCHAR(11) CHARACTER SET latin1 NOT NULL
-> );
mysql> INSERT INTO alters (c) VALUES ('aabc'), (UNHEX('61e06263')), (UNHEX('61e16263'));
mysql> ALTER TABLE alters MODIFY c VARBINARY(11) NOT NULL;
mysql> ALTER TABLE alters MODIFY c VARCHAR(11) CHARACTER SET utf8 NOT NULL;
Query OK, 3 rows affected, 2 warnings (0.10 sec)
Records: 3 Duplicates: 0 Warnings: 2
mysql> SHOW WARNINGS;
+---------+------+----------------------------------------------------------+
| Level | Code | Message |
+---------+------+----------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xE0bc' for column 'c' at row 2 |
| Warning | 1366 | Incorrect string value: '\xE1bc' for column 'c' at row 3 |
+---------+------+----------------------------------------------------------+
mysql> SELECT c, HEX(c) from alters;
+------+----------+
| c | HEX(c) |
+------+----------+
| aabc | 61616263 |
| a | 61 |
| a | 61 |
+------+----------+
mysql> -- Observation: text was truncated ! BAD
mysql> SHOW CREATE TABLE alters\G
Create Table: CREATE TABLE `alters` (
`c` varchar(11) CHARACTER SET utf8 NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Case 3: Text was incorrectly stored as utf8 in a latin1 column; use the "Double ALTER to fix it
mysql> CREATE TABLE alters (
-> c VARCHAR(11) CHARACTER SET latin1 NOT NULL
-> );
mysql> INSERT INTO alters (c) VALUES ('aabc'), (UNHEX('61c3a06263')), (UNHEX('61c3a16263'));
mysql> ALTER TABLE alters MODIFY c VARBINARY(11) NOT NULL;
mysql> ALTER TABLE alters MODIFY c VARCHAR(11) CHARACTER SET utf8 NOT NULL;
mysql> SELECT c, HEX(c) from alters;
+-------+------------+
| c | HEX(c) |
+-------+------------+
| aabc | 61616263 |
| aàbc | 61C3A06263 |
| aábc | 61C3A16263 |
+-------+------------+
mysql> SHOW CREATE TABLE alters\G
Create Table: CREATE TABLE `alters` (
`c` varchar(11) CHARACTER SET utf8 NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Case 4: Using ALTER ... MODIFY; note the LENGTH and CHAR_LENGTH
mysql> CREATE TABLE alters (
-> c VARCHAR(9) CHARACTER SET latin1 NOT NULL
-> );
mysql> INSERT INTO alters (c) VALUES ('aabc'), (UNHEX('61e06263')),
-> (UNHEX('61e16263')),
-> (UNHEX('61e162633536373839'));
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) from alters;
+------------+--------------------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------------+--------------------+-----------+----------------+
| aabc | 61616263 | 4 | 4 |
| aàbc | 61E06263 | 4 | 4 |
| aábc | 61E16263 | 4 | 4 |
| aábc56789 | 61E162633536373839 | 9 | 9 |
+------------+--------------------+-----------+----------------+
mysql> ALTER TABLE alters MODIFY c VARCHAR(9) CHARACTER SET utf8 NOT NULL;
mysql> SELECT c, HEX(c), LENGTH(c), CHAR_LENGTH(c) from alters;
+------------+----------------------+-----------+----------------+
| c | HEX(c) | LENGTH(c) | CHAR_LENGTH(c) |
+------------+----------------------+-----------+----------------+
| aabc | 61616263 | 4 | 4 |
| aàbc | 61C3A06263 | 5 | 4 |
| aábc | 61C3A16263 | 5 | 4 |
| aábc56789 | 61C3A162633536373839 | 10 | 9 |
+------------+----------------------+-----------+----------------+
mysql> SHOW CREATE TABLE alters\G
Create Table: CREATE TABLE `alters` (
`c` varchar(9) CHARACTER SET utf8 NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Notes:
No Warnings except for the one case where I did SHOW.
Default table CHARSET was not changed, but that is not a problem.