How to set schema collation in MySQL for Japanese - mysql

I am having a problem with collation. I want to set collation to support the Japanese language. For example, when table.firstname has 'あ', a query with 'ぁ' should return the record. Thanks in advance.

That's like "uppercase" and "lowercase", correct?
mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_general_ci;
+---------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_general_ci |
+---------------------------------------+
| 0 |
+---------------------------------------+
mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_unicode_ci;
+---------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_unicode_ci |
+---------------------------------------+
| 1 |
+---------------------------------------+
mysql> SELECT 'あ' = 'ぁ' COLLATE utf8_unicode_520_ci;
+-------------------------------------------+
| 'あ' = 'ぁ' COLLATE utf8_unicode_520_ci |
+-------------------------------------------+
| 1 |
+-------------------------------------------+
I recommend changing your column to be COLLATION utf8_unicode_520_ci (or utf8mb4_unicode_520_ci).
If you expect to be including Chinese, then be sure to use utf8mb4. (Perhaps this advice applies to Kanji, too.)

Related

MySQL distinct select for emoji always return 1 emoji out of 2

have these emojis stored in a reaction column. 😏, 😇
If I run this query select a distinct reaction from chat_reactions where chat_id = 593 it only select the first one. My column's collation is utf8mb4_unicode_ci but this issue doesn't happen for ❤️ (red love emoji)
How can I solve this?
Solve this by using a collation that treats the emojis as distinct.
Collation defines which characters are equal, less than, or greater than other characters.
Demo: I created a table like yours and filled it with the three emojis:
mysql> select reaction from chat_reactions;
+----------+
| reaction |
+----------+
| 😏 |
| 😇 |
| ❤️ |
+----------+
In utf8mb4_unicode_ci collation, those face emojis are considered the same, so they are collapsed into one row:
mysql> select distinct reaction collate utf8mb4_unicode_ci from chat_reactions;
+-------------------------------------+
| reaction collate utf8mb4_unicode_ci |
+-------------------------------------+
| 😏 |
| ❤️ |
+-------------------------------------+
But the more modern collations treat the face emojis as distinct:
mysql> select distinct reaction collate utf8mb4_unicode_520_ci from chat_reactions;
+-----------------------------------------+
| reaction collate utf8mb4_unicode_520_ci |
+-----------------------------------------+
| 😏 |
| 😇 |
| ❤️ |
+-----------------------------------------+
mysql> select distinct reaction collate utf8mb4_0900_ai_ci from chat_reactions;
+-------------------------------------+
| reaction collate utf8mb4_0900_ai_ci |
+-------------------------------------+
| 😏 |
| 😇 |
| ❤️ |
+-------------------------------------+

MYSQL: best utf setting for ACCENT sensitive but CASE insensitive

For example, ἐν or Ἐν are the same, but should be distinguished from ἕν/Ἓν. I've tried utf8_bin which seems to be the closest, but is also case sensitive.
mysql> select 'ἐν' = 'Ἐν' collate utf8mb4_0900_as_ci;
+----------------------------------------------+
| 'ἐν' = 'Ἐν' collate utf8mb4_0900_as_ci |
+----------------------------------------------+
| 1 |
+----------------------------------------------+
mysql> select 'ἐν' = 'ἕν' collate utf8mb4_0900_as_ci;
+----------------------------------------------+
| 'ἐν' = 'ἕν' collate utf8mb4_0900_as_ci |
+----------------------------------------------+
| 0 |
+----------------------------------------------+

Why the result of "SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;" is 1?

I am a Chinese. I can not understand why the result of "SELECT 'ä' = 'ae' COLLATE latin1_german2_ci", which is the example on MySQL document, is 1?
Besides, I have read another article. It set the charset as latin1 first, and then the result of "SELECT 'ä' = 'ae' COLLATE latin1_german2_ci" becomes 0. Why is the two results of the same sql is different? Is it because the charset difference?
On MySQL document.
mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+-----------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+-----------------------------------------+
| 0 |
+-----------------------------------------+
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+--------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+
On the another article.
mysql> set charset latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+---------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+---------------------------------------+
| 0 |
+---------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+------------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+------------------------------------------+
| 0 |
+------------------------------------------+
1 row in set (0.00 sec)

MySQL Incorrect string value error

In a Django application with MySQL DB back-end users try to insert notes which contain some smileys and hearts and stuff which are Unicode characters. MySQL refuses the operations with an error:
(1366, "Incorrect string value: '\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F' for column 'note' at row 1")
(The column in question has longtext type. The Unicode characters in this case valid, it's a heart and a modifier https://codepoints.net/U+2764 https://codepoints.net/U+FE0F, so it's not that they would be 4 byte long UTF-8 characters. I made sure that MySQL's default character set is utf-8.)
What is interesting is that I cannot fully reproduce this error on my local developer environment. One particular difference is that it only emits a warning for that anomaly.
Update1:
This is still bothering to me:
mysql> SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name="sblive";
+----------------------------+
| default_character_set_name |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
I converted the specific table's charset to utf-8:
mysql> alter table uploads_uploads convert to character set utf8 COLLATE utf8_general_ci;
Query OK, 1209036 rows affected (1 min 10.31 sec)
Records: 1209036 Duplicates: 0 Warnings: 0
mysql> SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "sblive" AND table_name = "uploads_uploads" AND column_name = "note";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
+--------------------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE '%colla%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
You are asking for ❤️ followed by a "non-spacing" "VARIATION SELECTOR-16".
Your bytes are utf8 -- good
Your connection needs to specify utf8 -- does it?
Your TEXT column need to be declared CHARACTER SET utf8 -- is it? Use SHOW CREATE TABLE to verify.
If you are using HTML, it needs to say charset=UTF-8 -- does it?
Suggest you switch to utf8mb4 if the 'back-end users' are likely to enter more emoticons -- the 'Emoji' will need it.
Addenda
Let's check the data... Please run this
SELECT col, HEX(col) FROM ...
Those two character should deliver hex E29DA4 and EFB88F. If you see C3A2C29DC2A4C3AFC2B8C28F, you have "double encoding", which is a messier problem. 2764FE0F would indicate utf16, I think.

COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

I am trying to fix a character encoding issue - previously we had the collation set for this column utf8_general_ci which caused issues because it is accent insensitive..
I'm trying to find all the entries in the database that could have been affected.
set names utf8;
select * from table1 t1 join table2 t2 on (t1.pid=t2.pid and t1.id != t2.id) collate utf8_general_ci;
However, this generates the error:
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'
The database is now defined with DEFAULT CHARACTER SET utf8
The table is defined with CHARSET=utf8
The "pid" column is defined with: CHARACTER SET utf8 COLLATE utf8_bin NOT NULL
The server version is Server version: 5.5.37-MariaDB-0ubuntu0.14.04.1 (Ubuntu)
Question: Why am I getting an error about latin1 when latin1 doesn't seem to be present anywhere in the table / schema definition?
MariaDB [(none)]> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
MariaDB [(none)]> SHOW VARIABLES LIKE '%collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
First, run this query:
SHOW VARIABLES LIKE '%char%';
You have character_set_server='latin1' shown in your post ...
So, go into your my.cnf and add or uncomment these lines:
character-set-server = utf8
collation-server = utf8_unicode_ci
Restart the server.
The same error is produced in MariaDB (10.1.36-MariaDB) by using the combination of parenthesis and the COLLATE statement. My SQL was different, the error was the same, I had:
SELECT *
FROM table1
WHERE (field = 'STRING') COLLATE utf8_bin;
Omitting the parenthesis was solving it for me.
SELECT *
FROM table1
WHERE field = 'STRING' COLLATE utf8_bin;
In my case I created a database and gave the collation 'utf8_general_ci' but the required collation was 'latin1'. After changing my collation type to latin1_bin the error was gone.