AWS MariaDB Statement could not be executed - mysql

Seem to have a bit of a character encoding issue with MariaDB on AWS that I can't seem to resolve;
Statement could not be executed (22007 - 1366 - Incorrect string value: '\xA320 Of...'
Initially I presumed this was because the table was set to latin1 but I've since changed the table and the column to utf8mb4_unicode_ci and the error persists.

Since your column definition is utf8, you also need to insert utf8 data.
\xA320 is not a valid utf8 character:
mysql> select convert(X'A320' using utf8mb4);
+--------------------------------+
| convert(X'A320' using utf8mb4) |
+--------------------------------+
| ? |
+--------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+-------------------------------------------+
| Level | Code | Message |
+---------+------+-------------------------------------------+
| Warning | 1300 | Invalid utf8mb4 character string: '\xA3 ' |
+---------+------+-------------------------------------------+
1 row in set (0.00 sec)

Related

Adding a UCA Collation to a Unicode Character Set, why it is doesn't work?

In Unicode Locale Data Markup Language(LDML), since version 24, the element and its sub-elements is deprecated. But the MySQL example still uses deprecated element.
The collation defined when I added to MySQL Collation with a latest version of the CLDR Collation definition marked with the element did not take effect.
I want to add to MySQL collation for the UTF8 character using stroke collation in <zh.xml>.
MySQL Path: mysql-8.0.28-winx64\share\charsets\index.xml
MySQL version: 8.0.28
Stroke collation in: https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml
http://www.unicode.org/reports/tr35/#Element_rules
https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/ldml-rules.html
How to repeat
Step 1. Edit mysql-8.0.28-winx64\share\charsets\index.html
Add some element(collation content copy from CLDR collation zh.xml) like:
<charset name="utf8mb4">
<family>Unicode</family>
<description>UTF-8 Unicode</description>
<collation name="utf8mb4_stroke_ci" id="1030" type='stroke'>
<cr><![CDATA[
[import zh-u-co-private-pinyin]
...more data...
]]></cr>
</collation>
</charset>
Step 2. Restart mysql server
Step 3. Check collation added success
mysql> SHOW COLLATION WHERE Collation = 'utf8mb4_stroke_ci';
+----------------+---------+------+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------+---------+------+---------+----------+---------+---------------+
| utf8mb4_stroke_ci | utf8 | 1030 | | | 8 | PAD SPACE |
+----------------+---------+------+---------+----------+---------+---------------+
1 row in set (0.00 sec)
Step 4. Create a database and table then insert some data
mysql> create database collation_test;
Query OK, 1 row affected (0.02 sec)
mysql> use collation_test;
Database changed
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_stroke_ci;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE member_stroke (
-> name VARCHAR(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_stroke_ci
-> );
Query OK, 0 rows affected (0.05 sec)
mysql> insert into member_stroke values('一'); -- character '一' means '1', stroke 1.
Query OK, 1 row affected (0.01 sec)
mysql> insert into member_stroke values('二'); -- character '一' means '2', stroke 2.
Query OK, 1 row affected (0.01 sec)
mysql> insert into member_stroke values('三'); -- character '一' means '3', stroke 3.
Query OK, 1 row affected (0.01 sec)
Step 4. Select data and order by name
mysql> select * from member_stroke order by name;
+------+
| name |
+------+
| 一 |
| 三 |
| 二 |
+------+
3 rows in set (0.00 sec)
Expect result
+------+
| name |
+------+
| 一 |
| 二 |
| 三 |
+------+
Additional information
When I use the element to define collation, it success! But is`s deprecated at LDML(version 24) on 2013-09-18.
<charset name="utf8mb4">
<family>Unicode</family>
<description>UTF-8 Unicode</description>
<collation name="utf8mb4_stroke_ci" id="1030" type='stroke' alt='short'>
<rules>
<!-- START AUTOGENERATED STROKE SHORT -->
<reset><last_non_ignorable /></reset>
<p>﷐⠁</p><!-- INDEX 1 -->
<pc>一</pc><!-- 1 -->
<p>﷐⠁</p><!-- INDEX 2 -->
<pc>二</pc><!-- 2 -->
<p>﷐⠁</p><!-- INDEX 3 -->
<pc>三</pc><!-- 3 -->
</rules>
</collation>
</charset>
mysql> select * from member_stroke order by name;
+------+
| name |
+------+
| 一 |
| 二 |
| 三 |
+------+
3 rows in set (0.00 sec)

How to change encoding on fly in SELECT statement?

I have a table with a column, which has cp1251_general_ci collation. I don't want to change column collation, but I want to get data in utf8 encoding.
Is there a way to select any data somehow in a way that it looks just like a data with utf8_general_ci collation?
I.e. I need something like this
SELECT CONVERT_TO_UTF8(weirdColumn) FROM weirdTable
Here's a demo table using the cp1251 encoding. I'll insert some Cyrillic characters into it.
mysql> CREATE TABLE weirdTable (weirdColumn text) ENGINE=InnoDB DEFAULT CHARSET=cp1251;
mysql> insert into weirdTable values ('ЂЃЉЌ');
mysql> select * from weirdTable;
+-------------+
| weirdColumn |
+-------------+
| ЂЃЉЌ |
+-------------+
Use MySQL's CONVERT() function to force the characters to a different encoding:
mysql> select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
Here's proof that the result has been converted to utf8. I create a table using metadata from the query result:
mysql> create table w2
as select convert(weirdColumn using utf8) as weirdColumnUtf8 from weirdTable;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> show create table w2\G
*************************** 1. row ***************************
Table: w2
Create Table: CREATE TABLE `w2` (
`weirdColumnUtf8` longtext CHARACTER SET utf8
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> select * from w2;
+-----------------+
| weirdColumnUtf8 |
+-----------------+
| ЂЃЉЌ |
+-----------------+
On my MySQL instance, utf8mb4 is the default character encoding. That's okay; it's a superset of utf8, and the utf8 encoding is enough to store these characters. However, I generally recommend if you use utf8, there's no reason not to use utf8mb4.
If you change the character encoding, you cannot keep the cp1251 collation. Collations are specific to encodings. But you can use one of the collations associated with utf8 or utf8mb4. You can see the available collations for a given character encoding:
mysql> SHOW COLLATION WHERE Charset = 'utf8';
+--------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+--------------------------+---------+-----+---------+----------+---------+---------------+
...
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | PAD SPACE |
| utf8_general_mysql500_ci | utf8 | 223 | | Yes | 1 | PAD SPACE |
...

Inserting 4-byte unicode characters into MySQL/MariaDB

When attempting to insert 💩 (for example, which is a 4-byte unicode char), both MySQL (5.7) and MariaDB (10.2/10.3/10.4) give the same error:
Incorrect string value: '\xF0\x9F\x92\xA9'
The statement:
mysql> insert into bob (test) values ('💩');
Here's my database's charset/collation:
mysql> select ##collation_database; +----------------------+
| ##collation_database |
+----------------------+
| utf8mb4_unicode_ci |
+----------------------+
1 row in set (0.00 sec)
mysql> SELECT ##character_set_database; +--------------------------+
| ##character_set_database |
+--------------------------+
| utf8mb4 |
+--------------------------+
1 row in set (0.00 sec)
The server's character set:
mysql> show global variables like '%character_set_server%'\G; *************************** 1. row ***************************
Variable_name: character_set_server
Value: utf8mb4
The table:
create table bob ( `test` TEXT NOT NULL );
mysql> SHOW FULL COLUMNS FROM bob;
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| test | text | utf8mb4_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
1 row in set (0.00 sec)
Can anyone point me in the right direction?
Yes, as you commented, you need to use SET NAMES utf8mb4.
Your 4-byte character must pass from your client through the database connection and into a table. All of those must support utf8mb4. If any one of them does not support utf8mb4, then 4-byte characters will not be able to get through.
SET NAMES utf8mb4 makes the database session expect clients to send string using that encoding. The default for character_set_client on MySQL 5.7 is utf8, so you need to set it to utf8mb4.
In MySQL 8.0.1 and later, the default character_set_client is utf8mb4 already, so you won't need to change it.

MySQL Incorrect string value error

In a Django application with MySQL DB back-end users try to insert notes which contain some smileys and hearts and stuff which are Unicode characters. MySQL refuses the operations with an error:
(1366, "Incorrect string value: '\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F' for column 'note' at row 1")
(The column in question has longtext type. The Unicode characters in this case valid, it's a heart and a modifier https://codepoints.net/U+2764 https://codepoints.net/U+FE0F, so it's not that they would be 4 byte long UTF-8 characters. I made sure that MySQL's default character set is utf-8.)
What is interesting is that I cannot fully reproduce this error on my local developer environment. One particular difference is that it only emits a warning for that anomaly.
Update1:
This is still bothering to me:
mysql> SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name="sblive";
+----------------------------+
| default_character_set_name |
+----------------------------+
| latin1 |
+----------------------------+
1 row in set (0.00 sec)
I converted the specific table's charset to utf-8:
mysql> alter table uploads_uploads convert to character set utf8 COLLATE utf8_general_ci;
Query OK, 1209036 rows affected (1 min 10.31 sec)
Records: 1209036 Duplicates: 0 Warnings: 0
mysql> SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "sblive" AND table_name = "uploads_uploads" AND column_name = "note";
+--------------------+
| character_set_name |
+--------------------+
| utf8 |
+--------------------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE '%colla%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
You are asking for ❤️ followed by a "non-spacing" "VARIATION SELECTOR-16".
Your bytes are utf8 -- good
Your connection needs to specify utf8 -- does it?
Your TEXT column need to be declared CHARACTER SET utf8 -- is it? Use SHOW CREATE TABLE to verify.
If you are using HTML, it needs to say charset=UTF-8 -- does it?
Suggest you switch to utf8mb4 if the 'back-end users' are likely to enter more emoticons -- the 'Emoji' will need it.
Addenda
Let's check the data... Please run this
SELECT col, HEX(col) FROM ...
Those two character should deliver hex E29DA4 and EFB88F. If you see C3A2C29DC2A4C3AFC2B8C28F, you have "double encoding", which is a messier problem. 2764FE0F would indicate utf16, I think.

MySQL UTF8 varchar column size

MySQL documentation says that since 5.0, varchar lengths refer to character units, not bytes. However, I recently came across an issue where I was getting truncated data warnings when inserting values that should have fit into the varchar column it was designated.
I replicated this issue with a simple table in v5.1
mysql> show create table test\G
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`string` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
I then inserted multiple 10 characters values with differing amounts of UTF8 characters
mysql> insert into test (string) values
-> ('abcdefghij'),
-> ('ãáéíçãáéíç'),
-> ('ãáéíç67890'),
-> ('éíç4567890'),
-> ('íç34567890');
Query OK, 5 rows affected, 4 warnings (0.06 sec)
Records: 5 Duplicates: 0 Warnings: 4
mysql> show warnings;
+---------+------+---------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'string' at row 2 |
| Warning | 1265 | Data truncated for column 'string' at row 3 |
| Warning | 1265 | Data truncated for column 'string' at row 4 |
| Warning | 1265 | Data truncated for column 'string' at row 5 |
+---------+------+---------------------------------------------+
mysql> select * from test;
+------------+
| string |
+------------+
| abcdefghij |
| ãáéíç |
| ãáéíç |
| éíç4567 |
| íç345678 |
+------------+
5 rows in set (0.00 sec)
I think that this shows that the varchar size is still defined in bytes or at least, is not accurate in character units.
The question is, am I understanding the documentation correctly and is this a bug? Or am I misinterpreting the documentation?
It's true that VARCHAR and CHAR sizes are considered in characters, not bytes.
I was able to recreate your issue when I set my connection character set to latin1 (single byte).
Ensure that you set your connection character set to UTF8 prior to running the insertion query with the following command:
SET NAMES utf8
If you don't do this, a two-byte UTF8 character will get sent as two single-byte characters.
You might consider changing your default client character set.