Is "VARCHAR(255) CHARACTER SET utf8" 255 bytes or 255 characters - mysql

I've declared a field in my INNODB/MySQL table as
VARCHAR(255) CHARACTER SET utf8 NOT NULL
however when inserting my data is truncated at 255 bytes not characters. This
might chop the trailing two bite code point iemphasized textn two leaving an invalid character.
Any ideas what I might be doing wrong
EDIT:
A sample session is like this
mysql> update channel set comment="ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬x" where id = 1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 0 Warnings: 1
mysql> select id, channelName, comment from channel;
+----+-------------+------------------------------------------------------------------------------------------
| id | channelName | comment |
+----+-------------+-----------------------------------------------------------------------------------------
| 1 | foo | ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩ�� |
+----+-------------+-----------------------------------------------------------------------------------------
1 row in set (0.00 sec)
via mysql-admin I look at the comment field and see that it is indeed VARCHAR(255) and uses "UTF-8 Unicode"
from the command
show full columns from channel
I get
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| channelName | varchar(255) | utf8_general_ci | NO | | NULL | | select,insert,update,references | |
| comment | varchar(255) | utf8_general_ci | NO | | NULL | | select,insert,update,references | |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
mysql> SHOW VARIABLES LIKE 'character_set%'
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

According to the manual, you should be fine:
MySQL interprets length specifications in character column definitions in character units. (Before MySQL 4.1, column lengths were interpreted in bytes.) This applies to CHAR, VARCHAR, and the TEXT types.
Do you happen to be using a pre-4.1 version of mySQL?

This is a stab in the dark, but are you using UTF-8 as the connection and client character sets? Issue SHOW VARIABLES LIKE 'character_set%' and see whether it tells you UTF-8 or latin-1.
Perhaps if you are using the wrong connection/client character sets, the UTF-8 bytes are reinterpreted as single-byte characters and stored that way in the database.

Related

Saving a set of strings using stylized fonts to MariaDB with strange results

I currently have MariaDB version 10.4.18 on CentOS 8.0. When I'm trying to save a string with stylized fonts like below,
𝘈𝘴𝘵𝘳𝘪 𝘈𝘯𝘢𝘯𝘵𝘢
MariaDB saved them as "??? ????"
The statement
mysql> insert into testings(test) values ('𝘈𝘴𝘵𝘳𝘪 𝘈𝘯𝘢𝘯𝘵𝘢');
Here is my database's charset and collation
mysql> select ##collation_database;
+----------------------+
| ##collation_database |
+----------------------+
| utf8mb4_unicode_ci |
+----------------------+
1 row in set (0.00 sec)
mysql> SELECT ##character_set_database;
+--------------------------+
| ##character_set_database |
+--------------------------+
| utf8mb4 |
+--------------------------+
The table
mysql> SHOW FULL COLUMNS FROM testings;
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| test | text | utf8mb4_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------+------+--------------------+------+-----+---------+-------+---------------------------------+---------+
Can anyone point me to right direction?
Answered by #Akina, I edited my database config with parameters below
SET collation_connection = 'utf8mb4_unicode_ci';
SET character_set_client = 'utf8mb4';
SET character_set_results = 'utf8mb4';
SET character_set_system = 'utf8mb4';
Now it works!

Inserting UTF8 data into SJIS DB (MySQL)

I am working with web-app (JSP) which inserts data to mySQL database from webform, the data is sent to servlet as parameters encoded in UTF8. Application works perfectly with normal letters and with symbols till certain extent. But if I am trying to insert any 4 byte character will it be replaced by question mark (?) symbol.
I am pretty sure the problem has something to do with MySQL weird way of having UTF8 as 3 bytes only, but this time the collation is SJIS.
I must be overlooking something so I would appreciate any help available, I have been hitting my head to wall for one day for this.
as for collation information, I have tried multiple different settings, result is always the same, everything works fine, except the 4byte characters.
this is the default collation:
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%'
OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+
I have also tried with following:
+--------------------------+------------------+
| Variable_name | Value |
+--------------------------+------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | sjis |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | sjis_japanese_ci |
+--------------------------+------------------+
example of a table I am inserting into(Z column):
show FULL COLUMNS FROM XYZ;
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| X | int(10) unsigned | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| Y | date | NULL | YES | | NULL | | select,insert,update,references | |
| Z | varchar(255) | sjis_japanese_ci | YES | | NULL | | select,insert,update,references | |
+--------+------------------+------------------+------+-----+---------+----------------+---------------------------------+---------+
inside JAVA-class encoding is set as following
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("SHIFT_JIS");
I know DB can hold this characters as previously imported(LODA DATA INFILE) data has these characters and they are visible in DB (not question marks).
so Friends, I ask your help with this, this is probably something very easy (or impossible), if you need more information I can get it from the DB/source.
example of UTF8 4 byte character is: (might not be visible of your browser)
𠜎
or :) https://codepoints.net/U+1F4A9
Thank you very much!
I have tried absolutely everything to make this work with SJIS but didn't succeed, I fixed the situation with altering all the tables to utf8mb4.
ALTER TABLE xxx CONVERT TO CHARACTER SET utf8mb4;
and changing encoding all the way to UTF-8:
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
stay away from SJIS if possible.

MySQL can select litral emoji but wont store emoji into table

I am trying to insert some emoji characters into a table in MySQL, but values are stored as question marks (????).
I made sure to create the database with the proper utf8mb4 encoding:
mysql> describe users;
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(191) | YES | | NULL | |
+-------+--------------+------+-----+---------+----------------+
2 rows in set (0.01 sec)
Then I tried to make sure, does MySql understand emoji or not, so I did this:
mysql> select '🌰';
+------+
| 🌰 |
+------+
| 🌰 |
+------+
1 row in set (0.00 sec)
Then I did this:
mysql> insert into users (name) values ('🌰');
Query OK, 1 row affected, 1 warning (0.05 sec)
mysql> select * from users;
+----+------------+
| id | name |
+----+------------+
| 21 | فاضل |
| 30 | سلاحف |
| 46 | ???? |
| 47 | ???? |
| 48 | ???? |
| 49 | ???? |
+----+------------+
6 rows in set (0.01 sec)
I don't know what to do to fix that..
** EDIT ** : as requested in the comments, I ran the following command:
mysql> SHOW VARIABLES LIKE 'character%';
+--------------------------+-------------------------+
| Variable_name | Value |
+--------------------------+-------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /static/share/charsets/ |
+--------------------------+-------------------------+
8 rows in set (0.00 sec)
Your connection is set up for utf8; it needs to be set up for utf8mb4.
How did you set it? Change to whichever of these applies.
SET NAMES utf8mb4
PDO(... charset=utf8mb4)
mysqli::set_charset('utf8mb4')
etc
Emoji are 4-byte utf8 codes, hence the four question marks.

Some confusing phenomena about insert emoji character into mysql table

When insert emoji character in mysql interactive interface, I found some phenomena very confusing. Hope someone could clear it. Now see below:
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
CREATE TABLE `t` (
`data` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> insert into t select '\U+1F600';
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1
mysql> set names utf8mb4;
mysql> insert into t select '\U+1F600';
Query OK, 1 row affected (0.00 sec)
mysql> select * from t;
+------+
| data |
+------+
| 😀 |
+------+
mysql> select data, hex(data) from t;
+------+-----------+
| data | hex(data) |
+------+-----------+
| 😀 | F09F9880 |
+------+-----------+
Why do I need execute set names utf8mb4 explicitly? From error message, it seems it resolved the data content to four byte(f0 9f 98 80) successully? Why still can't insert successfully?
Below is another puzzle for me.
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> insert into t select '\U+1F600';
Query OK, 1 row affected (0.01 sec)
mysql> select data,hex(data) from t;
+------+--------------------+
| data | hex(data) |
+------+--------------------+
| 😀 | C3B0C5B8CB9CE282AC |
+------+--------------------+
I have to say I feel a little shock about this. In my opinion only utf8mb4 support emoji character, but now latin1 support emoji character too.
Anybody can clear it for me. Thanks!
You can insert UTF8 data into a latin1 table, but MySQL won't treat the byte stream as a UTF8 character. So you won't be able to query against it for example. If your application understands the UTF8 byte stream then it will look like its working OK. But the table charset really needs to be utf8 (or utf8mb4) if MySQL is to understand those bytes as Unicode characters.

ERROR 1062 (23000): Duplicate entry '?' for key 'PRIMARY' with two differents entries

iam triying to import a table from SQLite to MySQL that contains a lot of japanese kanji characters.
The table where i try to insert the data looks like this:
+--------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+-------+
| literal | char(10) | NO | PRI | NULL | |
| grade | int(11) | YES | | NULL | |
| stroke_count | int(11) | YES | | NULL | |
| freq | int(11) | YES | | NULL | |
| jlpt | int(11) | YES | | NULL | |
When i try
INSERT INTO main VALUES('𠂉',NULL,2,NULL,NULL);
i got the next error:
mysql>
ERROR 1062 (23000): Duplicate entry '?' for key 'PRIMARY'
And if try to look up that entry i get:
select * from main where literal = '𠂉';
+---------+-------+--------------+------+------+
| literal | grade | stroke_count | freq | jlpt |
+---------+-------+--------------+------+------+
| 𠀋 | NULL | 4 | NULL | NULL |
+---------+-------+--------------+------+------+
1 row in set (0.00 sec)
Why looking up '𠂉' it shows up like '𠀋'?
I thought that it may was related with the UTF8 encoding so , i reconfigured all the Db and tables to utf8mb4 following the instructions of this link.
Here is mysql configuration:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
After that nohing changes...any ideas?
Thanks
best regards
Depending on the collation, those two characters might be treated as equivalent.
You could try another collation - utf8mb4_bin, but then you have to take care for lower-casing all the values in your application code to make sure the primary key is case-insensitive.
Alternatively you could lookup the characters you gave in your example in this database (I cannot post more than 2 links, sorry):
http://codepoints.net/
Their UTF Code Points are:
U+20089
U+2000B
Check here for the standard collation maps: http://www.unicode.org/charts/uca/
I could not find those two characters in any Unicode collation maps, but there are many cases with Latin characters with diacritics (e.g. 'Ç' and 'C'), which are defined as equivalents in the utf8 case-insensitive collation maps.