Inserting UTF8 data into SJIS DB (MySQL) - mysql

I am working with web-app (JSP) which inserts data to mySQL database from webform, the data is sent to servlet as parameters encoded in UTF8. Application works perfectly with normal letters and with symbols till certain extent. But if I am trying to insert any 4 byte character will it be replaced by question mark (?) symbol.
I am pretty sure the problem has something to do with MySQL weird way of having UTF8 as 3 bytes only, but this time the collation is SJIS.
I must be overlooking something so I would appreciate any help available, I have been hitting my head to wall for one day for this.
as for collation information, I have tried multiple different settings, result is always the same, everything works fine, except the 4byte characters.
this is the default collation:
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%'
OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name | Value |
+--------------------------+-------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | latin1_swedish_ci |
+--------------------------+-------------------+
I have also tried with following:
+--------------------------+------------------+
| Variable_name | Value |
+--------------------------+------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | sjis |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | sjis |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | sjis_japanese_ci |
| collation_server | sjis_japanese_ci |
+--------------------------+------------------+
example of a table I am inserting into(Z column):
show FULL COLUMNS FROM XYZ;
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+--------+------------------+------------------+------+-----+---------+----- -----------+---------------------------------+---------+
| X | int(10) unsigned | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| Y | date | NULL | YES | | NULL | | select,insert,update,references | |
| Z | varchar(255) | sjis_japanese_ci | YES | | NULL | | select,insert,update,references | |
+--------+------------------+------------------+------+-----+---------+----------------+---------------------------------+---------+
inside JAVA-class encoding is set as following
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("SHIFT_JIS");
I know DB can hold this characters as previously imported(LODA DATA INFILE) data has these characters and they are visible in DB (not question marks).
so Friends, I ask your help with this, this is probably something very easy (or impossible), if you need more information I can get it from the DB/source.
example of UTF8 4 byte character is: (might not be visible of your browser)
𠜎
or :) https://codepoints.net/U+1F4A9
Thank you very much!

I have tried absolutely everything to make this work with SJIS but didn't succeed, I fixed the situation with altering all the tables to utf8mb4.
ALTER TABLE xxx CONVERT TO CHARACTER SET utf8mb4;
and changing encoding all the way to UTF-8:
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
stay away from SJIS if possible.

Related

MySQL query shows something like 0x8081 instead of special characters

When i run SELECT CHAR(128,129,130,131,132,133,134,135,136,137);
I got 0x80818283848586878889 istead of Çüéâäàåçêë.
Does anybody know why?
I'm using charset utf8mb4.
When I run show variables like '%char%'; I got
+--------------------------------------+--------------------------------+
| Variable_name | Value |
+--------------------------------------+--------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/share/mysql-8.0/charsets/ |
| validate_password.special_char_count | 1 |
+--------------------------------------+--------------------------------+
I'm using MySQL version 8.0.26
The output charset is "DOS West European"
Use this query it will display the text in charset "DOS West European"
SELECT CHAR(128,129,130,131,132,133,134,135,136,137 using cp850)

French characters not showing properly

I have a weird issue. I have an embedded system running Linux, QT 4.8 (for touch screen) and MySQL 5.7.8. In my database, I have some entries that contained French accent characters such as "é, à, À, etc"
So, I start mysqld process first, then start my application running QT that will open the database.
This will give me corrupted characters on the screen. For example, the "É" will become "Ã%".
If I restart mysqld, all the French characters will be displayed properly on my screen.
So, it seems my QT application can handle French characters, but I need to restart mysqld to make it happened!
I wanted to start my QT app first and then mysqld after, but I need to get access to the database and because it is not started yet, my app gives an error.
Any clue why?
UPDATE 20170524
Here is my database info:
mysql> show variables like "collation_database";
+--------------------+-------------------+
| Variable_name | Value |
+--------------------+-------------------+
| collation_database | latin1_swedish_ci |
+--------------------+-------------------+
mysql> show variables like "%character%"; show variables like "%collation%";
+--------------------------+-------------------------------------------------+
| Variable_name | Value |
+--------------------------+-------------------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /mnt/data/part1/usr/local/mysql/share/charsets/ |
+--------------------------+-------------------------------------------------+
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+

ERROR 1062 (23000): Duplicate entry '?' for key 'PRIMARY' with two differents entries

iam triying to import a table from SQLite to MySQL that contains a lot of japanese kanji characters.
The table where i try to insert the data looks like this:
+--------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+-------+
| literal | char(10) | NO | PRI | NULL | |
| grade | int(11) | YES | | NULL | |
| stroke_count | int(11) | YES | | NULL | |
| freq | int(11) | YES | | NULL | |
| jlpt | int(11) | YES | | NULL | |
When i try
INSERT INTO main VALUES('𠂉',NULL,2,NULL,NULL);
i got the next error:
mysql>
ERROR 1062 (23000): Duplicate entry '?' for key 'PRIMARY'
And if try to look up that entry i get:
select * from main where literal = '𠂉';
+---------+-------+--------------+------+------+
| literal | grade | stroke_count | freq | jlpt |
+---------+-------+--------------+------+------+
| 𠀋 | NULL | 4 | NULL | NULL |
+---------+-------+--------------+------+------+
1 row in set (0.00 sec)
Why looking up '𠂉' it shows up like '𠀋'?
I thought that it may was related with the UTF8 encoding so , i reconfigured all the Db and tables to utf8mb4 following the instructions of this link.
Here is mysql configuration:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
After that nohing changes...any ideas?
Thanks
best regards
Depending on the collation, those two characters might be treated as equivalent.
You could try another collation - utf8mb4_bin, but then you have to take care for lower-casing all the values in your application code to make sure the primary key is case-insensitive.
Alternatively you could lookup the characters you gave in your example in this database (I cannot post more than 2 links, sorry):
http://codepoints.net/
Their UTF Code Points are:
U+20089
U+2000B
Check here for the standard collation maps: http://www.unicode.org/charts/uca/
I could not find those two characters in any Unicode collation maps, but there are many cases with Latin characters with diacritics (e.g. 'Ç' and 'C'), which are defined as equivalents in the utf8 case-insensitive collation maps.

mysql database not rendering german character correctly

Update:
it turns out this is not directly related to the DB server itself, but the clients encoding. If the client uses encoding utf8, the german character is rendered incorrectly. But if the client uses encoding cp850, then the german character is rendered correctly. But I need to use utf8 since there might be other class of characters that the app needs to deal with. what should I do?
Original:
I have two database servers, viewing from the same mysql cient, server1 is rendering the german characters correctly, server2 is not. the following are the differences. But I'm baffled since server2 uses utf8 more. What could be the cause of this?
server1's encodings:
mysql> SHOW VARIABLES LIKE "character\_set\_database";
+------------------------+--------+
| Variable_name | Value |
+------------------------+--------+
| character_set_database | latin1 |
+------------------------+--------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | cp850 |
| character_set_connection | cp850 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | cp850 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
server2
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.01 sec)
mysql> SHOW VARIABLES LIKE "character\_set\_database";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| character_set_database | utf8 |
+------------------------+-------+
1 row in set (0.00 sec)

Is "VARCHAR(255) CHARACTER SET utf8" 255 bytes or 255 characters

I've declared a field in my INNODB/MySQL table as
VARCHAR(255) CHARACTER SET utf8 NOT NULL
however when inserting my data is truncated at 255 bytes not characters. This
might chop the trailing two bite code point iemphasized textn two leaving an invalid character.
Any ideas what I might be doing wrong
EDIT:
A sample session is like this
mysql> update channel set comment="ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬x" where id = 1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 0 Warnings: 1
mysql> select id, channelName, comment from channel;
+----+-------------+------------------------------------------------------------------------------------------
| id | channelName | comment |
+----+-------------+-----------------------------------------------------------------------------------------
| 1 | foo | ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩ�� |
+----+-------------+-----------------------------------------------------------------------------------------
1 row in set (0.00 sec)
via mysql-admin I look at the comment field and see that it is indeed VARCHAR(255) and uses "UTF-8 Unicode"
from the command
show full columns from channel
I get
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| channelName | varchar(255) | utf8_general_ci | NO | | NULL | | select,insert,update,references | |
| comment | varchar(255) | utf8_general_ci | NO | | NULL | | select,insert,update,references | |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
mysql> SHOW VARIABLES LIKE 'character_set%'
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
According to the manual, you should be fine:
MySQL interprets length specifications in character column definitions in character units. (Before MySQL 4.1, column lengths were interpreted in bytes.) This applies to CHAR, VARCHAR, and the TEXT types.
Do you happen to be using a pre-4.1 version of mySQL?
This is a stab in the dark, but are you using UTF-8 as the connection and client character sets? Issue SHOW VARIABLES LIKE 'character_set%' and see whether it tells you UTF-8 or latin-1.
Perhaps if you are using the wrong connection/client character sets, the UTF-8 bytes are reinterpreted as single-byte characters and stored that way in the database.