I have a csv file containing some characters that lie outside Unicode BMP, for example the character š. They are SMP characters, so they need to be stored in utf8mb4 charset and utf8mb4_general_ci collation in MySQL instead of utf8 charset and utf8_general_ci collation.
So here are my SQL queries.
MariaDB [tweets]> set names 'utf8mb4';
Query OK, 0 rows affected (0.01 sec)
MariaDB [tweets]> create table test (a text) collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.06 sec)
MariaDB [tweets]> insert into test (a) values ('š');
Query OK, 1 row affected (0.03 sec)
MariaDB [tweets]> select * from test;
+------+
| a |
+------+
| š |
+------+
1 row in set (0.00 sec)
No warnings. Everything is right. Now I want to load that csv file. For test, the file has only one line.
MariaDB [tweets]> load data local infile 't.csv' into table wzyboy character set utf8mb4 fields terminated by ',' enclosed by '"' lines terminated by '\n\n' (tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,timestamp,source,text,expanded_urls);
Query OK, 1 row affected, 7 warnings (0.01 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 7
The warning message is:
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x80\x80' for column 'text' at row 1 |
All my working environments (OS, Termianl, etc) use UTF-8. I have specified utf8mb4 in everyplace I could think up of, and if I manually INSERT INTO it works just fine. However, when I use LOAD DATA INFILE [...] CHARACTER SET utf8mb4 [...] it just fails with error "Incorrect string value".
Problem solved.
It was a mistake. During the experiment, I just TRUNCATE TABLE but not re-create it. So the database and the table are both utf8mb4, but the columns are still utf8...
Related
I'm trying to update a mariadb table column, to a string that contains a literal backslash.
I want the resulting string in the table to be
4.4 \(blah blah\)
I've tried
UPDATE table SET string = '4.4 \\(blah blah\\)' WHERE string = '4.4 (blah blah)';
This works when I run it in Sequel Pro, but when I run it as part of a ruby/rails migration, the result is that the column remains unchanged, ie. 4.4 (blah blah).
I've tried every combination of single quotes, double quotes, single backslash, double backslash. I also tried a triple backslash.
NO_BACKSLASH_ESCAPES sql_mode.
Enabling this mode disables the use of the backslash character () as
an escape character within strings and identifiers. With this mode
enabled, backslash becomes an ordinary character like any other, and
the default escape sequence for LIKE expressions is changed so that no
escape character is used.
mysql> create table my_table (
-> string varchar(255) );
Query OK, 0 rows affected (0.34 sec)
mysql>
mysql> insert into my_table values
-> ('4.4 (blah blah)');
Query OK, 1 row affected (0.07 sec)
mysql> select ##sql_mode;
+------------------------+
| ##sql_mode |
+------------------------+
| NO_ENGINE_SUBSTITUTION |
+------------------------+
1 row in set (1.318 sec)
mysql> set session sql_mode='NO_BACKSLASH_ESCAPES,NO_ENGINE_SUBSTITUTION';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> UPDATE my_table SET string = '4.4 \(blah blah\)' WHERE string = '4.4 (blah blah)';
Query OK, 1 row affected (0.08 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from my_table;
+-------------------+
| string |
+-------------------+
| 4.4 \(blah blah\) |
+-------------------+
1 row in set (0.02 sec)
I'm trying to use MySQL 8.0 but I'm having some problems. I have installed MySQL 5.7 and 8.0, and have different behavior with CHAR columns.
For MySQL 5.7:
mysql> create table test (id integer, c5 char(5));
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test values(0, 'a');
Query OK, 1 row affected (0.00 sec)
mysql> select * from test where c5 = 'a ';
+------+------+
| id | c5 |
+------+------+
| 0 | a |
+------+------+
1 row in set (0.00 sec)
mysql>
For MySQL 8.0:
mysql> create table test (id integer, c5 char(5));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into test values(0, 'a');
Query OK, 1 row affected (0.01 sec)
mysql> select * from test where c5 = 'a ';
Empty set (0.00 sec)
mysql>
Both servers have same configuration.
MySQL 5.7:
[mysqld]
port=3357
datadir=/opt/mysql_57/data
sql_mode="STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION"
default_storage_engine=innodb
character-set_server=utf8mb4
socket=/opt/mysql_57/mysql57.sock
max_allowed_packet=4194304
server_id=1
lower_case_table_names=0
MySQL 8.0:
[mysqld]
port=3380
datadir=/opt/mysql_80/data
sql_mode="STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION"
default_storage_engine=innodb
character-set_server=utf8mb4
socket=/opt/mysql_80/mysql80.sock
max_allowed_packet=4194304
server_id=1
lower_case_table_names=0
A brief overview of the MySQL 8.0 changelog didn't give me any information. Where described this behavior changes?
Best regards.
How MySQL handled trailing spaces, depends on the collation being used. See https://dev.mysql.com/doc/refman/8.0/en/charset-binary-collations.html for details.
What has changed between 5.7 and 8.0, is that the default character set is now UTF8mb4 with NOPAD collations.
If you want another behavior, you should change character set/collation for your column/table/database. Check the INFORMATION_SCHEMA table COLLATIONS for available PAD collations. (One warning: The older PAD SPACE collations may be less efficient. Quite some work has been made to improve the performance of the new Unicode collations based on UCA 9.0.0.)
See PAD_CHAR_TO_FULL_LENGTH in MySQL documentation
I have to port some DBS into stand alone MySQL Version: 5.0.18 running on Windows7 64bit and I got a problem I am stuck with. If I try to insert any national/unicode character into varchar I got error:
ERROR 1406 (22001): Data too long for column 'nam' at row 1
Here MCVE SQL script:
SET NAMES utf8;
DROP TABLE IF EXISTS `tab`;
CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(1024) default '' ) DEFAULT CHARSET=utf8;
INSERT INTO `tab` VALUES (1,'motorÄek');
INSERT INTO `tab` VALUES (2,'motorcek');
SELECT * FROM `tab`;
And here output:
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> DROP TABLE IF EXISTS `tab`;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(1024) default '' ) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO `tab` VALUES (1,'motorÄek');
ERROR 1406 (22001): Data too long for column 'nam' at row 1
mysql> INSERT INTO `tab` VALUES (2,'motorcek');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `tab`;
+------+----------+
| ix | nam |
+------+----------+
| 2 | motorcek |
+------+----------+
1 row in set (0.00 sec)
As you can see the entry with national character Ä E8h is missing.
I am aware of these QAs:
How to make MySQL handle UTF-8 properly
āData too long for columnā - why?
Error Code: 1406. Data too long for column - MySQL
but they do not address this problem (no solution from any of those work for this).
This problem is present even for single character strings. No matter the size of VARCHAR. So the only solution for now is change the national characters into ASCII but that would lose information which I would rather avoid.
I tried using various character sets utf8, ucs2, latin1 without any effect.
I tried drop the STRICT_TRANS_TABLES as some of the other answers suggest but that has no effect with this either (and the string size is many times bigger than needed).
Does anyone have any clues? May be it has something to do with fact that this MySQL server is standalone (it is not installed) it is started with this cmd:
#echo off
bin\mysqld --defaults-file=bin\my.ini --standalone --console --wait_timeout=2147483 --interactive_timeout=2147483
if errorlevel 1 goto error
goto finish
:error
echo.
echo MySQL could not be started
pause
:finish
and queries are done inside console started like this cmd:
#echo off
bin\mysql.exe -uroot -h127.0.0.1 -P3306
rem bin\mysql.exe -uroot -proot -h127.0.0.1 -P3306
Well looking at the char Ä code E8h (while writing question) It does not look like UTF8 but rather a extended ASCII (code above 7Fh) which finally pointed me to try this MySQL script:
SET NAMES latin1;
DROP TABLE IF EXISTS `tab`;
CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(1024) default '' );
INSERT INTO `tab` VALUES (1,'motorÄek');
INSERT INTO `tab` VALUES (2,'motorcek');
SELECT * FROM `tab`;
Which finally works (silly me I thought I already tried it before without correct result). So my error was to force Unicode (which was set as default) for Non Unicode strings (which I think should work). Here the result:
mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> DROP TABLE IF EXISTS `tab`;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(1024) default '' );
Query OK, 0 rows affected (0.02 sec)
mysql> INSERT INTO `tab` VALUES (1,'motorÄek');
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO `tab` VALUES (2,'motorcek');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `tab`;
+------+----------+
| ix | nam |
+------+----------+
| 1 | motorÄek |
| 2 | motorcek |
+------+----------+
2 rows in set (0.00 sec)
But as you can see there is some discrepancy in the table formatting but that does not matter much as the presentation will be done in C++ anyway.
Without writing this Question I would probably going in circles for hours or even days. Hopefully this helps others too.
[Edit1]
Now I got another problem caused by Windows. If I pass the script with Clipboard or type it myself all is OK but if I use source file then the national characters will go wrong (and the -e option does not help either). As I need to use files I am still looking for solution. But as this is different problem I decided to Ask new question:
Using source command corrupts non Unicode text encoding
I have a table with data already in it. I would like to change the character encoding for one of the columns. Currently the column seems to have two encodings. Even after changing it, I see the same results.
Current Encoding
mysql> SELECT character_set_name FROM information_schema.`COLUMNS`
-> WHERE table_name = "mytable"
-> AND column_name = "my_col";
+--------------------+
| character_set_name |
+--------------------+
| latin1 |
| utf8 |
+--------------------+
2 rows in set (0.02 sec)
Changing the encoding (0 rows are affected)
mysql> ALTER TABLE mytable MODIFY my_col LONGTEXT CHARACTER SET utf8;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
You probably have 2 rows because it is two different tables in two different databases.
Do SELECT * ... instead of SELECT character_set_name ....
ALTER TABLE mytable MODIFY my_col LONGTEXT CHARACTER SET utf8; is safe only if there are no values in mytable.my_col yet.
A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;
I am creating my first serious project in PHP and I want to make sure I have my database setup correctly. It is utf8_general_ci and for example the max I want usernames to be is 20 characters, so the username field in the database would be a varchar(20)? Sorry if this is stupid, it is just I read something somewhere that is making me question myself.
Yes you're right:
CREATE DATABASE my_test_db
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)
USE my_test_db;
Database changed
CREATE TABLE users (username varchar(20));
Query OK, 0 rows affected (0.04 sec)
INSERT INTO users VALUES ('abcdefghijklmnopqrstuvwxyz');
Query OK, 1 row affected, 1 warning (0.00 sec)
SELECT * FROM users;
+----------------------+
| username |
+----------------------+
| abcdefghijklmnopqrst |
+----------------------+
1 row in set (0.00 sec)