How to change character encoding for column in mysql table - mysql

I have a table with data already in it. I would like to change the character encoding for one of the columns. Currently the column seems to have two encodings. Even after changing it, I see the same results.
Current Encoding
mysql> SELECT character_set_name FROM information_schema.`COLUMNS`
-> WHERE table_name = "mytable"
-> AND column_name = "my_col";
+--------------------+
| character_set_name |
+--------------------+
| latin1 |
| utf8 |
+--------------------+
2 rows in set (0.02 sec)
Changing the encoding (0 rows are affected)
mysql> ALTER TABLE mytable MODIFY my_col LONGTEXT CHARACTER SET utf8;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0

You probably have 2 rows because it is two different tables in two different databases.
Do SELECT * ... instead of SELECT character_set_name ....
ALTER TABLE mytable MODIFY my_col LONGTEXT CHARACTER SET utf8; is safe only if there are no values in mytable.my_col yet.
A table declared to be latin1, and containing latin1 bytes can be converted to utf8 via
ALTER TABLE tbl CONVERT TO CHARACTER SET utf8;

Related

Mariadb query utf8 escaped string

I am using 5.5.65-MariaDB MariaDB Server.
I have a table with a column of type medium text, named "remoteData", where I store a json string.
String values in this json string are stored as escaped utf8 sequences, for example
"patientFirstName":"\u0395\u039b\u0395\u03a5\u0398\u0395\u03a1\u0399\u039f\u03a3"
The above value is the Greek Name "ΕΛΕΥΘΕΡΙΟΣ".
I am trying to search this column using the query
Select * from sync_details where remoteData like "%ΛΕΥΘΕΡ%"
but I get an empty set.
I assume this is because of the values being escaped but I don't know what to do.
EDIT: The query will run through php so we can use a solution that includes php functions.
Thank you in advance.
Christoforos
With a database defined to use CHARACTER SET utf8and a utf8_general_ci collation it should just work like this:
CREATE DATABASE IF NOT EXISTS `test` CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `test`.`sync_details` (`remoteData` MEDIUMTEXT);
INSERT INTO `test`.`sync_details` (`remoteData`) VALUES ('{"patientFirstName":"\\u0395\\u039b\\u0395\\u03a5\\u0398\\u0395\\u03a1\\u0399\\u039f\\u03a3"}');
SELECT `remoteData` FROM `test`.`sync_details` WHERE `remoteData` LIKE '%ΛΕΥΘΕΡ%';
+----------------------------------------------+
| remoteData |
+----------------------------------------------+
| {"patientFirstName": "ΕΛΕΥΘΕΡΙΟΣ"} |
+----------------------------------------------+
1 row in set (0,00 sec)
You could also try JSON_EXTRACT to get structured data from the stored JSON object. I just tested it like this:
SELECT JSON_EXTRACT(`remoteData`, "$.patientFirstName")
FROM `test`.`sync_details`
WHERE JSON_EXTRACT(`remoteData`, "$.patientFirstName")
LIKE '%ΛΕΥΘΕΡ%';
+--------------------------------------------------+
| JSON_EXTRACT(`remoteData`, "$.patientFirstName") |
+--------------------------------------------------+
| "ΕΛΕΥΘΕΡΙΟΣ" |
+--------------------------------------------------+
1 row in set (0,00 sec)
To index data in the JSON object you could add a "Generated Column" to your table using the GENERATED ALWAYS syntax
ALTER TABLE `test`.`sync_details` ADD COLUMN `firstName` VARCHAR(100) GENERATED ALWAYS AS (`remoteData` ->> '$.patientFirstName');
CREATE INDEX `firstnames_idx` ON `test`.`sync_details`(`firstName`);
SELECT `firstName` FROM `test`.`sync_details` WHERE `firstName` LIKE '%ΛΕΥΘΕΡ%';
+----------------------+
| firstName |
+----------------------+
| ΕΛΕΥΘΕΡΙΟΣ |
+----------------------+
1 row in set (0,00 sec)
This will only work with MariaDB >= 10.2 and with a utf8 encoded db and a utf8_general_ci collation.

utf8mb4 characters not surviving "LOAD DATA INFILE"

I have a csv file containing some characters that lie outside Unicode BMP, for example the character 🀀. They are SMP characters, so they need to be stored in utf8mb4 charset and utf8mb4_general_ci collation in MySQL instead of utf8 charset and utf8_general_ci collation.
So here are my SQL queries.
MariaDB [tweets]> set names 'utf8mb4';
Query OK, 0 rows affected (0.01 sec)
MariaDB [tweets]> create table test (a text) collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.06 sec)
MariaDB [tweets]> insert into test (a) values ('🀀');
Query OK, 1 row affected (0.03 sec)
MariaDB [tweets]> select * from test;
+------+
| a |
+------+
| 🀀 |
+------+
1 row in set (0.00 sec)
No warnings. Everything is right. Now I want to load that csv file. For test, the file has only one line.
MariaDB [tweets]> load data local infile 't.csv' into table wzyboy character set utf8mb4 fields terminated by ',' enclosed by '"' lines terminated by '\n\n' (tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,timestamp,source,text,expanded_urls);
Query OK, 1 row affected, 7 warnings (0.01 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 7
The warning message is:
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x80\x80' for column 'text' at row 1 |
All my working environments (OS, Termianl, etc) use UTF-8. I have specified utf8mb4 in everyplace I could think up of, and if I manually INSERT INTO it works just fine. However, when I use LOAD DATA INFILE [...] CHARACTER SET utf8mb4 [...] it just fails with error "Incorrect string value".
Problem solved.
It was a mistake. During the experiment, I just TRUNCATE TABLE but not re-create it. So the database and the table are both utf8mb4, but the columns are still utf8...

COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'?

mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
How do I get rid of this error?
What I already tried (copy&paste):
$ mysql -u admin -p $DATABASE
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.69 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SELECT LOCATE(_utf8"n", _utf8"München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SHOW VARIABLES LIKE "character_set_database";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| character_set_database | utf8 |
+------------------------+-------+
1 row in set (0.00 sec)
Possibly the server has been compiled with a default character set of binary, so that string literals are being interpreted as such, or the client is set to use a binary mode when communicating with the server. You can change the client and connection character set by calling SET NAMES utf8 (though this is not recommended if your SQL statements are being issued from PHP, for example, as PHP will have its own commands for setting the connection character set). See Connection Character Sets and Collations in the MySQL reference manual.
Alternatively you can use "introducers" to specify explicitly the charset used for the string literals in your LOCATE function, for instance:
LOCATE(_utf8"n", _utf8"München")
See the reference manual page Character String Literal Character Set and Collation for more details.
The COLLATE in my example sets the collation of the return value of
LOCATE, the result of which is of type binary.
To set the collation of the arguments:
mysql> SELECT LOCATE(_utf8"n" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 3 |
+--------+
1 row in set (0.00 sec)
My motivation actually was finding out whether MySQL takes the collation
into account when searching for the substring. Unfortunately it does
not. See the result of the second command:
mysql> SELECT LOCATE(_utf8"ü" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 2 |
+--------+
1 row in set (0.00 sec)
mysql> SELECT LOCATE(_utf8"u" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 0 |
+--------+
1 row in set (0.00 sec)
Test with a temporary table (collation taken into account in WHERE clause, but not in
LOCATE):
mysql> CREATE TEMPORARY TABLE test
(text VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO test VALUES("München");
Query OK, 1 row affected (0.00 sec)
mysql> SELECT text FROM test WHERE text LIKE "%u%";
+---------+
| text |
+---------+
| München |
+---------+
1 row in set (0.00 sec)
mysql> SELECT LOCATE("u", text) AS locate FROM test WHERE text LIKE "%u%";
+--------+
| locate |
+--------+
| 0 |
+--------+
1 row in set (0.01 sec)
I know this is late, but I hope it helps someone. I kept getting the same error and I knew my charsets and collations were fine.
Check for '#' symbols in your statement that don't belong. I was testing my stored procedure out as a select statement with variables, then when creating the stored proc forgot to remove the '#' symbols. Needless to say, I felt very silly.
I also know this doesn't seem to be the case in this question but this is my first SO post and I don't have enough rep to do much else, so I apologize.

MySQL LIKE operator with wildcard and backslash

It's frustrated with MySQL's pattern escaping used in LIKE operator.
root#dev> create table foo(name varchar(255));
Query OK, 0 rows affected (0.02 sec)
root#dev> insert into foo values('with\\slash');
Query OK, 1 row affected (0.00 sec)
root#dev> insert into foo values('\\slash');
Query OK, 1 row affected (0.00 sec)
root#dev> select * from foo where name like '%\\\\%';
Empty set (0.01 sec)
root#dev> select * from foo;
+------------+
| name |
+------------+
| with\slash |
| \slash |
+------------+
2 rows in set (0.00 sec)
root#dev> select * from foo where name like '%\\\\%';
Empty set (0.00 sec)
root#dev> select * from foo where name like binary '%\\\\%';
+------------+
| name |
+------------+
| with\slash |
| \slash |
+------------+
2 rows in set (0.00 sec)
According to MySQL docs: http://dev.mysql.com/doc/refman/5.5/en/string-comparison-functions.html#operator_like
%\\\\% is the right operand, but why it yields no result?
EDIT:
The database I'm testing that in has character_set_database set to utf8. To further my investigation, I created the same setup in a database that has character_set_database set to latin1, and guess what, '%\\\\%' works!
EDIT:
The problem can be reproduced and it's the field collation problem. Details: http://bugs.mysql.com/bug.php?id=63829
In MySQL 5.6.10, with the text field collation utf8mb4_unicode_520_ci this can be achieved by using 5 backslash characters instead of 4, i.e:
select * from foo where name like binary '%\\\\\%';
Somehow, against all expectations, this properly finds all rows with backslashes.
At least this should work until the MySQL field collation bug above is fixed. Considering it's been more than 5 years since the bug is discovered, any app designed with this may outlive its usefulness before MySQL is even fixed - so should be a pretty reliable workaround.
With MySQL 5.0.12 dev on Windows 10 I got the following results when I changed the query from
SELECT * FROM `foo` WHERE `name` LIKE '%http:\/\/%'
to
SELECT * FROM `foo` WHERE `name` LIKE '%http:\\\\\\\%'
it works and yet the first string with forward slashes was the original field content. It seems to have interpreted forward slashes as backslashes.
It seems it has some relation to that MySQL bug: http://bugs.mysql.com/bug.php?id=46659
I think you connect to mysql not specifying correct --character-set-server option (which defaults to latin1 with collation latin1_swedish_ci), and having utf-8 as the current charset of the console. That causes incorrect char conversions and comparisons when you deal with data which supposed to be converted to the utf8 from the charset of --character-set-server.

double checking my mysql field lengths

I am creating my first serious project in PHP and I want to make sure I have my database setup correctly. It is utf8_general_ci and for example the max I want usernames to be is 20 characters, so the username field in the database would be a varchar(20)? Sorry if this is stupid, it is just I read something somewhere that is making me question myself.
Yes you're right:
CREATE DATABASE my_test_db
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)
USE my_test_db;
Database changed
CREATE TABLE users (username varchar(20));
Query OK, 0 rows affected (0.04 sec)
INSERT INTO users VALUES ('abcdefghijklmnopqrstuvwxyz');
Query OK, 1 row affected, 1 warning (0.00 sec)
SELECT * FROM users;
+----------------------+
| username |
+----------------------+
| abcdefghijklmnopqrst |
+----------------------+
1 row in set (0.00 sec)