How to store emojis in a MySQL table? Tried everything - mysql

I have a CSV file containing tweets with emojis (eg. "Cool! 💕") and I need to import them into a MySQL table in such a way they will be saved/displayed correctly...
What do I have to set up and how for a correct import (I mean collation, etc.)?
More details:
In the CSV file, the emoji are visible
The encoding of the CSV file is UTF-8
I am on Windows 11
I already tried:
To set the character set to utf8mb4 and collation to utf8mb4_unicode_ci in the table
To add " SET NAMES 'utf8mb4';" (also tried with Latin1) before the LOAD query

The table must encode text in character set utf8mb4 to store emojis.
Demo:
mysql> create table no ( t text ) character set=utf8;
mysql> load data local infile 'm.csv' into table no;
mysql> select * from no;
+---------+
| t |
+---------+
| Cool! ? |
+---------+
So utf8 does not support emojis.
mysql> create table yes ( t text ) character set=utf8mb4;
mysql> load data local infile 'm.csv' into table yes;
mysql> select * from yes;
+------------+
| t |
+------------+
| Cool! 💕 |
+------------+
But utf8mb4 does support emojis. The difference is that utf8mb4 supports 4-byte encodings, but utf8 doesn't. This is an unfortunate part of MySQL's history, that they didn't implement utf8 originally to support the Supplemental Multilingual Plane of the UTF-8 standard.
Let's see if altering the first table helps.
mysql> alter table no character set utf8mb4;
mysql> load data local infile 'm.csv' into table no;
mysql> select * from no;
+---------+
| t |
+---------+
| Cool! ? |
| Cool! ? |
+---------+
Why didn't this work? Because alter table ... character set does not convert existing columns. It only changes the table's default character set, which will not be used until the next time we add a column to that table.
We can see that the existing column is still using the old character set:
mysql> show create table no\G
*************************** 1. row ***************************
Table: no
Create Table: CREATE TABLE `no` (
`t` text CHARACTER SET utf8mb3
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
utf8mb3 is the character set that utf8 is an alias for in MySQL 8.0.
To convert existing columns, use:
mysql> alter table no convert to character set utf8mb4;
mysql> show create table no\G
*************************** 1. row ***************************
Table: no
Create Table: CREATE TABLE `no` (
`t` mediumtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
No try the load again:
mysql> load data local infile 'm.csv' into table no;
mysql> select * from no;
+------------+
| t |
+------------+
| Cool! ? |
| Cool! ? |
| Cool! 💕 |
+------------+
Note that someday, MySQL may change the 'utf8' alias to mean utf8mb4. This is shown in warnings on many of the above usages of 'utf8':
'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.

Related

Column declared as NVARCHAR gets created as VARCHAR in MySQL. Both VARCHAR AND NVARCHAR declaration can store non latin characters

I am unable to create NVARCHAR data type in MySQL.
I have the following query -
CREATE TABLE table1 ( column1 NVARCHAR(10) );
This is supposed to create column1 that stores data type NVARCHAR(10). But the query -
DESCRIBE table1;
gives me the output -
+---------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| column1 | varchar(10) | YES | | NULL | |
+---------+-------------+------+-----+---------+-------+
Thus instead of column1 that can store NVARCHAR(10) data type, column1 that can store VARCHAR(10) data type gets created.
Now only NVARCHAR data type is supposed to store non Latin characters.
But the query -
INSERT INTO table1 VALUES ("भारत");
Runs successfully without any error. Here "भारत" is a Hindi word in Devanagari script which in English sounds "Bharat" and translates to "India".
The query -
SELECT * FROM table1;
gives display as expected -
+--------------+
| column1 |
+--------------+
| भारत |
+--------------+
I guess may be MySQL treats VARCHAR internally as NVARCHAR. But I can't find any documentation stating so.
The following is a link from MySQL developers website -
https://dev.mysql.com/doc/refman/8.0/en/charset-national.html
Here it says that NVARCHAR is fully supported.
To find out if non Latin characters can be stored in a column defined as VARCHAR I ran the following queries -
CREATE TABLE table2 ( column2 VARCHAR(10) );
DESCRIBE table2;
This gives me the output -
+---------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| column2 | varchar(10) | YES | | NULL | |
+---------+-------------+------+-----+---------+-------+
Here column2 that can store VARCHAR(10) data type gets created as expected.
Running the query -
INSERT INTO table2 VALUES ("भारत");
runs without any error.
and the query -
SELECT * FROM table2;
gives expected output -
+--------------+
| column2 |
+--------------+
| भारत |
+--------------+
Thus even if I declare column2 as VARCHAR(10) I can successfully store non Latin characters (here Devanagari characters of Hindi language).
The most logical conclusion is that regardless of declaring a column as VARCHAR or NVARCHAR MySQL always internally stores it as NVARCHAR. But I can't find any documentation regarding the same.
The following stackoverflow question gets closest to my question -
Issue Converting varchar to nvarchar mysql
But there is no answer provided to the question.
I am using operating system Ubuntu 20.04 and MySQL version - 8.0.26
Which information you can save is stored in character set and collation.
so as the default is utf8, bith can save hindi or chines or kisuali in their 4 byites
but
CREATE TABLE table1 ( column1 NVARCHAR(10),column2 VARCHAR(10) );
Actually is treated slightly different
CREATE TABLE `table1` (
`column1` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`column2` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
in the sample database the Default is
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
But the national varchar is like the standard defines
CHARACTER SET utf8 COLLATE utf8_general_ci
For your hindi word "भारत" it makes no differenz, but for some charachters there can be "problems"
Get in the habit of using SHOW CREATE TABLE instead of DESCRIBE. It would have answered your question.
mysql> CREATE TABLE nv ( column1 NVARCHAR(10) );
Query OK, 0 rows affected, 1 warning (0.05 sec)
mysql> show warnings;
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 3720 | NATIONAL/NCHAR/NVARCHAR implies the character set UTF8MB3, which will be replaced by UTF8MB4 in a future release. Please consider using CHAR(x) CHARACTER SET UTF8MB4 in order to be unambiguous. |
+---------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW CREATE TABLE nv\G
*************************** 1. row ***************************
Table: nv
Create Table: CREATE TABLE `nv` (
`column1` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci
1 row in set (0.00 sec)
The Warning gives you a hint of an important problem, should you ever try to store Chinese or Emoji in the column. utf8mb4 is needed.
So, you should say
CREATE TABLE nv ( column1 VARCHAR(10) CHARACTER SET utf8mb4 );
That is, don't use NVARCHAR, use VARCHAR and specify the appropriate character set.
utf8 happens to be OK for DEVANAGARI, as in your example.

MySQL LIKE is case sensitive but I don't want it to be [duplicate]

This question already has answers here:
MySQL: is a SELECT statement case sensitive?
(14 answers)
Closed 1 year ago.
As I understand it, MySQL LIKE is supposed to be case insensitive. Everywhere I've looked provides instructions on how to make it case sensitive if needed. Mine seems to be case sensitive, but I don't want it to be
This is causing an issue with my authentication server which needs to be case insensitive when authenticating users. Please let me know how to fix this, or how I can figure out why LIKE is case sensitive here.
Case sensitivity is based on the collation of the column you are searching, defined in your CREATE TABLE, or else the collation of the session, which determines the character set and collation of string literals.
Example:
CREATE TABLE `users_user` (
`username` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
insert into users_user set username='DEMO1-0048';
Here we see the default collation of utf8mb4_general_ci is case-insensitive:
mysql> select * from users_user where username like 'DeMO1-0048';
+------------+
| username |
+------------+
| DEMO1-0048 |
+------------+
But if I force the column to use a case-sensitive collation:
mysql> select * from users_user where username collate utf8mb4_bin like 'DeMO1-0048';
Empty set (0.00 sec)
Or if I force the string literal to use a case-insensitive collation:
mysql> select * from users_user where username like 'DeMO1-0048' collate utf8mb4_bin;
Empty set (0.00 sec)
Or if I define the table with a case-sensitive collation:
CREATE TABLE `users_user` (
`username` text COLLATE utf8mb4_bin
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
insert into users_user set username='DEMO1-0048';
mysql> select * from users_user where username like 'DeMO1-0048';
Empty set (0.00 sec)
So I would infer that your table is defined with a case-sensitive collation. You can check this:
mysql> select character_set_name, collation_name from information_schema.columns where table_name='users_user' and column_name='username';
+--------------------+----------------+
| character_set_name | collation_name |
+--------------------+----------------+
| utf8mb4 | utf8mb4_bin |
+--------------------+----------------+
You can force a string comparison to be case-insensitive, even if the default collation defined for the table/column is case-sensitive.
mysql> select * from users_user where username like 'DeMO1-0048' collate utf8mb4_general_ci;
+------------+
| username |
+------------+
| DEMO1-0048 |
+------------+
This works if you use the collate option on the column too:
mysql> select * from users_user where username collate utf8mb4_general_ci like 'DeMO1-0048';
+------------+
| username |
+------------+
| DEMO1-0048 |
+------------+
did you try this ?
SELECT users_user.username FROM users_user WHERE users_user.username LIKE '%DEMO1-0048%'
You could make your query convert to upper case so that they match,
SELECT * FROM users_user WHERE UPPER(username) LIKE UPPER('%DeMO1-0048%');

Setting default collation method in MySQL Workbench doesn't change the collation method

I am using MySQL workbench to design my MySQL schema and I need my database to be case-sensitive. I have set the default collation method to latin1_general_cs, latin1_bin, and utf8_bin to no avail. When I check the collation version in MySQL by using the command SELECT collation(version()), it returns
mysql> select collation(version());
+----------------------+
| collation(version()) |
+----------------------+
| utf8_general_ci |
+----------------------+
1 row in set (0.00 sec)
Regardless of what default method I have chosen.
When I do the following search:
mysql> select * from table_name where t_name = "search_request";
I get back SEARCH_REQUEST, as well as search_request.
However, if I use the command
mysql> select * from table_name where t_name = "search_request" collate utf8_bin;
I get the anticipated result, search_request.
By the way, in my .sql file, I see the following:
CREATE SCHEMA IF NOT EXISTS `database_name` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin ;
USE `database_name` ;
which makes me incredibly confused why I am not seeing the correct results or the correct default collation method. Any ideas?
I found that the problem was that I wasn't dropping my database schema before reinitializing it, so the changes weren't propagated to the current database.
Simply adding this command to the start of my .sql script:
DROP SCHEMA IF EXISTS `database_name` ;
Fixed my problem and I was able to make case-sensitive queries!

How to convert old database in mysql form latin1 to utf8

I have a database in latin1 format, all the utf8 character stored are shown as ????
+------+---------+-------+---------+--------------------+----------+-------------------- -----+---------------------+---------------------+---------+
| id | user_id | fname | lname | designation | location | email | created_at | updated_at | country |
+------+---------+-------+---------+--------------------+----------+------------------------- +---------------------+---------------------+---------+
| 6035 | 6035 | ????? | ??????? | ???????? ????????? | | ccc#rddd.net | 2011-04-11 06:05:54 | 2011-04-10 06:13:04 | xxxxxxxxx |
+------+---------+-------+---------+--------------------+----------+-------------------------+---------------------+---------------------+---------+
Now I use this command and change the format of the database and the table to utf8
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8;
ALTER DATABASE <database_name> CHARACTER SET utf8;
I have read that latin1 uses 1byte for every character but utf8 uses 3bytes for every character. My question is If i alter my table (Already containing lots of data) form latin1 to utf8, what will the old character data consume 3bytes or 1byte. If i use alter and convert the data will i have problem with the old data ? I am sure that new data will be in utf8.
first, you should try :
SET NAMES 'utf8'
SET CHARACTER SET utf8
and SELECT your row #6085 in order to verify if data recorded are not corrupted and encoded in UTF8 format.
UTF8 (unlike UTF16), in order to be backward compatible, uses 1 byte for ASCII characters. It uses up to 4 bytes for other characters (unicode faq).
You should not convert your data if they are already stored in UTF8 format.
Warning
Try your ALTER TABLE on a backup.
ALTER TABLE locks your database.

Old entries containing UTF8 characters saved incorrectly in UTF8 database

Ok, so I've ensured that my MySQL (5.1.61) database is UTF8, the table is UTF8, the field is UTF8, and the MySQL client's charset is set to UTF8. I can store and retrieve UTF8 entries successfully. I've also ensured my terminal's encoding is set to UTF8.
CREATE TABLE `cities` (
`name` varchar(255) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The issue when it comes to the 200,000 entries that already exist in the database. It appears the people we inherited the project from messed up a lot of the encoding, actually saving a string like Hörby as Hörby where à and ¶ are valid UTF8 characters. That is, MySQL is receiving a UTF8 string of Hörby and is storing it as such. Here is an example where the first entry is one of the old entries, and the second is us inserting "Hörby" into the database with everything set to UTF8:
mysql> INSERT INTO cities SET name = 'Hörby';
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM cities;
+----------+
| name |
+----------+
| Hörby | <--- old entry
| Hörby | <--- new entry
+----------+
What can we do to convert the squished characters into what they once were? We're pretty much ready to do anything at this point, but re-typing all 200,000 records is not feasible.
It looks like you had previously stored utf8 encoded strings in a latin1 column, then converted that column to utf8. To fix that:
Convert the data back to latin1:
ALTER TABLE cities MODIFY name varchar(255) CHARACTER SET latin1;
Change the column type to UTF-8 without altering the data (going via binary):
ALTER TABLE cities MODIFY name varchar(255) CHARACTER SET binary;
ALTER TABLE cities MODIFY name varchar(255) CHARACTER SET utf8;
You could use the REPLACE function in MYSQL.
Something like -
`UPDATE cities
SET name = REPLACE(name, 'ö', 'ö');`