Proclaimer: YES, I've done my search on Stackoverflow and NO it couldn't find an answer for this case.
I'm migrating data from an forum which has some legacy in it's MySQL database. One of the issues is the storage of Emoji's.
Donor database:
-- Server: 5.5.41-MariaDB
CREATE TABLE `forumtopicresponse` (
`id` int(10) UNSIGNED NOT NULL,
`topicid` int(10) UNSIGNED NOT NULL DEFAULT '0',
`userid` int(10) UNSIGNED NOT NULL DEFAULT '0',
`message` text NOT NULL,
`created` int(10) UNSIGNED NOT NULL DEFAULT '0',
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
In the message column I've got a message like this: Success!ðŸ‘ðŸ‘, which displays as "Success!👍👍"
Laravel target database:
-- Server: MySQL 5.7.x
CREATE TABLE `answers` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`topic_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`body` text CHARACTER SET utf8mb4,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
...keys & indexes
) ENGINE=InnoDB AUTO_INCREMENT=1254419 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
In HTML the document has a <meta charset="utf-8"> and to display the field, I'm using
{!! nl2br(e($answer->body)) !!}
And with this it just displays as Success!ðŸ‘👠and not the Emoji's.
Question
How can I migrate this data CLEAN and UTF-8 valid into my fresh database? I think I need some utf encoding, but can't figure out which.
UPDATE! THE SOLUTION
Got it fixed. The only solution was to alter the table in the Donor database.
ALTER TABLE forumtopicresponse CHANGE message message LONGTEXT CHARACTER SET latin1;
ALTER TABLE forumtopicresponse CHANGE message message LONGBLOB;
Do NOT change the LONGBLOB to LONGTEXT anymore: I lost data this way.
When I migrate the LONGBLOB data to the Laravel target database everything get's migrated correctly: all special chars and emoji's are fixed and in UTF-8.
The Emoji 👍 is hex F09F918D. That is, it is a 4-byte string.
MySQL's CHARACTER SET = utf8 does not handle 4-byte UTF-8 strings, only 3-byte ones, thereby excluding many of the Emoji and some of Chinese.
When interpreted as latin1, those hex digits are 👠(plus a 4th, but unprintable, character). Showing gibberish like that is called "Mojibake".
So, you have 2 problems:
Need to change the storage to utf8mb4 so you can store the Emoji.
Need to announce to MySQL that your client is speaking UTF-8, not latin1.
See "Best Practice" in Trouble with UTF-8 characters; what I see is not what I stored
And also see UTF-8 all the way through
Here's my list of fixes, but you must first correctly identify which case you have. Applying the wrong fix makes things worse.
There may be a 3rd mistake -- in moving the data from 5.5 to 5.7. Please provide those details.
Related
I use MySQL 5.7, but I do not know how to config it to display Vietnamese correctly.
I have set
CREATE DATABASE brt
DEFAULT CHARACTER SET utf8 COLLATE utf8_vietnamese_ci;
After that I used "LOAD DATA LOCAL INFILE" to load data written by Vietnamese into the database.
But I often get a result with error in Vietnamese character display.
For the detailed codes and files, please check via my GitHub as the following link
https://github.com/fivermori/mysql
Please show me how to solve this. Thanks.
As #ysth suggests, using utf8mb4 will save you a world of trouble going forward. If you change your create statements to look like this, you should be good:
CREATE DATABASE `brt` DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
USE `brt`;
DROP TABLE IF EXISTS `fixedAssets`;
CREATE TABLE IF NOT EXISTS `fixedAssets` (
`id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`code` varchar(250) UNIQUE NOT NULL DEFAULT '',
`name` varchar(250) NOT NULL DEFAULT '',
`type` varchar(250) NOT NULL DEFAULT '',
`createdDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
CREATE INDEX `idx_fa_main` ON `fixedAssets` (`code`);
I've tested this using the data that you provided and get the expected query results:
name
----------------------------------------------------------------
Mould Terminal box cover BN90/112 612536030 39 tháng
Mould W2206-045-9911-VN #3 ( 43 tháng)
Mould Flange BN90/B5 614260271 ( 43 tháng)
Mould 151*1237PH04pC11 ( 10 năm)
Transfer 24221 - 2112 ( sửa chữa nhà xưởng Space T 07-2016 ) BR2
Using the utf8mb4 character set and utf8mb4_unicode_ci collation is usually one of the simpler ways to ensure that your database can correctly display everything from plain ASCII to modern emoji and everything in between.
When trying to insert a unicode emoji character (😎) to a MYSQL table, the insert fails due to the error;
Incorrect string value: '\\xF0\\x9F\\x98\\x8E\\xF0\\x9F...' for column 'Title' at row 1
From what I've red about this issue, it's apparently caused by the tables default character set, and possible the columns default character set, being set incorrectly. This post suggests to use utf8mb4, which I've tried, but the insert is still failing.
Here's my table configuration;
CREATE TABLE `TestTable` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`InsertDate` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`Title` text,
`Description` text,
`Info` varchar(250) CHARACTER SET utf8 DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `xId_TestTablePK` (`Id`)
) ENGINE=InnoDB AUTO_INCREMENT=2191 DEFAULT CHARSET=utf8mb4;
Note that the Title and Text columns dont have an explicitly stated character set. Initially I had no default table character set, and had these two columns were setup with DEFAULT CHARSET=utf8mb4. However, when I altered the table's default charset to the same, they were removed (presumably because the columns inherit the type from the table?)
Can anyone please help me understand how I can store these unicode values in my table?
Its worth noting that I'm on Windows, trying to perform this insert on the MYSQL Workbench. I have also tried using C# to insert into the database, specifying the character set with CHARSET=utf8mb4, however this returned the same error.
EDIT
To try and insert this data, I am executing the following;
INSERT INTO TestTable (Title) SELECT '😎😎';
Edit
Not sure if this is relevant or not, but my database is also set up with the same default character set;
CREATE DATABASE `TestDB` /*!40100 DEFAULT CHARACTER SET utf8mb4 */;
The connection needs to establish that the client is talking utf8mb4, not just utf8. This involves changing the parameters used at connection time. Or executing SET NAMES utf8mb4 just after connecting.
I have a MySQL Database (myDB; ~2GB in size) with 4 Tables (tab1, tab2, tab3, tab4). Currently, the data that is stored in the tables was added using the charset ISO-8859-1 (i.e. Latin-1).
I'd like to convert the data in all tables to UTF-8 and use UTF-8 as default charset of the tables/database/columns.
On https://blogs.harvard.edu/djcp/2010/01/convert-mysql-database-from-latin1-to-utf8-the-right-way/ I found an interesting approach:
mysqldump myDB | sed -i 's/CHARSET=latin1/CHARSET=utf8/g' | iconv -f latin1 -t utf8 | mysql myDB2
I haven't tried it yet, but are there any caveats?
Is there a way to do it directly in the MySQL shell?
[EDIT:]
Result of SHOW CREATE TABLE messages; after running ALTER TABLE messages CONVERT TO CHARACTER SET utf8mb4;
CREATE TABLE `messages` (
`number` int(11) NOT NULL AUTO_INCREMENT,
`status` enum('0','1','2') NOT NULL DEFAULT '1',
`user` varchar(30) NOT NULL DEFAULT '',
`comment` varchar(250) NOT NULL DEFAULT '',
`text` mediumtext NOT NULL,
`date` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`number`),
KEY `index_user_status_date` (`user`,`status`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=3285217 DEFAULT CHARSET=utf8mb4
It is possible to convert the tables. But then you need to convert the application, too.
ALTER TABLE tab1 CONVERT TO utf8mb4;
etc.
To check, do SHOW CREATE TABLE tab1; it should show you CHARACTER SET utf8mb4.
Note: There are 3 things going on:
Convert the encoding of the data in any VARCHAR and TEXT columns.
Change the CHARACTER SET for such columns.
Change the DEFAULT CHARACTER SET for the table -- this comes into play if you add any new columns without specifying a charset.
The application...
When you connect from a client to MySQL, you need to tell it, in a app-specific way or via SET NAMES, the encoding of the bytes in the client. This does not have to be the same as the column declarations; conversion will occur during INSERT and SELECT, if necessary.
I recommend you take a backup and/or test a copy of one of the tables. Be sure to go all the way through -- insert, select, display, etc.
I read different articles and topics on this forum to help me setting up the charset & collation for my database. Not sure about the choices I made. I would appreciate any comments or advice.
I'm using MySQL 5.5.
The database (used with PHP) will have some datas from different languages (chinese, french, dutch, Us, spanish, arabic etc..)
I will mainly insert datas and get information from table ID'S. I won't need to full search and compare text.
So here is what I've done to create my database, I decided to use CHARSET utf8mb4 and COLLATION utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
When I create the table:
CREATE TABLE IF NOT EXISTS sector (
idSector INT(5) NOT NULL AUTO_INCREMENT,
sectoreName VARCHAR(45) NOT NULL DEFAULT '',
PRIMARY KEY (idSector)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=0;
For some tables, I thought it was better to use utf8_bin
Ex: timezone (contain 168 047 rows)
CREATE TABLE timezone (
zone_id int(10) NOT NULL,
abbreviation varchar(6) COLLATE utf8_bin NOT NULL,
time_start decimal(11,0) NOT NULL,
gmt_offset int(11) NOT NULL,
dst char(1) COLLATE utf8_bin NOT NULL,
KEY idx_zone_id (zone_id),
KEY idx_time_start (time_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
So basically I would like to know if I'm on the right or if I'm doing something that could lead to problems.
Different columns can have different character sets and/or collations, but...
If you compare columns of different charset or collation (WHERE a.x = b.y), indexes cannot be used.
utf8 does not handle all of Chinese, nor does it handle some Emoji. For those, you need utf8mb4.
On other issues...
In INT(5), the (5) means nothing. Check out SMALLINT UNSIGNED with a range of 0..65535.
time_start decimal(11,0) is strange for a time. If it is a unix timestamp, either TIMESTAMP or INT UNSIGNED should work ok. See also TIME.
dst char(1) COLLATE utf8_bin -- this takes 3 bytes, because of utf8. Perhaps you want CHARACTER SET ascii so it will be only 1 byte?
InnoDB tables really should be given an explicit PRIMARY KEY. (Probably zone_id?)
You are making good a good choice for your sectoreName column. Notice one thing: utf8mb4_unicode_ci is a good collation for most language. But, for Spanish, it gets the alphabet wrong: in that language N and Ñ are considered different letters. Ñ appears immediately after N in the collating sequence. But in other European language they are considered the same letter. So, your Spanish-language users will, when they ask for Niña, get back Niña and Nina. That may appear to them as a mistake. (But, they're probably used to getting this sort of thing from pan-European software applications.)
You should use utf8mb4 as your character set throughout any new application. So, use that instead of utf8 in your timezone table. Using the _bin collation for your abbreviation column is fine.
I'm storing single emojis in a CHAR column in a MySQL database. The column's encoding is utf8mb4.
When I run this aggregate query, MySQL won't group by the emoji characters. It instead returns a single row with a single emoji and the count of all the rows in the database.
SELECT emoji, count(emoji) FROM emoji_counts GROUP BY emoji
Here's my table definition:
CREATE TABLE `emoji_counts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`emoji` char(1) DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Is there some special Unicode behavior I'll have to account for?
Turns out I needed to specify an expanded collation in the query, namely utf8mb4_unicode_520_ci.
This worked:
SELECT emoji, count(emoji) FROM emoji_counts group by emoji collate utf8mb4_unicode_520_ci;
EDIT: That collation isn't available on some server configs (including ClearDB's)... utf8mb4_bin also appears to work.