Hi I recently changed the hosting provider for my website. When doing this I exported the mysql database I had in my previous cpanel phpmyadmin. It had CHARACTER SET latin1 and COLLATE latin1_swedish_ci. After I importing it to my new phpmyadmin I saw there was an issue with displaying the characters written in Czech ě ř č ů which appeared as question mark or weird symbols etc. I also wasn't able to insert these letters at first but after changing the table CHARSET to utf8 I'm able to insert them. But how do I export the data from my old database and import it in the new one without messing up the data? Here's what the database looks like:
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";
/*!40101 SET #OLD_CHARACTER_SET_CLIENT=##CHARACTER_SET_CLIENT */;
/*!40101 SET #OLD_CHARACTER_SET_RESULTS=##CHARACTER_SET_RESULTS */;
/*!40101 SET #OLD_COLLATION_CONNECTION=##COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
--
-- Database: `sambajiu_samba`
--
-- --------------------------------------------------------
CREATE TABLE `bookings` (
`id` int(11) NOT NULL,
`fname` varchar(100) NOT NULL,
`surname` varchar(100) DEFAULT NULL,
`email` varchar(255) NOT NULL,
`telephone` varchar(100) NOT NULL,
`age_group` varchar(100) DEFAULT NULL,
`hear` varchar(100) DEFAULT NULL,
`experience` text,
`subscriber` tinyint(1) DEFAULT NULL,
`booking_date` varchar(255) DEFAULT NULL,
`lesson_time` varchar(255) NOT NULL,
`booked_on` datetime DEFAULT CURRENT_TIMESTAMP
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `bookings` ADD PRIMARY KEY (`id`);
ALTER TABLE `bookings` MODIFY `id` int(11) NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=345;
Czech is not handled by latin1. It would be better to use utf8mb4 (which can handle virtually everything in the world). Outside of MySQL, it is called "UTF-8".
How did you do the "export" and "import"? What is in the file? Can you get the hex of a small portion of the exported file -- we need to check what encoding was used for the Czech characters.
As for "as question mark or weird symbols", see question marks and Mojibake in Trouble with UTF-8 characters; what I see is not what I stored .
Your hex probably intended to say
Rezervovat trénink zda
In the middle of the hex is
C383 C2A9
Which is UTF-8 for é. When you display the data, you might see that, or you might see the desired é. In the latter case, the browser is probably "helping" you by decoding the data twice. For further discussion on this, see "double encoding" in the link above.
"Fixing the data" is quite messy:
CONVERT(BINARY(CONVERT(CONVERT(
UNHEX('52657A6572766F766174207472C383C2A96E696E6B207A6461')
USING utf8mb4) USING latin1)) USING utf8mb4)
==> 'Rezervovat trénink zda'
But, I don't think we are finished. that acute-e is a valid character in latin1. You mentioned 4 Czech accented letters that, I think, are not in Latin1. Latin5 and dec8 may be relevant.
I have dumped my small MySQL table (manually reduced to localize the problem) to show it here:
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";
/*!40101 SET #OLD_CHARACTER_SET_CLIENT=##CHARACTER_SET_CLIENT */;
/*!40101 SET #OLD_CHARACTER_SET_RESULTS=##CHARACTER_SET_RESULTS */;
/*!40101 SET #OLD_COLLATION_CONNECTION=##COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
CREATE TABLE `symb` (
`smb` varchar(200) NOT NULL,
`trtmnt` varchar(200) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `symb` (`smb`, `trtmnt`) VALUES
('і', 'ty'),
('ї', 'hr');
/*!40101 SET CHARACTER_SET_CLIENT=#OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=#OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=#OLD_COLLATION_CONNECTION */;
If you create the MySQL table above and run this query
select * from symb where smb = 'ї';
or this one (queries are different - please see the symbols 'ї' vs 'і')
select * from symb where smb = 'і';
then you may see you get two rows selected instead of one as I would expect.
To reemphasize, these two select queries above are different - the symbol 'ї' is different from 'і' (both are cyrillic symbols, 'і' is NOT latin here).
Collation chosen was utf8_general_ci
Any reasons why 'і' and 'ї' are treated as the same symbols and what's the proper way to make it different? I need to select the exact row, not two.
Queries above were tested in phpMyAdmin and HeidiSQL which means that's MySQL (collation?) issue, not the program used to run queries.
Each different symbol should be treated as a different symbol and the table should be case sensitive. What's wrong with the table above? As result I'm unable to set unique key for this row.
Thank you.
Just added based on comments:
What does SHOW TABLE STATUS LIKE 'symb' show you?
It shows me:
Name symb
Engine InnoDB
Version 10
Row_format Compact
Rows 2
Avg_row_length 8192
Data_length 16384
Max_data_length 0
Index_length 0
Data_free 0
Auto_increment NULL
Create_time 22.05.16 12:11
Update_time NULL
Check_time NULL
Collation utf8_general_ci
Checksum NULL
Create_options
Comment
That is the way, how the collation chosen by you works. You can look here for more information: https://stackoverflow.com/a/1036459/4099089
Because your SELECT statement is returning both records, it appears that your data has already been encoded wrongly into UTF-8. So merely changing the encoding of the smb column from Latin1 to UTF-8 won't work. One option for you would be to dump the database to binary, and then reimport it as UTF-8:
mysqldump --add-drop-table your_database | replace CHARSET=latin1 CHARSET=utf8 |
iconv -f latin1 -t utf8 | mysql your_database
Read here and here for more information.
Which do you want?
D197 1111=x0457 [ї] L CYRILLIC SMALL LETTER YI
C3AF 239=x00EF [ï] L LATIN SMALL LETTER I WITH DIAERESIS
If you do SELECT col, HEX(col) ... you should get either D197 or C3AF for a correctly stored YI or i-umlaut. That is the best way to tell if it was stored correctly as utf8 (or utf8mb4).
They look the same, but they are treated differently. All the utf8/utf8mb4 collations sort all Cyrillic letters after all Latin letters.
The "best" "general" collation is utf8mb4_unicode_520_ci. (utf8, instead of utf8mb4, is ok if you don't need Chinese or Emoji.)
Here is my rundown of how Western European characters compare in various utf8/utf8mb4 collations. utf8_spanish2_ci, for example, is the only one to treat ll as a 'separate character', after all other l values. utf8_latvian_ci handles Ķ and Ļ as separate letters. Etc.
SHOW TABLE STATUS shows the default for the table; you need to look at SHOW CREATE TABLE to see if any column overrides that default.
I've solved* this issue in the following way:
1) Change table collation to utf8mb4_unicode_520_ci
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
This allows you to insert all letters in Ukrainian alphabet except for ґ.
This also allows you to sort letters the way they are supposed to.
2) Change column collation to utf8mb4_bin
ALTER TABLE table_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
This allows you to insert ґ character.
*The only drawback of this approach is that when sorting you have to use
SELECT * FROM table_name ORDER BY column_name COLLATE utf8mb4_unicode_520_ci ASC
But still, it won't sort DESC
Here is quetions about adding comment to column for MySQL. Can this comment be utf-8? Also what encoding MySQL uses for these columns by default?
Default character set and collation is set when the database is created
CREATE DATABASE mydb
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
You can modify character set on a specific column like this
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;
I am trying to insert emoji's into mysql but it turns to question marks, I have changed mysql connection server collation, database collation , table collation and column collation. I used these to change the items
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
I have done all these but emoji's in mysql still show question marks. Please what should I do to make mysql show the emojis. Thanks in advance
Little late to answer the question. But I hope it will be useful for others...
Above configuration makes the database tables to store utf8 encoded data. But, the database connection(JDBC) should be able to transfer the utf8 encoded data to client. For that the JDBC connection parameter charset should be set to utf8mb4.
The default encoding for inbound connections isn't set properly. DEFAULT CHARSET will return as utf8 however character_set_server will be something different.
So, Set default-character-set=utf8.
I'm setting a rails environment up for one of my colleagues, who's using a mac (in case that's relevant). I've pulled the data down from our live mysql database and made a local development database with that data. If i open the mysql console, and look at the data for a record which has extended charset characters in its name field, then it looks fine. However, in the rails console (and in a rails-generated web page) the encoding is broken: an endash is replaced by "—" for example.
The only rails config options i know about that are relevant to this is in config/database.yml. I currently have this set:
encoding: utf8
collation: utf8_general_ci
which makes it work fine on my machine for example. But like i say it's not working on my colleague's machine. Any ideas anyone?
EDIT 1: on the live server, where i copied the data FROM, the charset info looks like this:
mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
EDIT 2: in response to #eggyal's comment i've done a couple of mysqldumps, which has been quite revealing. Here's the first dump:
$ mysqldump -u root -h127.0.0.1 dbname lessons --where="id=79510"
-- MySQL dump 10.11
--
-- Host: 127.0.0.1 Database: e_learning_resource_v3
-- ------------------------------------------------------
-- Server version 5.0.32-Debian_7etch4-log
/*!40101 SET #OLD_CHARACTER_SET_CLIENT=##CHARACTER_SET_CLIENT */;
/*!40101 SET #OLD_CHARACTER_SET_RESULTS=##CHARACTER_SET_RESULTS */;
/*!40101 SET #OLD_COLLATION_CONNECTION=##COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET #OLD_TIME_ZONE=##TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET #OLD_UNIQUE_CHECKS=##UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET #OLD_FOREIGN_KEY_CHECKS=##FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET #OLD_SQL_MODE=##SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET #OLD_SQL_NOTES=##SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `lessons`
--
DROP TABLE IF EXISTS `lessons`;
CREATE TABLE `lessons` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) default NULL,
`description` text,
`user_id` int(11) default NULL,
`created_at` datetime default NULL,
`privacy` int(11) default '1',
`is_official` tinyint(1) default '0',
`is_readonly` tinyint(1) default NULL,
`comments_allowed` tinyint(1) default NULL,
`hours` int(11) default NULL,
`sessions` int(11) default NULL,
`updated_at` datetime default NULL,
`custom_menu_swf` varchar(255) default NULL,
`pupil_liked_at` datetime default NULL,
`user_liked_at` datetime default NULL,
`pupil_favorite_count` int(11) default '0',
`user_favorite_count` int(11) default '0',
`teacher_notes` text,
`pupil_notes` text,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `lessons`
--
-- WHERE: id=79510
LOCK TABLES `lessons` WRITE;
/*!40000 ALTER TABLE `lessons` DISABLE KEYS */;
INSERT INTO `lessons` VALUES (79510,'Jazz–Man',NULL,NULL,'2014-04-03 12:08:05',1,0,NULL,NULL,NULL,NULL,'2014-04-03 12:08:05',NULL,NULL,NULL,0,0,NULL,NULL);
/*!40000 ALTER TABLE `lessons` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET TIME_ZONE=#OLD_TIME_ZONE */;
/*!40101 SET SQL_MODE=#OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=#OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=#OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=#OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=#OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=#OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=#OLD_SQL_NOTES */;
-- Dump completed on 2014-04-03 11:16:42
So, this was just a straight mysqldump and it's got the broken character in it (Jazz–Man) in the "INSERT INTO lessons" line.
I do it again with some extra options, and the data looks ok in the dump file:
$ mysqldump -u root -h127.0.0.1 dbname lessons --extended-insert --single-transaction --default-character-set=latin1 --skip-set-charset --where="id=79510"
-- MySQL dump 10.11
--
-- Host: 127.0.0.1 Database: e_learning_resource_v3
-- ------------------------------------------------------
-- Server version 5.0.32-Debian_7etch4-log
/*!40103 SET #OLD_TIME_ZONE=##TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET #OLD_UNIQUE_CHECKS=##UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET #OLD_FOREIGN_KEY_CHECKS=##FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET #OLD_SQL_MODE=##SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET #OLD_SQL_NOTES=##SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `lessons`
--
DROP TABLE IF EXISTS `lessons`;
CREATE TABLE `lessons` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) default NULL,
`description` text,
`user_id` int(11) default NULL,
`created_at` datetime default NULL,
`privacy` int(11) default '1',
`is_official` tinyint(1) default '0',
`is_readonly` tinyint(1) default NULL,
`comments_allowed` tinyint(1) default NULL,
`hours` int(11) default NULL,
`sessions` int(11) default NULL,
`updated_at` datetime default NULL,
`custom_menu_swf` varchar(255) default NULL,
`pupil_liked_at` datetime default NULL,
`user_liked_at` datetime default NULL,
`pupil_favorite_count` int(11) default '0',
`user_favorite_count` int(11) default '0',
`teacher_notes` text,
`pupil_notes` text,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Dumping data for table `lessons`
--
-- WHERE: id=79510
LOCK TABLES `lessons` WRITE;
/*!40000 ALTER TABLE `lessons` DISABLE KEYS */;
INSERT INTO `lessons` VALUES (79510,'Jazz–Man',NULL,NULL,'2014-04-03 12:08:05',1,0,NULL,NULL,NULL,NULL,'2014-04-03 12:08:05',NULL,NULL,NULL,0,0,NULL,NULL);
/*!40000 ALTER TABLE `lessons` ENABLE KEYS */;
UNLOCK TABLES;
/*!40103 SET TIME_ZONE=#OLD_TIME_ZONE */;
/*!40101 SET SQL_MODE=#OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=#OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=#OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=#OLD_SQL_NOTES */;
-- Dump completed on 2014-04-03 11:18:20
So, it looks like the extra options did the trick:
--extended-insert --single-transaction --default-character-set=latin1 --skip-set-charset
When a MySQL client interacts with the server:
the server receives any text merely as a string of bytes; the client will have previously told it how such text would be encoded.
if the server then has to store that text in a table, it must transcode it to the encoding of the relevant column (if different).
if the client subsequently wants to retrieve such text, the server must transcode it to the encoding expected by the client.
If the encodings used by the client in steps 1 and 3 are the same (which is usually the case, especially when the client in both cases is the same application), then it often goes unnoticed if the client is using an encoding other than the one it said it would. For example, suppose the client tells MySQL that it will use latin1, but actually sends data in utf8:
The string 'Jazz–Man' is sent to the server in UTF-8 as 0x4a617a7ae280934d616e.
MySQL, decoding those bytes in Windows-1252, understands them to represent the string 'Jazz–Man'.
To store in a utf8 column, MySQL transcodes the string to its UTF-8 encoding 0x4a617a7ac3a2e282ace2809c4d616e. This can be verified by using SELECT HEX(name) FROM lessons WHERE id=79510.
When the client retrieves the value, MySQL thinks that it wants it in latin1 and so transcodes to the Windows-1252 encoding 0x4a617a7ae280934d616e.
When the client receives those bytes, it decodes them as UTF-8 and therefore understands the string to be 'Jazz–Man'.
Conclusion: the client doesn't realise anything is wrong. Problems are only detected when a different client (one that does not misstate its UTF-8 connection as latin1) tries to use the table. In your case, this occurred when mysqldump obtained an export of the data; using the --default-character-set=latin1 --skip-set-charset options effectively forced mysqldump to behave in the same broken way as your application, so it ended up with correctly encoded data.
To fix your issue going forward, you must:
Configure your application so that it correctly sets its MySQL connection character set (e.g. set encoding: utf8 in config/database.yml for Rails);
Recode the data in your database, e.g. UPDATE lessons SET name = BINARY CONVERT(name USING latin1) (note that this must be done for every misencoded text column).
Also note that you will probably want to perform these two actions atomically, which may require some thought.
I managed to fix this by semi-accident. When i was trying to import the sql data that had been done with the extra options relating to LATIN1 (see Edit 3 on my OP) i was getting an error message about the LC_TYPE variable (I didn't make a note of this exact error unfortunately). A bit of googling led me to set the following variables in his ~/.bash_profile file:
export LC_CTYPE=en_GB.UTF-8
export LANG=en_GB.UTF-8
After setting this, and opening a new console tab, i was able to import the data. But, it still looked wrong (though in a different way to before: ie some other messed up characters replacing the endash for example.) I scratched my head then did something else for a while.
Now, after he has restarted his laptop several times (because it's been a couple of weeks), it is all magically working. So, i think that a reboot fixed it. So, the answer is, i think, this:
Set these options in your rails config/database.yml file
encoding: utf8
collation: utf8_general_ci
Add these environment variables to ~/.bash_profile
export LC_CTYPE=en_GB.UTF-8
export LANG=en_GB.UTF-8
Add (or change if they are there already) these options to your mysql config (in this case, /Applications/MAMP PRO/MAMP PRO.app/Contents/Resources/my.cnf but a more common location would be /etc/mysql/my.cnf or /etc/my.cnf - look for it with locate my.cnf)
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
Now reboot your machine.
Then, when you do mysqldump, make sure you do it with these options (in addition to whatever other options you have)
--extended-insert --single-transaction --default-character-set=latin1 --skip-set-charset
Some of this may not be necessary, but i think it was all necessary for me!
Thanks to everyone who commented for your help.