I use a Python program to write text containing Unicode characters to a MySQL
database. As an example, two of the characters are
u'\u2640' a symbol for Venus or female
u'\u2642' a symbol for Mars or male
I use utf8mb4 for virtually all character sets involved with MySQL. Here is
an excerpt from /etc/mysql/my.cnf
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
[mysqld]
default-character-set=utf8mb4
character-set-server =utf8mb4
character_set_system =utf8mb4
In addition, all tables are created with these parameters:
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
In all respects except one, the treatment of Unicode works just fine. I can
write Unicode to database tables, read it, display it, etc., with no
problems. The exception is mysql, the MySQL Command-Line Tool. When I
execute a SELECT statement to see rows in a table containing the Venus and
Mars Unicode characters, here is what I see on the screen:
| Venus | ♀ |
| Mars | ♂ |
What I should see in the right column are the standard glyphs for Venus and
Mars.
Any ideas about how to get the MySQL Command-Line Tool to display Unicode
properly?
Edit:
I have done a fair amount of research into the various MySQL system
variables, etc., and I now realize that the my.cnf settings shown above have
some serious issues. In fact, the server, mysqld, would not launch with the
settings shown. To correct things, remove these from [mysqld]:
default-character-set=utf8mb4
character-set-system=utf8mb4
I'm not sure that the [client] option does anything, but it doesn't seem to
hurt.
In Python u'\u2640' represents a single Unicode character, namely "♀". This
compiles down to three bytes containing the hex value E29980. I am having
no problems at all encoding and decoding Unicode. The correct values are
being stored in a MySQL table; they are correctly read from the table, and
when displayed by a Python program they show up like this:
♀ Venus
♂ Mars
The program output can be redirected to a file, processed by a text editor,
etc., and in all cases the correct Unicode symbol is displayed.
There is only one place where the correct Unicode symbol is not displayed,
and that is when I am using the MySQL Command Line Tool. When I issue a
SELECT statement on the table containing the Unicode symbols I get the junk
shown above. This is not a Windows specific issue. I have exactly the same
problem with the MySQL Command Line Tool when I run it on Windows, Mac OS X,
and Ubuntu.
Windows cmd and utf8. If you are talking about Windows, then chcp 65001, plus picking the right font is sufficient. See details.
Mojibake. But, on the other hand, if you are complaining about "Mojibake" such as ♀ instead of ♀, then see Mojibake in here. The hex for Venus (aka Female Sign), when correctly stored in utf8 will be E29980. If you see C3A2 E284A2 E282AC, you have "double encoding", not simply Mojibake.
Do not use u'\u2640' anywhere in MySQL.
I had the same issue querying Wordpress with the mysql command line program. You can see characters properly output to terminal when using the --default-character-set=utf8mb4 option.
For example
$ mysql --default-character-set=utf8mb4 -D my_database -e "SELECT option_value FROM wp_options WHERE option_name = 'myoption'"
If you want to configure this for the user, simply edit ~/.my.cnf and add the following to the [client] section:
[client]
default-character-set = utf8mb4
Everything works for us this way.
I am somewhat embarrassed to report that there never was a problem with the
MySQL command-line tool displaying Unicode characters. Why did I think
there was?
I wrote a number of Python 2 programs using MySQLdb to communicate with
MySQL. My data involved Unicode characters such as the symbols for Mars and
Venus. I was able to write these Unicode characters to the database, read
them back, and, in general, operate on them just like any other characters.
There was one annoyance: Using the MySQL command-line tool, when I SELECTed
rows from tables containing symbols like Mars and Venus, I only saw junk.
That is what led me to my original post asking how I could get Unicode to
display properly. I never got a satisfactory answer.
Recently I began converting the Python 2 programs to Python 3, using pymysql
to communicate with MySQL. Immediately, I ran into problems. The Unicode
characters I was reading from the database seemed all wrong. Investigations
showed that, in fact, the bytes stored in the database (created with Python
2) did NOT form the correct utf8 sequences for the Unicode characters I was
using.
I converted the Python 2 program which created the tables to Python 3,
recreated the tables, and, presto, changeo, everything worked. In other
words, characters in the database had been wrong from day One, but when read
by a Python 2 program, the original Unicode characters were recreated
properly.
And, of course, suddenly, the MySQL command-line tool began displaying
Unicode characters just fine. The problem had been that the bytes in the
database created by Python 2 and MySQLdb were not proper utf8
representations of the characters I was storing. I do not know exactly what
the bytes were, and I have been dealing with this issue too long to spend time
trying to find out.
For anyone working with Unicode in MySQL, I recommend this article.
It shows all the MySQL parameters which must be set up for Unicode, and it
shows how you can view the parameters on your own MySQL installation.
Related
I'm having a problem with the encoding when dumping a database using mysqldump.
The issue is that the file being generated is breaking non-ASCII characters (for ex. german and spanish characters). The data in the DB is right, but it is exported wrong.
I have tried the following:
using --default-character-set to utf8, utf8mb4, and latin1 (the last option because although the tables are using utf8_general_ci collation, the database itself is set to latin1, I don't know why). Weirdly enough, the output differs in filesize, but the content (specially the problematic characters) shows the same issue in all three cases. As if the option would be ignored.
importing the dumped file into a new mysql service, but since the characters are broken in the file, the import is also broken. for ex. the dump with the utf8mb4 option is imported in a fresh database with character encoding utf8mb4, but since the source file is wrongly encoded, it is not being "transcoded back" to a right form.
Initially I thought that it could be an issue with the version of the mysql server being different (5.7 in the source, 8.0 in the destination server), but since the file seems to be already broken, I now think that this might not be the root-cause. Still lost, so I prefer to mention it just in case it helps.
An example of the sentence I'm running:
mysqldump --default-character-set=utf8mb4 --no-tablespaces -u database_user -p database_name > /home/username/database_name-utf8mb4-20220712.sql
No errors appear neither during the export nor during the import in the new server. Everything seems to run smooth, but the character encoding is messed up, so something isn't OK.
Any support is much appreciated. Thank you!
but the character encoding is messed up
Give us an example. Include a hex dump of a small portion of the file where garbage shows up.
It is likely that the original data was either in character set utf8 or latin1, but the dumping and/or reloading specified the wrong character set. Please provide more details of the dump and load.
Also see: Trouble with UTF-8 characters; what I see is not what I stored
I have a server hosting MySQL, PHPMyAdmin reports:
Server version: 5.1.56-community
MySQL charset: UTF-8 Unicode (utf8)
I export a sql from using either mysqldump -uroot -p database > file.dump or mysqldump -uroot -p database -r file.dump (both generated files are identical anyway).
Locally, I installed MySQL 5.5 and HeidiSQL 9.5.
As the server's SQL file my.ini has:
default-character-set=utf8
I changed the local my.ini file to have
default-character-set=utf8
But also:
character-set-server=utf8
They were both set to latin1. Dunno why I have character-set-server set here while the server does not. Anyway.
Now I start HeidiSQL, it shows utf8mb4 references instead of utf8 for the sessions parameters. I don't know why:
Now, I import my dumped file, and I see that even if everything is apparently configured in utf8, it looks like I have some encoding problems.
On the server, I see:
Locally, in HeidiSQL, I see:
Special characters like à are not displayed correctly on the local database.
Am I doing something wrong?
Note that if I install HeidiSQL on the server, the variable tab shows the same values for the Session and Global parameters, and the à is shown correctly.
So this may be the root cause of the problem, but I don't know how to fix it. If I change the Session values before importing the sql file it does not fix the issue, and also values are back to utf8mb4 when I start HeidiSQL again.
Thanks to deceze comment, I could fix the issue.
In HeidiSQL, when I choose the sql file to execute, there's actually an "ncoding" option I did not notice originally ;-)
If I keep "auto-detect", the import generates bad content (with mojibake characters)
If I force "UTF-8", the import is perfect
Dunno why HeidiSQL fails to auto-detect the encoding...
A few thoughts:
It looks like you have the character set set correctly. The fact that HeidiSQL displays a different character set, is probably because clients themselves set a character set.
For example, your mysql server might use "Character set A" by default. If a client connects and says they want "Character set B", the server will convert this on the fly.
utf8mb4 is a superset (and superior to) utf8. It's better to have your server default to utf8mb4. The popular usecase of utf8mb4 is emoji.
Anyway, the reason you are getting mojibake is probably unrelated to having these character sets set correctly.
What I think may have happened is as follows (this is a guess).
Your tables/columns were set as UTF-8.
A client connects and tells the server "I want to use ISO-8559-1/latin instead".
The server happily complies and will convert the clients ISO-8559-1 strings to UTF-8 on the fly.
Despite the client wanting to use ISO-8559-1, it actually sends UTF-8.
The server thinks the data is ISO-8559-1 and treats it as such, and converts the UTF-8 using a ISO-8559-1 to UTF. It's effectively a double-encoding.
If I'm right, it means that you can have all your columns, connections and tables set to UTF-8, but your data is simply bad.
If this is correct, this process is reversable
You really just need the opposite operation. For example, if you had a PHP string $data, which is 'double-encoded' as UTF-8, the process would simply be to call this:
$output = utf8_decode($input)
It's also possible to fix this in MySQL. See this stack overflow question.
A few things to be aware of:
Make sure this is actually the case. Are you getting the correct output after this operation?
Make backups, obviously.
Also make absolutely sure that whatever was writing double-encoded UTF-8 to your database is now fixed. The last thing you want is a table that's a mixture of different encodings.
Sidenote: This problem is extremely common. You are somewhat lucky that you're french because it highlights the problem. Many english systems I've seen have this issue but it largely goes unnoticed for a long time because a lot of text doesn't go outside the common ASCII range.
You have "Mojibake". à turns into à (there are two characters, the second is a space).
This is caused when latin1 is involved somewhere in the process. The SESSION and GLOBAL settings are not at fault. Let's see SHOW CREATE TABLE.
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored for the likely causes. It may involve "Double Encoding"; let's see SELECT col, HEX(col) ....
As for fixing the data -- It depends on whether you have simply Mojibake or Double Encoding. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases for both.
I'm using the Encoding UTF-8 Unicode (utf8mb4) and Collation utf8mb4_unicode_520_ci for both tables and fields in my MySQL database.
When I export the database from Sequel Pro and open the exported .sql file in a text editor my test character 𝌆 appears correctly, but when I import the file back into Sequel Pro it appears as ???? both in Sequel Pro and in my PHP/MySQL app.
In the import window I've tried Autodetect and Unicode (UTF-8) without success. Any ideas?
Also, is there any newer encoding out there that I should use instead () and is there any benefit of using utf8mb4_unicode_520_ci instead of just utf8mb4_unicode_ci?
Edit / Here's a picture of what I'm trying to do. It seems like my "odd" character is on track all the way until I'm trying to import the .sql file back into Sequel PRO.
The COLLATION does not matter except for ordering. The CHARACTER SET does matter, since this is a 4-byte code.
Somehow CHARACTER SET utf8 got involved, in spite of what you say. See "question marks" in Trouble with utf8 characters; what I see is not what I stored for the likely causes.
Do SELECT HEX(...) ... to verify that that character was actually stored as hex F09D8C86.
Provide SHOW CREATE TABLE so we can verify that the column is utf8mb4.
And, let's see the connection parameters.
I came accross a mind puzzling problem with mysql encoding today and would appreciate ideas on how to debug that further.
I had to update an old perl application, using mysql 5.6, which originally just in English and to which I had to add some unicode support (for khmer script).
I figured it would be best to do a test install. Took a dump of the prod db, imported into a test db, changed the charset of the tables that needed support to utf8 collate utf8_unicode_cli.
All worked well so went to apply to production. Ran the sql migration scripts to change charsets, deployed the new code and ... khmer characters do store/show fine but legacy è characters show as question mark with black square.
What really puzzles me is that
test and prod run on the same (windows) box, same mysql server instance
both test and prod databases have the same charsets et collation
for the table in question, test and prod show create table statements are identical
the same code connected to test works fine but connected to prod doesn't
I thought maybe the original data got mangled in the process so deleted it and reinserting it through the app interface. Still worked on test but not prod.
Same code works on test so code is probably not the issue.
Both on same server instance so probably not server config issue.
Khmer script works fine so probably not a utf "configuration" issue.
New data is wrongly handled so probably not a data migration/convertion issue.
So 2 questions:
is the question mark with black square a sign of double encoding or just wrong encoding
how can I debug this further? Anyway to see "raw" mysql stored data for example so I could compare?
Any input greatly appreciated.
When trying to use utf8/utf8mb4, if you see Black Diamonds with question marks,
one of these cases exists:
Case 1 (original bytes were not utf8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT were not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were utf8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>
Not relevant, but since you brought it up:
When trying to use utf8/utf8mb4, if you see Mojibake, check the following.
This discussion applies to Double Encoding, which is not necessarily visible.
The bytes to be stored need to be utf8-encoded.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4).
HTML should start with <meta charset=UTF-8>.
I want to transfer a 3.23.49 MySQL database to a 5.0.51 MySQL database. Now I have exported the SQL file and I'm ready for import. I looked in the sql-file and Notepad++ shows me that the files is encoded in ANSI. I looked in the values and some of them are in ANSI and some of them are in UTF-8. What is the best way to proceed?
Should I change the encoding within Notepad++?
Should I use ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8;?
Should I use iconv?
Do I have to look through each table and make the necessary changes?
Whate are the settings for the import? MYSQL323 compatibility mode and encoding latin1?
Do I have to be aware of something if the php-scripts are using another encoding?
Thank you for your hints!
If the problem is to import a utf8-encoded mysql dump, the solution is usually to add --default-character-set=utf8 to mysql options:
mysql --default-character-set=utf8 -Ddbname -uuser -p < dump.sql
UPD1: In case the dump file is corrupted, I would try to export the database once again table by table so that the dump would result in a correct utf8 encoded file.
I have converted a MySQL 4.0 database (which also had no notion of character encoding yet) to MySQL 5.0 four years ago, so BTDT.
But first of all, there is no "ANSI" character encoding; that is a misconception and a misnomer that has caught on from the early versions of Windows (there are ANSI escape sequences, but they have nothing to do with character encoding). You are most certainly looking at Windows‑1252-encoded text. You should convert that text to UTF‑8 as then you have the best chance of keeping all used characters intact (UTF‑8 is a Unicode encoding, and Unicode contains all characters that can be encoded with Windows-125x, but at different code points).
I had used both the iconv and recode programs (on the Debian GNU/Linux system that the MySQL server ran on) to convert Windows‑1252-encoded text of a MySQL export (created by phpMyAdmin) to UTF‑8. Use whatever program or combination of programs works best for you.
As to your questions:
You can try, but it might not work. In particular, you might have trouble opening a large database dump with Notepad++ or another text editor.
Depends. ALTER TABLE … CONVERT TO … does more than just converting encodings.
See the paragraph above.
Yes. You should set the character encoding of every table and every text field that you are importing data into, to utf8 (use whatever utf8_… collation fits your purpose or data best). ALTER TABLE … CONVERT TO … does that. (But see 2.)
I don't think MYSQL323 matters here, as your export would contain only CREATE, INSERT and ALTER statements. But check the manual first (the "?" icon next to the setting in phpMyAdmin). latin1 means "Windows-1252" in MySQL 5.0, so that might work and you must skip the manual conversion of the import then.
I don't think so; PHP is not yet Unicode-aware. What matters is how the data is processed by the PHP script. Usually the Content-Type header field for your generated text resources using that data should end with ; charset=UTF-8.
On an additional note, you should not be using MySQL 5.0.x anymore. The current stable version is MySQL 5.5.18. "Per the MySQL Support Lifecycle policy, active support for MySQL 5.0 ended on December 31, 2009. MySQL 5.0 is now in the Extended support phase." MySQL 5.0.0 Alpha having been released on 2003-12-22, Extended Support is expected to end 8 full years after that, on 2011‑12‑31 (this year).