When to use what encoding on tables - mysql

I recently had to changed mysql from latin-1 to utf-8 to handle Russian characters. They were originally showing up as ?????.
I also had to change a couple of tables in my database to utf8mb4. I originally had these set to utf8 but this did not have enough bits to handle certain characters.
I have to make a change to a production database and want to ensure that i do not have any issues a few months down the line with a particular encoding type.
So my question is when do i use what encoding on a table?

You have multiple questions.
The "???" probably came from converting from latin1 to utf8 incorrectly. The data is now lost, since only '?' remains. SELECT HEX(...) ... to confirm that all you get is 3F (?) where you should get something useful.
See "question marks" in Trouble with utf8 characters; what I see is not what I stored .
utf8mb4 and utf8 handle Cyrillic (Russian) identically, so the CHARACTER SET is not the issue with respect to the "???".
If you have an original copy of the data, then probably you want the 3rd item in here -- "CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset". That is what I call the two-step ALTER.
As for avoiding future issues... See "Best Practice" in my first link. If all you need is European (including Russian), either utf8 or utf8mb4 will suffice. But if you want Emoji or all of Chinese, then go with utf8mb4.
Also, note that you must specify what charset the client is using; this is a common omission, and was probably part of what got you in trouble in the first place.

Related

How can I determine utf8 data encoding error and correct it in MySql?

I have a website form written in Perl that saves user input in multiple languages to a MySQL database. While it has worked perfectly saving and displaying all characters without problems, in PHPMyAdmin the characters always displayed with errors. However I ignored this since the website was displaying characters OK.
Now I've just recently moved the website to a VPS and the database has seemingly enforced ut8mb4 encoding on the data, so it is now displaying character errors on the site. I'm not an expert and find the whole encoding area quite confusing. My question is, how can I:
a) determine how my data is actually encoded in my table?
b) convert it correctly to utf8mb4 so it displays correctly in PHPMyAdmin and my website?
All HTML pages use the charset=utf8 declaration. MySQL connection uses mysql_enable_utf8 => 1. The table in my original database was set to utf8_general_ci collation. The original database collation (I just noticed) was set to latin1_swedish_ci. The new database AND table collation is utf8mb4_general_ci. Thanks in advance.
SHOW CREATE TABLE will tell you the default CHARACTER SET for the table. For any column(s) that overrode the default, the column will specify what it is set to.
However, there could be garbage in the column. Many users have encountered this problem when they stored utf8 bytes into a latin1 column. This lead to "Mojobake" or "double encoding".
The only way to tell what is actually stored there is to SELECT HEX(col). Western European accented characters will be
one byte for a latin1 character stored in latin1 column.
2 bytes for a utf8 character stored in 1 utf8 character or into 2 latin1 characters.
several bytes for "double encoding" when converted twice.
More discussion: Trouble with UTF-8 characters; what I see is not what I stored

Can't insert Chinese character into MySQL when I use microsoft's cmd(or powershell)

Status
Create table message
mysql.ini
but insert fail
I set all places to utf8, but it still doesn't work.
But when I use the graphical interface to insert Chinese character into mysql, I succeeded.
Could you give me some suggestions to fix it?
Some Chinese characters require utf8mb4; utf8 is not sufficient.
Use InnoDB, not MyISAM.
But the real problem is the encoding:
These two CHARACTER SETs: gb2312 and gbk use hex d6d0c3c4 for '中媚'. However, I see that the first character matches, but not the second. I can investigate further if you can paste the Chinese character(s) into your question; I cannot work with images.
(Big5 does not work for either character.)
Meanwhile, try this, it may work better but not perfectly: After connecting, do SET NAMES gbk;; this announces the encoding for the client.
Note: The encoding for the column does not need to be identical to the encoding for the client. In the long run, it is best to use utf8mb4 for columns, regardless of what is used in the client.

latin1_general_ci and charset ISO-8859-15

I try to store strings of charset ISO-8859-15 in MySQL fields with CHARACTER SET latin1 COLLATION latin1_general_ci.
It seems to be the case that both of them are not full compatible. I am not able to save a correct €-Sign.
Can anybody tell me the correct CHARACTER SET for ISO-8859-15?
According to Wikipedia, there are 8 differences between ISO-8859-1 and ISO-8859-15. The €-Sign is one of them. I see on my copy of 5.6 a latin1 (ISO-8859-1) CHARACTER SET, but no latin9 (ISO-8859-15).
It is possible to add your own character set and collation to MySQL, but that may be more than you want to tackle. There is a Worklog for adding it, but they need nudging to ever get it done.
Sorry. Can you live with latin1 or latin2 (which has at least the Euro)? Or, better yet, switch to utf8, which has all the characters, and a zillion more.

How to detect UTF-8 characters in a Latin1 encoded column - MySQL

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.
Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.
This produces results like this:
latin1 utf8
----------------------------------------
Björn Björn
Since your question is not completely clear, let's assume some scenarios:
Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
There is a script on github to help with this sort of a thing.
I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.
Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).

Why does MySQL use latin1_swedish_ci as the default?

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434