Is it safe to update tables from utf8 to utf8mb4 in MySQL? - mysql

I am aware that similar questions have been asked before, but we need a more definitive answer.
Is it safe to update MySQL tables encoded in utf8 to utf8mb4 in all cases. More specifically, even for varchar fields with strings generated using for example (in Java):
new BigInteger(130, random).toString(32)
From our understanding utf8mb4 is a superset of utf8 so our assumption would be that everything should be fine, but we would love some input from more MySQL superusers.

How the data was originally inserted in MySQL is irrelevant. Let's suppose you used the entire character set of utf8, e.g. the BMP characters.
utf8mb4 is a superset of utf8mb3 (alias utf8) as documented here
10.9.7 Converting Between 3-Byte and 4-Byte Unicode Character Sets
One advantage of converting from utf8mb3 to utf8mb4 is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements.
In terms of table content, conversion from utf8mb3 to utf8mb4 presents no problems:
For a BMP character, utf8mb4 and utf8mb3 have identical storage
characteristics: same code values, same encoding, same length.
For a supplementary character, utf8mb4 requires four bytes to store
it, whereas utf8mb3 cannot store the character at all. When
converting utf8mb3 columns to utf8mb4, you need not worry about
converting supplementary characters because there will be none.
In terms of table structure, these are the primary potential incompatibilities:
For the variable-length character data types (VARCHAR and the TEXT types), the maximum permitted length in characters is less for utf8mb4 columns than for utf8mb3 columns.
For all character data types (CHAR, VARCHAR, and the TEXT types), the maximum number of characters that can be indexed is less for utf8mb4 columns than for utf8mb3 columns.
Consequently, to convert tables from utf8mb3 to utf8mb4, it may be necessary to change some column or index definitions.
Personally I had some issues with indexes on relative long texts where the maximum size of the index was reached. It was a search index, not a unique index, so the workaround was to use less characters in the index. See also this answer
Of course I suppose that you will use the same collation. If you change the collation other issues apply.

Related

Chinese names and Unicode Basic Multilingual Plane (BMP)

I am building an application using MySQL, where Chinese names need to be stored in the database. I'm trying to decide whether or not using the basic utf8 encoding (which only works with the Basic Multilingual Plane, and stores a maximum of 3 bytes per character in a UTF-8 encoding), or if I need to make use of the utf8mb4 encoding, which permits characters from higher planes to be encoded/stored.
Is the Unicode Basic Multilingual Plane (BMP) sufficient to store all Chinese proper names?
MySQL's CHARACTER SET utf8 only handles 3-byte UTF-8 codes (BMP). Instead, use CHARACTER SET utf8mb4, which handles all 4-byte codes. Yes that includes all of currently defined Unicode for Chinese, Emoji, etc.
Use version 5.7, if practical.
TL;DR it doesn't matter, stick with utf8mb4 encoding, especially for new applications.
Long-form answer: the key difference between the two encodings is that utf8, long supported by MySQL, supports UTF8-encoded characters up to three bytes in length. As of 5.5.3, as noted by #rick-james, a new encoding, utf8mb4 relaxes this restriction, and otherwise has no disadvantages.
According to the MySQL documentation, the newer utf8mb4 encoding lifts this arbitrary three-character restriction, and there are few, if any disadvantages:
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
Thus, my original question was misconceived: the maximum number of bytes to encode each character of a Chinese name shouldn't matter so long as the encoding you use actually supports encoding all Unicode code points.

Does MySQL UTF8 collation fit japanese and korean characters?

I've set all collation and characters sets to UTF8 in PHP and MySQL. There is no problem. But as seen on http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html, standard utf8_general_ci collation uses three bytes for storing characters. That should be enough to store all BMP characters. But I've still found no hint, if all korean and japanese characters are included in BMP or if there are characters that needs four bytes to be stored. I simply want to know, if utf8_general_ci and utf8_bin are really enough to store all korean/japanese characters, or if I have to use utf8mb4_general_ci and utf8mb4_bin?
The most frequently used characters are in the BMP. The characters in higher planes are mostly rare and historic, but some of them may be in use in personal names for example. If you can use utf8mb4 you probably should.

utf-8 vs latin1

What are the advantages/disadvantages between using utf8 as a charset against using latin1?
If utf can support more chars and is used consistently wouldn't it always be the better choice? Is there any reason to choose latin1?
UTF8 Advantages:
Supports most languages, including RTL languages such as Hebrew.
No translation needed when importing/exporting data to UTF8 aware components (JavaScript, Java, etc).
UTF8 Disadvantages:
Non-ASCII characters will take more time to encode and decode, due to their more complex encoding scheme.
Non-ASCII characters will take more space as they may be stored using more than 1 byte (characters not in the first 127 characters of the ASCII characters set). A CHAR(10) or VARCHAR(10) field may need up to 30 bytes to store some UTF8 characters.
Collations other than utf8_bin will be slower as the sort order will not directly map to the character encoding order), and will require translation in some stored procedures (as variables default to utf8_general_ci collation).
If you need to JOIN UTF8 and non-UTF8 fields, MySQL will impose a SEVERE performance hit. What would be sub-second queries could potentially take minutes if the fields joined are different character sets/collations.
Bottom line:
If you don't need to support non-Latin1 languages, want to achieve maximum performance, or already have tables using latin1, choose latin1.
Otherwise, choose UTF8.
latin1 has the advantage that it is a single-byte encoding, therefore it can store more characters in the same amount of storage space because the length of string data types in MySql is dependent on the encoding. The manual states that
To calculate the number of bytes used to store a particular CHAR,
VARCHAR, or TEXT column value, you must take into account the
character set used for that column and whether the value contains
multibyte characters. In particular, when using a utf8 Unicode
character set, you must keep in mind that not all characters use the
same number of bytes. utf8mb3 and utf8mb4 character sets can require
up to three and four bytes per character, respectively. For a
breakdown of the storage used for different categories of utf8mb3 or
utf8mb4 characters, see Section 10.9, “Unicode Support”.
Furthermore lots of string operations (such as taking substrings and collation-dependent compares) are faster with single-byte encodings.
In any case, latin1 is not a serious contender if you care about internationalization at all. It can be an appropriate choice when you will be storing known safe values (such as percent-encoded URLs).
#Ross Smith II, Point 4 is worth gold, meaning inconsistency between columns can be dangerous.
To add value to the already good answers, here is a small performance test about the difference between charsets:
A modern 2013 server, real use table with 20000 rows, no index on concerned column.
SELECT 4 FROM subscribers WHERE 1 ORDER BY time_utc_str; (4 is cache buster)
varchar(20) CHARACTER SET latin1 COLLATION latin1_bin: 15ms
varbinary(20): 17ms
utf8_bin: 20ms
utf8_general_ci: 23ms
For simple strings like numerical dates, my decision would be, when performance is concerned, using utf8_bin (CHARACTER SET utf8 COLLATE utf8_bin). This would prevent any adverse effects with other code that expects database charsets to be utf8 while still being sort of binary.
Fixed-length encodings such as latin-1 are always more efficient in terms of CPU consumption.
If the set of tokens in some fixed-length character set is known to be sufficient for your purpose at hand, and your purpose involves heavy and intensive string processing, with lots of LENGTH() and SUBSTR() stuff, then that could be a good reason for not using encodings such as UTF-8.
Oh, and BTW. Do not confuse, as you seem to do, between a character set and an encoding thereof. A character set is some defined set of writeable glyphs. The same character set can have multiple distinct encodings. The various versions of the unicode standard each constitute a character set. Each of them can be subjected to either UTF-8, UTF-16 and "UTF-32" (not an official name, but it refers to the idea of using full four bytes for any character) encoding, and the latter two can each come in a HOB-first or HOB-last flavour.

CHARSET on column level in Mysql 5

My app has a table that has two columns needing utf8 and others are latin. Latin ones does not contain non-latin characters by definition and utf8 ones may or may not contain utf8 ones. One utf8 column is indexed and other is not.
I have three questions:
Is mixing charsets on a column level a good practice?
If a row (on this table) contains only latin chars and no utf8 chars how are data storage and index size affected? Put another way, is a utf8 column data/index size same as latin without storing any utf8 text.
Quantitively how are data and index storage affected on utf8 columns with respect to latin?
Thanks
UTF-8 is a variable length encoding. Characters inside the ASCII set will be encoded with one byte as in latin1; characters beyond that will be encoded using up to four bytes. A string consisting of ASCII characters will have the same length in UTF8 and latin1.
Is mixing charsets on a column level a good practice?
I have never done this, and would tend to say no, as it complicates the database schema unnecessarily. While the database engine should be able to deal with it fine, I would not use mixed charsets out of storage considerations. The savings will be minimal at best.
The only valid reason to mix charsets that I can think of is the use of different collations for a specific sort order and/or case/accent sensitive/insensitive searching.

When to use utf-8 and when to use latin1 in MySQL?

I know that MySQL has default of latin1 encoding and apparently it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?
I am working on a site that I hope will be used globally. Do I absolutely need to have utf-8? Or will I be able to get away with using latin1?
Also, I tried to change some tables from latin1 to utf8 but I got this error:
Speficief key was too long; max key length is 1000 bytes
Does anyone know the solution to this? And should I really solve that or may latin1 be enough?
Thanks,
Alex
it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?
It takes 1 bytes to store a latin1 character and 1 to 3 bytes to store a UTF8 character.
If you only use basic latin characters and punctuation in your strings (0 to 128 in Unicode), both charsets will occupy the same length.
Also, I tried to change some tables from latin1 to utf8 but I got this error: "Speficief key was too long; max key length is 1000 bytes" Does anyone know the solution to this? And should I really solve that or may latin1 be enough?
If you have a column of VARCHAR(334) or longer, MyISAM wont't let you create an index on it since there is remote possibility of the column to occupy more that 1000 bytes.
Note that keys of such length are rarely useful. You can create a prefixed index which will be almost as selective for any real-world data.
At a bare minimum I would suggest using UTF-8. Your data will be compatible with every other database out there nowadays since 90%+ of them are UTF-8.
If you go with LATIN1/ISO-8859-1 you risk the data being not properly stored because it doesn't support international characters... so you might run into something like the left side of this image:
If you go with UTF-8, you don't need to deal with these headaches.
Regarding your error, it sounds like you need to optimize your database. Consider this: http://bugs.mysql.com/bug.php?id=4541#c284415
It would help if you gave specifics on your table schema and column for that issue.
If you allow users to post in their own languages, and if you want users from all countries to participate, you have to switch at least the tables containing those posts to UTF-8 - Latin1 covers only ASCII and western European characters. The same is true if you intend to use multiple languages for your UI. See this post for how to handle migration.
In my experience, if you plan to support Arabic, Russian, Asian languages or others, the investment in UTF-8 support upfront will pay off down the line. However, depending on your circumstances you may be able to get away with English for a while.
As for the error, you probably have a key or index field with more than 333 characters, the maximum allowed in MySQL with UTF-8 encoding. See this bug report.
We did an application using Latin because it was the default. But later on we had to change everything to UTF because of spanish characters, not incredible difficult but no point having to change things unnecessarily.
So short answer is just go with UTF-8 from the beginning, it will save you trouble later on.
Since the max length of a key is 1000 BYTES, if you use utf8, then this will limmit you to 333 characters.
However MySQL is different form Oracle for charset. In Oracle you can't have a different character set per column, wheras in MySQL you can, so may be you can set the key to latin1 and other columns to utf8.
Finally I believe only defunct version 6.0alpha (ditched when Sun bought MySQL) could accomodate unicode characters beyound the BMP (Basic Multilingual Plan). So basically, even with UTF-8, you won't have all the whole unicode character set. In practice this is only a problem for rare Chinese characters, if that really matters to you.
I am not an expert, but I always understood that UTF-8 is actually a 4-byte wide encoding set, not 3. And as I understand it, the MySQL implementation of utf8_unicode_ci only handles a 3-byte wide encoding set...
If you want the full UTF-8 4-byte character encoding, you need to use utf8mb4_unicode_ci encoding for your MySQL database/tables.
Current best practice is to never use MySQL's utf8 character set. Use utf8mb4 instead, which is a proper implementation of the standard.
See Adam Hooper's Explanation for more detail.
Note that in utf8mb4, characters have a variable number of bytes. As the name implies, characters are up to four bytes. For characters in the the latin character set, encoded as utf8mb4, they still occupy only one byte. Other characters, including those with accents, Kanji, and emoji's require two, three, or four bytes to store.
The Specified key was too long; max key length is 1000 bytes error occurs when an index contains columns in utf8mb4 because the index may be over this limit. You'll need to shorten the column length of some character columns or shorten the length of the index on the columns using this syntax to ensure that it is shorter than the limit.
ALTER TABLE.. ADD INDEX `myIndex` ( column1(15), column2(200) );