MySQL: adding support for Asian characters to an existing database - mysql

I'm looking for a best-practices approach to adding support for Asian character sets to an existing database. We have existing tables that are in the latin1 charset:
show create table books
CREATE TABLE `books` (
`id` varchar(255) NOT NULL,
`category` varchar(255) default NULL,
`contactEmail` varchar(255) default NULL,
`description` text,
`price` varchar(255) default NULL,
PRIMARY KEY (`id`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Currently when we enter UTF8 chars for the description field, we get back '?' chars for Asian chars on the round-trip. Latin1 chars work just fine.
Can I simply convert this table with something like this?
ALTER TABLE books CONVERT TO CHARACTER SET utf8
I understand that this won't magically fix data already present in the table. I just want it to work properly for new data going forward.
Do I need to worry about collation? I have no idea how that would work for non-latin characters.
Would it make sense to make utf8 the default for the database? Are there any caveats to that?
Thanks

I don't have much experience with how MySQL handles character sets, but I have experience with character sets in general.
Currently when we enter UTF8 chars for the description field, we get back '?' chars for Asian chars on the round-trip. Latin1 chars work just fine.
Because your table is using latin1 for encoding, it can only store characters that are present in the latin1 character set. Latin1 is a shorthand for ISO-8859-1, you can see what characters it has — no Asian characters, which is why they won't store. I'm a little surprised MySQL doesn't error on such input.
Would it make sense to make utf8 the default for the database? Are there any caveats to that?
UTF-8 would be a good choice if you need to store characters from multiple languages. UTF-8, as a Unicode encoding, will let you store any Unicode character (there are literally thousands of them), from many languages. You could store the string "Dog café θλφ 你好" using UTF-8. UTF-8 is widely used, and is able to encode just about anything — I highly recommend it.
I would peruse the Internet to find literature on converting MySQL tables, to make sure there aren't any gotchas. If this is production data, test on an offline dataset — a development table or a QA table.
Last, you seem to indicate that there are half-stored Asian characters somehow in your DB. I'd figure out what extactly is stored: if it's the UTF-8 sequence for the Asian character, but the database thinks it's latin1 (a classic case of mojibake), some recovery may be possible. I would worry that the conversion may attempt to transform the UTF-8 code units as if they were latin1, resulting in very interesting output. Test test test.

The fact that you're getting back '?' is a good sign, as it would suggest that the characters not present in Latin-1 have been properly converted to the replacement character. Before embarking on a project to convert the data, make sure that everything in there is sane. This is especially important if you have more than one application and programming language writing to the database.
One of the simplest ways to do a rough and ready sanity check is to check the character length against the byte length.
SELECT length(foo), char_length(foo) FROM bar
The first returned value is the length of the string in bytes, the second is the length of the string in characters. If there are any multi-byte characters in there somehow, these two values will differ.
There are a great many guides to converting available on the internet, and of those I've found one in particular to be incredibly useful.

Related

How can I determine utf8 data encoding error and correct it in MySql?

I have a website form written in Perl that saves user input in multiple languages to a MySQL database. While it has worked perfectly saving and displaying all characters without problems, in PHPMyAdmin the characters always displayed with errors. However I ignored this since the website was displaying characters OK.
Now I've just recently moved the website to a VPS and the database has seemingly enforced ut8mb4 encoding on the data, so it is now displaying character errors on the site. I'm not an expert and find the whole encoding area quite confusing. My question is, how can I:
a) determine how my data is actually encoded in my table?
b) convert it correctly to utf8mb4 so it displays correctly in PHPMyAdmin and my website?
All HTML pages use the charset=utf8 declaration. MySQL connection uses mysql_enable_utf8 => 1. The table in my original database was set to utf8_general_ci collation. The original database collation (I just noticed) was set to latin1_swedish_ci. The new database AND table collation is utf8mb4_general_ci. Thanks in advance.
SHOW CREATE TABLE will tell you the default CHARACTER SET for the table. For any column(s) that overrode the default, the column will specify what it is set to.
However, there could be garbage in the column. Many users have encountered this problem when they stored utf8 bytes into a latin1 column. This lead to "Mojobake" or "double encoding".
The only way to tell what is actually stored there is to SELECT HEX(col). Western European accented characters will be
one byte for a latin1 character stored in latin1 column.
2 bytes for a utf8 character stored in 1 utf8 character or into 2 latin1 characters.
several bytes for "double encoding" when converted twice.
More discussion: Trouble with UTF-8 characters; what I see is not what I stored

What will happen if I changed the collation of my database?

I am trying to increase performance in MySQL. once of the thing that I learned in using Latin1 charset is faster than using UTF8 because latin1 uses less bytes.. But I am wondering what will happen to the data if I changed the collation? in my application today most of the things are in Amerian english but I can't guarantee that there won't be any other languages stores as well. it someone store data other than english I don't really care about that data much.
My question:
1) if I changed the collation in my databases to latin1 what will happen to the data if it was not written in American English?
2) which lation1_? do I use latin1_bin, latin1_general_ci, or latin1_general_cs? and if possible what is the difference?
3) when changing the collation of the database do I need to also change the collation of each table separately?
Thanks
UTF 8 only uses extra bytes if it has strange characters. So really, you should NOT change your collation. it won't help. UTF 16 was developed to hold all characters of all languages, and yes, it uses 16 bits so if you were using utf16 I would suggest utf8 if you mostly had standard latin characters. utf 8 is the compromise. it has a special character that means "more bytes coming", and if it sees it, it groups the next ones together. But if all you have is latin characters, the bytes will be exactly the same number as with latin collation.
to answer specifically, you can set the default colation for new tables, but yes, you have to do it for each one of the existing ones. you could do it with an sql statement that lists the tables then runs the sql statement on each, to change it. (change 1 and notice the sql statement). but again, don't do it. utf8 is the standard for a reason. your performance issues are elsewhere.

utf-8 vs latin1

What are the advantages/disadvantages between using utf8 as a charset against using latin1?
If utf can support more chars and is used consistently wouldn't it always be the better choice? Is there any reason to choose latin1?
UTF8 Advantages:
Supports most languages, including RTL languages such as Hebrew.
No translation needed when importing/exporting data to UTF8 aware components (JavaScript, Java, etc).
UTF8 Disadvantages:
Non-ASCII characters will take more time to encode and decode, due to their more complex encoding scheme.
Non-ASCII characters will take more space as they may be stored using more than 1 byte (characters not in the first 127 characters of the ASCII characters set). A CHAR(10) or VARCHAR(10) field may need up to 30 bytes to store some UTF8 characters.
Collations other than utf8_bin will be slower as the sort order will not directly map to the character encoding order), and will require translation in some stored procedures (as variables default to utf8_general_ci collation).
If you need to JOIN UTF8 and non-UTF8 fields, MySQL will impose a SEVERE performance hit. What would be sub-second queries could potentially take minutes if the fields joined are different character sets/collations.
Bottom line:
If you don't need to support non-Latin1 languages, want to achieve maximum performance, or already have tables using latin1, choose latin1.
Otherwise, choose UTF8.
latin1 has the advantage that it is a single-byte encoding, therefore it can store more characters in the same amount of storage space because the length of string data types in MySql is dependent on the encoding. The manual states that
To calculate the number of bytes used to store a particular CHAR,
VARCHAR, or TEXT column value, you must take into account the
character set used for that column and whether the value contains
multibyte characters. In particular, when using a utf8 Unicode
character set, you must keep in mind that not all characters use the
same number of bytes. utf8mb3 and utf8mb4 character sets can require
up to three and four bytes per character, respectively. For a
breakdown of the storage used for different categories of utf8mb3 or
utf8mb4 characters, see Section 10.9, “Unicode Support”.
Furthermore lots of string operations (such as taking substrings and collation-dependent compares) are faster with single-byte encodings.
In any case, latin1 is not a serious contender if you care about internationalization at all. It can be an appropriate choice when you will be storing known safe values (such as percent-encoded URLs).
#Ross Smith II, Point 4 is worth gold, meaning inconsistency between columns can be dangerous.
To add value to the already good answers, here is a small performance test about the difference between charsets:
A modern 2013 server, real use table with 20000 rows, no index on concerned column.
SELECT 4 FROM subscribers WHERE 1 ORDER BY time_utc_str; (4 is cache buster)
varchar(20) CHARACTER SET latin1 COLLATION latin1_bin: 15ms
varbinary(20): 17ms
utf8_bin: 20ms
utf8_general_ci: 23ms
For simple strings like numerical dates, my decision would be, when performance is concerned, using utf8_bin (CHARACTER SET utf8 COLLATE utf8_bin). This would prevent any adverse effects with other code that expects database charsets to be utf8 while still being sort of binary.
Fixed-length encodings such as latin-1 are always more efficient in terms of CPU consumption.
If the set of tokens in some fixed-length character set is known to be sufficient for your purpose at hand, and your purpose involves heavy and intensive string processing, with lots of LENGTH() and SUBSTR() stuff, then that could be a good reason for not using encodings such as UTF-8.
Oh, and BTW. Do not confuse, as you seem to do, between a character set and an encoding thereof. A character set is some defined set of writeable glyphs. The same character set can have multiple distinct encodings. The various versions of the unicode standard each constitute a character set. Each of them can be subjected to either UTF-8, UTF-16 and "UTF-32" (not an official name, but it refers to the idea of using full four bytes for any character) encoding, and the latter two can each come in a HOB-first or HOB-last flavour.

When to use utf-8 and when to use latin1 in MySQL?

I know that MySQL has default of latin1 encoding and apparently it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?
I am working on a site that I hope will be used globally. Do I absolutely need to have utf-8? Or will I be able to get away with using latin1?
Also, I tried to change some tables from latin1 to utf8 but I got this error:
Speficief key was too long; max key length is 1000 bytes
Does anyone know the solution to this? And should I really solve that or may latin1 be enough?
Thanks,
Alex
it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?
It takes 1 bytes to store a latin1 character and 1 to 3 bytes to store a UTF8 character.
If you only use basic latin characters and punctuation in your strings (0 to 128 in Unicode), both charsets will occupy the same length.
Also, I tried to change some tables from latin1 to utf8 but I got this error: "Speficief key was too long; max key length is 1000 bytes" Does anyone know the solution to this? And should I really solve that or may latin1 be enough?
If you have a column of VARCHAR(334) or longer, MyISAM wont't let you create an index on it since there is remote possibility of the column to occupy more that 1000 bytes.
Note that keys of such length are rarely useful. You can create a prefixed index which will be almost as selective for any real-world data.
At a bare minimum I would suggest using UTF-8. Your data will be compatible with every other database out there nowadays since 90%+ of them are UTF-8.
If you go with LATIN1/ISO-8859-1 you risk the data being not properly stored because it doesn't support international characters... so you might run into something like the left side of this image:
If you go with UTF-8, you don't need to deal with these headaches.
Regarding your error, it sounds like you need to optimize your database. Consider this: http://bugs.mysql.com/bug.php?id=4541#c284415
It would help if you gave specifics on your table schema and column for that issue.
If you allow users to post in their own languages, and if you want users from all countries to participate, you have to switch at least the tables containing those posts to UTF-8 - Latin1 covers only ASCII and western European characters. The same is true if you intend to use multiple languages for your UI. See this post for how to handle migration.
In my experience, if you plan to support Arabic, Russian, Asian languages or others, the investment in UTF-8 support upfront will pay off down the line. However, depending on your circumstances you may be able to get away with English for a while.
As for the error, you probably have a key or index field with more than 333 characters, the maximum allowed in MySQL with UTF-8 encoding. See this bug report.
We did an application using Latin because it was the default. But later on we had to change everything to UTF because of spanish characters, not incredible difficult but no point having to change things unnecessarily.
So short answer is just go with UTF-8 from the beginning, it will save you trouble later on.
Since the max length of a key is 1000 BYTES, if you use utf8, then this will limmit you to 333 characters.
However MySQL is different form Oracle for charset. In Oracle you can't have a different character set per column, wheras in MySQL you can, so may be you can set the key to latin1 and other columns to utf8.
Finally I believe only defunct version 6.0alpha (ditched when Sun bought MySQL) could accomodate unicode characters beyound the BMP (Basic Multilingual Plan). So basically, even with UTF-8, you won't have all the whole unicode character set. In practice this is only a problem for rare Chinese characters, if that really matters to you.
I am not an expert, but I always understood that UTF-8 is actually a 4-byte wide encoding set, not 3. And as I understand it, the MySQL implementation of utf8_unicode_ci only handles a 3-byte wide encoding set...
If you want the full UTF-8 4-byte character encoding, you need to use utf8mb4_unicode_ci encoding for your MySQL database/tables.
Current best practice is to never use MySQL's utf8 character set. Use utf8mb4 instead, which is a proper implementation of the standard.
See Adam Hooper's Explanation for more detail.
Note that in utf8mb4, characters have a variable number of bytes. As the name implies, characters are up to four bytes. For characters in the the latin character set, encoded as utf8mb4, they still occupy only one byte. Other characters, including those with accents, Kanji, and emoji's require two, three, or four bytes to store.
The Specified key was too long; max key length is 1000 bytes error occurs when an index contains columns in utf8mb4 because the index may be over this limit. You'll need to shorten the column length of some character columns or shorten the length of the index on the columns using this syntax to ensure that it is shorter than the limit.
ALTER TABLE.. ADD INDEX `myIndex` ( column1(15), column2(200) );

Does MySQL handle a single utf-8 character key as well as an integer?

I' working on a Chinese/Japanese learning web app where many tables are indexed by the characters (the "glyphs") of those languages.
I'm wondering if the integer codepoint value of the glyph would be better for performance than using a single utf8 character (for primary key and indexes)?
Using a single utf8 character would be very useful because I can see the unicode characters fine in the shell I'm using, and this makes debugging the SQL queries of this app easier.
In theory MySQL would treat a single utf8 character as a unique integer value similarly to a mediumint (3 bytes)... but I suspect MySQL will handle the column as a string instead.
Would there be performance issues due to MySQL treating my single utf8 char as a string?
Would you recommend to stick to the integer codepoint for indexes and primary keys, and perhaps use CONVERT() or other operator to get the utf8 character in results?
MySQL will store and index a UTF-8 character as a multi-byte string, yes. So I would expect integer to be a faster key, though the difference in performance is unlikely to be significant.
Another possible issue is that until MySQL 6.0, the utf8 character set doesn't support characters outside the Basic Multilingual Plane (ie it's limited to three bytes per character). If you want to use some of the really obscure kanji in the Supplementary Ideographic Plane, that'd be no good.