Why does MySQL use latin1_swedish_ci as the default? - mysql

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.

As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.

latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.

Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.

Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.

To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

Related

MySQL Collation in Kamailio tables

For some reason I don't know, I have some Kamailio tables with utf32_general_ci collation (location in example) and others with latin1_swedish_ci.
I try to find answer on github repos (https://github.com/kamailio/kamailio/blob/4.4/scripts/mysql/my_create.sql) but collation is not specified.
That's right? Does matters? Which one it's right choice?
My Kamailio version is 4.4.3.
What is Kamailio? What does it store in a database?
In general, all software today should be using full UTF-8 characterset encoding. In MySQL, the syntax is CHARACTER SET utf8mb4.
You mentioned two COLLATIONs; they have to do with sort order, not character encoding. The two mentioned correspond to character sets
utf32 -- 4 bytes per character quite wasteful of space; virtually no one uses it; there is no good reason (that I can think of) to use it today.
latin1 -- 1 byte per character; handles western Europe, but none of Asia. It is an old default in MySQL (hence your tables having it).
I do not know the ramifications of trying to change Kamailio.
One thing to note in how MySQL deals with character set differences: You must specify the encoding in the client when connecting to MySQL. Then, when INSERTing and SELECTing, MySQL will convert (if necessary, and if possible) between the client encoding and column encoding. Because of this automatic conversion, you will currently see nothing wrong for western European characters. But Greek, Chinese, etc will be mangled/lost when attempting to store into latin1.

What MySQL collation is best for accepting all unicode characters?

Our column is currently collated to latin1_swedish_ci and special unicode characters are, obviously, getting stripped out. We want to be able to accept chars such as U+272A ✪, U+2764 ❤, (see this wikipedia article) etc. I'm leaning towards utf8_unicode_ci, would this collation handle these and other characters? I don't care about speed as this column isn't an index.
MySQL Version: 5.5.28-1
The collation is the least of your worries, what you need to think about is the character set for the column/table/database. The collation (rules governing how data is compared and sorted) is just a corollary of that.
MySQL supports several Unicode character sets, utf8 and utf8mb4 being the most interesting. utf8 supports Unicode characters in the BMP, i.e. a subset of all of Unicode. utf8mb4, available since MySQL 5.5.3, supports all of Unicode.
The collation to be used with any of the Unicode encodings is most likely xxx_general_ci or xxx_unicode_ci. The former is a general sorting and comparison algorithm independent of language, the latter is a more complete language independent algorithm supporting more Unicode features (e.g. treating "ß" and "ss" as equivalent), but is therefore also slower.
See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html.

utf-8 vs latin1

What are the advantages/disadvantages between using utf8 as a charset against using latin1?
If utf can support more chars and is used consistently wouldn't it always be the better choice? Is there any reason to choose latin1?
UTF8 Advantages:
Supports most languages, including RTL languages such as Hebrew.
No translation needed when importing/exporting data to UTF8 aware components (JavaScript, Java, etc).
UTF8 Disadvantages:
Non-ASCII characters will take more time to encode and decode, due to their more complex encoding scheme.
Non-ASCII characters will take more space as they may be stored using more than 1 byte (characters not in the first 127 characters of the ASCII characters set). A CHAR(10) or VARCHAR(10) field may need up to 30 bytes to store some UTF8 characters.
Collations other than utf8_bin will be slower as the sort order will not directly map to the character encoding order), and will require translation in some stored procedures (as variables default to utf8_general_ci collation).
If you need to JOIN UTF8 and non-UTF8 fields, MySQL will impose a SEVERE performance hit. What would be sub-second queries could potentially take minutes if the fields joined are different character sets/collations.
Bottom line:
If you don't need to support non-Latin1 languages, want to achieve maximum performance, or already have tables using latin1, choose latin1.
Otherwise, choose UTF8.
latin1 has the advantage that it is a single-byte encoding, therefore it can store more characters in the same amount of storage space because the length of string data types in MySql is dependent on the encoding. The manual states that
To calculate the number of bytes used to store a particular CHAR,
VARCHAR, or TEXT column value, you must take into account the
character set used for that column and whether the value contains
multibyte characters. In particular, when using a utf8 Unicode
character set, you must keep in mind that not all characters use the
same number of bytes. utf8mb3 and utf8mb4 character sets can require
up to three and four bytes per character, respectively. For a
breakdown of the storage used for different categories of utf8mb3 or
utf8mb4 characters, see Section 10.9, “Unicode Support”.
Furthermore lots of string operations (such as taking substrings and collation-dependent compares) are faster with single-byte encodings.
In any case, latin1 is not a serious contender if you care about internationalization at all. It can be an appropriate choice when you will be storing known safe values (such as percent-encoded URLs).
#Ross Smith II, Point 4 is worth gold, meaning inconsistency between columns can be dangerous.
To add value to the already good answers, here is a small performance test about the difference between charsets:
A modern 2013 server, real use table with 20000 rows, no index on concerned column.
SELECT 4 FROM subscribers WHERE 1 ORDER BY time_utc_str; (4 is cache buster)
varchar(20) CHARACTER SET latin1 COLLATION latin1_bin: 15ms
varbinary(20): 17ms
utf8_bin: 20ms
utf8_general_ci: 23ms
For simple strings like numerical dates, my decision would be, when performance is concerned, using utf8_bin (CHARACTER SET utf8 COLLATE utf8_bin). This would prevent any adverse effects with other code that expects database charsets to be utf8 while still being sort of binary.
Fixed-length encodings such as latin-1 are always more efficient in terms of CPU consumption.
If the set of tokens in some fixed-length character set is known to be sufficient for your purpose at hand, and your purpose involves heavy and intensive string processing, with lots of LENGTH() and SUBSTR() stuff, then that could be a good reason for not using encodings such as UTF-8.
Oh, and BTW. Do not confuse, as you seem to do, between a character set and an encoding thereof. A character set is some defined set of writeable glyphs. The same character set can have multiple distinct encodings. The various versions of the unicode standard each constitute a character set. Each of them can be subjected to either UTF-8, UTF-16 and "UTF-32" (not an official name, but it refers to the idea of using full four bytes for any character) encoding, and the latter two can each come in a HOB-first or HOB-last flavour.

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

In MySQL, which collation should I choose?

When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci