Why is MySQL's default collation latin1_swedish_ci? - mysql

What is the reasoning behind setting latin1_swedish_ci as the compiled default when other options seem much more reasonable, like latin1_general_ci or utf8_general_ci?

The bloke who wrote it was co-head of a Swedish company.
Possibly for similar reasons, Microsoft SQL Server's default language us_english.

latin1_swedish_ci is a single byte character set, unlike utf8_general_ci.
Compared to latin1_general_ci it has support for a variety of extra characters used in European languages. So it’s a best choice if you don’t know what language you will be using, if you are constrained to use only single byte character sets.

Related

What COLLATE should i set to use all kind of possible languages?

I have a column called username, i want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should i set on my database and tables?
I'm using utf_general_ci, i'm new so i don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because i will not use preg_replace or a function to replace special chars, i will only use prepared statement to avoid SLQ injection and protect by database.
First choice (MySQL 8.0): utf8mb4_0900_ai_ci
Second choice (as of 5.6): utf8mb4_unicode_520_ci
Third choice (5.5+): utf8mb4_unicode_ci
Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci
The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.
No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.
utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)
If Portuguese is the focus:
See https://pt.stackoverflow.com/ and MySQL collation for Portugese .
Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.
(Else go with the 'choices' listed above.)
If you are using MySQL 5.5.3 or higher, I would recommend UTF-8 character encoding utf8mb4_unicode_ci . AFAIK it supports most, if not all languages, and implements the Unicode standard for sorting and comparison. As a second choice, have a look at utf8mb4_general_ci, which may be faster but also less accurate.
See this excellent SO post for (many) more details, or check out the official MySQL doc.
Below 5.5.3, utf8_unicode_ci is your friend.
COLLATION refers to ordering (as in comparisons in WHERE and ORDER BY); you should really ask about CHARACTER SET:
Pre-5.5.3: utf8 (aka utf8mb3) handles all languages, except for a few Chinese characters and Emoji.
5.5.3 forward: utf8mb4 - Handles everything. Outside of MySQL, it is spelled "UTF-8".

MySQL Collation: latin1_swedish_ci Vs utf8_general_ci

What should I set for Collation when creating tables in MySQL:
latin1_swedish_ci or utf8_general_ci
What is Collation anyway?
I have been using latin1_swedish_ci, would it cause any problems?
Whatever you do, don't try to use the default swedish_ci collation with utf8 (instead of latin) in mysql, or you'll get an error. Collations must be paired with the right charset to work. This SQL will fail because of the mismatch in charset and collation:
CREATE TABLE IF NOT EXISTS `db`.`events_user_preference` (
`user_id` INT(10) UNSIGNED NOT NULL ,
`email` VARCHAR(40) NULL DEFAULT NULL ,
PRIMARY KEY (`user_id`) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = latin1_swedish_ci
And #Blaisorblade pointed out that the way to fix this is to use the character set that goes with the swedish collation:
DEFAULT CHARACTER SET = utf8_swedish_ci
The SQL for the cal (calendar) module for the Yii php framework had something similar to the above erroneous code. Hopefully they've fixed it by now.
You can read about character sets and collations as of MySQL 5.5 here:
Character Sets and Collations in General
Character Sets and Collations in MySQL
The collations support is necessary to support all the many written languages of the world. For instance in my language (Danish) we have a special character 'æ'. It sounds like Swedish, German, Hungarian (and more) 'ä' . That character also appears in Danish with words imported form one of those languages. Due to collations' support we can have both printed correctly and and the same sorted (ORDER BY ...) as being identical. Without collations support that was not possible.
Swedish collations is the MySQL default for latin charsets. It works fine with English. English is so easy - it works with everything, because it has no special characters, accents etc. But if you have another language that you use often (for instance Spanish) you could change collation to a Spanish one, so sorting of Spanish Strings would be correct according to Spanish language rules.
A very special example of a collation is one of the German ones. It was created to allowed for sorting like in German phone books. German phone books don't follow general rules of german language!
You can create your own collation if you like. Collations can be compiled or text-format.
In Wamp Server 2.5 you can change the collation by going into PHPAdmin, selecting the database you need to change. This will give you another set of tabs. Select the Tab called Operations. In that tab will be section called collation, pick the one you want in the drop-down and select go.
Try these:
<?php
echo htmlspecialchars($string);
echo htmlentities($string);
?>
You can see more info from http://php.net/manual/en/function.htmlspecialchars.php. :D
Worked for me! No more diamonds :)

Why does MySQL use latin1_swedish_ci as the default?

Does anyone know why latin1_swedish is the default for MySQL. It would seem to me that UTF-8 would be more compatible right?
Defaults are usually chosen because they are the best universal choice, but in this case it does not seem thats what they did.
As far as I can see, latin1 was the default character set in pre-multibyte times and it looks like that's been continued, probably for reasons of downward compatibility (e.g. for older CREATE statements that didn't specify a collation).
From here:
What 4.0 Did
MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.
As to why Swedish, I can only guess that it's because MySQL AB is/was Swedish. I can't see any other reason for choosing this collation, it comes with some specific sorting quirks (ÄÖÜ come after Z I think), but they are nowhere near an international standard.
latin1 is the default character set. MySQL's latin1 is the same as the
Windows cp1252 character set. This means it is the same as the
official ISO 8859-1 or IANA (Internet Assigned Numbers Authority)
latin1, except that IANA latin1 treats the code points between 0x80
and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions.
from
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html
Might help you understand why.
Using a single-byte encoding has some advantages over multi-byte encondings, e.g. length of a string in bytes is equal to length of that string in characters. So if you use functions like SUBSTRING it is not intuitively clear if you mean characters or bytes. Also, for the same reasons, it requires quite a big change to the internal code to support multi-byte encodings.
Most strange features of this kind are historic. They did it like that long time ago, and now they can't change it without breaking some app depending on that behavior.
Perhaps UTF8 wasn't popular then. Or perhaps MySQL didn't support charsets where multiple bytes encode on character then.
To expand on why not utf8, and explain a gotcha not mentioned elsewhere in this thread be aware there is a gotcha with mysql utf8. It's not utf8! Mysql has been around for a long time, since before utf8 existed. As explained above this is likely why it is not the default (backwards comparability, and expectations of 3rd party software).
In the time when utf8 was new and not commonly used, it seems mysql devs added basic utf8 support, incorrectly using 3 bytes of storage. Now that it exists, they have chosen not to increase it to 4 bytes or remove it. Instead they added utf8mb4 "multi byte 4" which is real 4 byte utf8.
Its important that anyone migrating a mysql database to utf8 or building a new one knows to use utf8mb4. For more information see https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.

In MySQL, which collation should I choose?

When I create a new MySQL database through phpMyAdmin, I have the option to choose the collation (e.g.-default, armscii8, ascii, ... and UTF-8). The one I know is UTF-8, since I always see this in HTML source code. But what is the default collation? What are the differences between these choices, and which one should I use?
Collation tells database how to perform string matching and sorting. It should match your charset.
If you use UTF-8, the collation should be utf8_general_ci. This will sort in unicode order (case-insensitive) and it works for most languages. It also preserves ASCII and Latin1 order.
The default collation is normally latin1.
Collation is not actually the default, it's giving you the default collation as the first choice.
What we're talking about is collation, or the character set that your database will use in its text types. Your default option is usually based on regional settings, so unless you're planning to globalize, that's usually peachy-keen.
Collations also determine case and accent sensitivity (i.e.-Is 'Big' == 'big'? With a CI, it is). Check out the MySQL list for all the options.
Short answer: always use utf8mb4 (specifically utf8mb4_unicode_ci) when dealing with collation in MySql & MariaDB.
Long answer:
MySQL’s utf8 encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.
Luckily, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 which maps to proper UTF-8 and thus fully supports Unicode.
Read the full text here: https://mathiasbynens.be/notes/mysql-utf8mb4
As to which specific utf8mb to choose, go with utf8mb4_unicode_ci so that sorting is always handled properly with minimal/unnoticeable performance drawbacks. See more details here: What's the difference between utf8_general_ci and utf8_unicode_ci