What is MySQL Collation, how to use it in practice? - mysql

Let's say I want to make a search engine in some weird languages in 4 languages:
English
Swedish
Hebrew
Arabic
How would I set the collations in MySQL ?

A collation defines:
The character set used to store the characters (UTF8, ISO8859, etc.)
The sorting and presentation rules
If you want to have different languages (where they cannot be sanely represented in the same collation, as you mention) you can have columns with different collations.
Of course you can set collation at database and table levels too, and even set collation to a string literal.
If you can find a single collation that handles all the languages you're interested in, that's best.

The collation determines how MySQL compares strings.
A list of all character sets and collations can be found with:
SHOW CHARACTER SET;
SHOW COLLATION;
To change the collation for a table use:
ALTER TABLE `my_table` DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci
http://dev.mysql.com/doc/refman/5.0/en/charset.html

Related

Is it more efficient to set more restrictive collations in MySQL columns that do not require utf8 characters?

Creating a new table where I want a few VARCHAR columns to be indexed, but only one of them really needs utf8_unicode_ci collation.
Would it be any more efficient when searching the table to only set that single column to utf8_unicode_ci collation?
Or is it the same if I collate the entire table with utf8_unicode_ci collation?
If you use a binary collation, string comparisons are a little bit faster.
When you use a collation like utf8_unicode_ci, MySQL must do string comparisons character by character, so it can tell if each character is equivalent according to the rules in the collation you use.
But with a binary collation, MySQL is optimized to use the system call memcmp() to compare the whole string as literal bytes. This means there are no character equivalency rules. This comparison is case-sensitive.
However, the performance advantage of this is minor, and it's much more of an advantage to use an index.

Which collation to use so that `ş` and `s` are treated as unique values?

The issue is that ş and s are interpreted by MySQL as identical values.
I'm new to MySQL, so I have no idea which collations would view them as unique.
The collations that I've tried using which don't work are:
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci
Does anybody know which collation to use?
P.S. I also really need the collation to interpret emojis and other non-Latin characters, and, to my knowledge of MySQL and collations, the only collation able to do this is unicode?
utf8_turkish_ci and utf8_romanian_ci -- as shown in http://mysql.rjweb.org/utf8_collations.html
(Plus, of course, utf8_bin.)
For your added question: You are looking for a "character set" (not a "collation") that can represent Emoji and other non-Latin characters -- UTF-8 is the one to use. In MySQL, it is utf8mb4. The "collations" that are associated with that are named utf8mb4_.... Collations control ordering and equality, as indicated in the first part of your question about s and ş.
MySQL's CHARACTER SET utf8 is a subset of utf8mb4. Either can handle all the "letters" in the world. But only utf8mb4 can handle Emoji and some Chinese characters.

Why setting a collation for UTF8 text data in MySQL?

I work with human-generated text which I download from different online datasets like GitHub Torrent, Twitter API, web-scraped HTML pages, Google BigQuery for GitHub etc. which means I have tens and hundreds of millions of text in the databse.
In which scenarios I should be setting a collation for UTF8 fields and UTF8 tables in MySQL databases? Is it necessary at all, cannot I simply use "CHARACTER SET UTF8"?
What are the differences between utf8 - default collation, utf8_unicode_ci, utf8_general_ci and utf8_general_mysql500_ci?
Every textual column has a collation. It may be set explicitly in the table definition, or it may simply be set from the table's default, the database's default, or the server-wide default. But it has a collation.
The collations you mention are all case-insensitive. That is, they ignore the difference between upper- and lower- case letters. If you want case-sensitive collations use utf8_binary.
You probably want to use utf8_unicode_ci in a modern server. Read this for background. What's the difference between utf8_general_ci and utf8_unicode_ci
utf8_general_mysql500_ci is a collation specifically for backward compatibility to older versions of MySQL. http://dev.mysql.com/doc/relnotes/mysql/5.5/en/news-5-5-21.html

MySQL Collation: latin1_swedish_ci Vs utf8_general_ci

What should I set for Collation when creating tables in MySQL:
latin1_swedish_ci or utf8_general_ci
What is Collation anyway?
I have been using latin1_swedish_ci, would it cause any problems?
Whatever you do, don't try to use the default swedish_ci collation with utf8 (instead of latin) in mysql, or you'll get an error. Collations must be paired with the right charset to work. This SQL will fail because of the mismatch in charset and collation:
CREATE TABLE IF NOT EXISTS `db`.`events_user_preference` (
`user_id` INT(10) UNSIGNED NOT NULL ,
`email` VARCHAR(40) NULL DEFAULT NULL ,
PRIMARY KEY (`user_id`) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = latin1_swedish_ci
And #Blaisorblade pointed out that the way to fix this is to use the character set that goes with the swedish collation:
DEFAULT CHARACTER SET = utf8_swedish_ci
The SQL for the cal (calendar) module for the Yii php framework had something similar to the above erroneous code. Hopefully they've fixed it by now.
You can read about character sets and collations as of MySQL 5.5 here:
Character Sets and Collations in General
Character Sets and Collations in MySQL
The collations support is necessary to support all the many written languages of the world. For instance in my language (Danish) we have a special character 'æ'. It sounds like Swedish, German, Hungarian (and more) 'ä' . That character also appears in Danish with words imported form one of those languages. Due to collations' support we can have both printed correctly and and the same sorted (ORDER BY ...) as being identical. Without collations support that was not possible.
Swedish collations is the MySQL default for latin charsets. It works fine with English. English is so easy - it works with everything, because it has no special characters, accents etc. But if you have another language that you use often (for instance Spanish) you could change collation to a Spanish one, so sorting of Spanish Strings would be correct according to Spanish language rules.
A very special example of a collation is one of the German ones. It was created to allowed for sorting like in German phone books. German phone books don't follow general rules of german language!
You can create your own collation if you like. Collations can be compiled or text-format.
In Wamp Server 2.5 you can change the collation by going into PHPAdmin, selecting the database you need to change. This will give you another set of tabs. Select the Tab called Operations. In that tab will be section called collation, pick the one you want in the drop-down and select go.
Try these:
<?php
echo htmlspecialchars($string);
echo htmlentities($string);
?>
You can see more info from http://php.net/manual/en/function.htmlspecialchars.php. :D
Worked for me! No more diamonds :)

What's the difference between utf8_general_ci and utf8_unicode_ci in MySQL?

For a while now, I've used phpMyAdmin to manage my local MySQL databases. One thing I'm starting to pick up is the correct character sets for my database. I've decided UTF-8 is the best for compatibility (as my XHTML templates are served as UTF-8) but one thing that confuses me is the varied options for UTF-8 I'm presented with in the phpMyAdmin interface?
The two I've isolate are:
utf8_general_ci
utf8_unicode_ci
So my question is this: what is the difference between the general and unicode variants of utf8 in MySQL? (I've come to learn that ci is shorthand for case-insensitive)
Any help would be most grateful in this matter.
From the MySQL manual on Unicode Character Sets:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
See the referenced page for further information and examples.
The ##%!ing manual discusses this... :)
One of the issues is speed and accuracy of certain operations.