I have just exported a MySQL database (in LATIN1) and converted to UTF-8 in the process, and imported on to a newer system.
It seemed to go OK, but I did hit a few instances where a UNIQUE key threw an error because two entries which differed only in an international character, e.g.
"åle" was not considered unique from "ale"
I did not find anything in the documentation on UNIQUE keys that mentioned character sets or encodings at all.
How can I configure the database to ensure that it considers these letters unique?
This depends on the "COLLATION" setting for the column in question. You can see the current collation with "SHOW FULL COLUMNS IN yourtablename".
For example, "utf8_general_ci" considers "ale", "åle" and "ALE" the same. Depending on your use case, something like "utf8_swedish_ci" or "utf8_bin" might be more appropriate. Note that changing collation will also change what ".. where yourcolumn=value" matches, and the ordering of "...order by yourcolumn".
You can change the collation with "ALTER TABLE" (for a single column), or database-wide. More information in the manual: http://dev.mysql.com/doc/refman/5.5/en/globalization.html
Related
Okay, not the clearest title ever; feel free to improve.
I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.
I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.
I figured this should be a fairly easy task; just do:
SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin
But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.
No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:
For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.
Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.
Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?
SET NAMES utf8mb4;
CREATE TABLE x(s VARCHAR(11) COLLATE utf8mb4_unicode_520_ci NOT NULL);
INSERT INTO x (s)
VALUES ('aha'), ('áha̱'), ('ā̧́ḫà̀'),
('i'), ('i̯');
SELECT s FROM x GROUP BY s HAVING COUNT(*) > 1;
Comes back with
aha
i
without any complaints about numeric stuff.
I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.
What I am doing differently than your case?
When adding an explicate COLLATE clause, put it on the component of the query that needs it. (COLLATE does not apply to the query as a whole; different parts can be collated differently.)
As per this link, I would like to understand the meaning of
Case-sensitivity in string comparison depends on the collating sequence used
As per the additional information retrieved, With this information from msdn, I could not understand the meaning of collation with these below statement:
A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects that have different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page.
So, Can you help me understand the meaning/significance of collating sequence in database with an example?
Note: I am currently part of database intro course.
While creating database you might feel the need to store data in different language and different languages will have different number of characters with different sort order, so you might need some way to sort them accordingly, at that point we use collation. Collation controls the way string values are sorted. In TSQL you can define it using collate clause as explained here http://msdn.microsoft.com/en-us/library/ms184391.aspx
You can go through the documentation if you want to find out what are different configurations that are supported. If you don't define any collation while creating database it will pick the default collation from current sql server instance. You can also apply collation at database table or at a column and you can use collation while selecting data it will apply to the sort order.
Here is a related question that will help you understand further What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do?
Simple question: when creating database tables I already specify the collation for the entire table, however, individual columns also have a collation field. Should this be left blank or filled with the same collation as the table (for non-numeric types that is)?
Could anyone shed some light on what this field is for exactly? My suspicion is that it is used for when the column's collation differs from the table's collation, however, I would like a confirmation.
Leave it blank and it will use the table's collation. (See here in the manual.)
As for what all this is for, see the manual section Character Set Support and, in particular, Specifying Character Sets and Collations.
We have issues with utf8-string comparisons in MySQL 5, regarding case and accents :
from what I gathered, what MySQL implements collations by considering that "groups of characters should be considered equal".
For example, in the utf8_unicode_ci collation, all the letters "EÉÈÊeéèê" are in the same box (together with other variants of "e").
So if you have a table containing ["video", "vidéo", "vidÉo", "vidÊo", "vidêo", "vidÈo", "vidèo", "vidEo"] (in a varchar column declared with ut8_general_ci collation) :
when asking MySQL to sort the rows according to this column, the sorting is random (MySQL does not enforce a sorting rule between "é" and "É" for example),
when asking MySQL to add a Unique Key on this column, it raises an error because it considers all the values are equal.
What setting can we fiddle with to fix these two points ?
PS : on a related note, I do not see any case sensitive collation for the utf8 charset. Did I miss something ?
[edit] I think my initial question still holds some interest, and I will leave it as is (and maybe one day get a positive answer).
It turned out, however, that our problems with string comparisons regarding accents was not linked to the collation of our text columns. It was linked to a configuration problem with the character_set_client parameter when talking with MySQL - which defaulted to latin1.
Here is the article that explained it all to us, and allowed us to fix the problem :
Getting out of MySQL character set hell
It is lengthy, but trust me, you need this length to explain both the problem and the fix.
Use collation that considers these characters to be distinct. Maybe utf8_bin (it's case sensitive, since it does binary comparison of characters)
http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
I set up a MyISAM table to do FULLTEXT searching.
I do not want searches to be case-sensitive.
My searches are along the lines of:
SELECT * FROM search WHERE MATCH (keywords) AGAINST ('+diversity +kitten' IN BOOLEAN MODE);
Let's say the keywords field I'm looking for has the value "my Diversity kitten".
I noticed the searches were case-sensitive.
I double-checked my collation on the search table, it was set to utf8_bin. D'oh!
I changed it to utf8_general_ci.
But my query is still case-sensitive!
Why?
Is there a server setting I need to change, too?
Is there something I need to do besides change the collation?
I did a "REPAIR TABLE search QUICK" to rebuild the FULLTEXT index, but that didn't do it either...
My searches are still case-sensitive. =(
Aha, figured it out for reals this time.
I believe my issue was using NaviCat to update the collation. I have an older version of NaviCat, maybe it was a bug or something.
Doing:
ALTER TABLE search CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
fixed it correctly.
Always use command line, kids! =)
Hmm - that behavior doesn't match the manual:
By default, the search is performed in
case-insensitive fashion. However, you
can perform a case-sensitive full-text
search by using a binary collation for
the indexed columns. For example, a
column that uses the latin1 character
set of can be assigned a collation of
latin1_bin to make it case sensitive
for full-text searches.
Which version of MySQL do you use? Can you provide some data that would allow replicating the problem on another machine?