Mysql order by on column with unicode characters - mysql

I am running a select query on mysql table and trying to order it by the "name" column in the table.
The name column contains both English character names and names with Latin character like â.
I am running into the below problem.
The query I run returns the results ordered in the below manner i.e.
Eg: if Name contains "archer", "aaakash", "â hayden", "bourne", "jason"
The results returned by the query is ordered as below
"aaakash", "archer", "â hayden", "bourne", "jason"
However I want to order it based on unicode code points (like below)
"aaakash", "archer", "bourne", "jason", "â hayden"
(See the difference in the position of â hayden in the orders)
What can I do to order the results based on the character's position in unicode character set?

However I want to order it based on unicode code points (like below)
To sort using unicode code point, you probably need to use utf8_bin collation.
Precisely, the _bin suffix indicate to sort by the binary representation of each character.
To override the default collation while ordering, you will use ORDER BY ... COLLATE:
To paraphrase the documentation:
SELECT k
FROM t1
ORDER BY k COLLATE utf8_bin;
If your text column does not use utf8 encoding, you will have to CONVERT it:
SELECT k
FROM t1
ORDER BY CONVERT(k USING utf8) COLLATE utf8_bin;
Please notice I used utf8 as an example here as this is the most common Unicode encoding. But your MySQL server probably support other Unicode encoding, like ucs2("UTF-16").

Related

Mysql does'nt distinguish between characters "c" and "ç" in UNIQUE index [duplicate]

These two querys gives me the exact same result:
select * from topics where name='Harligt';
select * from topics where name='Härligt';
How is this possible? Seems like mysql translates åäö to aao when it searches. Is there some way to turn this off?
I use utf-8 encoding everywhere as far as i know. The same problem occurs both from terminal and from php.
Yes, this is standard behaviour in the non-language-specific unicode collations.
9.1.13.1. Unicode Character Sets
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
See also Examples of the effect of collation
You need to either
use a collation that doesn't have this "feature" (namely utf8_bin, but that has other consequences)
use a different collation for the query only. This should work:
select * from topics where name='Harligt' COLLATE utf8_bin;
it becomes more difficult if you want to do a case insensitive LIKE but not have the Ä = A umlaut conversion. I know no mySQL collation that is case insensitive and does not do this kind of implicit umlaut conversion. If anybody knows one, I'd be interested to hear about it.
Related:
Looking for case insensitive MySQL collation where “a” != “ä”
MYSQL case sensitive search for utf8_bin field
Since you are in Sweden I'd recommend using the Swedish collation. Here's an example showing the difference it makes:
CREATE TABLE topics (name varchar(100) not null) CHARACTER SET utf8;
INSERT topics (name) VALUES ('Härligt');
select * from topics where name='Harligt';
'Härligt'
select * from topics where name='Härligt';
'Härligt'
ALTER TABLE topics MODIFY name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_swedish_ci;
select * from topics where name='Harligt';
<no results>
select * from topics where name='Härligt';
'Härligt'
Note that in this example I only changed the one column to Swedish collation, but you should probably do it for your entire database, all tables, all varchar columns.
While collations are one way of solving this, the much more straightforward way seems to me to be the BINARY keyword:
SELECT 'a' = 'ä', BINARY 'a' = 'ä'
will return 1|0
In your case:
SELECT * FROM topics WHERE BINARY name='Härligt';
See also https://www.w3schools.com/sql/func_mysql_binary.asp
you want to check your collation settings, collation is the property that sets which characters are identical.
these 2 pages should help you
http://dev.mysql.com/doc/refman/5.1/en/charset-general.html
http://dev.mysql.com/doc/refman/5.1/en/charset-mysql.html
Here you can see some collation charts. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
Here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html

Get same MySQL search results using foreign and English characters

We have a MySQL database containing a table of authors. Some of the authors names have non-English characters in them (example LÜTTGE).
Our client wants users to be able to find such records even if they don't enter the non-English character. So in the above example "LUTTGE" should also find that result. At the moment it only works if the user searches for the name using the non-English character, so "LÜTTGE" works but "LUTTGE" returns nothing.
The frontend to this is a web application written in CakePHP 2
Does anyone have any ideas on how to do this as I'm at a loss? Ideally we want to be able to do this within CakePHP/MySQL, and not use third party search systems.
The above is just one example in a table of thousands of records. So it's not just a case of substituting "U" with "Ü" - there are many other variants.
This can be handled by using the MySQL collation system.
For example, the following query returns a true (1) value:
SELECT 'LÜTTGE' COLLATE utf8_general_ci = 'LUTTGE'
Accordingly, if you set the column's character set to utf8 and its collation to utf8_general_ci you will get the result you mention with umlaut characters.
The default collation in MySQL reflects its Swedish origin and is utf8_swedish_ci. In Swedish, Ü and U are not the same letter. You probably have used the default collation for your columns.
The utf8_general_ci collation handles matching 'Eßen' to 'Esen' but not to 'Essen'. It handles matching 'LÜTTGE' to 'LUTTGE' but not to 'Luettge', unfortunately.
On the other hand, the utf8_german2_ci collation matches 'Eßen' to 'Essen' and 'LÜTTGE' to 'LUETTGE'. If your users are accustomed to using ASCII transliterations of German characters you may wish to explore your choices here. One of them is to use a query with OR
SELECT whatever
FROM table
WHERE ( namen COLLATE utf8_general_ci = 'LUTTGE'
OR namen COLLATE utf8_german2_ci = 'LUTTGE' )
It can get more complex if you need to handle Spanish, because Ñ is considered a different letter from N. You may need to do some explaining for your users.
Marcus suggested using the utf_unicode_ci collation. That will handle things partially too. Here are the cases
type utf8_general_ci utf8_german2_ci utf8_unicode_ci utf8_spanish_ci
'Eßen' to 'Esen' substitute match no match no match no match
'Eßen' to 'Essen' transliterate no match match match match
'LÜTTGE' to 'LUTTGE' substitute match no match match match
'LÜTTGE' to 'LUETTGE' transliterate no match match no match no match
'Niño' to 'Nino' transliterate match match match no match
So you still need some extra work to handle transliterations.

mysql collate latin1_german1_ci not working with order by

I have a mysql database where I need to perform a search on a varchar column. All data is encoded in latin1. Sometimes these columns have western accented characters in them (for me almost always French.) Using the default collation (latin1_swedish_ci) has always worked fine for me. But now I have a problem with some data containing umlauts. If I search for "nusserhof" I want mysql to return "nüsserhof", but it is not. Changing the collation to latin1_german1_ci solves the problem in the simplest sense, for instance this query works, returning all rows containing the word "nüsserhof":
select * from mytable where mycolumn like '%nusserhof%' collate latin1_german1_ci;
But if I add an order by clause it no longer works. This doesn't return any rows containing the word "nüsserhof":
select * from mytable where mycolumn like '%nusserhof%' order by mycolumn collate latin1_german1_ci;
Surprisingly, I can't find anything here or through google about this. Is this expected behavior? As a work around I'm just dropping the order by, and sorting after the select in PHP. But it seems like I should be able to get it to work.
Is this expected behavior?
Yes, it is.
In Swedish, the glyph ü represents the letter tyskt y ("German Y") and thus under latin1_swedish_ci it is a variation of the letter y rather than u. If, applying that collation, you were to search where mycolumn like '%nysserhof%', your record containing nüsserhof would be returned.
In German, the glyph ü represents an accented variation (specifically an umlaut) of the base glyph and thus under latin1_german1_ci it is a variation of the letter u as expected. Thus you obtain the desired results when running your search under this collation.
It is because of local differences of this sort that we must choose appropriate collations for our data: no single collation can always be appropriate in the general case.
The problem that you observe when applying ORDER BY results from a misunderstanding of the COLLATE keyword: it is not part of the SELECT command (such that it instructs MySQL to use that collation for all comparisons within the command); rather, it is part of the immediately preceding string (such that it instructs MySQL to use that explicit collation for the immediately preceding string only).
That is, in your first case, the explicit latin1_german1_ci collation is applied to the '%nusserhof%' string literal with a coercibility of 0; the collation of mycolumn (which is presumably latin1_swedish_ci) has a coercibility of 2. Since the former has a lower value, it is used when evaluating the expression.
In your second case, the explicit latin1_german1_ci collation is applied to mycolumn within the ORDER BY clause: thus the sorted results will place 'nüsserhof' between 'nu' and 'nv' instead of between 'ny' and 'nz'. However the explicit collation no longer applies to the filter expression within the WHERE clause, and so the column's default collation will apply.
If the data in mycolumn is all in the German language, you can simply change its default collation and no longer worry about specifying explicit collations within your SQL commands:
ALTER TABLE mytable MODIFY mycolumn <type> COLLATE latin1_german1_ci

MySQL WHERE `character` = 'a' is matching a, A, Ã, etc. Why?

I have the following query in MySQL:
SELECT id FROM unicode WHERE `character` = 'a'
The table unicode contains each unicode character along with an ID (it's integer encoding value). Since the collation of the table is set to utf8_unicode_ci, I would have expected the above query to only return 97 (the letter 'a'). Instead, it returns 119 rows containing the IDs of many 'a'-like letters:
a A Ã ...
It seems to be ignoring both case and the multi-byte nature of the characters.
Any ideas?
As documented under Unicode Character Sets:
MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The full collation chart makes clear that, in this collation, most variations of a base letter are equivalent irrespective of their lettercase or accent/decoration.
If you want to only match exact letters, you should use a binary collation such as utf8_bin.
The collation of the table is part of the issue; MySQL with a _ci collation is treating all of those 'a's as variants of the same character.
Switching to a _cs collation will force the engine to distinguish 'a' from 'A', and 'á' from 'Á', but it may still treat 'a' and 'á' as the same character.
If you need exact comparison semantics, completely disregarding the equivalency of similar characters, you can use the BINARY comparison operators
SELECT id FROM unicode WHERE BINARY character = 'a'
The ci in the collation means case-insensitive. Switch to a case-sensitive collation (cs) to get the results you're looking for.

How to interpret a column as having a different character set per query?

I need to interface with a database for which I cannot change the collation and charset.
However, I would like to pick some binary data from it, interpret it as if it were UTF8 and then do an UPPER on it (since just doing UPPER() on binary returns the raw value).
I would assume that this works:
SELECT UPPER(Filename.Name) COLLATE utf8_general_ci FROM Filename;
but it doesn't and complains that
COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
which is fair enough, I need some incantation to cast the binary field as being utf-8. How do I do a select which gives me a computed column with the right character set assigned to it?
Ok figured it out. For modern MySQL versions you can use CAST, and for older ones CONVERT (which is actually standard SQL).
SELECT UPPER(CONVERT(BINARY(Filename.Name) USING utf8)) FROM Filename;
I think you're looking for:
SELECT UPPER(Filename.Name COLLATE utf8_general_ci) FROM Filename;
But I'm not sure because I don't have a broken database to test with.