Get same MySQL search results using foreign and English characters - mysql

We have a MySQL database containing a table of authors. Some of the authors names have non-English characters in them (example LÜTTGE).
Our client wants users to be able to find such records even if they don't enter the non-English character. So in the above example "LUTTGE" should also find that result. At the moment it only works if the user searches for the name using the non-English character, so "LÜTTGE" works but "LUTTGE" returns nothing.
The frontend to this is a web application written in CakePHP 2
Does anyone have any ideas on how to do this as I'm at a loss? Ideally we want to be able to do this within CakePHP/MySQL, and not use third party search systems.
The above is just one example in a table of thousands of records. So it's not just a case of substituting "U" with "Ü" - there are many other variants.

This can be handled by using the MySQL collation system.
For example, the following query returns a true (1) value:
SELECT 'LÜTTGE' COLLATE utf8_general_ci = 'LUTTGE'
Accordingly, if you set the column's character set to utf8 and its collation to utf8_general_ci you will get the result you mention with umlaut characters.
The default collation in MySQL reflects its Swedish origin and is utf8_swedish_ci. In Swedish, Ü and U are not the same letter. You probably have used the default collation for your columns.
The utf8_general_ci collation handles matching 'Eßen' to 'Esen' but not to 'Essen'. It handles matching 'LÜTTGE' to 'LUTTGE' but not to 'Luettge', unfortunately.
On the other hand, the utf8_german2_ci collation matches 'Eßen' to 'Essen' and 'LÜTTGE' to 'LUETTGE'. If your users are accustomed to using ASCII transliterations of German characters you may wish to explore your choices here. One of them is to use a query with OR
SELECT whatever
FROM table
WHERE ( namen COLLATE utf8_general_ci = 'LUTTGE'
OR namen COLLATE utf8_german2_ci = 'LUTTGE' )
It can get more complex if you need to handle Spanish, because Ñ is considered a different letter from N. You may need to do some explaining for your users.
Marcus suggested using the utf_unicode_ci collation. That will handle things partially too. Here are the cases
type utf8_general_ci utf8_german2_ci utf8_unicode_ci utf8_spanish_ci
'Eßen' to 'Esen' substitute match no match no match no match
'Eßen' to 'Essen' transliterate no match match match match
'LÜTTGE' to 'LUTTGE' substitute match no match match match
'LÜTTGE' to 'LUETTGE' transliterate no match match no match no match
'Niño' to 'Nino' transliterate match match match no match
So you still need some extra work to handle transliterations.

Related

Mysql does'nt distinguish between characters "c" and "ç" in UNIQUE index [duplicate]

These two querys gives me the exact same result:
select * from topics where name='Harligt';
select * from topics where name='Härligt';
How is this possible? Seems like mysql translates åäö to aao when it searches. Is there some way to turn this off?
I use utf-8 encoding everywhere as far as i know. The same problem occurs both from terminal and from php.
Yes, this is standard behaviour in the non-language-specific unicode collations.
9.1.13.1. Unicode Character Sets
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
See also Examples of the effect of collation
You need to either
use a collation that doesn't have this "feature" (namely utf8_bin, but that has other consequences)
use a different collation for the query only. This should work:
select * from topics where name='Harligt' COLLATE utf8_bin;
it becomes more difficult if you want to do a case insensitive LIKE but not have the Ä = A umlaut conversion. I know no mySQL collation that is case insensitive and does not do this kind of implicit umlaut conversion. If anybody knows one, I'd be interested to hear about it.
Related:
Looking for case insensitive MySQL collation where “a” != “ä”
MYSQL case sensitive search for utf8_bin field
Since you are in Sweden I'd recommend using the Swedish collation. Here's an example showing the difference it makes:
CREATE TABLE topics (name varchar(100) not null) CHARACTER SET utf8;
INSERT topics (name) VALUES ('Härligt');
select * from topics where name='Harligt';
'Härligt'
select * from topics where name='Härligt';
'Härligt'
ALTER TABLE topics MODIFY name VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_swedish_ci;
select * from topics where name='Harligt';
<no results>
select * from topics where name='Härligt';
'Härligt'
Note that in this example I only changed the one column to Swedish collation, but you should probably do it for your entire database, all tables, all varchar columns.
While collations are one way of solving this, the much more straightforward way seems to me to be the BINARY keyword:
SELECT 'a' = 'ä', BINARY 'a' = 'ä'
will return 1|0
In your case:
SELECT * FROM topics WHERE BINARY name='Härligt';
See also https://www.w3schools.com/sql/func_mysql_binary.asp
you want to check your collation settings, collation is the property that sets which characters are identical.
these 2 pages should help you
http://dev.mysql.com/doc/refman/5.1/en/charset-general.html
http://dev.mysql.com/doc/refman/5.1/en/charset-mysql.html
Here you can see some collation charts. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.
Here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html

mysql collation: case-preserving, case-insensitive but accent-sensitive [duplicate]

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.

mysql collate latin1_german1_ci not working with order by

I have a mysql database where I need to perform a search on a varchar column. All data is encoded in latin1. Sometimes these columns have western accented characters in them (for me almost always French.) Using the default collation (latin1_swedish_ci) has always worked fine for me. But now I have a problem with some data containing umlauts. If I search for "nusserhof" I want mysql to return "nüsserhof", but it is not. Changing the collation to latin1_german1_ci solves the problem in the simplest sense, for instance this query works, returning all rows containing the word "nüsserhof":
select * from mytable where mycolumn like '%nusserhof%' collate latin1_german1_ci;
But if I add an order by clause it no longer works. This doesn't return any rows containing the word "nüsserhof":
select * from mytable where mycolumn like '%nusserhof%' order by mycolumn collate latin1_german1_ci;
Surprisingly, I can't find anything here or through google about this. Is this expected behavior? As a work around I'm just dropping the order by, and sorting after the select in PHP. But it seems like I should be able to get it to work.
Is this expected behavior?
Yes, it is.
In Swedish, the glyph ü represents the letter tyskt y ("German Y") and thus under latin1_swedish_ci it is a variation of the letter y rather than u. If, applying that collation, you were to search where mycolumn like '%nysserhof%', your record containing nüsserhof would be returned.
In German, the glyph ü represents an accented variation (specifically an umlaut) of the base glyph and thus under latin1_german1_ci it is a variation of the letter u as expected. Thus you obtain the desired results when running your search under this collation.
It is because of local differences of this sort that we must choose appropriate collations for our data: no single collation can always be appropriate in the general case.
The problem that you observe when applying ORDER BY results from a misunderstanding of the COLLATE keyword: it is not part of the SELECT command (such that it instructs MySQL to use that collation for all comparisons within the command); rather, it is part of the immediately preceding string (such that it instructs MySQL to use that explicit collation for the immediately preceding string only).
That is, in your first case, the explicit latin1_german1_ci collation is applied to the '%nusserhof%' string literal with a coercibility of 0; the collation of mycolumn (which is presumably latin1_swedish_ci) has a coercibility of 2. Since the former has a lower value, it is used when evaluating the expression.
In your second case, the explicit latin1_german1_ci collation is applied to mycolumn within the ORDER BY clause: thus the sorted results will place 'nüsserhof' between 'nu' and 'nv' instead of between 'ny' and 'nz'. However the explicit collation no longer applies to the filter expression within the WHERE clause, and so the column's default collation will apply.
If the data in mycolumn is all in the German language, you can simply change its default collation and no longer worry about specifying explicit collations within your SQL commands:
ALTER TABLE mytable MODIFY mycolumn <type> COLLATE latin1_german1_ci

Mysql order by on column with unicode characters

I am running a select query on mysql table and trying to order it by the "name" column in the table.
The name column contains both English character names and names with Latin character like â.
I am running into the below problem.
The query I run returns the results ordered in the below manner i.e.
Eg: if Name contains "archer", "aaakash", "â hayden", "bourne", "jason"
The results returned by the query is ordered as below
"aaakash", "archer", "â hayden", "bourne", "jason"
However I want to order it based on unicode code points (like below)
"aaakash", "archer", "bourne", "jason", "â hayden"
(See the difference in the position of â hayden in the orders)
What can I do to order the results based on the character's position in unicode character set?
However I want to order it based on unicode code points (like below)
To sort using unicode code point, you probably need to use utf8_bin collation.
Precisely, the _bin suffix indicate to sort by the binary representation of each character.
To override the default collation while ordering, you will use ORDER BY ... COLLATE:
To paraphrase the documentation:
SELECT k
FROM t1
ORDER BY k COLLATE utf8_bin;
If your text column does not use utf8 encoding, you will have to CONVERT it:
SELECT k
FROM t1
ORDER BY CONVERT(k USING utf8) COLLATE utf8_bin;
Please notice I used utf8 as an example here as this is the most common Unicode encoding. But your MySQL server probably support other Unicode encoding, like ucs2("UTF-16").

is possible to have accent sensitive and case insensitive utf8 collation in mysql?

How can I perform accent-sensitive but case-insensitive utf8 search in mysql? Utf8_bin is case sensitive, and utf8_general_ci is accent insensitive.
If you want to differ "café" from "cafe"
You may use :
Select word from table_words WHERE Hex(word) LIKE Hex("café");
This way it will return 'café'.
Otherwise if you use :
Select word from table_words WHERE Hex(word) LIKE Hex("cafe");
it will return café.
I'm using the latin1_german2_ci Collation.
There doesn't seem to be one because case sensitivity is tough to do in Unicode.
There is a utf8_general_cs collation but it seems to be experimental, and according to this bug report, doesn't do what it's expected to when using LIKE.
If your data consists of western umlauts only (ie. umlauts that are included in ISO-8859-1), you might be able to collate your search operation to latin1_german2_ci or create a separate search column with it (that specific collation is accent sensitive according to this page; latin1_general_ci might be as well, I don't know and can't test right now).
You can use "hex" to make the search accent-sensitive. Then simply add lcase to make it case insensitive again. So that would give:
SELECT name FROM people WHERE HEX(LCASE(name)) = HEX(LCASE("René"))
You do throw all your indexes out of the window like that. If you want to avoid having to do a full table scan and you have an index on "name", also search for the same thing without the hex and lcase:
SELECT name FROM people WHERE name = "René" and HEX(LCASE(name)) = HEX(LCASE("René"))
This way the index on "name" will be used to find for example only the rows "René" and "Rene" and then the comparison with the "hex" needs to be done only on those two rows instead of on the complete table.