utf8_unicode_ci: How to support expansions with "Like"? - mysql

It states in the docs:
utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”.
I tried this out with an utf8mb4_unicode_ci collated table, with a row that has the name Heß.
select * from users where name = "Hess"
return the row with Heß. However, when I do the same with the like syntax
select * from users where name LIKE "%Hess%"
it returns no match.
Why does this happen and how does one fix it?

Related

MySQL collation query results

does anyone has explaination why :
SELECT * FROM MY_TABLE WHERE 1 = 1 AND libelle COLLATE latin1_general_ci LIKE '%dég%'
returns 1 record (only the record with é) while
SELECT * FROM MY_TABLE WHERE 1 = 1 AND libelle COLLATE latin1_swedish_ci LIKE '%dég%'
returns 4 record (of course including the one above) ?
According to MySQL doc latin1_general_ci is "Multilingual (Western European) case insensitive" so should not it manage accents like latin1_swedish_ci ?
Thanks
Nicolas
I suspect you're misunderstanding what collation is.
Collation is a set of rules used in natural language (Swedish, English, Russian, Japanese...) to determine the dictionary order of words. In relational databases, this is used to sort data (e.g. ORDER BY clauses) and to compare data (e.g. WHERE clauses or unique indexes). A couple of examples:
If you need to order by country name in English you get this:
Canada
China
Colombia
However, in traditional Spanish ch used to be an independent letter so correct order was this:
Canada
Colombia
China
In Swedish, å is a separate letter so you can have a login name like ångström even if you already have angström. In other languages they'd be duplicates and would not be allowed.
Collation is not something you use to display emojis and other Unicode characters. That's just encoding (ISO-8859-1, UTF-8, UTF-16... whatever).

Find accented and non-accented variations of same word

I have a database table which represent people and the records have people's names in them. Some of the names have accented characters in them. Some do not. Some are non-accented duplicates of the accented version.
I need to generate a report of all of the potential duplicates by finding names that are the same (first, middle, last) except for the accents so that someone else can go through this list and verify which are true duplicates, and which are actually different people (I'm assuming they have some other way of knowing).
For example: Jose DISTINCT-LAST-NAME and José DISTINCT-LAST-NAME should be picked up as potential duplicates because they have the same characters, but one has an accented character.
How can this type of query by written in MySQL?
This question: How to remove accents in MySQL? is not the same. It is asking about de-accenting strings in-place and the poster already has a second column of data that has been de-accented. Also, the accepted answer to that question is to set the character set and collation. I have already set the character set and collation.
I am trying to generate a report that finds strings in different records that are the same except for their accents.
I found your question very interesting.
According to this article Accents in text searches, using "like" condition with some character collation adjustments will solve your problem. I have not tested this solution, so if it helps you, please come back and tell us.
Here is a similar question: Accent insensitive search query in MySQL,
according to that, you can use something like:
where 'José' like 'Jose' collate utf8_general_ci
Well, I found something that seems to work (the real query involves a few more other fields, but the same basic idea):
select distinct p1.person_id, p1.first_name, p1.last_name, p2.last_name
from people as p1, people as p2
where binary p1.last_name <> binary p2.last_name
and p1.last_name = p2.last_name
and p1.first_name = p2.first_name
order by p1.last_name, p1.first_name, p2.last_name, p2.first_name;
The results look like this:
12345 Bob Jose José
56789 Bob José Jose
...
This makes sense as there are 2 records for Bob José and I know that in this case, it is the same person but one record is missing the accent.
The trick is to do a binary and non-binary compare on the "last_name" field as well as matching on all other fields. This way we can find everything that is "equal" and also not binary-equal. This works because with the current character-set/collation (utf8/utf8_general_ci), Jose and José are equal but are not binary-equal. you can try it out like this:
select 'Jose' = 'José', 'Jose' like 'José', binary 'Jose' = binary 'José';
The Bane of Character Encodings
There are a wide variety of character-sets and encodings that may be used in MySQL, and when dealing with encoding it is important to learn what you can about them. In particular, take a close look at the differences between:
utf8_unicode_ci
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_general_ci
Some character sets are built to include as many printable characters as possible, to support a wider range of uses, while others are built with the intent of portability and compatibility between systems. In particular, utf8_unicode_ci maps most accented characters to non-accented equivalents. Alternatively, you could use uft8_ascii_ci which is even more restrictive.
Take a look at the utf8_unicode_ci collation chart, and What's the difference between utf8_general_ci and utf8_unicode_ci .
The best answer is from a similar question, "How to remove accents in MySQL?"
If you set an appropriate collation for the column then the value
within the field will compare equal to its unaccented equivalent
naturally.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
| 1 |
+------------+
1 row in set (0.05 sec)
How to apply this to your situation?
SELECT id, last-name
FROM people
WHERE last-name COLLATE utf8_unicode_ci IN
(
SELECT last-name
FROM people
GROUP BY last-name COLLATE utf8_unicode_ci
HAVING COUNT(last-name)>1
)

MySQL Select with alpha wildcards not working

I have a simple table of a single column with rows of char(12) like:
DRF4482
DRF4497
DRF451
DRF4515
EHF452
FJF453
GKF4573
I want to select all of the rows that are between D and F, and have 4 numbers at the end. Like DRF4482, DRF4497, DRF4515, etc. I've tried a number of different wildcard combinations but I get no rows. I'm using:
SELECT * FROM `expired` WHERE id like '%[D-F][A-Z][A-Z]____';
I've even tried to broaden it to:
SELECT * FROM `expired` WHERE id like '%[D-F]%';
and that returns nothing as well.
I've even tried COLLATE latin1_bin based on some other posts but that didn't work either. My table is utf8, but I've created a second table as latin1 and tried a few different collations with the same results - no rows.
Where is my error?
You need to use REGEXP instead of LIKE. Notice that the syntax is a little different; it doesn't do anything with the SQLish % wildcard characters.
So, you want
id REGEXP '[D-F][A-Z][A-Z][0-9]{4}'
for this app. Hopefully you don't have multibyte characters in these strings, because MySQL's regexp doesn't work correctly in those circumstances.

sql : select uppercase columns in database

I need to select columns that end in an uppercase extension.
for example, look at the following table.
id - picture
1 - abc.JPG
2 - def.jpg
3 - 123.jpg
4 - xyz.JPG
the results should give me the rows 1 and 4 because JPG are in uppercase.
can anyone help?
I'm far from an expert, but case sensitivity has hung me up before. Are you able to modify the structure of your table? One thing that may help is changing the collation of the table as follows (SQLFiddle here):
CREATE TABLE pics (id INT, picture VARCHAR(200))
CHARACTER SET latin1 COLLATE latin1_general_cs;
INSERT INTO pics VALUES
(1, 'abc.JPG'),
(2, 'def.jpg'),
(3, '123.jpg'),
(4, 'xyz.JPG')
The _cs stands for case sensitive, and I believe the default is case insensitive, which makes case-based comparisons a bit trickier. You can then use the following query to get your rows:
SELECT *
FROM pics
WHERE picture REGEXP '\.[[:upper:]+]$'
If you do not have access to your underlying table, you could try the following, which casts the column in a different character set (latin1), and then changes the collation to support case-insensitive comparisons (SQLFiddle here):
SELECT *
FROM pics
WHERE CAST(picture AS CHAR CHARACTER SET latin1)
COLLATE latin1_general_cs REGEXP '\.[[:upper:]+]$'
Some REGEXPs:
'[.][[:upper:]]+$' -- just uppercase letters
'[.].*[[:upper:]]' -- at least one uppercase letter
'[.].*[[:lower:]]' -- at least one lowercase letter; AND together with previous to get upper and lower, etc.
If there could be two "." in the filename, then consider using
SUBSTRING_INDEX(picture, '.', -1)
to isolate the 'extension'.
Most SQL languages have a UCASE or UPPER function to convert text to uppercase. I'm also taking advantage of the RIGHT function which isn't in all SQL dialects. If your SQL doesn't have a RIGHT function you'll have to futz with SUBSTRING and LENGTH to get the right three characters in picture.
Select id, picture
from table
where UPPER(RIGHT(TRIM(picture),3)) = RIGHT(TRIM(picture),3)
If the same text converted to uppercase is the same as the unconverted text then it is uppercase in the database and will be selected.
As it can not always be assumed that the file extension will be 3 letters you may use the following to get the first chart after the period and compare it to see if it is uppercase:
select * from table where SUBSTRING(picture,CHARINDEX('.',picture) + 1,1)
= upper(SUBSTRING(picture,CHARINDEX('.',picture) + 1,1)) collate SQL_Latin1_General_CP1_CS_AS

MySQL DB selects records with and without umlauts. e.g: '.. where something = FÖÖ'

My Table collation is "utf8_general_ci". If i run a query like:
SELECT * FROM mytable WHERE myfield = "FÖÖ"
i get results where:
... myfield = "FÖÖ"
... myfield = "FOO"
is this the default for "utf8_general_ci"?
What collation should i use to only get records where myfield = "FÖÖ"?
SELECT * FROM table WHERE some_field LIKE ('%ö%' COLLATE utf8_bin)
A list of the collations offered by MySQL for Unicode character sets can be found here:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
If you want to go all-out and require strings to be absolutely identical in order to test as equal, you can use utf8_bin (the binary collation). Otherwise, you may need to do some experimentation with the different collations on offer.
For scandinavian letters you can use utf8_swedish_ci fir example.
Here is the character grouping for utf8_swedish_ci. It shows which characters are interpreted as the same.
http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html
Here's the directory listing for other collations. I'm no sure which is the used utf8_general_ci though. http://collation-charts.org/mysql60/