Character class is not working for Arabic text column - mysql

By definition MySQL character class [...] matches any character within the brackets. So I used it for Arabic characters. And it is giving me empty set every time.
Here is my query:
select hadith_raw_ar from view_hadith_in_book where hadith_raw_ar like '%[بل]ت';

With older versions, you cannot use character classes with LIKE or RLIKE and non-latin1 character sets. (At least not and expect to get the right results.)
REGEXP is lame. It looks only at bytes; 6 bytes in your character class, some of which are duplicated. Here's the hex: D8 AA D8 A8 D9 84.
Sometimes you will happen to get the 'right' answer from REGEXP. MariaDB has a decent REGEXP. For example, SELECT '٪' REGEXP '[تبل]'; returns true. Note that I am testing for a Arabic percent sign - hex D9AA. Note how I picked D9, which exists in some Arabic characters and AA.
The MySQL 8.0 manual implies that REGEXP might work correctly for Arabic. (But not for Emoji and some Chinese characters.) MariaDB has had PCRE built-in since 10.0.5.

Be definition mySQL character class [...] matches any character within
the brackets.
Umm, that inaccurate. The character class is actually part of Regex, not MySQL. However, you can still use Regex with MySQL, of course, but you need to use the keyword REGEXP instead of LIKE.
Now, if you're trying to match anything that starts with any character represented in your character class, you should be using a regex pattern that looks something like ^[...] where you replace the ... with the characters you want.
So, in your case, you'd need something like this:
SELECT hadith_raw_ar FROM view_hadith_in_book WHERE hadith_raw_ar REGEXP '^[تبل]';
Which is equivalent to:
SELECT hadith_raw_ar
FROM view_hadith_in_book
WHERE hadith_raw_ar LIKE 'ت%' OR
hadith_raw_ar LIKE 'ب%' OR
hadith_raw_ar LIKE 'ل%';
..when not using Regex.
References:
Regex: Character Classes or Character Sets.
Using Regular Expressions with MySQL.

Use utf8_general_ci collection to insert any language characters

Related

Regex pattern equivalent of %word% in mysql

I need 2 regex case insensitive patterns. One of them are equivalent of SQL's %. So %word%. My attempt at this was '^[a-zA-Z]*word[a-zA-Z]*$'.
Question 1: This seems to work, but I am wondering if this is the equivalent of %word%.
Finally the last pattern being similar to %, but requires 3 or more characters either before and after the word. So for example if the target word was word:
words = matched because it doesn't have 3 or more characters either before or after it.
swordfish = not matched because it has 3 or more characters after word
theword = not matched because it has 3 or more characters before it
mywordly = matched because it doesn't contain 3 or more characters before or after word.
miswordeds = not matched because it has 3 characters before it. (it also has 3 words after it too, but it met the criteria already).
Question 2: For the second regex, I am not very sure how to start this. I will be using the regex in a MySQL query using the REGEXP function for example:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]*word[a-zA-Z]*$'
First Question:
According to https://dev.mysql.com/doc/refman/8.0/en/string-comparison-functions.html#operator_like
With LIKE you can use the following two wildcard characters in the pattern:
% matches any number of characters, even zero characters.
_ matches exactly one character.
It means the REGEX ^[a-zA-Z]*word[a-zA-Z]*$' is not equivalent to %word%
Second Question:
Change * to {0,2} to indicate you want to match at maximum 2 characters either before or after it:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]{0,2}word[a-zA-Z]{0,2}$'
And to make case insensitive:
SELECT 1 WHERE LOWER('SWORDFISH') REGEXP '^[a-z]{0,2}word[a-z]{0,2}$'
Assuming
The test string (or column) has only letters. (Hence, I can use . instead of [a-z]).
Case folding and accents are not an issue (presumably handled by a suitable COLLATION).
Either way:
WHERE x LIKE '%word%' -- found the word
AND x NOT LIKE '%___word%' -- but fail if too many leading chars
AND x NOT LIKE '%word___%' -- or trailing
WHERE x RLIKE '^.{0,2}word.{0,2}$'
I vote for RLIKE being slightly faster than LIKE -- only because there are fewer ANDs.
(MySQL 8.0 introduced incompatible regexp syntax; I think the syntax above works in all versions of MySQL/MariaDB. Note that I avoided word boundaries and character class shortcuts like \\w.)

How to perform a multi-byte safe SQL REGEXP query?

I have the following SQL query to find the dictionary words that contain specific letters.
It's working fine in the English dictionary:
SELECT word
FROM english_dictionary
WHERE word REGEXP '[abcdef]'
But running the same query on Slovak dictionary, which includes UTF8 special accented letters don't work.
SELECT word
FROM slocak_dictionary
WHERE word REGEXP '[áäčďéóú]'
I've searched everywhere, can't find the answer to this issue. If I use LIKE, it's working, but the query is getting very ugly:
SELECT word
FROM slocak_dictionary
WHERE
word LIKE '%á%'
AND word LIKE '%ä%'
AND word LIKE '%č%'
AND word LIKE '%ď%'
AND word LIKE '%é%'
AND word LIKE '%ó%'
AND word LIKE '%ú%'
Because I deal with many letters that need to be excluded or includes in the query, breaking it down like this is not very elegant.
Is there any way to perform a multi-byte safe SQL REGEXP query on MySQL?
MariaDB has better support of REGEXP.
In MySQL, this will test for word having any of those accented characters:
HEX(word) REGEXP '^(..)*(C3A1|C3A4|C48D|C48F|C3A9|C3B3|C3BA)'
The ^(..)* is to make sure the subsequent test is byte (2 hex chars) aligned.
You can see those utf8 encodings by doing something like
SELECT HEX('áäčďéóú');
(Your attempt with LIKE should have said OR instead of AND.)

Regex returning inexplicable results (to me)

I want to return entries from a table that match the format:
prefix + optional spaces + Thai digit
Testing using ยก as the prefix I use the following SQL
SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"
This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.
ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล
It doesn't return
ยก ๑
ยก ๒
which I know are in the database and are the entries I want.
I'm very new to all this. What am I doing wrong?
FWIW, this is against a MySQL database and everything is in Unicode.
Thanks
As quoted from the MySQL docs:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.
I use utf8_general_ci and try.I matched
ยกนะ
with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to
"^ยก[ ]*[๐-๙]+$"
,and it can match
ยกนะ
ยก ๑
Maybe the problem is character encoding.

MySQL REGEXP query - accent insensitive search

I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents)
The basic query looks like this:
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]'
which will return entries with 'Faugères' in the title, but not 'Faugeres'
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]'
does the opposite.
I had thought something like:
SELECT *
FROM `table`
WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]'
might do the trick, but this only returns the results without the accents.
The field is collated as utf8_unicode_ci, which from what I've read is how it should be.
Any suggestions?!
You're out of luck:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multi-byte safe and may produce unexpected results with multi-byte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.
You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).
You're certainly out of luck :)
Addendum: this has changed as of MySQL 8.0:
MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.
Because REGEXP and RLIKE are byte oriented, have you tried:
SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';
This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.
utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well.
Also, change REGEXP to LIKE. REGEXP makes binary comparison.
To solve this problem, I tried different things, including using the binary keyword or the latin1 character set but to no avail.
Finally, considering that it is a MySql bug, I ended up replacing the é and è chars,
Like this :
SELECT *
FROM `table`
WHERE replace(replace(wine_name, 'é', 'e'), 'è', 'e') REGEXP '[[:<:]]Faugeres[[:>:]]'
I had the same problem trying to find every record matching one of the following patterns: 'copropriété', 'copropriete', 'COPROPRIÉTÉ', 'Copropri?t?'
REGEXP 'copropri.{1,2}t.{1,2} worked for me.
Basically, .{1,2} will should work in every case wether the character is 1 or 2 byte encoded.
Explanation: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
I have this problem, and went for Álvaro's suggestion above. But in my case, it misses those instances where the search term is the middle word in the string. I went for the equivalent of:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
OR wine_name LIKE '% Faugères %'
Ok I just stumbled on this question while searching for something else.
This returns true.
SELECT 'Faugères' REGEXP 'Faug[eèêéë]+r[eèêéë]+s';
Hope it helps.
Adding the '+' Tells the regexp to look for one or more occurrences of the characters.

Match beginning of words in Mysql for UTF8 strings

I m trying to match beignning of words in a mysql column that stores strings as varchar. Unfortunately, REGEXP does not seem to work for UTF-8 strings as mentioned here
So,
select * from names where name REGEXP '[[:<:]]Aandre';
does not work if I have name like Foobar Aándreas
However,
select * from names where name like '%andre%'
matches the row I need but does not guarantee beginning of words matches.
Is it better to do the like and filter it out on the application side ? Any other solutions?
A citation from tha page you mentioned:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
select * from names where name like 'andre%'
select * from names where name like 'andre%' is not solution for eg:
name = 'richard andrew', because the string begining with richa... and not with andre...
for the moment, the temporaly solution, for search words (words != string) starting with a string
select * from names where name REGEXP '[[:<:]]andre';
But it no matching with accented words, eg: ándrew.
Any other solution, with regular expressions (mysql) to search in accented words?