How to perform a multi-byte safe SQL REGEXP query? - mysql

I have the following SQL query to find the dictionary words that contain specific letters.
It's working fine in the English dictionary:
SELECT word
FROM english_dictionary
WHERE word REGEXP '[abcdef]'
But running the same query on Slovak dictionary, which includes UTF8 special accented letters don't work.
SELECT word
FROM slocak_dictionary
WHERE word REGEXP '[áäčďéóú]'
I've searched everywhere, can't find the answer to this issue. If I use LIKE, it's working, but the query is getting very ugly:
SELECT word
FROM slocak_dictionary
WHERE
word LIKE '%á%'
AND word LIKE '%ä%'
AND word LIKE '%č%'
AND word LIKE '%ď%'
AND word LIKE '%é%'
AND word LIKE '%ó%'
AND word LIKE '%ú%'
Because I deal with many letters that need to be excluded or includes in the query, breaking it down like this is not very elegant.
Is there any way to perform a multi-byte safe SQL REGEXP query on MySQL?

MariaDB has better support of REGEXP.
In MySQL, this will test for word having any of those accented characters:
HEX(word) REGEXP '^(..)*(C3A1|C3A4|C48D|C48F|C3A9|C3B3|C3BA)'
The ^(..)* is to make sure the subsequent test is byte (2 hex chars) aligned.
You can see those utf8 encodings by doing something like
SELECT HEX('áäčďéóú');
(Your attempt with LIKE should have said OR instead of AND.)

Related

Character class is not working for Arabic text column

By definition MySQL character class [...] matches any character within the brackets. So I used it for Arabic characters. And it is giving me empty set every time.
Here is my query:
select hadith_raw_ar from view_hadith_in_book where hadith_raw_ar like '%[بل]ت';
With older versions, you cannot use character classes with LIKE or RLIKE and non-latin1 character sets. (At least not and expect to get the right results.)
REGEXP is lame. It looks only at bytes; 6 bytes in your character class, some of which are duplicated. Here's the hex: D8 AA D8 A8 D9 84.
Sometimes you will happen to get the 'right' answer from REGEXP. MariaDB has a decent REGEXP. For example, SELECT '٪' REGEXP '[تبل]'; returns true. Note that I am testing for a Arabic percent sign - hex D9AA. Note how I picked D9, which exists in some Arabic characters and AA.
The MySQL 8.0 manual implies that REGEXP might work correctly for Arabic. (But not for Emoji and some Chinese characters.) MariaDB has had PCRE built-in since 10.0.5.
Be definition mySQL character class [...] matches any character within
the brackets.
Umm, that inaccurate. The character class is actually part of Regex, not MySQL. However, you can still use Regex with MySQL, of course, but you need to use the keyword REGEXP instead of LIKE.
Now, if you're trying to match anything that starts with any character represented in your character class, you should be using a regex pattern that looks something like ^[...] where you replace the ... with the characters you want.
So, in your case, you'd need something like this:
SELECT hadith_raw_ar FROM view_hadith_in_book WHERE hadith_raw_ar REGEXP '^[تبل]';
Which is equivalent to:
SELECT hadith_raw_ar
FROM view_hadith_in_book
WHERE hadith_raw_ar LIKE 'ت%' OR
hadith_raw_ar LIKE 'ب%' OR
hadith_raw_ar LIKE 'ل%';
..when not using Regex.
References:
Regex: Character Classes or Character Sets.
Using Regular Expressions with MySQL.
Use utf8_general_ci collection to insert any language characters

how to separate "/" in SQL REGEX

I want to get rows that the urls contain one or more chinese character.I write a sql with regexp to do it.But i failed because the "/" fits the regexp.
The regexp is
SELECT "/" REGEXP '.*[^\x0f-\xff].*'
and the Sequel Pro returns 1
However, I find a pro-reg-testing-website to do the same regexp and it turns out 0.
Why it acts different with the same regexp in that website and the Sequel Pro?If the website has some optimization on it, then how to make it in the Sequel?
SELECT ...
WHERE HEX(str) REGEXP '^(..)*E[3456789ABCD]';
will check for a variety of CJK characters. (This assumes str is CHARACTER SET utf8 or utf8mb4.) This may include Japanese and Korean characters, too.
I'm digging around for the 'extension' characters; seems like they begin with F0.
EDIT
Well, it turns out that Chinese is all over the place:
REGEXP
'^(..)*E2B[AB]|E380|E387|E38[89AB]|E38[CDEF]|E[34][9AB][0-9A-F]|E[456789]B[89ABCDEF]|EFA[456789AB]|EFB[89]|F0A[0123456789A][89][0-9A-F]|F0A[AB]9C|F0AB[9A][DEF0]|F0A[BC][AB][0-9A-F]|F0AFA[012345678]'

MYSQL select all rows with asian characters

On a database with customer information and in a table where names and addresses are mixed in latin and asian characters I'd like to select all that do (or don't) contain any asian characters. The data is UTF-8 encoded. Is that possible with MYSQL itself or do I need to write a custom script using PHP / Perl?
You might be able to do this with regular expressions. The idea is to look for all the simple characters that might be in string and use ^. So, to find unexpected (i.e. "Asian") characters:
where col regexp '.%[^0-9a-zA-Z.,:()& ].%'
The .% at the beginning and end are not necessary, but I like to have them so the patters are similar to LIKE patterns.
The page linked to in amdixon's comment had a working answer. Here it is so that we have it on SO:
To select all rows with non latin characters on col:
SELECT *
FROM table
WHERE col != CONVERT(col USING latin1)

MySQL REGEXP query - accent insensitive search

I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents)
The basic query looks like this:
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]'
which will return entries with 'Faugères' in the title, but not 'Faugeres'
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]'
does the opposite.
I had thought something like:
SELECT *
FROM `table`
WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]'
might do the trick, but this only returns the results without the accents.
The field is collated as utf8_unicode_ci, which from what I've read is how it should be.
Any suggestions?!
You're out of luck:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multi-byte safe and may produce unexpected results with multi-byte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.
You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).
You're certainly out of luck :)
Addendum: this has changed as of MySQL 8.0:
MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.
Because REGEXP and RLIKE are byte oriented, have you tried:
SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';
This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.
utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well.
Also, change REGEXP to LIKE. REGEXP makes binary comparison.
To solve this problem, I tried different things, including using the binary keyword or the latin1 character set but to no avail.
Finally, considering that it is a MySql bug, I ended up replacing the é and è chars,
Like this :
SELECT *
FROM `table`
WHERE replace(replace(wine_name, 'é', 'e'), 'è', 'e') REGEXP '[[:<:]]Faugeres[[:>:]]'
I had the same problem trying to find every record matching one of the following patterns: 'copropriété', 'copropriete', 'COPROPRIÉTÉ', 'Copropri?t?'
REGEXP 'copropri.{1,2}t.{1,2} worked for me.
Basically, .{1,2} will should work in every case wether the character is 1 or 2 byte encoded.
Explanation: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
I have this problem, and went for Álvaro's suggestion above. But in my case, it misses those instances where the search term is the middle word in the string. I went for the equivalent of:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
OR wine_name LIKE '% Faugères %'
Ok I just stumbled on this question while searching for something else.
This returns true.
SELECT 'Faugères' REGEXP 'Faug[eèêéë]+r[eèêéë]+s';
Hope it helps.
Adding the '+' Tells the regexp to look for one or more occurrences of the characters.

Match beginning of words in Mysql for UTF8 strings

I m trying to match beignning of words in a mysql column that stores strings as varchar. Unfortunately, REGEXP does not seem to work for UTF-8 strings as mentioned here
So,
select * from names where name REGEXP '[[:<:]]Aandre';
does not work if I have name like Foobar Aándreas
However,
select * from names where name like '%andre%'
matches the row I need but does not guarantee beginning of words matches.
Is it better to do the like and filter it out on the application side ? Any other solutions?
A citation from tha page you mentioned:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
select * from names where name like 'andre%'
select * from names where name like 'andre%' is not solution for eg:
name = 'richard andrew', because the string begining with richa... and not with andre...
for the moment, the temporaly solution, for search words (words != string) starting with a string
select * from names where name REGEXP '[[:<:]]andre';
But it no matching with accented words, eg: ándrew.
Any other solution, with regular expressions (mysql) to search in accented words?