Regex returning inexplicable results (to me) - mysql

I want to return entries from a table that match the format:
prefix + optional spaces + Thai digit
Testing using ยก as the prefix I use the following SQL
SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"
This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.
ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล
It doesn't return
ยก ๑
ยก ๒
which I know are in the database and are the entries I want.
I'm very new to all this. What am I doing wrong?
FWIW, this is against a MySQL database and everything is in Unicode.
Thanks

As quoted from the MySQL docs:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.

I use utf8_general_ci and try.I matched
ยกนะ
with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to
"^ยก[ ]*[๐-๙]+$"
,and it can match
ยกนะ
ยก ๑
Maybe the problem is character encoding.

Related

same queries using 'regexp' gives different result in mysql

Basically, what I want, is to understand why
select 'aa' regexp '[h]' returns 0 and
select 'აა' regexp '[ჰ]' returns 1 ?
check FIDDLE
I think MqSQL regex does not support utf-8 yet. See bug 30241 and 12.5.2 Regular Expressions.
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
You could match the byte sequence without character class: SELECT 'აა' REGEXP 'ჰ' returns 0.

MySQL REGEXP not producing expected results (not multi byte safe?). Is there a work around?

I'm trying to write a MySQL query to identify first name fields that actually contain initials. The problem is that the query is picking up records that should not match.
I have tested against the POSIX ERE regex implementation in RegEx Buddy to confirm my regex string is correct, but when running in a MySQL query, the results differ.
For example, the query should identify strings such as:
'A.J.D' or 'A J D'.
But it is also matching strings like 'Ralph' or 'Terrance'.
The query:
SELECT *, firstname REGEXP '^[a-zA-z]{1}(([[:space:]]|\.)+[a-zA-z]{1})+([[:space:]]|\.)?$' FROM test_table
The 'firstname' field here is VARCHAR 255 if that's relevant.
I get the same result when running with a string literal rather than table data:
SELECT 'Ralph' REGEXP '^[a-zA-z]{1}(([[:space:]]|\.)+[a-zA-z]{1})+([[:space:]]|\.)?$'
The MySQL documentation warns about potential issues with REGEXP, I'm unsure if this is related to the problem I'm seeing:
Warning The REGEXP and RLIKE operators work in byte-wise fashion, so
they are not multi-byte safe and may produce unexpected results with
multi-byte character sets. In addition, these operators compare
characters by their byte values and accented characters may not
compare as equal even if a given collation treats them as equal.
Thanks in advance.
If you're testing this in the mysql client, you need to escape the backslashes. Each occurence of \. must turn into \\. This is necessary because your input is first processed by the mysql client, which turns \. into .. So you need to make it keep the backslashes by escaping them.

MySQL REGEXP query - accent insensitive search

I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents)
The basic query looks like this:
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]'
which will return entries with 'Faugères' in the title, but not 'Faugeres'
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]'
does the opposite.
I had thought something like:
SELECT *
FROM `table`
WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]'
might do the trick, but this only returns the results without the accents.
The field is collated as utf8_unicode_ci, which from what I've read is how it should be.
Any suggestions?!
You're out of luck:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multi-byte safe and may produce unexpected results with multi-byte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.
You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).
You're certainly out of luck :)
Addendum: this has changed as of MySQL 8.0:
MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.
Because REGEXP and RLIKE are byte oriented, have you tried:
SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';
This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.
utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well.
Also, change REGEXP to LIKE. REGEXP makes binary comparison.
To solve this problem, I tried different things, including using the binary keyword or the latin1 character set but to no avail.
Finally, considering that it is a MySql bug, I ended up replacing the é and è chars,
Like this :
SELECT *
FROM `table`
WHERE replace(replace(wine_name, 'é', 'e'), 'è', 'e') REGEXP '[[:<:]]Faugeres[[:>:]]'
I had the same problem trying to find every record matching one of the following patterns: 'copropriété', 'copropriete', 'COPROPRIÉTÉ', 'Copropri?t?'
REGEXP 'copropri.{1,2}t.{1,2} worked for me.
Basically, .{1,2} will should work in every case wether the character is 1 or 2 byte encoded.
Explanation: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
I have this problem, and went for Álvaro's suggestion above. But in my case, it misses those instances where the search term is the middle word in the string. I went for the equivalent of:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
OR wine_name LIKE '% Faugères %'
Ok I just stumbled on this question while searching for something else.
This returns true.
SELECT 'Faugères' REGEXP 'Faug[eèêéë]+r[eèêéë]+s';
Hope it helps.
Adding the '+' Tells the regexp to look for one or more occurrences of the characters.

How to detect rows with chinese characters in MySQL?

How can I detect and delete rows with Chinese characters in MySQL?
Here is the Table "Chinese_Test" Contains the Chinese Character on my PhpMyAdmin
Data:
Structure
notice my type of Collation is utf8, thus let's take a look at the Chinese Characters in utf8 table.
http://www.ansell-uebersetzungen.com/gbuni.html
Notice the Chinese Character is from E4 to E9, hence we use the code
select number
from Chinese_Test
where HEX(contents) REGEXP '^(..)*(E[4-9])';
and here is the result:
If all the other rows have alphanumeric values try the following:
DELETE FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
Do check the results before deletion, using the following:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
I don't have an answer, but to provide you with a starting point: Chinese characters will occupy certain blocks in the UTF-8 character set. Example
You would have to query for rows that contain characters between the first and the last point of that block. I can't think of a way to automate this though (i.e. to query for characters inside a certain range without naming each character explicitly).
Another untested idea that comes to mind is using iconv() to convert the string to a specifically Chinese encoding, using //IGNORE, and seeing whether any data is left. If anything is left, the string may contain chinese characters.... although this would probably be disrupted by any numbers inside the string,
It's an interesting problem.

Match beginning of words in Mysql for UTF8 strings

I m trying to match beignning of words in a mysql column that stores strings as varchar. Unfortunately, REGEXP does not seem to work for UTF-8 strings as mentioned here
So,
select * from names where name REGEXP '[[:<:]]Aandre';
does not work if I have name like Foobar Aándreas
However,
select * from names where name like '%andre%'
matches the row I need but does not guarantee beginning of words matches.
Is it better to do the like and filter it out on the application side ? Any other solutions?
A citation from tha page you mentioned:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
select * from names where name like 'andre%'
select * from names where name like 'andre%' is not solution for eg:
name = 'richard andrew', because the string begining with richa... and not with andre...
for the moment, the temporaly solution, for search words (words != string) starting with a string
select * from names where name REGEXP '[[:<:]]andre';
But it no matching with accented words, eg: ándrew.
Any other solution, with regular expressions (mysql) to search in accented words?