Mysql RLIKE / PREG_MATCH bug - mysql

Cold someone could explain why this returns true:
SELECT BINARY 'â' RLIKE '[™]';
SELECT BINARY 'é' RLIKE '[©]';
What could be the fix ? Is it some misconfiguration on my part?
UPDATE:
found that using (™|©) instead of [™©] would work as a first workaround

From the documentation:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multi-byte safe and may produce unexpected results with multi-byte
character sets.

Related

same queries using 'regexp' gives different result in mysql

Basically, what I want, is to understand why
select 'aa' regexp '[h]' returns 0 and
select 'აა' regexp '[ჰ]' returns 1 ?
check FIDDLE
I think MqSQL regex does not support utf-8 yet. See bug 30241 and 12.5.2 Regular Expressions.
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
You could match the byte sequence without character class: SELECT 'აა' REGEXP 'ჰ' returns 0.

How can I use case sensitive in mysql except the first character of the string?

How can I use case sensitive in mysql except the first character of the string?
Examples:
String: ABC12345
aBC12345: ok
abc12345: no
ABc12345: no
You can use REGEXP:
select * from mytable where binary myfield REGEXP '^[aA]BC12345$';
Make note of:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
View a sqlfiddle.
you can use the binary option as such :
select * from mytable where binary myfield like '%BC12345'

Regex returning inexplicable results (to me)

I want to return entries from a table that match the format:
prefix + optional spaces + Thai digit
Testing using ยก as the prefix I use the following SQL
SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"
This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.
ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล
It doesn't return
ยก ๑
ยก ๒
which I know are in the database and are the entries I want.
I'm very new to all this. What am I doing wrong?
FWIW, this is against a MySQL database and everything is in Unicode.
Thanks
As quoted from the MySQL docs:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.
I use utf8_general_ci and try.I matched
ยกนะ
with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to
"^ยก[ ]*[๐-๙]+$"
,and it can match
ยกนะ
ยก ๑
Maybe the problem is character encoding.

MySQL REGEXP query - accent insensitive search

I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents)
The basic query looks like this:
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]'
which will return entries with 'Faugères' in the title, but not 'Faugeres'
SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]'
does the opposite.
I had thought something like:
SELECT *
FROM `table`
WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]'
might do the trick, but this only returns the results without the accents.
The field is collated as utf8_unicode_ci, which from what I've read is how it should be.
Any suggestions?!
You're out of luck:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multi-byte safe and may produce unexpected results with multi-byte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.
You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).
You're certainly out of luck :)
Addendum: this has changed as of MySQL 8.0:
MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.
Because REGEXP and RLIKE are byte oriented, have you tried:
SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';
This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.
utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well.
Also, change REGEXP to LIKE. REGEXP makes binary comparison.
To solve this problem, I tried different things, including using the binary keyword or the latin1 character set but to no avail.
Finally, considering that it is a MySql bug, I ended up replacing the é and è chars,
Like this :
SELECT *
FROM `table`
WHERE replace(replace(wine_name, 'é', 'e'), 'è', 'e') REGEXP '[[:<:]]Faugeres[[:>:]]'
I had the same problem trying to find every record matching one of the following patterns: 'copropriété', 'copropriete', 'COPROPRIÉTÉ', 'Copropri?t?'
REGEXP 'copropri.{1,2}t.{1,2} worked for me.
Basically, .{1,2} will should work in every case wether the character is 1 or 2 byte encoded.
Explanation: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
I have this problem, and went for Álvaro's suggestion above. But in my case, it misses those instances where the search term is the middle word in the string. I went for the equivalent of:
SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
OR wine_name LIKE 'Faugères %'
OR wine_name LIKE '% Faugères'
OR wine_name LIKE '% Faugères %'
Ok I just stumbled on this question while searching for something else.
This returns true.
SELECT 'Faugères' REGEXP 'Faug[eèêéë]+r[eèêéë]+s';
Hope it helps.
Adding the '+' Tells the regexp to look for one or more occurrences of the characters.

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';