regex in mySQL and MariaDB give different results - mysql

I am migrating my MariaDB to MySQL and have come across differences.
I have a very simple query that produces results (197) in Maria DB but Zero in mySQL can anyone help?
SELECT DISTINCT title FROM films where title REGEXP 'The \\w{4}[^\\s]*\\b'
The database is exactly the same (exported from MariaDB into MySQL with no issues).

In MySQL 5.7, you have to use the POSIX-like regex library and use
SELECT DISTINCT title FROM films where title REGEXP 'The [[:alnum:]_]{4}[^[:space:]]*[[:>:]]'
Also, note that the regex matching here will be case insensitive, if you need to make The only match The and not THE, you need to add the BINARY keyword after REGEXP.
Here,
[[:alnum:]_]{4} - \w{4} - four word chars, letters, digits or underscores
[^[:space:]]* - \S* - zero or more non-whitespace chars
[[:>:]] - \b(?!\w) - a right-hand (trailing) word boundary

Related

Regex pattern equivalent of %word% in mysql

I need 2 regex case insensitive patterns. One of them are equivalent of SQL's %. So %word%. My attempt at this was '^[a-zA-Z]*word[a-zA-Z]*$'.
Question 1: This seems to work, but I am wondering if this is the equivalent of %word%.
Finally the last pattern being similar to %, but requires 3 or more characters either before and after the word. So for example if the target word was word:
words = matched because it doesn't have 3 or more characters either before or after it.
swordfish = not matched because it has 3 or more characters after word
theword = not matched because it has 3 or more characters before it
mywordly = matched because it doesn't contain 3 or more characters before or after word.
miswordeds = not matched because it has 3 characters before it. (it also has 3 words after it too, but it met the criteria already).
Question 2: For the second regex, I am not very sure how to start this. I will be using the regex in a MySQL query using the REGEXP function for example:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]*word[a-zA-Z]*$'
First Question:
According to https://dev.mysql.com/doc/refman/8.0/en/string-comparison-functions.html#operator_like
With LIKE you can use the following two wildcard characters in the pattern:
% matches any number of characters, even zero characters.
_ matches exactly one character.
It means the REGEX ^[a-zA-Z]*word[a-zA-Z]*$' is not equivalent to %word%
Second Question:
Change * to {0,2} to indicate you want to match at maximum 2 characters either before or after it:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]{0,2}word[a-zA-Z]{0,2}$'
And to make case insensitive:
SELECT 1 WHERE LOWER('SWORDFISH') REGEXP '^[a-z]{0,2}word[a-z]{0,2}$'
Assuming
The test string (or column) has only letters. (Hence, I can use . instead of [a-z]).
Case folding and accents are not an issue (presumably handled by a suitable COLLATION).
Either way:
WHERE x LIKE '%word%' -- found the word
AND x NOT LIKE '%___word%' -- but fail if too many leading chars
AND x NOT LIKE '%word___%' -- or trailing
WHERE x RLIKE '^.{0,2}word.{0,2}$'
I vote for RLIKE being slightly faster than LIKE -- only because there are fewer ANDs.
(MySQL 8.0 introduced incompatible regexp syntax; I think the syntax above works in all versions of MySQL/MariaDB. Note that I avoided word boundaries and character class shortcuts like \\w.)

Querying records that start/end with a string, within word boundaries using REGEXP (MySql)

In the below query, I'd like to find records that start with engineer . e.g. I'd like to pull back records with the description engineering
SELECT * FROM app.desc_test t
WHERE lower(t.desc) REGEXP '[[:<:]]engineer[[:>:]]';
The word boundaries are properly handling all special characters (i.e. commas, spaces, special characters, etc that are before and after), but I'm not sure how to write the Regex so that it starts with engineer.
Also, how would I make this say starts with OR ends with engineer.
Somewhat similar issue, but in .NET
Similar issue, but looking for double quotes in MySQL
MySQL 5.7 regex docs
CREATE TABLE desc_test (
id int(11) NOT NULL AUTO_INCREMENT,
desc varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Edit
The value will be unknown/dynamic, so hardcoding any "ing" expression isn't the solution.
If you only want to match the beginning of the word, you can just remove [[:>:]] from the regexp.
SELECT * FROM app.desc_test t
WHERE lower(t.desc) REGEXP '[[:<:]]engineer';
Note: Full Text Search as referenced by Bill Karwin is preferred
because using REGEXP is thousands of times slower than an indexed solution
But...
To use your current REGEXP implementation, your MySQL should look like this:
SELECT * FROM app.desc_test t WHERE lower(t.desc)
REGEXP '[[:<:]]engineer[a-z]*[[:>:]]';
The Regex looks like this:
[[:<:]]engineer[a-z]*[[:>:]]
Meaning:
[[:<:]] - Start of word boundary
engineer - The string given by the search (dynamic)
[a-z] - any character between a-z between zero and any number of times.
* - The above "group" between zero and any number of times.
[[:>:]] - End of word boundary
The above should do what you need. You can also customise it for instance to include digits ((a-z0-9)), or whatever you wish.
Revisions to this answer:
One:
Revised, Improved: use [[:alpha:]] so:
[[:<:]]engineer[[:alpha:]]*[[:>:]]
Two:
As correctly pointed out by Barmar there is actually little need for excess REGEXP. Your word boundaries, or lack thereof, does your work for you.
Therefore to select any word beginning with engineer or ending with engineer, you simply make a REGEXP OR statement :
SELECT * FROM app.desc_test t WHERE lower(t.desc)
REGEXP '([[:<:]]engineer)|(engineer)[[:>:]])'
This means:
Return true if:
The term engineer comes at the start of a word, regardless of what comes after it.
OR the term engineer comes at the end of a word, regardless of what comes before it.
This should fit exactly what you're looking for. This has been tested on MySQL 5.7.
Sources :
MYSQL 5.7 Manual
MySQL REGEXP word boundaries [[:<:]] [[:>:]] and double quotes
Example cases:
Engineer
Match
Engineering
Match
Engineers
Match
Engineer!
Match
Also, how would I make this say starts with OR ends with engineer.
Simply flip around the REGEXP and set it as an OR statement:
SELECT * FROM app.desc_test t WHERE lower(t.desc)
REGEXP '[[:<:]](engineer[[:alpha:]]*)|([[:alpha:]]*engineer)[[:>:]]';
Which tells the REGEXP to: "look for engineer at the beginning of the word followed by any a-z values or look for any a-z values followed by engineer at the end of the word".
For "desc starts with":
"Starts with:
REGEXP: '^engineer...'
LIKE: 'engineer%...'
Case folding:
If the collation of the column is `..._ci`, then do _not_ waste time with `LOWER()`.
So, this is optimal for finding desc that starts with "engineer" or "engineering" or "Engineer", etc:
WHERE t.desc LIKE 'engineer%'
If you really meant "where desc contains 'engineer' or ...", then
WHERE t.desc REGEXP '[[:<:]]engineer'
But a better way would be to use FULLTEXT(desc) and use this; it allows the word to be anywhere in desc and desc can be TEXT.
WHERE MATCH(desc) AGAINST('+engineer*' IN BOOLEAN MODE)
You must pick among the choices based on the actual requirements. Meanwhile, here is the relative performance of them:
LOWER(desc) ... -- poor, regardless of the rest of the clause
LIKE 'engineer%' -- excellent if you have INDEX(desc)
LIKE 'engineer%' -- poor with no index, or with prefixing: INDEX(desc(100))
MATCH... -- excellent due to FULLTEXT index.
REGEXP ... -- poor; will check every record
For "there is a word that starts or ends with":
You need to list positive and negative test cases:
engineering blah
The engineer.
MechanicalEngineering -- neither starts nor ends at word boundary??
engineer
If all of those are valid, then this is the only viable answer:
WHERE t.desc LIKE '%engineer%'
The equivalent REGEXP 'engineer' is slower (but has the same effect).
For other situations, I would look at something close to
WHERE t.desc REGEXP '[[:<:]]engineer|engineer[[:>:]]'
which looks for a "word" that starts or ends with 'engineer'. Note that this does not include 'MechanicalEngineering'.

MySQL REGEXP failing to limit # of occurrences (?!)

I have a table with a lot of individual words in it (Column name 'qWord') with contents including 'Utility', 'Utter', 'Unicorn' and 'Utile'
I'm trying to do a SELECT to find qWord strings which have at most one instance of the letter 't'.
Using REGEXP I thought it would be a trivial statement like:
SELECT *
FROM entries.qentries
WHERE (qWord REGEXP 'T{0,1}')
but I'm still getting 'Utter' and 'Utility' in the output -- along with 'Utile' and 'Unicorn'.
So what am I missing here?
(FWIW: MySQL 8.0.11, Community edition running on a Windows 8.1 machine)
Here's the full REGEXP and my apologies for not posting it initially. I'm looking for words composed only of specific letters and that part works fine.
But I also words with a limited number of a given letter, say t
SELECT * FROM entries.entries WHERE
(qWord NOT REGEXP 'C|F|G|I|J|K|P|Q|S|V|W|X|Y|Z|-')
AND (qWord REGEXP 'A|B|D|E|H|L|M|N|O|R|T|U')
AND (qWord REGEXP 't{0,1}') ;
I've also tried (qWord REGEXP 't{0}|t{1}') as well as (qWord REGEXP '(?<=[^t]|^)(t{0}|t{1})(?:[^t]|$)' )
without success, so I remain stuck
You can use the following regex:
SELECT *
FROM entries.qentries
WHERE (qWord REGEXP '^[^tT]*[tT]?[^tT]*$')
Explanations:
^, $ starting and ending anchors (this is needed to avoid word partial match)
[^tT]* any character that is not a t or a T 0 or more times
[tT]? at most one occurrence of t or T (? is equivalent to {0,1})
[^tT]* any character that is not a t or a T 0 or more times
Regex Demo
Additional Notes:
[^tT] this character range will accept anything that is not a t or a T (spaces, ., \n and other characters will also be accepted, you can restrict this if you want to accept only letters and exclude the t,T you can use: [a-su-zA-SU-Z], if you want to add other characters to this class, just add them at the end [a-su-zA-SU-Z -] will also accept words with spaces and -.

MySQL query to find matching string using REGEXP not working

I am using MySQL 5.5.
I have a table named nutritions, having a column serving_data with text datatype.
Some of the values in serving_data column are like:
[{"label":"1 3\/4 cups","unit":"3\/4 cups"},{"label":"1 cups","unit":"3\/4 cups"},{"label":"1 container (7 cups ea.)","unit":"3\/4 cups"}]
Now, I want to find records containing serving_data like 1 3\/4 cups .
For that I've made a query,
SELECT id,`name`,`nutrition_data`,`serving_data`
FROM `nutritions` WHERE serving_data REGEXP '(\d\s\\\D\d\scup)+';
But is seems not working.
Also I've tried
SELECT id,`name`,`nutrition_data`,`serving_data`
FROM `nutritions` WHERE serving_data REGEXP '/(\d\s\\\D\d\scup)+/g';
If I use the same pattern in http://regexr.com/ then it seems matching.
Can anyone help me?
Note that in MySQL regex, you cannot use shorthand classes like \d, \D or \s, replace them with [0-9], [^0-9] and [[:space:]] respectively.
You may use
REGEXP '[0-9]+[[:space:]][0-9]+\\\\/[0-9]+[[:space:]]+cup'
See the regex demo (note that in general, regex101.com does not support MySQL regex flavor, but the PCRE option supports the POSIX character classes like [:digit:], [:space:], so it is only used for a demo here, not as a proof it works with MySQL REGEXP).
Pattern details:
[0-9]+ - 1 or more digits
[[:space:]] - a whitespace
[0-9]+- 1 or more digits
\\\\/ - a literal \/ char sequence
[0-9]+[[:space:]]+cup - 1 or more digits, 1 or more whitespaces, cup.
Note that you may precise the word cup with a word boundary, add a [[:>:]] pattern after it to match a cup as a whole word.

MySQL REGEXP not producing expected results (not multi byte safe?). Is there a work around?

I'm trying to write a MySQL query to identify first name fields that actually contain initials. The problem is that the query is picking up records that should not match.
I have tested against the POSIX ERE regex implementation in RegEx Buddy to confirm my regex string is correct, but when running in a MySQL query, the results differ.
For example, the query should identify strings such as:
'A.J.D' or 'A J D'.
But it is also matching strings like 'Ralph' or 'Terrance'.
The query:
SELECT *, firstname REGEXP '^[a-zA-z]{1}(([[:space:]]|\.)+[a-zA-z]{1})+([[:space:]]|\.)?$' FROM test_table
The 'firstname' field here is VARCHAR 255 if that's relevant.
I get the same result when running with a string literal rather than table data:
SELECT 'Ralph' REGEXP '^[a-zA-z]{1}(([[:space:]]|\.)+[a-zA-z]{1})+([[:space:]]|\.)?$'
The MySQL documentation warns about potential issues with REGEXP, I'm unsure if this is related to the problem I'm seeing:
Warning The REGEXP and RLIKE operators work in byte-wise fashion, so
they are not multi-byte safe and may produce unexpected results with
multi-byte character sets. In addition, these operators compare
characters by their byte values and accented characters may not
compare as equal even if a given collation treats them as equal.
Thanks in advance.
If you're testing this in the mysql client, you need to escape the backslashes. Each occurence of \. must turn into \\. This is necessary because your input is first processed by the mysql client, which turns \. into .. So you need to make it keep the backslashes by escaping them.