Cannot find a way to negate a regexp inside the pattern - mysql

I can't find a way to use the regexp to find all matches that contains the first word and the second word but between the two should not be a specific word.
select '<p>66-155</p><p>en application</p>' regexp '66-155.*[<,p,>]en application'
should return 0
select '<p>66-155 en application</p>' regexp '66-155.*[<,p,>]en application'
should return 1

In MySQL versions before 8.0, you cannot use a single regex for that since it is a POSIX based regex that does not support lookarounds.
With earlier versions of MySQL, you could use something like
LIKE '%66-155%en application%' AND NOT LIKE '%66-155%<p>%en application%'
If, instead of literal substrings, you have regex patterns, then it would look like
REGEXP '66-155.*en application' AND NOT REGEXP '66-155.*<p>.*en application'
In MySQL 8.x, with the ICU regex engine, you may use a lookaround based regex:
REGEXP '66-155(?:(?!<p>).)*?en application'
The (?:(?!<p>).)*? is a tempered greedy token that matches any char (other than a line break char, to match any char including line breaks, add (?s) at the pattern start) with ., as few repetitions as possible (due to *? quantifier), that does not start a <p> char sequence.

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

Regex to match exact number in string without lookahead or lookbehind

Let's say I'm looking for a specific number in string with newlines:
1\n2\n4\n5\n7\n8\n9\n12\n13
Lookahead and lookbehind works perfectly with something like:
(?<![0-9])12(?![0-9])
See Regex 101 demo.
However, I need a workaround for the MySQL REGEXP, as it does not support patterns with lookarounds.
If (?<![0-9])12(?![0-9]) works for you in an online regex engine, the following regex must work for you in REGEXP pattern that is POSIX based:
(^|[^0-9])12($|[^0-9])
Or, you may use word boundaries:
[[:<:]]12[[:>:]]
See the regex demo.
Details:
(^|[^0-9]) - start of string or a non-digit char
[[:<:]] - leading word boundary
12 - a specific numeric value, 12 here
[[:>:]] - trailing word boundary
($|[^0-9]) - end of string or a non-digit.
For MySQL 8.0,
"\\b12\\b"
or
"\\s12\\s"
The former checks for "word boundaries" and might be better. The latter checks for whitespace, such as \n.
The doubled backslashes may be overkill, depending on the client.

REGEX alternative

Any guy can find the alternative way to rewrite the 2 REGEXs below without the question mark (?).
^(?:2131|1800|35\d{3})\d{11}$
^4[0-9]{12}(?:[0-9]{3})?$
Or, can you suggest how to make a query for search VISA and JCB card pattern with SQL language.
I just want to make a query to search card pattern inside my database. I try to use the regular expression to done this. Unfortunately, POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers like PCRE (Perl Compatible Regular Expressions). This means you can't use +? and *?.
In MySQL versions before v.8, you need to use POSIX ERE like regex syntax, that is:
You can't use non-capturing groups
You can't use \d shorthand character class for digits, you need to use [[:digit:]] or [0-9]
You won't be able to use lazy quantifiers, but your patterns do not contain them. In some cases, they can be replaced with negated bracket expressions (e.g. a.*?b is better written as a[^ab]*b).
In your case, you need to replace (?: with ( and replace \d with [0-9]
^(2131|1800|35[0-9]{3})[0-9]{11}$
^4[0-9]{12}([0-9]{3})?$
You drop the question mark in (?: that makes it a normal group.
instead of the )? use ){0,1}

Regular expressions convert PCRE to POSIX to use regex in MySQL query

I have this PCRE regex: (this regex with example on RegExr http://regexr.com/39lbb )
/(?=.*?ben)(?=.*?john).*/ig
And i have this PCRE regex: (this regex with example on RegExr http://regexr.com/39lbh )
/.*?\b(john|ben)\b.*/ig
How can i convert this PCRE regex to POSIX or how can i create same POSIX regex? I want use this regex in my MySQL query ( REGEXP http://dev.mysql.com/doc/refman/5.1/en/regexp.html )
Thanks
Just drop the look-ahead and non-greedy matching, you don't need any of it.
[[:<:]](john|ben)[[:>:]]
Note that [[:<:]] and [[:>:]] start-of-word and end-of-word boundaries, respectively (\b is both).
I also suspect you want to find strings that contain the words 'ben' or 'john', not match their contents, so I assume the .* is superfluous as well.

__* in Mysql regular expression

I am refering one open source code. There I can found an sql with this kind of a filter.
select sometext from table1,table2 where table1.sometext LIKE
CONCAT('% ',table2.test_keyword,' %') AND table2.test_keyword NOT
REGEXP '__*';
What is that __* in this sql?
__* matches one _ followed by zero or more _s.
__*
^^^
||\__ (zero or more) ^
|\___ underscore |
\____ underscore, then |
_+ would have done the same job.
_+
^^
|\__ (one or more) ^
\___ underscore |
It's simply one or more underscore characters.
The pattern is best read as:
'_', exactly one underscore,
'_*', followed by zero or more underscores.
Keep in mind that, without a start marker, that will match the pattern at any location in the string, so it basically means any string with an underscore in it (or, more accurately, since you're using NOT, a string without an underscore).
It's also needlessly complex, since you could achieve the same effect with AND table2.test_keyword NOT REGEXP '_'.
See here for the latest MySQL documentation on regexes (5.6 at the time of this answer).