Regex to match exact number in string without lookahead or lookbehind - mysql

Let's say I'm looking for a specific number in string with newlines:
1\n2\n4\n5\n7\n8\n9\n12\n13
Lookahead and lookbehind works perfectly with something like:
(?<![0-9])12(?![0-9])
See Regex 101 demo.
However, I need a workaround for the MySQL REGEXP, as it does not support patterns with lookarounds.

If (?<![0-9])12(?![0-9]) works for you in an online regex engine, the following regex must work for you in REGEXP pattern that is POSIX based:
(^|[^0-9])12($|[^0-9])
Or, you may use word boundaries:
[[:<:]]12[[:>:]]
See the regex demo.
Details:
(^|[^0-9]) - start of string or a non-digit char
[[:<:]] - leading word boundary
12 - a specific numeric value, 12 here
[[:>:]] - trailing word boundary
($|[^0-9]) - end of string or a non-digit.

For MySQL 8.0,
"\\b12\\b"
or
"\\s12\\s"
The former checks for "word boundaries" and might be better. The latter checks for whitespace, such as \n.
The doubled backslashes may be overkill, depending on the client.

Related

MySQL - Regexp for word boundary excluding underscore (connector punctuation)

I'm using regex word boundary \b, and I'm trying to match a word in the following sentence but the result is not what I need. Connector Punctuations (such as underscore) are not being considered as a word boundary
Sentence: ab﹎cd_de_gf|ij|kl|mn|op_
Regexp: \\bkl\\b
However, de is not getting matched.
I tried updating the regexp to use unicode connector punctuation (it's product requirement as we support CJK languages as well) but that isn't working.
Regexp: (?<=\\b|[\p{Pc}])de(?=\\b|[\p{Pc}])
What am i missing here?
Note: (?<=\\b|_)de(?=\\b|_) seems to work for underscores but i need the regex to work for all the connector punctuations.
Thanks in advance !!
Based on the use case you have described you can simplify your regex to:
(?<![[:alnum:]])de(?![[:alnum:]])
instead of trying to match word boundaries, unicode punctuation characters etc.
This will match de if it not followed or preceded by any alpha-numeric character.
To match any connector punctuation characters you need \p{Pc}:
(?<=\\b|\\p{Pc})de(?=\\b|\\p{Pc})
NOTE: \p{Pc} can also be written as [_\u203F\u2040\u2054\uFE33\uFE34\uFE4D-\uFE4F\uFF3F] that matches all these 10 chars.

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

Cannot find a way to negate a regexp inside the pattern

I can't find a way to use the regexp to find all matches that contains the first word and the second word but between the two should not be a specific word.
select '<p>66-155</p><p>en application</p>' regexp '66-155.*[<,p,>]en application'
should return 0
select '<p>66-155 en application</p>' regexp '66-155.*[<,p,>]en application'
should return 1
In MySQL versions before 8.0, you cannot use a single regex for that since it is a POSIX based regex that does not support lookarounds.
With earlier versions of MySQL, you could use something like
LIKE '%66-155%en application%' AND NOT LIKE '%66-155%<p>%en application%'
If, instead of literal substrings, you have regex patterns, then it would look like
REGEXP '66-155.*en application' AND NOT REGEXP '66-155.*<p>.*en application'
In MySQL 8.x, with the ICU regex engine, you may use a lookaround based regex:
REGEXP '66-155(?:(?!<p>).)*?en application'
The (?:(?!<p>).)*? is a tempered greedy token that matches any char (other than a line break char, to match any char including line breaks, add (?s) at the pattern start) with ., as few repetitions as possible (due to *? quantifier), that does not start a <p> char sequence.

To fetch hash tags from a string in mysql

Friends,
I want to fetch hashtags from a field.
select PREG_RLIKE("/[[:<:]]abcd[[:>:]]/","okok got it #abcd");
//output 1
BUT
select PREG_RLIKE("/[[:<:]]#abcd[[:>:]]/","okok got it #abcd");
//output 0
not getting why # is not considering
Please help
The pattern matches:
[[:<:]] - a leading word boundary
#abcd - a literal string
[[:>:]] - a trailing word boundary.
Since a leading word boundary is a location between a non-word and a word char (or start of a string and a word char), you can't expect it to be matched between a space (non-word char) and a hash symbol (#).
Since you are using a PCRE based UDF function, use lookarounds:
select PREG_RLIKE("/(?<!\\w)#abcd(?!\\w)/","okok got it #abcd");
The (?<!\w) negative lookbehind acts like a leading word boundary failing the match if the search term is preceded with a word char, and (?!\w) negative lookahead fails the match if the search term is followed with a word char.
See the regex demo.

Regex to SQL: repetition-operator operand invalid

I'm trying to use a regex to detect URLs in all the rows of my table, here's the regex
\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))
However, I invariably get the "repetition-operator operand invalid" error, which, after hours of search on the internet, still remains obscure.
Where have I gone wrong? What can I do to fix this? And alternaltively, is there a better way to detect URLs in messages in SQL other than a Regex?
Thank you.
You cannot use ? quantifier in MySQL regex as the syntax is POSIX-based. Still, you can use * to match 0 or more characters. Also, \b in MySQL regex should be replaced with [[:<:]] (since this matches at the beginning of a word).
Thus, I suggest using
[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))
I am expanding \w to [a-zA-Z0-9_] as it is exactly what \w is. Instead of \s, I am using a literal space. Instead of \d, I am using [0-9]. This is done for readability and better compatibility. If \w, \d and \s work for you, you can use them, but I do not see them among the supported entities in POSIX specs.
Also, instead of literal space, you could use [:space:], it matches space, tab, newline, and carriage return. Instead of [a-zA-Z] you can use [:alpha:], and instead of [0-9], you can use [:digit:]. Please also check this:
[[:<:]](([[:alpha:][:digit:]-]+:\/\/*|www[.])[^[:space:]()<>]+(\([[:alpha:][:digit:]_]+\)|([^[:space:][:punct:]]|\/)))