__* in Mysql regular expression - mysql

I am refering one open source code. There I can found an sql with this kind of a filter.
select sometext from table1,table2 where table1.sometext LIKE
CONCAT('% ',table2.test_keyword,' %') AND table2.test_keyword NOT
REGEXP '__*';
What is that __* in this sql?

__* matches one _ followed by zero or more _s.
__*
^^^
||\__ (zero or more) ^
|\___ underscore |
\____ underscore, then |
_+ would have done the same job.
_+
^^
|\__ (one or more) ^
\___ underscore |

It's simply one or more underscore characters.
The pattern is best read as:
'_', exactly one underscore,
'_*', followed by zero or more underscores.
Keep in mind that, without a start marker, that will match the pattern at any location in the string, so it basically means any string with an underscore in it (or, more accurately, since you're using NOT, a string without an underscore).
It's also needlessly complex, since you could achieve the same effect with AND table2.test_keyword NOT REGEXP '_'.
See here for the latest MySQL documentation on regexes (5.6 at the time of this answer).

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

MySQL RegEx to match two consecutive digits that are the same

I am using the following RegEx in MySQL to match two consecutive digits that are the same anywhere in a string:
^.*([[:digit:]])\1+.*$
It matches correctly the following strings:
8831
5011
9931
but it also matches
9318
and it doesn't match
3449
Is the problem around .* or is it something else?
There's no way to check to the same thing twice directly, instead you would need to check for all possibilities. Luckily since you are only looking at 10 digits, it's relatively easy:
(11|22|33|44|55|66|77|88|99|00)
I don't think MySQL regular expressions have back references. You can do the more verbose:
where col regexp '00|11|22|33|44|55|66|77|88|99'

MySQL matching this regex while it shouldn't

I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!
You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)

MySQL Regular Expression [a-z]\.[a-z] but not a.m. or p.m

Evening,
I want to search some columns in a MySQL table for any instances of [a-z]\.[a-z], for example:
John.than, Ame.ica, Llan.antffraid etc.
but I don't want this to include the strings 'a.m.' OR 'p.m.'. I have tried using (?!a.m.|p.m.) but this does not work. It returns the error: "Got error 'repetition-operator operand invalid' from regexp".
I have the following regular expression:
REGEXP BINARY '[a-z]\\\.[a-z]'
N.B. If a colum includes a.m. OR p.m. but also contains a string like bro.ken, it needs to be returned.
Build your regex step by step:
You want everything, except its a "standalone" a.m or p.m:
[b-oq-z]{1}\.[a-ln-z]{1} matches everything of the format x.y that is not a.# or p.# or #.m
However you miss a.a, a.b, a.c ... also. so add that cases:
a\.[^m] (same for the p-cases: p\.[^m])
a.m is valid, when there are chars in front of the a: kra.m, tra.m. Same applies for p.m: erp.m
[a-z]{1}[ap]\.m covers this condtion.
Now, we are missing strings, where the second part is longer: a.mod, p.markt:
[ap]\.m[a-z]+ covers that one.
Finally just the ones ending with .m but having a different prefix are missing:
[b-oq-z]{1}\.m
This should now cover all possible use Cases. Simple combine the pattern with OR (|) and you are done:
([b-oq-z]{1}\.[a-ln-z]{1}|a\.[^m]|p\.[^m]|[a-z]{1}[ap]\.m|[ap]\.m[a-z]+|[b-oq-z]{1}\.m)
Edit live on Debuggex
Note: This will NOT give you the exakt match groups. But since you use it in a SQL-Query only the case that there is a match is required. (ark.m will be matched by k.m - but it fulfills your specification)
Keep in Mind: When creating a regular expression, there is no right solution: Just Working Ones, and not working ones. a\.[^m]|p\.[^m] is equal to [ap]\.[^m], which will reduce the pattern by one OR.
You have found the perfect Regex-Pattern, when 2 conditions are met:
It works!
You can understand it, when looking at it in 4 months!
If you can use assertions, this might work, but not sure about backtracking.
# (?=^.*(?:(?!a\.m|p\.m)[a-z]\.[a-z]|(?:a\.m|p\.m).*(?!a\.m|p\.m)[a-z]\.[a-z]))
(?=
^
.*
(?:
(?! a\.m | p\.m )
[a-z] \. [a-z]
|
(?: a\.m | p\.m )
.*
(?! a\.m | p\.m )
[a-z] \. [a-z]
)
)
I would do it like this:
SELECT 'Ame.ica wakes up at 8 a.m.' REGEXP
'[b-oq-z]\\.[a-ln-z]|[ap]\\.[^m]|[^ap]\\.m|[[:alpha:]][ap]\\.m|[ap]\\.m[[:alpha:]]' findme,
'America wakes up at 8 a.m.' REGEXP
'[b-oq-z]\\.[a-ln-z]|[ap]\\.[^m]|[^ap]\\.m|[[:alpha:]][ap]\\.m|[ap]\\.m[[:alpha:]]' dontfindme
It's a shorter and therefor slightly faster version of dognose's answer. Also it's tailored to MySQL which has the slightly odd [[:alpha:]] class.

Regex for start with three alpha and four digits

I have writen an sql statement to retrieve data from Mysql db and I wanted to select data where myId start with three alpha and 4 digits example : ABC1234K1D2
myId REGEXP '^[A-Z]{3}/d{4}'
but it gives me empty result(data is available in DB). Could someone point me to correct way.
In most regex variants the answer would be: /d matches a / followed by a d; I think you want \d which matches a digit.
However MySQL has a somewhat limited regex implementation (see documentation).
There is no shortcut to character sets like \d for any digit.
You need to either use a named character set ([[:digit:]]), or just use [0-9].
Try this out :
[A-Z]{3}[0-9]{4}
If you want characters to be case insensitive. Try this :
[a-zA-Z]{3}[0-9]{4}
First, in regular regular expressions, to match a digit, you have to use \d instead of /d (which makes you match / followed by d).
Then, I had never noticed, but I think \d (and the others like \w, etc.) don't seem to be available in MySQL. The doc lists the accepted spacial chars, and those generic classes don't appear. You could use [:digit:] instead, even if [0-9] is quite shorter ;)
You are doing fine, just replace /d with \d.Final regex: ^[A-Z]{3}\d{4}
You could use the following pattern :
^[a-zA-Z]{3}\d{4}