MySQL 8.0.30 Regular Expression Word Matching with Special Characters - mysql

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.

You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.

Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

Related

Equalent mysql regex for following python regex

python pattern => ^(?=.\bABDUL\b)(?=.\bHAI\b.)(?=.\bMANSOOR\b).*$
need equalent mysql pattern
can you please help me out ?
The regex in question is a quite strange way how to match simple words. It is not clear what is the expected input. Maybe, the input justifies this approach.
^(?=.\bABDUL\b)(?=.\bHAI\b.)(?=.\bMANSOOR\b).*$
Which means: At the beginning there must be any character which is not a part of a word, then ABDUL, a non word character, HAI, a non word character, MANSOOR, a non word character or the end of the string.
^[^[:alnum:]]ABDUL[^[:alnum:]]HAI[^[:alnum:]]MANSOOR([^[:alnum:]]?.*)?$
Which is: At the beginning, not a number or alphabet character (alphanumerical), ABDUL, one non-alphanumerical, HAI, one non-alphanumerical, MANSOOR one non-alphanumerical or the end of the string.
I did not test it and did not intended to make it 100% the same as the first one, but it should be close enough.
For anyone who would like to copy it to their code:
Matching the first character is not very common and can be a bug in the original regexp.
(?=...) is an "lookahead assertion" which does not consume any characters, the POSIX version does not have it, but for a simple string searching it may not be important.
Both versions should match strings like !ABDUL$HAI)MANSOOR - make sure that this is what you want.
For someone who would like to understand the regular expressions I used
https://dev.mysql.com/doc/refman/8.0/en/regexp.html for mysql (POSIX syntax) and https://docs.python.org/3/library/re.html for python (PCRE = Perl compatible syntax)

Parsing wide spanning HTML table with regex [duplicate]

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
What is this?
This is a collection of common Q&A. This is also a Community Wiki, so everyone is invited to participate in maintaining it.
Why is this?
regex is suffering from give me ze code type of questions and poor answers with no explanation. This reference is meant to provide links to quality Q&A.
What's the scope?
This reference is meant for the following languages: php, perl, javascript, python, ruby, java, .net.
This might be too broad, but these languages share the same syntax. For specific features there's the tag of the language behind it, example:
What are regular expression Balancing Groups? .net
The Stack Overflow Regular Expressions FAQ
See also a lot of general hints and useful links at the regex tag details page.
Online tutorials
RegexOne ↪
Regular Expressions Info ↪
Quantifiers
Zero-or-more: *:greedy, *?:reluctant, *+:possessive
One-or-more: +:greedy, +?:reluctant, ++:possessive
?:optional (zero-or-one)
Min/max ranges (all inclusive): {n,m}:between n & m, {n,}:n-or-more, {n}:exactly n
Differences between greedy, reluctant (a.k.a. "lazy", "ungreedy") and possessive quantifier:
Greedy vs. Reluctant vs. Possessive Quantifiers
In-depth discussion on the differences between greedy versus non-greedy
What's the difference between {n} and {n}?
Can someone explain Possessive Quantifiers to me? php, perl, java, ruby
Emulating possessive quantifiers .net
Non-Stack Overflow references: From Oracle, regular-expressions.info
Character Classes
What is the difference between square brackets and parentheses?
[...]: any one character, [^...]: negated/any character but
[^] matches any one character including newlines javascript
[\w-[\d]] / [a-z-[qz]]: set subtraction .net, xml-schema, xpath, JGSoft
[\w&&[^\d]]: set intersection java, ruby 1.9+
[[:alpha:]]:POSIX character classes
[[:<:]] and [[:>:]] Word boundaries
Why do [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java? java
Shorthand:
Digit: \d:digit, \D:non-digit
Word character (Letter, digit, underscore): \w:word character, \W:non-word character
Whitespace: \s:whitespace, \S:non-whitespace
Unicode categories (\p{L}, \P{L}, etc.)
Escape Sequences
Horizontal whitespace: \h:space-or-tab, \t:tab
Newlines:
\r, \n:carriage return and line feed
\R:generic newline php java-8
Negated whitespace sequences: \H:Non horizontal whitespace character, \V:Non vertical whitespace character, \N:Non line feed character pcre php5 java-8
Other: \v:vertical tab, \e:the escape character
Anchors
anchor
matches
flavors
^
Start of string
Common*
^
Start of line
Commonm
$
End of line
Commonm
$
End of text
Common* except javascript
$
Very end of string
javascript*, phpD
\A
Start of string
Common except javascript
\Z
End of text
Common except javascript python
\Z
Very end of string
python
\z
Very end of string
Common except javascript python
\b
Word boundary
Common
\B
Not a word boundary
Common
\G
End of previous match
Common except javascript, python
Term
Definition
Start of string
At the very start of the string.
Start of line
At the very start of the string, andafter a non-terminal line terminator.
Very end of string
At the very end of the string.
End of text
At the very end of the string, andat a terminal line terminator.
End of line
At the very end of the string, andat a line terminator.
Word boundary
At a word character not preceded by a word character, andat a non-word character not preceded by a non-word character.
End of previous match
At a previously set position, usually where a previous match ended.At the very start of the string if no position was set.
"Common" refers to the following: icu java javascript .net objective-c pcre perl php python swift ruby
* Default |
m Multi-line mode. |
D Dollar end only mode.
Groups
(...):capture group, (?:):non-capture group
Why is my repeating capturing group only capturing the last match?
\1:backreference and capture-group reference, $1:capture group reference
What's the meaning of a number after a backslash in a regular expression?
\g<1>123:How to follow a numbered capture group, such as \1, with a number?: python
What does a subpattern (?i:regex) mean?
What does the 'P' in (?P<group_name>regexp) mean?
(?>):atomic group or independent group, (?|):branch reset
Equivalent of branch reset in .NET/C# .net
Named capture groups:
General named capturing group reference at regular-expressions.info
java: (?<groupname>regex): Overview and naming rules (Non-Stack Overflow links)
Other languages: (?P<groupname>regex) python, (?<groupname>regex) .net, (?<groupname>regex) perl, (?P<groupname>regex) and (?<groupname>regex) php
Lookarounds
Lookaheads: (?=...):positive, (?!...):negative
Lookbehinds: (?<=...):positive, (?<!...):negative
Lookbehind limits in:
Lookbehinds need to be constant-length php, perl, python, ruby
Lookarounds of limited length {0,n} java
Variable length lookbehinds are allowed .net
Lookbehind alternatives:
Using \K php, perl (Flavors that support \K)
Alternative regex module for Python python
The hacky way
JavaScript negative lookbehind equivalents External link
Modifiers
flag
modifier
flavors
a
ASCII
python
c
current position
perl
e
expression
php perl
g
global
most
i
case-insensitive
most
m
multiline
php perl python javascript .net java
m
(non)multiline
ruby
o
once
perl ruby
r
non-destructive
perl
S
study
php
s
single line
ruby
U
ungreedy
php r
u
unicode
most
x
whitespace-extended
most
y
sticky ↪
javascript
How to convert preg_replace e to preg_replace_callback?
What are inline modifiers?
What is '?-mix' in a Ruby Regular Expression
Other:
|:alternation (OR) operator, .:any character, [.]:literal dot character
What special characters must be escaped?
Control verbs (php and perl): (*PRUNE), (*SKIP), (*FAIL) and (*F)
php only: (*BSR_ANYCRLF)
Recursion (php and perl): (?R), (?0) and (?1), (?-1), (?&groupname)
Common Tasks
Get a string between two curly braces: {...}
Match (or replace) a pattern except in situations s1, s2, s3...
How do I find all YouTube video ids in a string using a regex?
Validation:
Internet: email addresses, URLs (host/port: regex and non-regex alternatives), passwords
Numeric: a number, min-max ranges (such as 1-31), phone numbers, date
Parsing HTML with regex: See "General Information > When not to use Regex"
Advanced Regex-Fu
Strings and numbers:
Regular expression to match a line that doesn't contain a word
How does this PCRE pattern detect palindromes?
Match strings whose length is a fourth power
How does this regex find triangular numbers?
How to determine if a number is a prime with regex?
How to match the middle character in a string with regex?
Other:
How can we match a^n b^n?
Match nested brackets
Using a recursive pattern php, perl
Using balancing groups .net
“Vertical” regex matching in an ASCII “image”
List of highly up-voted regex questions on Code Golf
How to make two quantifiers repeat the same number of times?
An impossible-to-match regular expression: (?!a)a
Match/delete/replace this except in contexts A, B and C
Match nested brackets with regex without using recursion or balancing groups?
Flavor-Specific Information
(Except for those marked with *, this section contains non-Stack Overflow links.)
Java
Official documentation: Pattern Javadoc ↪, Oracle's regular expressions tutorial ↪
The differences between functions in java.util.regex.Matcher:
matches()): The match must be anchored to both input-start and -end
find()): A match may be anywhere in the input string (substrings)
lookingAt(): The match must be anchored to input-start only
(For anchors in general, see the section "Anchors")
The only java.lang.String functions that accept regular expressions: matches(s), replaceAll(s,s), replaceFirst(s,s), split(s), split(s,i)
*An (opinionated and) detailed discussion of the disadvantages of and missing features in java.util.regex
.NET
How to read a .NET regex with look-ahead, look-behind, capturing groups and back-references mixed together?
Official documentation:
Boost regex engine: General syntax, Perl syntax (used by TextPad, Sublime Text, UltraEdit, ...???)
JavaScript general info and RegExp object
.NET MySQL Oracle Perl5 version 18.2
PHP: pattern syntax, preg_match
Python: Regular expression operations, search vs match, how-to
Rust: crate regex, struct regex::Regex
Splunk: regex terminology and syntax and regex command
Tcl: regex syntax, manpage, regexp command
Visual Studio Find and Replace
General information
(Links marked with * are non-Stack Overflow links.)
Other general documentation resources: Learning Regular Expressions, *Regular-expressions.info, *Wikipedia entry, *RexEgg, Open-Directory Project
DFA versus NFA
Generating Strings matching regex
Books: Jeffrey Friedl's Mastering Regular Expressions
When to not use regular expressions:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (blog post written by Stack Overflow's founder)*
Do not use regex to parse HTML:
Don't. Please, just don't
Well, maybe...if you're really determined (other answers in this question are also good)
Examples of regex that can cause regex engine to fail
Why does this regular expression kill the Java regex engine?
Tools: Testers and Explainers
(This section contains non-Stack Overflow links.)
Online (* includes replacement tester, + includes split tester):
Debuggex (Also has a repository of useful regexes) javascript, python, pcre
*Regular Expressions 101 php, pcre, python, javascript, java
Regex Pal, regular-expressions.info javascript
Rubular ruby RegExr Regex Hero dotnet
*+ regexstorm.net .net
*RegexPlanet: Java java, Go go, Haskell haskell, JavaScript javascript, .NET dotnet, Perl perl php PCRE php, Python python, Ruby ruby, XRegExp xregexp
freeformatter.com xregexp
*+regex.larsolavtorvik.com php PCRE and POSIX, javascript
Offline:
Microsoft Windows: RegexBuddy (analysis), RegexMagic (creation), Expresso (analysis, creation, free)
MySQL 8.0: Various syntax changes were made. Note especially the doubling of backslashes in some contexts. (This Answer need further editing to reflect the differences.)

Regex to SQL: repetition-operator operand invalid

I'm trying to use a regex to detect URLs in all the rows of my table, here's the regex
\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))
However, I invariably get the "repetition-operator operand invalid" error, which, after hours of search on the internet, still remains obscure.
Where have I gone wrong? What can I do to fix this? And alternaltively, is there a better way to detect URLs in messages in SQL other than a Regex?
Thank you.
You cannot use ? quantifier in MySQL regex as the syntax is POSIX-based. Still, you can use * to match 0 or more characters. Also, \b in MySQL regex should be replaced with [[:<:]] (since this matches at the beginning of a word).
Thus, I suggest using
[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))
I am expanding \w to [a-zA-Z0-9_] as it is exactly what \w is. Instead of \s, I am using a literal space. Instead of \d, I am using [0-9]. This is done for readability and better compatibility. If \w, \d and \s work for you, you can use them, but I do not see them among the supported entities in POSIX specs.
Also, instead of literal space, you could use [:space:], it matches space, tab, newline, and carriage return. Instead of [a-zA-Z] you can use [:alpha:], and instead of [0-9], you can use [:digit:]. Please also check this:
[[:<:]](([[:alpha:][:digit:]-]+:\/\/*|www[.])[^[:space:]()<>]+(\([[:alpha:][:digit:]_]+\)|([^[:space:][:punct:]]|\/)))

MySQL matching this regex while it shouldn't

I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!
You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)

In OpenGrok how do you do a full search for special non-alphanumeric characters

I am trying to search my codebase for code that calls a function named "foo" so I am searching for "foo(" but the results I'm getting includes everything with the word foo in it which includes css, comments and strings that don't even have the trailing open parenthesis.
Anyone know how to do a search for strings that include special characters like ),"'?
When searching for special characters, try using escape character before the character, i.e. \, e.g. "foo\(".
Additionally, I found a reply for a similar question (see http://marc.info/?l=opensolaris-opengrok-discuss&m=115776447032671). It seems that frequently occurring special characters are not indexed because of performance issues, therefore it might not be possible to effectively search for such pattern.
Opengrok supports escaping special characters that are part of the query syntax. Current special characters are:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character. For example to search for (1+1):2 use the query \(1\+1)\:2