I need a regular expression in MySQL that matches:
a sentence that has a word
that begins with e
ends with k
can contain [deki] (doesn't have to include all)
is 3-4 characters long
For example, the following would match:
"My name is eik"
"the edik of doom"
"herp derp eidk derp"
MySQL's extended regular expressions are pretty powerful; see http://dev.mysql.com/doc/refman/5.6/en/regexp.html. You can write:
[[:<:]]e[deki]{1,2}k[[:>:]]
(boundary-at-start-of-word, plus e, plus one or two characters from [deki], plus k, plus boundary-at-end-of-word).
This assumes that your words are separated by punctuation or spaces or whatnot, and it doesn't require that your field contain multiple words.
(Hat-tip to Bohemian for giving an answer that used \b. MySQL doesn't support \b for word-boundaries, but it reminded me of things my original answer missed.)
Try this:
WHERE column RLIKE '[[:<:]]e[deki]{1,2}k[[:>:]]'
Edited:
mysql doesn't support \b, instead it uses [[:<:]] and [[:>:]] for start and end of word respectively. regex updated accordingly
Related
I'm using regex word boundary \b, and I'm trying to match a word in the following sentence but the result is not what I need. Connector Punctuations (such as underscore) are not being considered as a word boundary
Sentence: ab﹎cd_de_gf|ij|kl|mn|op_
Regexp: \\bkl\\b
However, de is not getting matched.
I tried updating the regexp to use unicode connector punctuation (it's product requirement as we support CJK languages as well) but that isn't working.
Regexp: (?<=\\b|[\p{Pc}])de(?=\\b|[\p{Pc}])
What am i missing here?
Note: (?<=\\b|_)de(?=\\b|_) seems to work for underscores but i need the regex to work for all the connector punctuations.
Thanks in advance !!
Based on the use case you have described you can simplify your regex to:
(?<![[:alnum:]])de(?![[:alnum:]])
instead of trying to match word boundaries, unicode punctuation characters etc.
This will match de if it not followed or preceded by any alpha-numeric character.
To match any connector punctuation characters you need \p{Pc}:
(?<=\\b|\\p{Pc})de(?=\\b|\\p{Pc})
NOTE: \p{Pc} can also be written as [_\u203F\u2040\u2054\uFE33\uFE34\uFE4D-\uFE4F\uFF3F] that matches all these 10 chars.
While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.
Let's say I'm looking for a specific number in string with newlines:
1\n2\n4\n5\n7\n8\n9\n12\n13
Lookahead and lookbehind works perfectly with something like:
(?<![0-9])12(?![0-9])
See Regex 101 demo.
However, I need a workaround for the MySQL REGEXP, as it does not support patterns with lookarounds.
If (?<![0-9])12(?![0-9]) works for you in an online regex engine, the following regex must work for you in REGEXP pattern that is POSIX based:
(^|[^0-9])12($|[^0-9])
Or, you may use word boundaries:
[[:<:]]12[[:>:]]
See the regex demo.
Details:
(^|[^0-9]) - start of string or a non-digit char
[[:<:]] - leading word boundary
12 - a specific numeric value, 12 here
[[:>:]] - trailing word boundary
($|[^0-9]) - end of string or a non-digit.
For MySQL 8.0,
"\\b12\\b"
or
"\\s12\\s"
The former checks for "word boundaries" and might be better. The latter checks for whitespace, such as \n.
The doubled backslashes may be overkill, depending on the client.
I require a Regular Expression to match single words of [a-zA-Z] only with 6 or more characters.
The list of words:
a
27
it was good because
it
cat went to
four
eight is bigger
peter
pepper
objects
83
harrison99
adjetives
atom-bomb
The following would be matched
pepper
objects
adjectives
It isn't too bad if the following are matched too as I can strip out the duplicates after
it was good because
eight is bigger
atom-bomb
So far I have this
/[a-zA-z]{6,}/g
Which matches 6 character or more words according to RegEx 101 but when I use it in my MySQL database to filter a table on matches RegExp.
Currently it includes words containing punctuation and other non a-zA-Z characters. I don't want it to include results with numbers or punctuation.
In an ideal world I don't want it to include a row that contains more than one word as usually the other words are already duplicates.
Actual MySQL results
If a row contains a number or punctuation and/or contain one or more words I don't want it included.
I want only results that are single, whole words of 6+ characters
Drop the delimiters and add anchors for start/end:
SELECT 'abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 1
SELECT 'a abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 0
Also I changed [a-zA-z] in the character class to [a-zA-Z].
Instead of [a-zA-Z] you can also use [[:alpha:]] POSIX style bracket extension. So it becomes:
^[[:alpha:]]{6,}$
(As a side note: MySql uses Henry Spencer's implementation of regular expressions. if word boundaries needed, use [[:<:]]word[[:>:]] See manual for further syntax differences)
I have writen an sql statement to retrieve data from Mysql db and I wanted to select data where myId start with three alpha and 4 digits example : ABC1234K1D2
myId REGEXP '^[A-Z]{3}/d{4}'
but it gives me empty result(data is available in DB). Could someone point me to correct way.
In most regex variants the answer would be: /d matches a / followed by a d; I think you want \d which matches a digit.
However MySQL has a somewhat limited regex implementation (see documentation).
There is no shortcut to character sets like \d for any digit.
You need to either use a named character set ([[:digit:]]), or just use [0-9].
Try this out :
[A-Z]{3}[0-9]{4}
If you want characters to be case insensitive. Try this :
[a-zA-Z]{3}[0-9]{4}
First, in regular regular expressions, to match a digit, you have to use \d instead of /d (which makes you match / followed by d).
Then, I had never noticed, but I think \d (and the others like \w, etc.) don't seem to be available in MySQL. The doc lists the accepted spacial chars, and those generic classes don't appear. You could use [:digit:] instead, even if [0-9] is quite shorter ;)
You are doing fine, just replace /d with \d.Final regex: ^[A-Z]{3}\d{4}
You could use the following pattern :
^[a-zA-Z]{3}\d{4}