MySQL REGEXP failing to limit # of occurrences (?!) - mysql

I have a table with a lot of individual words in it (Column name 'qWord') with contents including 'Utility', 'Utter', 'Unicorn' and 'Utile'
I'm trying to do a SELECT to find qWord strings which have at most one instance of the letter 't'.
Using REGEXP I thought it would be a trivial statement like:
SELECT *
FROM entries.qentries
WHERE (qWord REGEXP 'T{0,1}')
but I'm still getting 'Utter' and 'Utility' in the output -- along with 'Utile' and 'Unicorn'.
So what am I missing here?
(FWIW: MySQL 8.0.11, Community edition running on a Windows 8.1 machine)
Here's the full REGEXP and my apologies for not posting it initially. I'm looking for words composed only of specific letters and that part works fine.
But I also words with a limited number of a given letter, say t
SELECT * FROM entries.entries WHERE
(qWord NOT REGEXP 'C|F|G|I|J|K|P|Q|S|V|W|X|Y|Z|-')
AND (qWord REGEXP 'A|B|D|E|H|L|M|N|O|R|T|U')
AND (qWord REGEXP 't{0,1}') ;
I've also tried (qWord REGEXP 't{0}|t{1}') as well as (qWord REGEXP '(?<=[^t]|^)(t{0}|t{1})(?:[^t]|$)' )
without success, so I remain stuck

You can use the following regex:
SELECT *
FROM entries.qentries
WHERE (qWord REGEXP '^[^tT]*[tT]?[^tT]*$')
Explanations:
^, $ starting and ending anchors (this is needed to avoid word partial match)
[^tT]* any character that is not a t or a T 0 or more times
[tT]? at most one occurrence of t or T (? is equivalent to {0,1})
[^tT]* any character that is not a t or a T 0 or more times
Regex Demo
Additional Notes:
[^tT] this character range will accept anything that is not a t or a T (spaces, ., \n and other characters will also be accepted, you can restrict this if you want to accept only letters and exclude the t,T you can use: [a-su-zA-SU-Z], if you want to add other characters to this class, just add them at the end [a-su-zA-SU-Z -] will also accept words with spaces and -.

Related

Regex pattern equivalent of %word% in mysql

I need 2 regex case insensitive patterns. One of them are equivalent of SQL's %. So %word%. My attempt at this was '^[a-zA-Z]*word[a-zA-Z]*$'.
Question 1: This seems to work, but I am wondering if this is the equivalent of %word%.
Finally the last pattern being similar to %, but requires 3 or more characters either before and after the word. So for example if the target word was word:
words = matched because it doesn't have 3 or more characters either before or after it.
swordfish = not matched because it has 3 or more characters after word
theword = not matched because it has 3 or more characters before it
mywordly = matched because it doesn't contain 3 or more characters before or after word.
miswordeds = not matched because it has 3 characters before it. (it also has 3 words after it too, but it met the criteria already).
Question 2: For the second regex, I am not very sure how to start this. I will be using the regex in a MySQL query using the REGEXP function for example:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]*word[a-zA-Z]*$'
First Question:
According to https://dev.mysql.com/doc/refman/8.0/en/string-comparison-functions.html#operator_like
With LIKE you can use the following two wildcard characters in the pattern:
% matches any number of characters, even zero characters.
_ matches exactly one character.
It means the REGEX ^[a-zA-Z]*word[a-zA-Z]*$' is not equivalent to %word%
Second Question:
Change * to {0,2} to indicate you want to match at maximum 2 characters either before or after it:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]{0,2}word[a-zA-Z]{0,2}$'
And to make case insensitive:
SELECT 1 WHERE LOWER('SWORDFISH') REGEXP '^[a-z]{0,2}word[a-z]{0,2}$'
Assuming
The test string (or column) has only letters. (Hence, I can use . instead of [a-z]).
Case folding and accents are not an issue (presumably handled by a suitable COLLATION).
Either way:
WHERE x LIKE '%word%' -- found the word
AND x NOT LIKE '%___word%' -- but fail if too many leading chars
AND x NOT LIKE '%word___%' -- or trailing
WHERE x RLIKE '^.{0,2}word.{0,2}$'
I vote for RLIKE being slightly faster than LIKE -- only because there are fewer ANDs.
(MySQL 8.0 introduced incompatible regexp syntax; I think the syntax above works in all versions of MySQL/MariaDB. Note that I avoided word boundaries and character class shortcuts like \\w.)

SQL Regex last character search not working

I'm using regex to find specific search but the last separator getting ignore.
Must search for |49213[A-Z]| but searches for |49213[A-Z]
SELECT * FROM table WHERE (data REGEXP '/\|49213[A-Z]+\|/')
Why are you using | in the pattern? Why the +?
SELECT * FROM table WHERE (data REGEXP '\|49213[A-Z]\|')
If you want multiple:
SELECT * FROM table WHERE (data REGEXP '\|49213[A-Z]+\|')
or:
SELECT * FROM table WHERE (data REGEXP '[|]49213[A-Z][|]')
Aha. That is rather subtle.
\ escapes certain characters that have special meaning.
But it does not seem to do so for | ("or") or . ("any byte"), etc.
So, \| is the same as |.
But the regexp parser does not like having either side of "or" being empty. (I suspect this is a "bug"). Hence the error message.
https://dev.mysql.com/doc/refman/5.7/en/regexp.html says
To use a literal instance of a special character in a regular expression, precede it by two backslash () characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
The best fix seems to be [|] or \\| instead of \| when you want the pipe character.
Someday, the REGEXP parser in MySQL will be upgraded to PCRE as in MariaDB. Then a lot more features will come, and this 'bug' may go away.

MySQL regex matching rows containing two or more spaces in a row

I am trying to write a MySQL statement which finds and returns the book registrations that contain 2 or more spaces in a row.
The statement below is wrong.
SELECT * FROM book WHERE titles REGEXP '[:space]{2,}';
Since the 2 spaces already meet your condition, you really do not need to check if there are more than 2. Moreover, if you need to match a regular ASCII space (decimal code 32), you do not need a REGEXP operator, you can safely use
SELECT * FROM book WHERE titles LIKE '% %';
LIKE is preferred in all cases where you can use it instead of REGEXP (see MySQL | REGEXP VS Like)
When you need to match numerous whitespace symbols, you can use WHERE titles REGEXP '[[:space:]]{2}' (it will match [ \t\r\n\v\f]), and if you only plan to match tabs and spaces, use WHERE titles REGEXP '[[:blank:]]{2}'. For more details, see POSIX Bracket Expressions.
Note that [:class_name:] should only be used inside a character class (i.e. inside another pair of [...], otherwise, they are not recognized.
Your POSIX class must be,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2,}';
No need for ,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2}';
You may also use [[:blank:]]
SELECT * FROM book WHERE titles REGEXP '[[:blank:]]{2}';
If you mean just the space character: REGEXP ' '. Or you could use LIKE "% %", which would be faster. (Note: there are 2 blanks in those.)
Otherwise, see http://dev.mysql.com/doc/refman/5.6/en/regexp.html for blank and space.

How do you find words with hyphens in a MYSQL REGEXP query using word boundries?

I have a MYSQL query to try to find words with hyphens. I am using the MYSQL word boundary.
SELECT COUNT(id)
AS count
FROM table
WHERE (name REGEXP '^[[<:]]some-words-with-hyphens[[:>:]]/')
This seems to work, although the following does not (see the - after the word "hyphens"):
SELECT COUNT(id)
AS count
FROM table
WHERE (words REGEXP '^[[<:]]some-words-with-hyphens-[[:>:]]/')
I tried to escape the -'s with \- but that did not seem to change the result. I also tried to put the - in brackets like [-], but that did not seem to change the result.
What would be the proper way to write this query with the understanding that hyphens will be within and possibly at the end of the "word"?
As documented under Regular Expressions:
A regular expression for the REGEXP operator may use any of the following special characters and constructs:
[ deletia ]
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_).
mysql> SELECT 'a word a' REGEXP '[[:<:]]word[[:>:]]'; -> 1
mysql> SELECT 'a xword a' REGEXP '[[:<:]]word[[:>:]]'; -> 0
Since - and / are both non-word characters, the [[:>:]] construct does not match the point between them.
It's not clear why you're using these constructs at all, as the following ought to do the trick:
words REGEXP '^some-words-with-hyphens-/'
Indeed, it's not clear why you're even using regular expressions in this case, as simple pattern matching can achieve the same:
words LIKE 'some-words-with-hyphens-/%'
Assuming that some-words-with-hyphens is actually a regex and not some verbatim text, you could simply add an optional - at the end of the regex in order to match a trailing dash if it's present:
WHERE (name REGEXP '^[[<:]]some-words-with-hyphens[[:>:]]-?/')
#eggyal has already explained why the word boundary matches before that hyphen.

Querying a mysql database fetching words with a regexp

I'm using a regexp for fetching a set of words that accomplish the next syntax:
SELECT * FROM words WHERE word REGEXP '^[dcqaahii]{5}$'
My first impression gave me the sensation that it was good till I realized that some letters were used more than contained in the regexp.
The question is that I want to get all words (i.e. of 5 letters) that can be formed with the letters within the brackets, so if I have two 'a' resulting words can have no 'a', one 'a' or even two 'a', but no more.
What should i add to my regexp for avoiding this?
Thanks in advance.
It would probably be better to retrieve all candidates first and post-process, as others have suggested:
SELECT * FROM words WHERE word REGEXP '^[dcqahi]{5}$'
However, nothing is stopping you from doing multiple REGEXPs. You can select 0, 1, or 2 incidences of the letter 'a' with this grungy expression:
'^[^a]*a?[^a]*a?[^a]*$'
So do the pre-filter first and then combine additional REGEXP requirements with AND:
SELECT * FROM words
WHERE word REGEXP '^[dcqahi]{5}$'
AND word REGEXP '^[^a]*a?[^a]*a?[^a]*$'
AND word REGEXP '^[^i]*i?[^i]*i?[^i]*$'
[edit] As an afterthought, I have inferred that for the non-vowels you also want to restrict to 0 or 1 occurrance. So if that's the case, you'd keep going...
AND word REGEXP '^[^d]*d?[^d]*$'
AND word REGEXP '^[^c]*c?[^c]*$'
AND word REGEXP '^[^q]*q?[^q]*$'
AND word REGEXP '^[^h]*h?[^h]*$'
Yuck.
Only solution I can think of would be to use the above SQL you have to get an initial filtered set of data but then loop through it and further filter with some server side code (PHP etc.) which is better suited to doing that kind of logic.
In regular expressions, square brackets [] are merely a character class, like a list of allowed characters. Specifying the same letter twice within the brackets is therefore redundant.
For example the pattern [sed] will match sed, and seed because e is part of the allowed characters. Specifying a character count afterward in braces {} is merely a total count of characters previously allowed by the character class.
The pattern [sed]{3} therefore will match sed but not seed.
I would recommend moving the logic for testing the validity of words from SQL into your program.