Querying a mysql database fetching words with a regexp - mysql

I'm using a regexp for fetching a set of words that accomplish the next syntax:
SELECT * FROM words WHERE word REGEXP '^[dcqaahii]{5}$'
My first impression gave me the sensation that it was good till I realized that some letters were used more than contained in the regexp.
The question is that I want to get all words (i.e. of 5 letters) that can be formed with the letters within the brackets, so if I have two 'a' resulting words can have no 'a', one 'a' or even two 'a', but no more.
What should i add to my regexp for avoiding this?
Thanks in advance.

It would probably be better to retrieve all candidates first and post-process, as others have suggested:
SELECT * FROM words WHERE word REGEXP '^[dcqahi]{5}$'
However, nothing is stopping you from doing multiple REGEXPs. You can select 0, 1, or 2 incidences of the letter 'a' with this grungy expression:
'^[^a]*a?[^a]*a?[^a]*$'
So do the pre-filter first and then combine additional REGEXP requirements with AND:
SELECT * FROM words
WHERE word REGEXP '^[dcqahi]{5}$'
AND word REGEXP '^[^a]*a?[^a]*a?[^a]*$'
AND word REGEXP '^[^i]*i?[^i]*i?[^i]*$'
[edit] As an afterthought, I have inferred that for the non-vowels you also want to restrict to 0 or 1 occurrance. So if that's the case, you'd keep going...
AND word REGEXP '^[^d]*d?[^d]*$'
AND word REGEXP '^[^c]*c?[^c]*$'
AND word REGEXP '^[^q]*q?[^q]*$'
AND word REGEXP '^[^h]*h?[^h]*$'
Yuck.

Only solution I can think of would be to use the above SQL you have to get an initial filtered set of data but then loop through it and further filter with some server side code (PHP etc.) which is better suited to doing that kind of logic.

In regular expressions, square brackets [] are merely a character class, like a list of allowed characters. Specifying the same letter twice within the brackets is therefore redundant.
For example the pattern [sed] will match sed, and seed because e is part of the allowed characters. Specifying a character count afterward in braces {} is merely a total count of characters previously allowed by the character class.
The pattern [sed]{3} therefore will match sed but not seed.
I would recommend moving the logic for testing the validity of words from SQL into your program.

Related

Regex pattern equivalent of %word% in mysql

I need 2 regex case insensitive patterns. One of them are equivalent of SQL's %. So %word%. My attempt at this was '^[a-zA-Z]*word[a-zA-Z]*$'.
Question 1: This seems to work, but I am wondering if this is the equivalent of %word%.
Finally the last pattern being similar to %, but requires 3 or more characters either before and after the word. So for example if the target word was word:
words = matched because it doesn't have 3 or more characters either before or after it.
swordfish = not matched because it has 3 or more characters after word
theword = not matched because it has 3 or more characters before it
mywordly = matched because it doesn't contain 3 or more characters before or after word.
miswordeds = not matched because it has 3 characters before it. (it also has 3 words after it too, but it met the criteria already).
Question 2: For the second regex, I am not very sure how to start this. I will be using the regex in a MySQL query using the REGEXP function for example:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]*word[a-zA-Z]*$'
First Question:
According to https://dev.mysql.com/doc/refman/8.0/en/string-comparison-functions.html#operator_like
With LIKE you can use the following two wildcard characters in the pattern:
% matches any number of characters, even zero characters.
_ matches exactly one character.
It means the REGEX ^[a-zA-Z]*word[a-zA-Z]*$' is not equivalent to %word%
Second Question:
Change * to {0,2} to indicate you want to match at maximum 2 characters either before or after it:
SELECT 1
WHERE 'SWORDFISH' REGEXP '^[a-zA-Z]{0,2}word[a-zA-Z]{0,2}$'
And to make case insensitive:
SELECT 1 WHERE LOWER('SWORDFISH') REGEXP '^[a-z]{0,2}word[a-z]{0,2}$'
Assuming
The test string (or column) has only letters. (Hence, I can use . instead of [a-z]).
Case folding and accents are not an issue (presumably handled by a suitable COLLATION).
Either way:
WHERE x LIKE '%word%' -- found the word
AND x NOT LIKE '%___word%' -- but fail if too many leading chars
AND x NOT LIKE '%word___%' -- or trailing
WHERE x RLIKE '^.{0,2}word.{0,2}$'
I vote for RLIKE being slightly faster than LIKE -- only because there are fewer ANDs.
(MySQL 8.0 introduced incompatible regexp syntax; I think the syntax above works in all versions of MySQL/MariaDB. Note that I avoided word boundaries and character class shortcuts like \\w.)

MySQL NOT REGEXP not working when excluding results

I only want rows which do not contain any of the words matched in the REGEXP but somehow it's not working properly.
I found this as an alternative not to have 20 lines of
AND user_agent NOT LIKE 'word'
But my REGEXP seems broken, this is the line:
AND user_agent NOT REGEXP '/(ligatus|googlebot|appengine|Mediapartners-Google|semrushbot|ipad|iphone|android|admantx|MJ12bot|CCBot|bingbot|HybridBot|crawler)/gmi'
As the comments have pointed out, you have a slight syntax problem in your REGEXP expression. But, in addition to this, you should surround the alternation with word boundaries, because you want to match/non match entire words, but not words which might appear as substrings.
WHERE user_agent NOT REGEXP '[[:<:]](ligatus|googlebot|appengine)[[:>:]]'
I only included the first three terms so that it would fit in a single line, but you may use the full alternation.

MySql Specific Search - Replace String

I need to search words that contain multiple number prefixes.
Example:
0119
0129
0139
0149
But there is other prefixes, 0155859, 0128889
Etc.
If i search 0%9 it'll come up with all the results i don't want, it'll include the 0155859, 0128889 ones
I need to search and list ONLY the ones that have 0119, etc
How do i do it ?
0XX9 ( Where XX is any strings that matches, so 0119, 0129, etc. % Lists all other characters till a 9 appears, i don't want that. )
I'm trying on my english, correct me if i did'nt expressed myself right !
In a LIKE pattern, the _ character matches any single character. So you can do:
WHERE word LIKE '0__9%'
This matches a word that begins with 0, then any two characters, then 9, then anything after that.
My gut feeling at seeing your question was to consider using REGEXP, which is MySQL's regex matching operator. Try the following query:
SELECT *
FROM yourTable
WHERE word REGEXP '0[0-9][0-9]9'
The pattern used would match any word containing a zero, followed by any two numbers, followed by a 9.

MySQL regex matching rows containing two or more spaces in a row

I am trying to write a MySQL statement which finds and returns the book registrations that contain 2 or more spaces in a row.
The statement below is wrong.
SELECT * FROM book WHERE titles REGEXP '[:space]{2,}';
Since the 2 spaces already meet your condition, you really do not need to check if there are more than 2. Moreover, if you need to match a regular ASCII space (decimal code 32), you do not need a REGEXP operator, you can safely use
SELECT * FROM book WHERE titles LIKE '% %';
LIKE is preferred in all cases where you can use it instead of REGEXP (see MySQL | REGEXP VS Like)
When you need to match numerous whitespace symbols, you can use WHERE titles REGEXP '[[:space:]]{2}' (it will match [ \t\r\n\v\f]), and if you only plan to match tabs and spaces, use WHERE titles REGEXP '[[:blank:]]{2}'. For more details, see POSIX Bracket Expressions.
Note that [:class_name:] should only be used inside a character class (i.e. inside another pair of [...], otherwise, they are not recognized.
Your POSIX class must be,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2,}';
No need for ,
SELECT * FROM book WHERE titles REGEXP '[[:space:]]{2}';
You may also use [[:blank:]]
SELECT * FROM book WHERE titles REGEXP '[[:blank:]]{2}';
If you mean just the space character: REGEXP ' '. Or you could use LIKE "% %", which would be faster. (Note: there are 2 blanks in those.)
Otherwise, see http://dev.mysql.com/doc/refman/5.6/en/regexp.html for blank and space.

Using REGEXP within MySQL to find a certain number within a comma separated list

I have a list of numbers in some fields in a table, for example something like this:
2033,1869,1914,1913,19120,1911,1910,1909,1908,1907,1866,1921,1922,1923
Now, I'm trying to do a query to check if a number is found in the row, however, I can't use LIKE as then it may return false positives as if I did a search for 1912 in the above field I would get a result returned because of the number 19120, obviously we don't want that - we can't append or prepend a comma as the start/end numbers don't have them.
So, onto using REGEXP I go... I tried this, but it doesn't work (it returns a result):
SELECT * FROM cat_listing WHERE cats REGEXP '[^0-9]*1912[^0-9]*';
I imagine why it still finds something is because of the * quantifier; it found [^0-9] 0 times AFTER 1912 so it considers it a match.
I'm not sure how to modify it to do what I want.
In your case, it seems word boundaries are necessary:
SELECT * FROM cat_listing WHERE cats REGEXP '[[:<:]]1912[[:>:]]';
[[:<:]] is the beginning of a word and [[:>:]] is the end. See reference:
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and end of >words, respectively. A word is a sequence of word characters that is not >preceded by or followed by word characters. A word character is an alphanumeric >character in the alnum class or an underscore (_).
You have another option called find_in_set()
SELECT * FROM cat_listing WHERE find_in_set('1912', cats) <> 0;
Returns 0 if str is not in strlist or if strlist is the empty string. Returns NULL if either argument is NULL. This function does not work properly if the first argument contains a comma (“,”) character.
No need to use a regex just because the column value has no comma at either end:
SELECT
cats
FROM cat_listing
WHERE INSTR(CONCAT(',', cats, ','), ',1912,')
;
See it in action: SQL Fiddle.
Please comment if adjustment / further detail is required.