MySQL REGEXP word boundaries [[:<:]] [[:>:]] and double quotes - mysql

I'm trying to match some whole-word-expressions with the MySQL REGEXP function. There is a problem, when there are double quotes involved.
The MySQL documentation says: "To use a literal instance of a special character in a regular expression, precede it by two backslash () characters."
But these queries all return 0:
SELECT '"word"' REGEXP '[[:<:]]"word"[[:>:]]'; -> 0
SELECT '"word"' REGEXP '[[:<:]]\"word\"[[:>:]]'; -> 0
SELECT '"word"' REGEXP '[[:<:]]\\"word\\"[[:>:]]'; -> 0
SELECT '"word"' REGEXP '[[:<:]] word [[:>:]]'; -> 0
SELECT '"word"' REGEXP '[[:<:]][[.".]]word[[.".]][[:>:]]'; -> 0
What else can I try to get a 1? Or is this impossible?

Let me quote the documentation first:
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and
end of words, respectively. A word is a sequence of word characters
that is not preceded by or followed by word characters. A word
character is an alphanumeric character in the alnum class or an
underscore (_).
From the documentation we can see the reason behind your problem and it is not caused by escaping whatsoever. The problem is that you are trying to match the word boundary [[:<:]] right at the beginning of the string which won't work because a word boundary as you can see from the documentation separates a word character from a non-word character, but in your case the first character is a " which isn't a word character so there is no word boundary, the same goes for the last " and [[:>:]].
In order for this to work, you need to change your expression a bit to this one:
"[[:<:]]word[[:>:]]"
^^^^^^^ ^^^^^^^
Notice how the word boundary separates a non-word character " from a word character w in the beginning and a " from d at the end of the string.
EDIT: If you always want to use a word boundary at the start and end of the string without knowing if there will be an actual boundary then you might use the following expression:
([[:<:]]|^)"word"([[:>:]]|$)
This will either match a word boundary at the beginning or the start-of-string ^ and the same for the end of the word boundary or end-of-string. I really advise you to study the data you are trying to match and look for common patterns and don't use regular expressions if they are not the right tool for the job.
SQL Fiddle Demo

In MySQL up from 8.0.4 use: \\bword\\b
ref. https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-compatibility

In MySQL 8 and above
Adding to Oleksiy Muzalyev's answer
https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-compatibility
In MySQL 8.04 and above, you have to use:
\bword\b
Where \b represents the ICU variant for word boundary. The previous Spencer library uses [[:<:]] to represent a word boundary.
When actually using this as part of a query, I've had to escape the escape character \ so my query actually looked like
SELECT * FROM table WHERE field RLIKE '\\bterm\\b'
When querying from PHP, use SINGLE quotes to do the same thing
$sql = 'SELECT * FROM table WHERE field RLIKE ?';
$args = ['\\bterm\\b'];
...

You need to be a little more sophisticated:
SELECT '"word"' REGEXP '"word"'; --> 1
SELECT '"This is" what I need' REGEXP '"This is" what I need[[:>:]]'; --> 1
That is,
If the test string begins/ends with a 'letter', the precede/follow the string with [[:<:]]/[[:>:]].
This is as opposed to blindly tacking those onto the string. After all, you are already inspecting the search string for special regexp characters to escape them. This is just another task in that vein. The definition of 'letter' should match whatever the word-boundary tokens look for.

Related

Mysql query – Word begins with within the content

I'm attempting to replicate the following regex pattern in a MySql query: http://regexr.com/3gt57
I'm unable to use a like as I need to match words that begin with the submitted term but don't necessarily contain the term.
I can't seem to use the pattern in a REGEXP query:
SELECT * FROM serialised_post
WHERE post_content REGEXP '\bSugg\S*';
Any help would be appreciated
In MySQL regexp, \S = [^[:space:]] (a negated bracket expression matching any char other than a whitespace char ([:space:] is a POSIX character class matching any whitespace)) and \b (here, a leading word boundary) is [[:<:]].
Use
WHERE post_content REGEXP '[[:<:]]Sugg[^[:space:]]*';
See more details about MySQL regex syntax here.

Using REGEXP within MySQL to find a certain number within a comma separated list

I have a list of numbers in some fields in a table, for example something like this:
2033,1869,1914,1913,19120,1911,1910,1909,1908,1907,1866,1921,1922,1923
Now, I'm trying to do a query to check if a number is found in the row, however, I can't use LIKE as then it may return false positives as if I did a search for 1912 in the above field I would get a result returned because of the number 19120, obviously we don't want that - we can't append or prepend a comma as the start/end numbers don't have them.
So, onto using REGEXP I go... I tried this, but it doesn't work (it returns a result):
SELECT * FROM cat_listing WHERE cats REGEXP '[^0-9]*1912[^0-9]*';
I imagine why it still finds something is because of the * quantifier; it found [^0-9] 0 times AFTER 1912 so it considers it a match.
I'm not sure how to modify it to do what I want.
In your case, it seems word boundaries are necessary:
SELECT * FROM cat_listing WHERE cats REGEXP '[[:<:]]1912[[:>:]]';
[[:<:]] is the beginning of a word and [[:>:]] is the end. See reference:
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and end of >words, respectively. A word is a sequence of word characters that is not >preceded by or followed by word characters. A word character is an alphanumeric >character in the alnum class or an underscore (_).
You have another option called find_in_set()
SELECT * FROM cat_listing WHERE find_in_set('1912', cats) <> 0;
Returns 0 if str is not in strlist or if strlist is the empty string. Returns NULL if either argument is NULL. This function does not work properly if the first argument contains a comma (“,”) character.
No need to use a regex just because the column value has no comma at either end:
SELECT
cats
FROM cat_listing
WHERE INSTR(CONCAT(',', cats, ','), ',1912,')
;
See it in action: SQL Fiddle.
Please comment if adjustment / further detail is required.

Escape percent sign in SQL query regex?

This query works:
SELECT * FROM table WHERE column REGEXP "[[:<:]]100[[:>:]]"
But this doesn't. It returns nothing, but I have a value "100%" on my table.
SELECT * FROM table WHERE column REGEXP "[[:<:]]100%[[:>:]]"
How can I make the query with the percent signal?
It depends. You've got to change the character class [[::>::]] to something else that matches your need, because the percent sign is no word character, see REGEXP documentation:
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and
end of words, respectively. A word is a sequence of word characters
that is not preceded by or followed by word characters. A word
character is an alphanumeric character in the alnum class or an
underscore (_).
If it could be followed by i.e whitespace or punctuation or the end of the string then you could use
SELECT * FROM example WHERE content REGEXP '[[:<:]]100%([[:blank:]]|[[:punct:]]|$)';
You can see in this demo that you don't have to escape the percent sign (even you could do it).
The percent sign makes the use of the word end boundary not applicable anymore. You have to find a combination of other markers that will do for you.
As far as I know % is not a special character in regular expressions.
You should be able to just fire off:
SELECT * FROM table WHERE column REGEXP '[[:<:]]100%[[:>:]]'

Whole word matching with dot characters in MySQL

In MySQL, when searching for a keyword in a text field where only "whole word match" is desired, one could use REGEXP and the [[:<:]] and [[:>:]] word-boundary markers:
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]word[[:>:]]"
For example, when we want to find all text fields containing "europe", using
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]europe[[:>:]]"
would return "europe map", but not "european union".
However, when the target matching words contains "dot characters", like "u.s.", how should I submit a proper query? I tried the following queries but none of them look correct.
1.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u.s.[[:>:]]"
2.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"
3.
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\.s\.[[:>:]]"
When using double backslash to escape special characters, as suggested by d'alar'cop, it returns empty, even though there are something like "u.s. congress" in the table
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.[[:>:]]"
Any suggestion is appreciated!
This regex does what you want:
SELECT name
FROM tbl_name
WHERE name REGEXP '([[:blank:][:punct:]]|^)u[.]s[.]([[:punct:][:blank:]]|$)'
This matches u.s. when preceeded by:
a blank (space, tab etc)
punctuation (comma, bracket etc)
nothing (ie at start of line)
and followed by:
a blank (space, tab etc)
punctuation (comma, bracket etc)
nothing (ie at end of line)
See an SQLFiddle with edge cases covering above points.
The fundamental issue with your predicates is that . is a non-word character, and any non-word character will cause the word boundary test to fail if they follow a start test or precede an end test. You can see the behavior here.
To further complicate the issue, the flavor of regular expressions used by MySQL is very limited. According to Regular-Expressions.info, MySQL uses POSIX-ERE which if you read the chart at the bottom Regular Expression Flavor Comparisons has very few capabilities where compared to other flavors.
To solve your problem you must create a new regular expression that will replace the functionality of the word boundary so that it will allow non-word characters to be part of the boundary. I came up with the follow Regular Expression:
(^|[^[:alnum:]_])YOUR_TEXT_HERE($|[^[:alnum:]_])
This is equivalent to the standard regular expression below:
(^|[^a-zA-Z0-9_])YOUR_TEXT_HERE($|[^a-zA-Z0-9_])
The regex searches for non-words characters or string boundaries at the start and end of the text. (^|[^[:alnum:]_]) matches either start of string, an alpha-numeric character, or an underscore. The ending pattern is similar except it matches the end of a string instead of the start.
The pattern was designed to best match the definition of word boundaries from Regular Expressions in the MySQL manual:
[Boundaries] match the beginning and end of words, respectively. A
word is a sequence of word characters that is not preceded by or
followed by word characters. A word character is an alphanumeric
character in the alnum class or an underscore.
Test Results
Using the regex above, I came up with a scenario where I test a string that contains non-word characters at the start and end - .u.s.. I tried to come up with a reasonable set of test items. You can see the results at
SQLFiddle.
Test Data
test string not present: 'no match'
missing .'s: 'no us match'
missing last .: 'no u.s match'
missing first .: 'no us. match'
test start boundary word character: 'no.u.s.match'
test end boundary word character: 'no .u.s.match'
test boundaries word character: 'no.u.s.match'
test basic success case: 'yes .u.s. match'
test start boundary non-word character: 'yes !.u.s. match'
test end boundary non-word character: 'yes .u.s.! match'
test boundaries non-word character: 'yes !.u.s.! match'
test start of line: '.u.s.! yes match'
test end of line: 'yes match .u.s.'
Query
SELECT *
FROM TestRegex
WHERE name REGEXP '(^|[^[:alnum:]_])[.]u[.]s[.]($|[^[:alnum:]_])';
SQLFiddle
Conclusion
All the positive cases were returned and none of the negative ones => All test cases succeeded.
You can use [.] for the period character instead of \\. which I find to be somewhat more readable in the context of a SQL expression.
You can adjust the sets used to define the boundary to be more or less restrictive depending on your desires. For example you can restrict some non-word characters as well: [^a-zA-Z_0-9.!?#$].
Working example here: http://www.sqlfiddle.com/#!2/5aa90d/9/0
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u\\.s\\.([^[:alnum:]]|$)"
Basically saying that u.s. must be followed by anything that isn't an alphanumeric character, or the end of the string.
You could change [:alnum:] to [:alpha:] to include results like This is u.s.5 if that's desirable.
Just use this query:
SELECT name FROM tbl_name WHERE name REGEXP ""[[:<:]]u\\.s\\.([[:blank:]]|$)"
No need to use end-of-word [[:>:]] on RHS since you already have a dot after s.
In the mysql regexp manual is a table of special chars and howto escape them.
Doing your query like
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[.]s[.][[:>:]]"
or
SELECT name FROM tbl_name WHERE name REGEXP "[[:<:]]u[[.period.]]s[[.period.]][[:>:]]"
will work

How do you find words with hyphens in a MYSQL REGEXP query using word boundries?

I have a MYSQL query to try to find words with hyphens. I am using the MYSQL word boundary.
SELECT COUNT(id)
AS count
FROM table
WHERE (name REGEXP '^[[<:]]some-words-with-hyphens[[:>:]]/')
This seems to work, although the following does not (see the - after the word "hyphens"):
SELECT COUNT(id)
AS count
FROM table
WHERE (words REGEXP '^[[<:]]some-words-with-hyphens-[[:>:]]/')
I tried to escape the -'s with \- but that did not seem to change the result. I also tried to put the - in brackets like [-], but that did not seem to change the result.
What would be the proper way to write this query with the understanding that hyphens will be within and possibly at the end of the "word"?
As documented under Regular Expressions:
A regular expression for the REGEXP operator may use any of the following special characters and constructs:
[ deletia ]
[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_).
mysql> SELECT 'a word a' REGEXP '[[:<:]]word[[:>:]]'; -> 1
mysql> SELECT 'a xword a' REGEXP '[[:<:]]word[[:>:]]'; -> 0
Since - and / are both non-word characters, the [[:>:]] construct does not match the point between them.
It's not clear why you're using these constructs at all, as the following ought to do the trick:
words REGEXP '^some-words-with-hyphens-/'
Indeed, it's not clear why you're even using regular expressions in this case, as simple pattern matching can achieve the same:
words LIKE 'some-words-with-hyphens-/%'
Assuming that some-words-with-hyphens is actually a regex and not some verbatim text, you could simply add an optional - at the end of the regex in order to match a trailing dash if it's present:
WHERE (name REGEXP '^[[<:]]some-words-with-hyphens[[:>:]]-?/')
#eggyal has already explained why the word boundary matches before that hyphen.