Regexp for whole words, punctuation included

Regexp for whole words, punctuation included - mysql

I'm completely new about MySQL regexp and, after 3 days searching the web I've got to give up!
I need to find words into a database, exact words. But I need to take care about punctuation too (would be nice I provide my list of punctuation symbols). These words might be at the beginning of the database record, at the end or in the middle. They also can the the only data into the record. I don't care about uppercase or lowercase words.
Thus, I need to retrieve words in such cases:
Lion - Lion. - Lion, - Lion; - Lion... (dots) - Lion… (hellip) etc. (there can be many other punctuation symbols)
or
'Lion' - "Lion" - (Lion) - <Lion> - /Lion/ etc.
but those are incorrect:
Lion.tiger - Lions - superlion - <Lion" - (Lion> - Lion.....
I've tried dozens of regexp provided on several websites but none could solve that accurately.
Cheers.

You should be able to use word boundaries. Otherwise you'll have to declare all of the characters that you're willing to accept.
select id,myRecord from my_records where myRecord REGEXP '[[:<:]]lion[[:>:]]';
Here's an SQL Fiddle using some of the examples that you've provided for matching.
Note that because the . character in Lion.tiger is typically considered a word barrier (end of sentence declaration) in English, you might need to make a special case where you're not matching it.
select id,myRecord from my_records where
myRecord REGEXP '[[:<:]]lion[[:>:]]' and
myRecord NOT REGEXP '[[:<:]]lion\.[[:alpha:]]';
Example fiddle.

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.

You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.

Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

MySQL matching this regex while it shouldn't

I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!

You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)

Regular Expression to match single, whole words from MySQL Database

I require a Regular Expression to match single words of [a-zA-Z] only with 6 or more characters.
The list of words:
a
27
it was good because
it
cat went to
four
eight is bigger
peter
pepper
objects
83
harrison99
adjetives
atom-bomb
The following would be matched
pepper
objects
adjectives
It isn't too bad if the following are matched too as I can strip out the duplicates after
it was good because
eight is bigger
atom-bomb
So far I have this
/[a-zA-z]{6,}/g
Which matches 6 character or more words according to RegEx 101 but when I use it in my MySQL database to filter a table on matches RegExp.
Currently it includes words containing punctuation and other non a-zA-Z characters. I don't want it to include results with numbers or punctuation.
In an ideal world I don't want it to include a row that contains more than one word as usually the other words are already duplicates.
Actual MySQL results
If a row contains a number or punctuation and/or contain one or more words I don't want it included.
I want only results that are single, whole words of 6+ characters

Drop the delimiters and add anchors for start/end:
SELECT 'abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 1
SELECT 'a abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 0
Also I changed [a-zA-z] in the character class to [a-zA-Z].
Instead of [a-zA-Z] you can also use [[:alpha:]] POSIX style bracket extension. So it becomes:
^[[:alpha:]]{6,}$
(As a side note: MySql uses Henry Spencer's implementation of regular expressions. if word boundaries needed, use [[:<:]]word[[:>:]] See manual for further syntax differences)

Regex for start with three alpha and four digits

I have writen an sql statement to retrieve data from Mysql db and I wanted to select data where myId start with three alpha and 4 digits example : ABC1234K1D2
myId REGEXP '^[A-Z]{3}/d{4}'
but it gives me empty result(data is available in DB). Could someone point me to correct way.

In most regex variants the answer would be: /d matches a / followed by a d; I think you want \d which matches a digit.
However MySQL has a somewhat limited regex implementation (see documentation).
There is no shortcut to character sets like \d for any digit.
You need to either use a named character set ([[:digit:]]), or just use [0-9].

Try this out :
[A-Z]{3}[0-9]{4}
If you want characters to be case insensitive. Try this :
[a-zA-Z]{3}[0-9]{4}

First, in regular regular expressions, to match a digit, you have to use \d instead of /d (which makes you match / followed by d).
Then, I had never noticed, but I think \d (and the others like \w, etc.) don't seem to be available in MySQL. The doc lists the accepted spacial chars, and those generic classes don't appear. You could use [:digit:] instead, even if [0-9] is quite shorter ;)

You are doing fine, just replace /d with \d.Final regex: ^[A-Z]{3}\d{4}

You could use the following pattern :
^[a-zA-Z]{3}\d{4}

How can I query for text containing Asian-language characters in MySQL?

I have a MySQL table using the UTF-8 character set with a single column called WORDS of type longtext. Values in this column are typed in by users and are a few thousand characters long.
There are two types of rows in this table:
In some rows, the WORDS value has been composed by English speakers and contains only characters used in ordinary English writing. (Not all are necessarily ASCII, e.g. the euro symbol may appear in some cases.)
Other rows have WORDS values written by speakers of Asian languages (Korean, Chinese, Japanese, and possibly others), which include a mix of English words and words in the Asian languages using their native logographic characters (and not, for example, Japanese romaji).
How can I write a query that will return all the rows of type 2 and no rows of type 1? Alternatively, if that's hard, is there a way to query most such rows (here it's OK if I miss a few rows of type 2, or include a few false positives of type 1)?
Update: Comments below suggest I might do better to avoid the MySQL query engine altogether, as its regex support for unicode doesn't sound too good. If that's true, I could extract the data into a file (using mysql -B -e "some SQL here" > extract.txt) and then use perl or similar on the file. An answer using this method would be OK (but not as good as a native MySQL one!)

In theory you could do this:
Find the unicode ranges that you want to test for.
Manually encode the start and end into UTF-8.
Use the first byte of each of the encoded start and end as a range for a REGEXP.
I believe that the CJK range is far enough removed from things like the euro symbol that the false positives and false negatives would be few or none.
Edit: We've now put theory into practice!
Step 1: Choose the character range. I suggest \u3000-\u9fff; easy to test for, and should give us near-perfect results.
Step 2: Encode into bytes. (Wikipedia utf-8 page)
For our chosen range, utf-8 encoded values will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.
Thus, we want to mach bytes in the range 11100011 to 11101001, or 0xe3 to 0xe9.
Step 3: Make our regexp using the very handy (and just now discovered by me) UNHEX function.
SELECT * FROM `mydata`
WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')
Just tried it out. Works like a charm. :)

You can also use the HEX value of the character. SELECT * FROM table WHERE <hex code>
Try it out with SELECT HEX(column) FROM table
This might help as well http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Regexp for whole words, punctuation included - mysql

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

MySQL matching this regex while it shouldn't

Regular Expression to match single, whole words from MySQL Database

Regex for start with three alpha and four digits

How can I query for text containing Asian-language characters in MySQL?

Categories

Resources