Regular Expression to match single, whole words from MySQL Database - mysql

I require a Regular Expression to match single words of [a-zA-Z] only with 6 or more characters.
The list of words:
a
27
it was good because
it
cat went to
four
eight is bigger
peter
pepper
objects
83
harrison99
adjetives
atom-bomb
The following would be matched
pepper
objects
adjectives
It isn't too bad if the following are matched too as I can strip out the duplicates after
it was good because
eight is bigger
atom-bomb
So far I have this
/[a-zA-z]{6,}/g
Which matches 6 character or more words according to RegEx 101 but when I use it in my MySQL database to filter a table on matches RegExp.
Currently it includes words containing punctuation and other non a-zA-Z characters. I don't want it to include results with numbers or punctuation.
In an ideal world I don't want it to include a row that contains more than one word as usually the other words are already duplicates.
Actual MySQL results
If a row contains a number or punctuation and/or contain one or more words I don't want it included.
I want only results that are single, whole words of 6+ characters

Drop the delimiters and add anchors for start/end:
SELECT 'abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 1
SELECT 'a abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 0
Also I changed [a-zA-z] in the character class to [a-zA-Z].
Instead of [a-zA-Z] you can also use [[:alpha:]] POSIX style bracket extension. So it becomes:
^[[:alpha:]]{6,}$
(As a side note: MySql uses Henry Spencer's implementation of regular expressions. if word boundaries needed, use [[:<:]]word[[:>:]] See manual for further syntax differences)

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

How to select data with REGEXP without containing Extended ASCII

I just want to display data with :
containing a-z
containing A-Z
contain numbers 0-9
contain printable ASCII charachter on the following link http://www.theasciicode.com.ar/extended-ascii-code/latin-diphthong-ae-uppercase-ascii-code-146.html
I stuck using code like this, please help point -4 above:
select * from Delin where the address REGEXP '^ [A-Za-z0-9]'
with sample raw data below:
and i wanna output like this(where these images show a-Z and only printable symbols) :
Your items 1–3 (a–z, A–Z, and 0–9) are all subsets of item 4 (printable ASCII characters), so you need only concern yourself with the latter. The following query satisfies that criterion:
SELECT * FROM Delin
WHERE alamat REGEXP '^[ -~]+$';
The character class [ -~], indicates the ASCII characters from space to tilde inclusive, which happens to be all of the printable ASCII characters and no others.
You can see it in an SQL Fiddle here: http://sqlfiddle.com/#!9/6c7b8/1
Terminology note: There is no such thing as "Extended ASCII." The ASCII character set corresponds to the numbers 0–127 inclusive. Any character corresponding to a number greater than 127 is not ASCII. The term "Extended ASCII" is often mistakenly applied to various non-ASCII encodings, none of which is an "extension" of ASCII in any official sense.

MySQL Regex for matching exactly 3 same chars but not 4 same chars within a larger string

I am trying to write a regex for mysql in PHP to find (at least one occurrence of) exactly 3 of the same characters in a row, but not 4 (or more) of the same.
Eg for "000" I want to find:
0//////0/00/ LS///////000
000////0/00/ LS//////////
0//////0/00/ LS////000///
0//////000// LS//////000/
0//////000// LS//00000000
but not:
0//////0000/ LS//////////
0//////0000/ LS//////////
0/////00000/ LS//////////
I have tried the code below which I thought would match 3 zeros preceded and followed by zero or more chars which are not 0, but this resulted in some rows with single 0's and some 000000's
REGEXP '[^0]*[0{3}][^0]*'
Many thanks.
If you plan to use a regex in MySQL, you cannot use lookarounds. Thus, you can use alternation with negated character class and anchors:
(^|[^0])0{3}([^0]|$)
See the regex demo
Explanation:
(^|[^0]) - a group matching either the start of string (^) or a character other than 0
0{3} - exactly 3 zeros
([^0]|$) - a group matching either a character other than 0 or the end of string ($).

MySQL: Regular expression that matches these criteria?

I need a regular expression in MySQL that matches:
a sentence that has a word
that begins with e
ends with k
can contain [deki] (doesn't have to include all)
is 3-4 characters long
For example, the following would match:
"My name is eik"
"the edik of doom"
"herp derp eidk derp"
MySQL's extended regular expressions are pretty powerful; see http://dev.mysql.com/doc/refman/5.6/en/regexp.html. You can write:
[[:<:]]e[deki]{1,2}k[[:>:]]
(boundary-at-start-of-word, plus e, plus one or two characters from [deki], plus k, plus boundary-at-end-of-word).
This assumes that your words are separated by punctuation or spaces or whatnot, and it doesn't require that your field contain multiple words.
(Hat-tip to Bohemian for giving an answer that used \b. MySQL doesn't support \b for word-boundaries, but it reminded me of things my original answer missed.)
Try this:
WHERE column RLIKE '[[:<:]]e[deki]{1,2}k[[:>:]]'
Edited:
mysql doesn't support \b, instead it uses [[:<:]] and [[:>:]] for start and end of word respectively. regex updated accordingly

How can I query for text containing Asian-language characters in MySQL?

I have a MySQL table using the UTF-8 character set with a single column called WORDS of type longtext. Values in this column are typed in by users and are a few thousand characters long.
There are two types of rows in this table:
In some rows, the WORDS value has been composed by English speakers and contains only characters used in ordinary English writing. (Not all are necessarily ASCII, e.g. the euro symbol may appear in some cases.)
Other rows have WORDS values written by speakers of Asian languages (Korean, Chinese, Japanese, and possibly others), which include a mix of English words and words in the Asian languages using their native logographic characters (and not, for example, Japanese romaji).
How can I write a query that will return all the rows of type 2 and no rows of type 1? Alternatively, if that's hard, is there a way to query most such rows (here it's OK if I miss a few rows of type 2, or include a few false positives of type 1)?
Update: Comments below suggest I might do better to avoid the MySQL query engine altogether, as its regex support for unicode doesn't sound too good. If that's true, I could extract the data into a file (using mysql -B -e "some SQL here" > extract.txt) and then use perl or similar on the file. An answer using this method would be OK (but not as good as a native MySQL one!)
In theory you could do this:
Find the unicode ranges that you want to test for.
Manually encode the start and end into UTF-8.
Use the first byte of each of the encoded start and end as a range for a REGEXP.
I believe that the CJK range is far enough removed from things like the euro symbol that the false positives and false negatives would be few or none.
Edit: We've now put theory into practice!
Step 1: Choose the character range. I suggest \u3000-\u9fff; easy to test for, and should give us near-perfect results.
Step 2: Encode into bytes. (Wikipedia utf-8 page)
For our chosen range, utf-8 encoded values will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.
Thus, we want to mach bytes in the range 11100011 to 11101001, or 0xe3 to 0xe9.
Step 3: Make our regexp using the very handy (and just now discovered by me) UNHEX function.
SELECT * FROM `mydata`
WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')
Just tried it out. Works like a charm. :)
You can also use the HEX value of the character. SELECT * FROM table WHERE <hex code>
Try it out with SELECT HEX(column) FROM table
This might help as well http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html