How to correctly specify a quantifier for a group of words? - mysql

Table has field containing the list of IDs separated by "-".
Example: 559-3319-3537-4345-29923
I need to check rows that use at least 4 of the specified identifiers using regex
Example: before inserting to the db, I need to check the value 559-3319-3537-29923-30762 for this condition.
I've build a pattern that only works in the specified order, but if the IDs are swapped, it doesn't work.
Template: ^.*\b(-*(559|3319|3537|29923|30762)-*){4,}\b.*$
Initially, I thought that a simple (559|3319|3537|29923|30762){4,} should be enough, but in this case it also doesn't work, although it sees all 4 values without a quantifier.
Please tell me how to write such an expression correctly.

For ease of reading/testing, I've simplified the Ids being searched for to single digit integers 1-5. The following pattern will match strings with at least 4 out of the 5 ids:
(\b(1|2|3|4|5)\b.*){4,}
(Play with this here)
OR MySQL's regex dialect:
([[:<:]](1|2|3|4|5)[[:>:]].*){4,}
(Play with MySQL version here)
Here are some examples:
#
Example
Is Match?
Description
1
1-2-3-4-5
YES
All the Ids
2
1-2-3-9-5
YES
Enough Ids
3
1-1-9-1-1
YES
Enough Ids, but there are repeats
4
9-8-7-6-0
NO
None of the Ids
5
1-2-3-9-9
NO
Some, but not enough of the Ids
If the repeated Ids as shown in example 3 are an issue, then regex is probably not a good fit for this problem.

Edit:
^.*\b((559|3319|3537|29923|30762)-?([0-9]*)?-?){4,}\b.*$
The reasoning behind this is that each group is not just one of the 5 numbers, but it can include some extra characters. So the matched groups in your example are:
(559-)
(3319-)
(3537-4345-)
(29923)
Original answer:
This would be one way to do it (not sure if there are other ways to do it):
^.*\b(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)\b.*$

Related

jq: groupby and nested json arrays

Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.

Capture a value from a repeating group on every iteration (as opposed to just last occurrence)

How does one capture a value recursively with regex, where value is a part of a group that repeats?
I have a serialized array in mysql database
These are 3 examples of a serialized array
a:2:{i:0;s:2:"OR";i:1;s:2:"WA";}
a:1:{i:0;s:2:"CA";}
a:4:{i:0;s:2:"CA";i:1;s:2:"ID";i:2;s:2:"OR";i:3;s:2:"WA";}
a:1 stands for array:{number of elements}
then in between {} i:0 means element 0, i:1 means element 1 etc.
then the actual value s:2:"CA" means string with length of 2
so I have 2 elements in first array, 1 element in the second and 4 elements in the last
I have this data in mysql database and I DO NOT HAVE an option to parse this with back-end code - this has to be done in mysql (10.0.23-MariaDB-log)
the repeating pattern is inside of the curly braces
the number of repeats is variable (as in 3 examples each has a different number of repeating patterns),
the number of repeating patterns is defined by the number at 3rd position (if that helps)
for the first example it's a:2:
and so there are 2 repeating blocks:
i:0;s:2:"OR";
i:1;s:2:"WA";
I only care to extract the values in bold
So I came up with this regex
^a:(?:\d+):\{(?:i:(?:\d+);s:(?:\d+):\"(\w\w)\";)+}$
it captures the values I want all right but problem is it only captures the last one in each repeating group
so going back to the example what would be captured is
WA
CA
WA
What I would want is
OR|WA
CA
CA|ID|OR|WA
these are the language specific regex functions available to me:
https://mariadb.com/kb/en/library/regular-expressions-functions/
I don't care which one is used to solve the problem
Ultimately I need this in as sensible form that can be presented to the client e.g. CA,ID,OR or CA|ID|OR
Current thoughts are perhaps this isn't possible in a one liner, and I have to write a multi-step function where
extract the repeating portion between the curly braces
then somehow iterate over each repeating portion
then use the regex on each
then return the results as one string with separated elements
I doubt if such a capture is possible. However, this would probably do the job for your specific purpose.
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(str1, '^a:\\d+:\{', ''),
'i:\\d+;s:\\d+:\"(\\w\\w)\";',
'\\1,'
),
'\,?\}$',
''
)
Basically, this works with the input string (or column) str1 like
remove the first part
replace every cell with the string you want
remove the last 2 characters, ,}
and voila! You get a string CA,ID,OR.
Aftenote
It may or may not work well when the original array before serialised is empty (it depends how it is serialised).

How can I go about querying a database for non-similar, but almost matching items

How can I go about querying a database for items that are not only exactly similar to a sample, but also those that are almost similar? Almost as search engines work, but only for a small project, preferably in Java. For example:
String sample = "Sample";
I would like to retrieve all the following whenever I query sample:
String exactMatch = "Sample";
String nonExactMatch = "S amp le";
String nonExactMatch_2 = "ampls";
You need to define what similar means in terms that your database can understand.
Some possibilities include Levenshtein distance, for example.
In your example, sample matches...
..."Sample", if you search without case sensitivity.
..."S amp le", if you remove a set of ignored characters (here space only) from both the query string and the target string. You can store the new value in the database:
ActualValue SearchFor
John Q. Smith johnqsmith%
When someone searches for "John Q. Smith, Esq." you can boil it down to johnqsmithesq and run
WHERE 'johnqsmithesq' LIKE SearchFor
"ampls" is more tricky. Why is it that 'ampls' is matched by 'sample'? A common substring? A number of shared letters? Does their order count (i.e. are anagrams valid)? Many approaches are possible, but it is you who must decide. You might use Levenshtein distance, or maybe store a string such as "100020010003..." where every digit encodes the number of letters you have, up to 9 (so 3 C's and 2 B's but no A's would give "023...") and then run the Levenshtein distance between this syndrome and the one from each term in the DB:
ActualValue Search1 Rhymes abcdefghij_Contains anagramOf
John Q. Smith johnqsmith% ith 0000000211011... hhijmnoqst
...and so on.
One approach is to ask oneself, how must I transform both searched value and value searched for, so that they match?, and then proceed and implement that in code.
You can use match_against in myisam full text indexes columns.

How to compile a complete list of MySQL "Words"

Really getting into MySQL and one thought I've had on mastering one aspect of it is to gather a complete listing of MySQL words. One example of this might be the Reserved Words list, though it appears that's not a complete list; example: CONCAT, CRC32, etc.
Bizarre as it may seem, I was thinking that such a list might exist, or that there might even be a query that would yield it, and/or a way to extract it from the source code of MySQL.
It is a non-scientific method, but what I would do is:
extract all strings from Native_func_registry func_array. Lookup for it sql/item_create.cc , e.g in
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/item_create.cc
Those should cover builtin functions.
extract strings from 'symbols' and 'functions' in lexer :
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/lex.h
extract symbols from bison input http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/sql_yacc.yy from lines
%token SOMETOKEN
except when tokens have _SYM suffix (they are covered by sql/lex.h)
Combine all of those, and the resulting set might come near :)

MySQL regexp for matching list of numbers in a string

I have several rows with strings similar to this: 1,19|11|14,2
The info I want to look at is: 19|11|14 (which is a list of the numbers, 19 11 and 14)
This info should be matched to see if any of the numbers are in the range between 8 and 13
What's the most effective way to accomplish this? I tried using a regexp like: [^0-9]*(8|9|10|11|12|13)[^0-9]*
But this would also match the number 9 which is actually 19.
Other methods for parsing the string is also welcomed, only functions available in MySQL 5.0 can be used.
From what I remember MySQLs Regex support is very simplistic I'm not sure how possible this will actually be. I don't believe it supports word boundaries or look around assertions. How about this...
(^|[^0-9])(8|9|10|11|12|13)([^0-9]|$)