Capture a value from a repeating group on every iteration (as opposed to just last occurrence) - mysql

How does one capture a value recursively with regex, where value is a part of a group that repeats?
I have a serialized array in mysql database
These are 3 examples of a serialized array
a:2:{i:0;s:2:"OR";i:1;s:2:"WA";}
a:1:{i:0;s:2:"CA";}
a:4:{i:0;s:2:"CA";i:1;s:2:"ID";i:2;s:2:"OR";i:3;s:2:"WA";}
a:1 stands for array:{number of elements}
then in between {} i:0 means element 0, i:1 means element 1 etc.
then the actual value s:2:"CA" means string with length of 2
so I have 2 elements in first array, 1 element in the second and 4 elements in the last
I have this data in mysql database and I DO NOT HAVE an option to parse this with back-end code - this has to be done in mysql (10.0.23-MariaDB-log)
the repeating pattern is inside of the curly braces
the number of repeats is variable (as in 3 examples each has a different number of repeating patterns),
the number of repeating patterns is defined by the number at 3rd position (if that helps)
for the first example it's a:2:
and so there are 2 repeating blocks:
i:0;s:2:"OR";
i:1;s:2:"WA";
I only care to extract the values in bold
So I came up with this regex
^a:(?:\d+):\{(?:i:(?:\d+);s:(?:\d+):\"(\w\w)\";)+}$
it captures the values I want all right but problem is it only captures the last one in each repeating group
so going back to the example what would be captured is
WA
CA
WA
What I would want is
OR|WA
CA
CA|ID|OR|WA
these are the language specific regex functions available to me:
https://mariadb.com/kb/en/library/regular-expressions-functions/
I don't care which one is used to solve the problem
Ultimately I need this in as sensible form that can be presented to the client e.g. CA,ID,OR or CA|ID|OR
Current thoughts are perhaps this isn't possible in a one liner, and I have to write a multi-step function where
extract the repeating portion between the curly braces
then somehow iterate over each repeating portion
then use the regex on each
then return the results as one string with separated elements

I doubt if such a capture is possible. However, this would probably do the job for your specific purpose.
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(str1, '^a:\\d+:\{', ''),
'i:\\d+;s:\\d+:\"(\\w\\w)\";',
'\\1,'
),
'\,?\}$',
''
)
Basically, this works with the input string (or column) str1 like
remove the first part
replace every cell with the string you want
remove the last 2 characters, ,}
and voila! You get a string CA,ID,OR.
Aftenote
It may or may not work well when the original array before serialised is empty (it depends how it is serialised).

Related

How to correctly specify a quantifier for a group of words?

Table has field containing the list of IDs separated by "-".
Example: 559-3319-3537-4345-29923
I need to check rows that use at least 4 of the specified identifiers using regex
Example: before inserting to the db, I need to check the value 559-3319-3537-29923-30762 for this condition.
I've build a pattern that only works in the specified order, but if the IDs are swapped, it doesn't work.
Template: ^.*\b(-*(559|3319|3537|29923|30762)-*){4,}\b.*$
Initially, I thought that a simple (559|3319|3537|29923|30762){4,} should be enough, but in this case it also doesn't work, although it sees all 4 values without a quantifier.
Please tell me how to write such an expression correctly.
For ease of reading/testing, I've simplified the Ids being searched for to single digit integers 1-5. The following pattern will match strings with at least 4 out of the 5 ids:
(\b(1|2|3|4|5)\b.*){4,}
(Play with this here)
OR MySQL's regex dialect:
([[:<:]](1|2|3|4|5)[[:>:]].*){4,}
(Play with MySQL version here)
Here are some examples:
#
Example
Is Match?
Description
1
1-2-3-4-5
YES
All the Ids
2
1-2-3-9-5
YES
Enough Ids
3
1-1-9-1-1
YES
Enough Ids, but there are repeats
4
9-8-7-6-0
NO
None of the Ids
5
1-2-3-9-9
NO
Some, but not enough of the Ids
If the repeated Ids as shown in example 3 are an issue, then regex is probably not a good fit for this problem.
Edit:
^.*\b((559|3319|3537|29923|30762)-?([0-9]*)?-?){4,}\b.*$
The reasoning behind this is that each group is not just one of the 5 numbers, but it can include some extra characters. So the matched groups in your example are:
(559-)
(3319-)
(3537-4345-)
(29923)
Original answer:
This would be one way to do it (not sure if there are other ways to do it):
^.*\b(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)[0-9-]*(559|3319|3537|29923|30762)\b.*$

NetSuite Saved Search: REGEXP_SUBSTR Pattern troubles

I am trying to break down a string that looks like this:
|5~13~3.750~159.75~66.563~P20~~~~Bundle A~~|
Here is a second example for reference:
|106~10~0~120~1060.000~~~~~~~|
Here is a third example of a static sized item:
|3~~~~~~~~~~~5:12|
Example 4:
|3~23~5~281~70.250~upper r~~~~~~|
|8~22~6~270~180.000~center~~~~~~|
|16~22~1~265~353.333~center~~~~~~|
Sometimes there are multiple lines in the same string.
I am not super familiar with setting up patterns for regexp_substr and would love some assistance with this!
The string will always have '|' at the beginning and end and 11 '~'s used to separate the numeric/text values which I am hoping to obtain. Also some of the numeric characters have decimals while others do not. If it helps the values are separated like so:
|Quantity~ Feet~ Inch~ Unit inches~ Total feet~ Piece mark~ Punch Pattern~ Notch~ Punch~ Bundling~ Radius~ Pitch|
As you can see, if there isn't something specified it shows as blank, but it may have them in another string, its rare for all of the values to have data.
For this specific case I believe regexp_substr will be my best option but if someone has another suggestion I'd be happy to give it a shot!
This is the formula(Text) I was able to come up with so far:
REGEXP_SUBSTR({custbody_msm_cut_list},'[[:alnum:]. ]+|$',1,1)
This allows me to pull all the matches held in the strings, but if some fields are excluded it makes presenting the correct data difficult.
TRIM(REGEXP_SUBSTR({custbody_msm_cut_list}, '^\|(([^~]*)~){1}',1,1,'i',2))
From the start of the string, match the pipe character |, then match anything except a tilde ~, then match the tilde. Repeat N times {1}. Return the last of these repeats.
You can control how many tildes are processed by the integer in the braces {1}
EG:
TRIM(REGEXP_SUBSTR('|Quantity~ Feet~ Inch~ Unit inches~ Total feet~ Piece mark~ Punch Pattern~ Notch~ Punch~ Bundling~ Radius~ Pitch|', '^\|(([^~]*)~){1}',1,1,'i',2))
returns "Quantity"
TRIM(REGEXP_SUBSTR('|Quantity~ Feet~ Inch~~~ Piece mark~ Punch Pattern~ Notch~ Punch~ Bundling~ Radius~ Pitch|', '^\|(([^~]*)~){7}',1,1,'i',2))
returns "Punch Pattern"
The final value Pitch is a slightly special case as it is not followed by a tilde:
TRIM(REGEXP_SUBSTR('|~~~~~~~~~~ Radius~ Pitch|', '^\|(([^~]*)~){11}([^\|]*)',1,1,'i',3))
Adapted and improved from https://stackoverflow.com/a/70264782/7885772

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

How to remove string that is no longer needed in MySQL database?

This seemed like it should be very simple to do yet I've not been able to find an answer after weeks of looking.
I'm trying to remove strings that are no longer needed. Regex_replace sounds perfect but is not available in MySQL.
In MySQL how would I accomplish changing this:
[quote=ABC;xxxxxx]
to this:
[quote=ABC]
The issues are:
- this can appear anywhere in a text blob
- the xxxxxx can only be numeric but may be 6, 7 or 8 characters long
- not adding/removing any rows, just rewriting the contents of one column on one row at a time.
Thanks.
I don't think you really need REGEX_Replace (though it would make things easier of course).
Assuming that the example you presented is a real reflection of what you have:
Your starting point is with the string [quote=<something>;, meaning that you can start searching for [quote=,
Once you found it, you need to search for ; and after that for ],
Once you found them both, you know what to extract when where to start for the next search (if the pattern you mentioned can appear more than once within a singe blob.
Did I get you correctly?
EDIT
This paradigm is aimed to convert all instances of [quote=ABC;xxxxxx] to [quote=ABC] under the following assumptions:
The pattern can appear any number of times within the input string,
The length of xxxxxx is not fixed,
The resulting string (after removing all the appearances of ;xxxxxx) should replace the value in the table,
Performance is not an issue since either this is going to be a one-time job (through the whole table) or it will run every time on a single string (e.g. before INSERTing a new record).
Some MySQL functions that will be used:
INSTR: Searches within a string for the first appearance of a sub-string and returns the position (offset) where the sub-string was found,
SUBSTR: Returns a substring from a string (several ways to use it),
CONCAT: Concatenates two or more strings.
The guidelines presented here apply for the manipulation of a single INPUT string. If this needs to be used over, say, a whole table, simply get the strings into a CURSOR and loop.
Here are the steps:
Declare five INT local variables to serve as indices and total input string length, say L_Start, L_UpTo, l_Total_Length, l_temp1 and l_temp 2, setting the initial value for l_Start = 1 and l_Total_Length = LENGTH(INPUT_String),
Declare a string variable into which you will copy the "cleaned" result and initiate it as '', say l_Output_str; also declare a temporary string to hold the value of 'ABD', say l_Quote,
Start a infinite loop (you will set the exit condition within it; see below),
Exit loop if l_Start >= l_Total_Length (here is one of the two exit points from the loop),
Find the first location of '[quote=' within the input string starting from L_Start,
If the returned value is 0 (i.e. substring not found), concatenate the current contents of l_Output_str with whatever remains if the input string from position L_start (e.g. SET l_Output_str = CONCAT(l_Output_str,SUBSTR(INPUT_String,L_Start) ;) and exit loop (second exit position),
Search the input string for the ; symbol starting from L_start + 7 (i.e. the length of [quote=) and save the value in l_temp_1,
Search the input string for the ] symbol starting from L_start + 7 + l_temp2 and save the value in l_temp_2,
Add the found result to output string as SET l_Output_str = CONCAT(l_Output_str,'[quote=',SUBSTR(INPUT_String,L_Start + 7, l_temp_2 - l_temp_1),']') ;,
Set L_Start = L_Start + 7 + l_temp_2 + 1 ;
End of loop.
Notes:
As I neither made the code nor tested it, it is possible that I'm not setting indices correctly; you will need to perform detailed tests to make get it working as needed;
The above IS the method I suggested;
If the input string is very long (many MBs), you might observe poor performance (i.e. it might take few seconds to complete) because of the concatenations. There are some steps that can be taken to improve performance, but let's have this working first and then, if needed, tackle the performance issues.
Hope that the above is clear and comprehensive.

Get Unique String from a longer string when the unique string is at 2 different locations

In a web application that I am creating tests for, there are 2 sets of strings from which I wish to get a substring (which is unique) to use for identifying that element on the Web Page:
Parent Form:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-bms_9999999_3512-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109-bMs-textField-bMs-ABNylGGXXu8IPwjI4jMM5y1K
SubForm:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVXJ-bMs-bms_FK_9999999_406_ID-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_177-bMs-searchLookupField-bMs-ABNylGGXXu8IPwjI4jMM5y1K-bMs-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-PRIMARY9999999_480-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109
I wish to get the substring from both of these using a single function, so that I don't have to create a different functions for each type I encounter:
Substring in the above 2 provided strings is:
ABNylGGXXu8IPwjI4jMM5y1K
This substring can change for each element on the web page, but is unique for each element of the page and so useful to identify.
I cannot use the full string, as it changes for each environment or if I generate a new environment to host the web pages (the complete string depends on the Meta Data).
We tried doing it for the Parent Form, by using the "-" as the delimiter and identifying the last -bMs- and then taking the string, but that does not work for the SubForm.
So, my main question is, is there some RegEx that can be created to extract only that string (composed of alphabets [upper & lower case] and numbers) from the full string? Or some other simpler way to identify that string?
You could try a combination of positive Lookbehind, [A-Z] and [a-z]. Try this code:
(?<=-bMs-)[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/1
It seems to work without even the positive Lookbehind
[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/2
If you're happy to base the selection of the elements on the previous one, then this might work for you:
(?<=searchLookupField-bMs-|textField-bMs-)\w+
Example
And if you wanted to be extra certain, you could append a second lookahead to the end.
(?<=searchLookupField-bMs-|textField-bMs-)\w+(?=-bMs-|$)
Example
If these don't work, or if the whole string varies greatly, then some more examples would help us narrow it down and come up with a great answer!