How can I determine which regular expressions from a list possibly overlap - mysql

I have a table of regular expressions that are in an MySQL table that I match text against.
Is there a way, using MySQL or any other language (preferably Perl) that I can take this list of expressions and determine which of them MAY overlap. This should be independent of whatever text may be supplied to the expressions.
All of the expression have anchors.
Here is an example of what I am trying to get:
Expressions:
^a$
^b$
^ab
^b.*c
^batch
^catch
Result:
'^b.*c' and '^batch' MAY overlap
Thoughts?
Thanks,
Scott
Further explanation:
I have a list of user-created regexes and an imported list of strings that are to be matched against the regexes. In this case the strings are "clean" data (ie they are not user-created but imported from another source - they must not change).
When a user adds to the list of regexes I do not want any collisions on either the existing list of strings nor any future strings (which can not be guessed ahead of time - the only constraints being they are ASCII printable characters no longer than 255 characters).
A brute-force method would be to create a "rainbow" table of all of the permutations of strings and each time a regex is added run all of the regexes against the rainbow table. However I'd like to avoid this (I'm not even sure of the cost) and so was wondering aloud as to the possibility of an algorithm that would AT LEAST show which regexes in a list MAY collide.

I will punt on full REs. Even limiting to BREs and/or MySQL-pre-8.0 will be challenging. Here are some thoughts.
If end-anchored and no + or *, the calculate the length. The fixed-length can be used as a discriminator. Also, it could be used for toning back the "brute force" by perhaps an order of magnitude.
Anything followed by + or * gets turned into .* for simplicity. (Re the "may collide" rule.)
Any RE with explicit characters (including those followed by +) becomes a discriminator in some situations. Eg, ^a.*b$ vs ^a.*c$.
For those anchored at the end, reverse the pattern and test it that way. (I don't know how difficult reversing is.)
If you can say that a particular character must be at any position, then use it as a discriminator: ^a.b.*c$ -- a in pos 1; b in pos 3; c at end. Perhaps this can be extended to character classes: ^\w may match, but ^\d and ^a.*\d$ can't.

Related

Finding the set of wildcards for representing a binary interval

In a switch flow table there is a field called match field, where a list of matching condition is maintained. In the match fields, binary sequences with wildcards (*, a bit of wildcard character means that it could be either 1 or 0 in this bit) are used to represent some matching conditions.
For example, we can use the following binary sequences with wildcards to represent the matching condition '38798 <= Port <= 56637':
100101111000111*
100101111**1****
100101111*1*****
1001011111******
10011***********
101*************
110111010011110*
1101110100***0**
1101110100**0***
1101110100*0****
11011101000*****
11011100********
110**0**********
110*0***********
1100************
Does anyone know (or suggests) a way to obtain such sequence? So far, I used a brute force strategy (getting all the possible combinations for the intervals), but it is not computationally feasible (memory explosion) and has a problem of redundancy (wildcards representing the same interval). So now I'm trying to use range splitting and grid-of-tries to get a solution, but without success yet.

Pattern matching with associative and commutative operators

Pattern matching (as found in e.g. Prolog, the ML family languages and various expert system shells) normally operates by matching a query against data element by element in strict order.
In domains like automated theorem proving, however, there is a requirement to take into account that some operators are associative and commutative. Suppose we have data
A or B or C
and query
C or $X
Going by surface syntax this doesn't match, but logically it should match with $X bound to A or B because or is associative and commutative.
Is there any existing system, in any language, that does this sort of thing?
Associative-Commutative pattern matching has been around since 1981 and earlier, and is still a hot topic today.
There are lots of systems that implement this idea and make it useful; it means you can avoid write complicated pattern matches when associtivity or commutativity could be used to make the pattern match. Yes, it can be expensive; better the pattern matcher do this automatically, than you do it badly by hand.
You can see an example in a rewrite system for algebra and simple calculus implemented using our program transformation system. In this example, the symbolic language to be processed is defined by grammar rules, and those rules that have A-C properties are marked. Rewrites on trees produced by parsing the symbolic language are automatically extended to match.
The maude term rewriter implements associative and commutative pattern matching.
http://maude.cs.uiuc.edu/
I've never encountered such a thing, and I just had a more detailed look.
There is a sound computational reason for not implementing this by default - one has to essentially generate all combinations of the input before pattern matching, or you have to generate the full cross-product worth of match clauses.
I suspect that the usual way to implement this would be to simply write both patterns (in the binary case), i.e., have patterns for both C or $X and $X or C.
Depending on the underlying organisation of data (it's usually tuples), this pattern matching would involve rearranging the order of tuple elements, which would be weird (particularly in a strongly typed environment!). If it's lists instead, then you're on even shakier ground.
Incidentally, I suspect that the operation you fundamentally want is disjoint union patterns on sets, e.g.:
foo (Or ({C} disjointUnion {X})) = ...
The only programming environment I've seen that deals with sets in any detail would be Isabelle/HOL, and I'm still not sure that you can construct pattern matches over them.
EDIT: It looks like Isabelle's function functionality (rather than fun) will let you define complex non-constructor patterns, except then you have to prove that they are used consistently, and you can't use the code generator anymore.
EDIT 2: The way I implemented similar functionality over n commutative, associative and transitive operators was this:
My terms were of the form A | B | C | D, while queries were of the form B | C | $X, where $X was permitted to match zero or more things. I pre-sorted these using lexographic ordering, so that variables always occurred in the last position.
First, you construct all pairwise matches, ignoring variables for now, and recording those that match according to your rules.
{ (B,B), (C,C) }
If you treat this as a bipartite graph, then you are essentially doing a perfect marriage problem. There exist fast algorithms for finding these.
Assuming you find one, then you gather up everything that does not appear on the left-hand side of your relation (in this example, A and D), and you stuff them into the variable $X, and your match is complete. Obviously you can fail at any stage here, but this will mostly happen if there is no variable free on the RHS, or if there exists a constructor on the LHS that is not matched by anything (preventing you from finding a perfect match).
Sorry if this is a bit muddled. It's been a while since I wrote this code, but I hope this helps you, even a little bit!
For the record, this might not be a good approach in all cases. I had very complex notions of 'match' on subterms (i.e., not simple equality), and so building sets or anything would not have worked. Maybe that'll work in your case though and you can compute disjoint unions directly.

On-the-fly parser/pre-generation space/time tradeoff considerations

Do the space-related benefits of using an on-the-fly parser outweigh the time-related benefits of a pre-generated lookup table?
Long version:
I am authoring a chemistry reference tool, and am including a feature that will automatically name formulae conforming to a specific pattern; e.g. C[n]H[2n+2] => [n]ane; where [n] is an integer for the LHS; and an index into an array of names on the RHS. (meth, eth, …)
As far as I can see, this can be implemented in one of two ways:
I pre-generate a key/value dual lookup dictionary of formula <=> name pairs; either when the application starts (slower startup), or a static list which is published with the application (slower download).
Formulae are evaluated on the fly by a custom-built parser.
In approach 1. name => formula lookup becomes simpler by an order of magnitude; but the generator will, unless I want to ship dozens of megabytes of data with the application, have to have a preset, and fairly low, value for n.
Compounding this is the fact that formulae can have several terms; such as C[n]H[2n+1]OC[n']H[2n'+1]; and for each of these, the number of possible matches increases geometrically with n. Additionally, using this approach would eat RAM like nobody's business.
Approach 2. lets me support fairly large values of n using a fairly small lookup table, but makes name => formula lookup somewhat more complex. Compared to the pre-generation to file for shipping with the application, it also lets me correct errors in the generation logic without having to ship new data files.
This also requires that each formula be matched against a cursory test for several rules, determining if it could fit; which, if there are a lot of rules, takes time that might lead to noticeable slowdowns in the interface.
The question then, is:
Are there any considerations in the tradeoff I have failed to account for, or approaches that I haven't considered?
Do the benefits of using an on-the-fly parser justify the increased implementation complexity?
You should go with the second approach.
One possible solution is a greedy algorithm. Define your set of transformations as a regular expression (used to test the pattern) and a function which is given the regexp match object and returns the transformed string.
Regular expressions aren't quite powerful enough to handle what you want directly. Instead you'll have to do something like:
m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
C_count, H_count = int(m.group(1)), int(m.group(2))
match_size = len(m.group(0))
if C_count*2+2 == H_count:
replacement = alkane_lookup[C_count]
elif C_count*2 == H_count:
replacement = alkene_lookup[C_count]
...
else:
replacement = m.group(0) # no replacement available
(plus a lot more for the other possibilities)
then embed that in a loop which looks like:
formula = "...."
new_formula = ""
while formula:
match_size, replacement = find_replacement(formula)
new_formula += replacement
formula = formula[match_size:]
(You'll need to handle the case where nothing matches. One possible way is to include a list of all possible elements at the end of find_replacement(), which only returns the next element and counts.)
This is a greedy algorithm, which doesn't guarantee the smallest solution. That's more complicated, but since chemists themselves have different ideas of the right form, I wouldn't worry so much about it.

What is a 'value' in the context of programming?

Can you suggest a precise definition for a 'value' within the context of programming without reference to specific encoding techniques or particular languages or architectures?
[Previous question text, for discussion reference: "What is value in programming? How to define this word precisely?"]
I just happened to be glancing through Pierce's "Types and Programming Languages" - he slips a reasonably precise definition of "value" in a programming context into the text:
[...] defines a subset of terms, called values, that are possible final results of evaluation
This seems like a rather tidy definition - i.e., we take the set of all possible terms, and the ones that can possibly be left over after all evaluation has taken place are values.
Based on the ongoing comments about "bits" being an unacceptable definition, I think this one is a little better (although possibly still flawed):
A value is anything representable on a piece of possibly-infinite Turing machine tape.
Edit: I'm refining this some more.
A value is a member of the set of possible interpretations of any possibly-infinite sequence of symbols.
That is equivalent to the earlier definition based on Turing machine tape, but it actually generalises better.
Here, I'll take a shot: A value is a piece of stored information (in the information-theoretical sense) that can be manipulated by the computer.
(I won't say that a value has meaning; a random number in a register may have no meaning, but it's still a value.)
In short, a value is some assigned meaning to a variable (the object containing the value)
For example type=boolean; name=help; variable=a storage location; value=what is stored in that location;
Further break down:
X = 2; where X is a variable while 2 is the value stored in X.
Have you checked the article in wikipedia?
In computer science, a value is a sequence of bits that is interpreted according to some data type. It is possible for the same sequence of bits to have different values, depending on the type used to interpret its meaning. For instance, the value could be an integer or floating point value, or a string.
Read the Wiki
Value = Value is what we call the "contents" that was stored in the variable
Variables = containers for storing data values
Example: Think of a folder named "Movies"(Variables) and inside of it are it contents which are namely; Pirates of the Carribean, Fantastic Beast, and Lala land, (this in turn is what we now call it's Values )

How forgiving should form inputs be?

I went to my bank website the other day and entered my account number with a trailing space. An error message popped that said, "Account number must consist of numeric values only." I thought to myself, "Seriously?! You couldn't have just stripped the space for me?". If I were any less of a computer geek, I may even have thought, "What? There are only numbers in there!" (not being able to see space).
The Calculator that comes with Ubuntu on the other hand merrily accepts spaces and commas, but oddly doesn't like trailing dots (without any ensuing digits).
So, that begs the question. Exactly how forgiving should web forms be? I don't think trimming whitespace is too much to ask, but what about other integer fields?
Should they allow +/- signs?
How many spaces should be allowed between the sign and the number?
What about commas for thousands separators?
What about in other parts of the world where use dots instead?
What if they're in between every 4 digits instead of every 3?
What about hexidecimal and octal representations?
Scientific notation?
What if I accidentally hit the quote button when I'm trying to hit enter, should that be stripped too?
It would be very easy for me to strip out all non-digit characters, and that would be extremely forgiving, but what if the user made an actual mistake that affects the input and should have been caught, but now I've just stripped it out?
What about things like phone numbers (which have a huge variety of formats), postal codes, zip codes, credit card numbers, usernames, emails, URLs (should I assume http? What about .com while I'm at it?)?
Where do you draw the line?
For something as important as banking, I don't mind it complaining about my input, especially if the other option is mistakenly transferring a bucketload of money into some stranger's account instead of my wife's (because of a missing or incorrect digit for example).
A classic example is one of my banks which disallows monetary values unless they have ".99" at the end (where 9 can be any digit of course). The vast majority of things I do are for exact dollar amounts and it's a tad annoying to have to always enter 500.00 instead of just 500.
But I'll be happier about that the first time I avoid accidentally paying somebody $5072 instead of $50.72 just because I forgot the decimal point. Actually, that's pretty unlikely since it also asks for confirmation and I'm pretty anal in controlling my money :-)
Having said that, the general rule I try to follow is "be liberal in what you accept, be strict in what you produce".
This allows other software using my output to expect a limited range of possibilities (making their lives easier). But it makes my software more useful if it can handle simple misteaks.
You draw the line at the point where the computer is guessing at what the correct input should be.
For example, a license key input box I wrote once accepts spaces and dashes and both upper and lower case, even though internally the keys were without said spaces, dashes and were all upper case. I could do that, since I knew that none of the keys actually had spaces or dashes.
Your example about URLs is another good one. I've noticed that modern browsers (I'm using Chrome), when something like 'flowers' is typed into the address bar, it knows it should search for it since it's not a valid URL. If instead, I type 'st' it auto corrects (or auto-suggests) 'stackoverflow.com' since it's a bookmark.
A well-written input system will complain when it would otherwise be forced to guess what the correct input should be.
Numeric input:
Stripping non-digits seems reasonable to me, but the problem is conflicting decimal notation. Some regions expect , (comma) to denote the decimal separator, while others use . (period). Unless the input would likely be in other bases, I would only assume base 10. If it's reasonable to assume non-base 10 input (base-16 for color input, for example), I would go with standard conventions for denoting the bases: leading 0 means base 8, leading 0x means base 16.
String input:
This gets a lot more complicated. It mostly depends on what the input is actually meant to represent. A username should exclude characters that will cause trouble, but the meaning of 'cause trouble' will vary depending on the use of the application and the system itself. URLs have a concrete definition of what qualifies, but that definition is rather broad. Fortunately, many languages come with tools to discern URLs, without you having to code your own parsing (whether the language does it perfectly or not is another question).
In the end, it's really a case-by-case basis. I do like paxadiablo's general rule, though: Accept as much as you can, output only what you must.
It totally depends on how the data is going to be used.
If the input is a monetary amount, for a transaction for example, then the inputted variable should be normalised to a set of standards for sure.
If it's simply a case of a phone number, then it is unlikely the stored data will provide any functional sort of use so you can be more forgiving.
There is nothing wrong with forcing correct format to make displayed look nicer, but you have to balance user irritation with micro benefits.
Once you start collecting data you can scan through it and see what sort of patterns emerge, and you can auto strip off inputted format.
Where do you draw the line?
When the consequences of accepting "invalid" data outweigh the irritation of not accepting it.
Should they allow +/- signs?
If negative values are valid, then of course they should.
If not, then don't just silently strip minus signs, as it totally changes the meaning of the data. Stripping pluses is less of a problem.
What if [thousands separators are] in between every 4 digits instead of every 3?
In countries that use three-digit grouping, "1,0000" can be assumed to be a typo. But is it a typo for "10,000" or for "1,000"? I wouldn't dare guess, as a wrong guess could cost the user $9,000.
What about hexidecimal and octal
representations?
Unless you're running the search feature for unicode.org, I can't imagine why anyone would use hexidecimal in web form.
And "01234" is almost certainly intended to be 1234 instead of 668.
What about things like...credit card numbers
Please allow spaces or hyphens in credit card numbers. It's really annoying when I have to type an undelimited 16-digit number.
I think you're over reacting a little bit. If there's anything in the field that shouldn't be there, strip it. otherwise try to force the input into whatever format you want, and if it doesn't fit, reject it.
I would say "Accept anything but process only valid data".
Expect your users to behave like a computer noob. Validate the input data using regular expressions and other validators.
Search for standard regular expressions for urls, emails and stuff.
Throw in a regular exp like this "/(?:([a-zA-Z0-9][\s,]+))([a-zA-Z0-9]+)$/" for comma or space separated values. With minor tweaking this exp will work for any number of comma separated values.
The one that irritates me as a user is credit card numbers, conventionally these appear as groups of 4 digits with spaces separating them but the odd webform will only accept a single string of digits with no spaces and no indication that this is the format it's seeking. Similarly telephone numbers, humans often use spaces to improve clarity, webforms sometimes accept the spaces and sometimes don't.