Regex for finding possible words (scrabble-style) - mysql

I'm trying to create a "scrabble-solver" to run stress-tests on a scrabble-like game I'm developing. I have a database containing ~200.000 words and I'm now looking for a way to match the scrabble tiles given with the words in the database.
Example:
Given tiles: A, P, E, F, O, L, M
Result: APE, POLE, PALE, MOLE, PAL...
Is this possible by using a simple SELECT-statement with REGEXP? If possible I would also like to add letters on specific positions and be able to determine max/min length.
I hope this question made sense :)
I've been googling my eyes out but I can't seem to find what I'm looking for. Anyone got an idea?
Thanks! :)

It doesn't sound like a regex problem. I think you'll be better off simply creating all possible combinations of letters from the existing tiles and then running your SELECT statement with the IN clause. For example, with tiles:
A, P, E
your SELECT clause will be
SELECT word FROM words WHERE word IN ('APE', 'AEP', 'PAE' ,'PEA', 'EPA', 'EAP');
You'll get the list of valid words from your table.

A regex would not help you much in this case. You need to construct the possible words by yourself.
The problem is that you have a limited number of each possible letter and a regex cannot encode that information. If you had infinite supply of each letter, then you could use a regex like [APEFOI]*.
You will have to enumerate all the possible words yourself. The implementation would depend on the language your using, but your best bet might be a next_permutation function or better a function that enumerates all permutations. A simple (and slightly inefficient) implementation (in Python-like pseudocode) would be:
words = []
for permutation in permutations(letters): # enumerate all character orders
for i in range(1, len(permutation)): # enumerate all lengths of words
words.append(letters[:i]) # append to candidate set
At that point words will contain all the candidate words you would then use in a SELECT ... IN statement.
That isn't the most efficient approach, but should be practical enough to get you started.

Related

Is there a function to compare two strings using a custom homoglyphs list

I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.
Example
homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike
compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1
compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2
Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.
The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.
Do something like this exists?
The key to your problem can be thought of like an IQ word association question.
Sound Glyph
--------- = ----------
Homophone Homoglyphs
Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).
The way to find similar sounding words is via Soundex (Sound Index).
So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.
Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.
Make sense?
If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.
Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.

How can I determine which regular expressions from a list possibly overlap

I have a table of regular expressions that are in an MySQL table that I match text against.
Is there a way, using MySQL or any other language (preferably Perl) that I can take this list of expressions and determine which of them MAY overlap. This should be independent of whatever text may be supplied to the expressions.
All of the expression have anchors.
Here is an example of what I am trying to get:
Expressions:
^a$
^b$
^ab
^b.*c
^batch
^catch
Result:
'^b.*c' and '^batch' MAY overlap
Thoughts?
Thanks,
Scott
Further explanation:
I have a list of user-created regexes and an imported list of strings that are to be matched against the regexes. In this case the strings are "clean" data (ie they are not user-created but imported from another source - they must not change).
When a user adds to the list of regexes I do not want any collisions on either the existing list of strings nor any future strings (which can not be guessed ahead of time - the only constraints being they are ASCII printable characters no longer than 255 characters).
A brute-force method would be to create a "rainbow" table of all of the permutations of strings and each time a regex is added run all of the regexes against the rainbow table. However I'd like to avoid this (I'm not even sure of the cost) and so was wondering aloud as to the possibility of an algorithm that would AT LEAST show which regexes in a list MAY collide.
I will punt on full REs. Even limiting to BREs and/or MySQL-pre-8.0 will be challenging. Here are some thoughts.
If end-anchored and no + or *, the calculate the length. The fixed-length can be used as a discriminator. Also, it could be used for toning back the "brute force" by perhaps an order of magnitude.
Anything followed by + or * gets turned into .* for simplicity. (Re the "may collide" rule.)
Any RE with explicit characters (including those followed by +) becomes a discriminator in some situations. Eg, ^a.*b$ vs ^a.*c$.
For those anchored at the end, reverse the pattern and test it that way. (I don't know how difficult reversing is.)
If you can say that a particular character must be at any position, then use it as a discriminator: ^a.b.*c$ -- a in pos 1; b in pos 3; c at end. Perhaps this can be extended to character classes: ^\w may match, but ^\d and ^a.*\d$ can't.

AS3 function producing combinations of array, no duplicates

This sounds like a duplicate question, as there are several questions similar to this, but they don't specifically ask this (or I just haven't found it! :) )
I have an array, this one has two distinct elements, "a" and "b", and a length of four total elements:
var list:Array = ["a","a","b","b"];
I'm looking for all combinations, using all elements, no duplicates.
This should yield:
aabb
abab
abba
bbaa
baba
baab
Searching for a solution for this has given me results similar to these:
a,b,ab,ba,aab,abb,aba, etc
or
a a b b, a a b b, a a b b, etc
Mind you, the application that would ultimately use this function would have two distinct elements, "a" and "b", and a length of 50 total elements:
var list:Array = ["a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a","a","a","a","a","a",
"a","a","a","a","a",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b"]
...so a brute force solution like I used with aabb wouldn't be feasible.
Any help, especially using AS3 code, would be appreciated, even if it is simply pointing me to the right google search :)
Here is a JavaScript answer that might get you started: Permutations in JavaScript? (they're both EcmaScript implementations so converting to ActionScript should only require minor changes)
It doesn't handle the uniqueness requirement, but it might point you in the right direction.
However, there are a few things you might need to consider first. I don't think it will be feasible to pre-compute all unique permutations upfront.
Based on this answer about unique permutations it looks like there are 50! / 25! * 25! = 126,410,606,437,752 unique permutations for 25 a's and 25 b's.
To give an idea how large that number is: if each combination was 1 byte in memory (in practice it will be more than this) then that would be: 126410606437752 bytes = 126,410.6 gigabytes in memory.
Plus, the algorithm for generating the permutations has complexity O(n!) - so it might take far too long, separate to memory constraints, to generate the list of permutations.

Word Jumble Algorithm

Given a word jumble (i.e. ofbaor), what would be an approach to unscramble the letters to create a real word (i.e. foobar)? I could see this having a couple of approaches, and I think I know how I'd do it in .NET, but I curious to see what some other solutions look like (always happy to see if my solution is optimal or not).
This isn't homework or anything like that, I just saw a word jumble in the local comics section of the paper (yes, good ol' fashioned newsprint), and the engineer in me started thinking.
edit:
please post some pseudo code or real code if you can; it's always nice to try and expand language knowledge by seeing examples like this.
Have a dictionary that's keyed by the letters of each word in sorted order. Then take you jumble an sort the letters - look up all the words in the dictionary by that sorted-letter string.
So, as an example, the words 'bear' and 'bare' would be in the dictionary as follows:
key word
----- ------
aber bear
aber bare
And if you're given the jumble, 'earb', you'd sort the letters to 'aber' and be able to look up both possible words in the dictionary.
Create all the permutations of the string, and look them up in a dictionary.
You can optimize by looking up shorter strings that begin words, and if there are no words of suitable length in the dictionary that start with those strings, eliminating permutations starting with those letters from further consideration.
CodeProject has a couple of articles here and here. The second uses recursion. Wikipedia also outlines a couple of algorithms here. The Wikipedia article also mentions a program called Jumbo that uses a more heuristic approach that solves the problem like a human would. There seem to be a few approaches to the problem.
Depending on the length of the string WhirlWind's approach could be faster, but an alternative way of doing it which has more or less O(n) complexity is instead of creating all the permutations of the string and looking them up, you go through all the words in the dictionary and see if each can be built from the input string.
A really smart algorithm that knows the number of words in the dictionary in advance could do something like this:
sort letters in jumbled string
if factorial(length of jumbled string) > count of words in dictionary:
for each word in dictionary:
sort letters in dictionary word
if sorted letters in dictionary word == sorted letters in jumbled string:
append original dictionary word to acceptable word list
else:
create permutation list of jumbled letters
for each permutation of letters:
search in dictionary for permutation
if permutation in dictionary:
add word to acceptable word list
http://www.codeproject.com/KB/game/Anagrams2.aspx
One approach is to split your dictionary into sorted sub-dictionaries with specific lengths, like 1-letter words, 2-letter words,...
When you search for words of a certain jumble, compare the number of possible permutations with the number of the words in the corresponding dictionary. If the former is larger, then compare words in the dictionary to the jumble, if the latter is, then create permutations then search for those in your dictionary.
You can also optimize it further by dividing the dictionaries into smaller subsets based on their first letters, and how frequently they appear, and then divide further based on the second letter. More division might significantly complicate the database and slow down searching, however.

How do you make mathematical equations readable and maintainable?

Given maths is not my strongest point I'm implementing a bezier curve for 3D animation.
The formula is shown here, and as you can see it is quite nasty. In my programming I use descriptive names, and like to break complex lines down to smaller manageable ones.
How is the best way to handle a scenario like this?
Is it to ignore programming best practices and stick with variable names such as x, y, and t?
In my opinion when you have a predefined mathematical equation it is perfectly acceptable to use short variable names: x, y, t, P_0 etc. which correspond to the equation. Make sure to reference the formula clearly though.
if the formulas is extrated to its own function i'd certainly use the canonical maths representation, and maybe add the wiki page url in a comment
if its imbedded in code with a specific usage of the function then keeping the domain names from your code might be better
it depends
Seeing as only the mathematician in you is actually going to understand the formula, my advice would be to go with a style that a mathematician would be most comfortable with (so letters as variables etc...)
I would also definitely put a comment in there somewhere that clearly states what the formula is, and what it does, for example "This method returns a series of points along a quadratic Bezier curve". That way whenever the programmer in you revisits the code you can safely ignore the mathematical complexity with the assumption that your inner mathematician has already checked to make sure its all ok.
I'd encourage you to use mathematic's best practices and denote variables with letters. Just provide explanation for the variables above the formula. And if you can split the formula to smaller subformulas, even better.
Don't bother. Just reference the documentation (the wikipedia page in this case or even better your own documentation) and make sure the variable names match your documentation. Code comments are just not well suited (nor need them to) describe mathematical formulation.
Sometimes a reference is better than 40 lines of comments or even suggestive variable names.
Make the formula in C# (or other language of preference) resemble the mathematical formula as closely as possible, and include a reference to the formula, including a description of the variables. The idea in coding is to be readable, and if you're dealing with mathematical formulae the most readable representation is the one that looks most like mathematics.
You could key the formula into wolfram alpha ... it will try to simplify for you.
It'll also output in a mathematica friendly style ... funnily enough ;)
Kindness,
Dan
I tend to break an equation down into its root parts.
def sum(array)
array.inject(0) { |result, item| result + item }
end
def average(array)
sum(array) / array.length
end
def sum_squared_error(array)
avg = average(array)
array.inject(0) { |result, item| result + (item - avg) ** 2 }
end
def variance(array)
sum_squared_error(array) / (array.length - 1)
end
def standard_deviation(array)
Math.sqrt(variance(array))
end
You might consider using a domain-specific language to handle this. Mathematica would allow you to write out the equation just as it appears in mathematical notion.
The more your final form resembles the original equation, the more maintainable it will be in the long run (otherwise you have to interpret the code every time you see it).