Actionscript regex false negative - actionscript-3

The list of words is very long, I cannot paste the actual code that bugs out here.
The regex whitelist has approx 4500 words in it seprated by a |
Both the regex, whitelist and whitelist2 includes the word hello but the test for each returns different results and I have no idea why after testing the same with javascript which gives correct results.
Here is the actionscript for testing.
The line for whitelist might not be visible entirely, try copying pasting the code from the link below in your text/code editor.
http://wonderfl.net/c/jTmb/
Edit1: problem I'm facing is that sometimes the words are not an exact match.
Example saturdays need to match saturday.
Its why I was using regex.
About the string length.
I tried to check the length of the string and its being reported correctly.
http://wonderfl.net/c/a9yp/
Edit2:
Test showing it works in javascript
http://tinyurl.com/m74hmdj

Actual answer...
This question led me into finding some interesting AS3 limitations for the first time...
Your regex fails at the length it has by the word "metabrushite". As far as I can tell from various tests, this is where it hits the longest supported length of a regex in AS3: 31391 characters. Any regex longer than that seems to always return false on a call to test(). Note that "hello" appears in the list before "metabrushite", so it's not a matter of truncation - the regex simply silently fails to work at all - e.g. a regex that should always return true for all words, still returns false if it's that long.
The limit seems a rather arbitrary number, so it's hard to tell exactly what makes this limit.
Again, you should really not be using regex for a task like this, but if you feel you have to, you'll need to split it up into several regex'es, each of which don't exceed the maximum length.
Side note:
Another interesting thing, which I haven't examined more closely, is that creating the RegExp from a single-statement concatenated string, i.e.:
trace("You'll never see this traced if too many words are added below.");
var s:String = "firstword|" +
"secondword|" +
... +
"lastword";
... will fail for even shorter resulting strings. This seems to be due to a max length imposed on the length of a single statement, and has nothing to do with regex. It doesn't freeze; it doesn't output an error or even the first trace. The script is simply silently excluded from the swf and hence never executed.

I'm thinking #tsiki is right about the max length of an AS3 regex.
This is really a comment, but since I'd like to include a bit of code, I'm putting it as an answer:
Since you're not using the regex for anything other than a list of words separated by |, consider using an array instead. Another advantage of this approach is that it will be quite a bit faster.
// This is just a way of reusing your list,
// rather than manually transforming it to an array:
var whitelist:Array = "abasement|abastardize|abastardize|..."
.split("|");
// Simply use .toLowerCase() on the input string to make it case insensitive,
// assuming all your whitelist words are lower case.
trace(whitelist.indexOf("hello") >= 0);
ETA: Performance
Here are some performance comparisons.
_array is pre-initialized to a lower case array of strings, split by |.
_regex is pre-initialized to your regex.
_search is pre-initialized to a given word to search for.
I'm using your words up to (and including) words starting with L - to get around the max regex length limitation:
The code for each test:
regex.test:
_regex.test(_search);
array.indexOf:
_array.indexOf(_search.toLowerCase()) >= 0;
loop over array:
for (var j:int = 0; j < _array.length; j++)
{
if (_array[j] == _search)
{
break;
}
}
Update: loop, indexOf (check if search string is substring of item in whitelist):
for (var j:int = 0; j < _array.length; j++)
{
if (_search.indexOf(array[j]) !== -1)
{
break;
}
}
The AS3 compiler doesn't do any unfair optimization of this simple code (such as skipping executions due to not using the result - it's not all that clever).
10 runs, 1000 iterations each, FP 11.4.402.278 - release version:
Method Search for Avg. Min Max Iter.
---------------------------------------------------------------------------
array.indexOf "abasement" 0.0 ms 0 ms 0 ms 0 ms
regex.test "abasement" 18.4 ms 14 ms 22 ms 0.0184 ms
loop over array "abasement" 0.0 ms 0 ms 0 ms 0 ms
loop, indexOf "abasement"    0.0 ms       0 ms      0 ms           0 ms
array.indexOf "hello" 31.1 ms 25 ms 42 ms 0.0311 ms
regex.test "hello" 326.8 ms 309 ms 347 ms 0.3268 ms
loop over array "hello" 59.4 ms 50 ms 69 ms 0.0594 ms
loop, indexOf   "hello"    97.4 ms      92 ms    105 ms     0.0974 ms
Avg. = average time for the 1000 iterations in each run
Min = Minimum time for the 1000 iterations in each run
Max = Maximum time for the 1000 iterations in each run
Iter. = Calculated time for a single iteration on average
It's quite clear that looping over the array and comparing each value is faster than using a regex. You could do a fair bit of comparison before it would catch up to the time the regex comparison spends. And in any event, we're dealing with fractions of milliseconds for a single lookup - it's really premature optimization, unless you're doing hundreds of lookups in a short period of time. If we were talking optimization, a Vector.<String> might speed up things slightly more, compared to Array.
The main point of this whole thing is that, except for relatively complex scenarios, a regex is unlikely to be more efficient than a tailored parser/comparer/lookup - that goes for all languages. It's designed to be a general purpose tool, not to do things the smartest way in every case (or pretty much any case for that matter).

Related

What's the proper use of output property in Octave?

I am not sure what is the use of output while using fminunc.
>>options = optimset('GradObj','on','MaxIter','1');
>>initialTheta=zeros(2,1);
>>[optTheta, functionVal, exitFlag, output, grad, hessian]=
fminunc(#CostFunc,initialTheta,options);
>> output
output =
scalar structure containing the fields:
iterations = 11
successful = 10
funcCount = 21
Even when I use max no of iteration = 1 still it is giving no of iteration = 11??
Could anyone please explain me why is this happening?
help me with grad and hessian properties too, means the use of those.
Given we don't have the full code, I think the easiest thing for you to do to understand exactly what is happening is to just set a breakpoint in fminunc.m itself, and follow the logic of the code. This is one of the nice things about working with Octave, since the source code is provided and you can check it freely (there's often useful information in octave source code in fact, such as references to papers which they relied on for the implementation, etc).
From a quick look, it doesn't seem like fminunc expects a maxiter of 1. Have a look at line 211:
211 while (niter < maxiter && nfev < maxfev && ! info)
Since niter is initialised just before (at line 176) with the value of 1, in theory this loop will never be entered if your maxiter is 1, which defeats the whole point of the optimization.
There are other interesting things happening in there too, e.g. the inner while loop starting at line 272:
272 while (! suc && niter <= maxiter && nfev < maxfev && ! info)
This uses "shortcut evaluation", to first check if the previous iteration was "unsuccessful", before checking if the number of iterations are less than "maxiter".
In other words, if the previous iteration was successful, you don't get to run the inner loop at all, and you never get to increment niter.
What flags an iteration as "successful" seems to be defined by the ratio of "actual vs predicted reduction", as per the following (non-consecutive) lines:
286 actred = (fval - fval1) / (abs (fval1) + abs (fval));
...
295 prered = -t/(abs (fval) + abs (fval + t));
296 ratio = actred / prered;
...
321 if (ratio >= 1e-4)
322 ## Successful iteration.
...
326 nsuciter += 1;
...
328 endif
329
330 niter += 1;
In other words, it seems like fminunc will respect your maxiters ignoring whether these have been "successful" or "unsuccessful", with the exception that it does not like to "end" the algorithm at a "successful" turn (since the success condition needs to be fulfilled first before the maxiters condition is checked).
Obviously this is an academic point, since you shouldn't even be entering this inner loop when you couldn't even make it past the outer loop in the first place.
I cannot really know exactly what is going on without knowing your specific code, but you should be able to follow easily if you run your code with a breakpoint at fminunc. The maths behind that implementation may be complex, but the code itself seems fairly simple and straightforward enough to follow.
Good luck!

Big unicode problems - AS3

I made a program where people can type in 4 letters and it will give you the corresponding unicode character that it inserts in a textflow element. Now i had a lot of problems with this, but in the end i succeeded with some help. Now the problem came when i typed "dddd" or "ddd1" as a test.
I got the error
- "An unpaired Unicode surrogate was encountered in the input."
Now i spend like 2 days testing for that, and there was absolutly no event triggering that made it possible for me to test for the error before it occurred.
The code:
str = "dddd"
num = parseInt(str,16)
res = String.fromCharCode(num)
Acutally when the error occurres res is equal to "?" in the console ... but if you test for it with if(res == "?") it returns false.
MY QUESTION:
Now i searched and searched and found abolutly no description on this error in adobes as3 reference, but after 2 days i found this page for javascript: http://scripts.sil.org/cms/scripts/page.php?item_id=IWS-Chapter04a
It says that
- The code units in the range 0xD800–0xDFFF, serve a special purpose, however. These code units, known as surrogate code units
So now i test with:
if( num > 0 && num < uint(0xD800)) || ( num > uint(0xDFFF) && num < uint(0xFFFF) ){
get unicode character.
}
my question is simply if i understood this correctly, that this will actually prevent the error from occurring? - I'm no unicode specialist and don't know really how to test for it, since there are ten's of thousands characters so i might have missed one and that would mean that the users by accident could get the error and risk crashing the application.
You are correct. A code point ("high surrogate") between 0xD800-0xDBFF must be paired with a code point ("low surrogate") between 0xDC00-0xDFFF. Those are reserved for use in UTF-16[1] - when needing to address the higher planes that don't fit in 16 bits - and hence those code points can't appear on their own. For example:
0xD802 DC01 corresponds to (I'll leave out the 0x hex markers):
10000 + (high - D800) * 0400 + (low - DC00)
10000 + (D802 - D800) * 0400 + (DC01 - DC00)
= 10000 + 0002 * 0400 + 0001
= 10801 expressed as UTF-16
... just adding that bit of into in case you later need to support it.
I haven't tested the AS3 functionality for the following, but you may want to also test the input below - you won't get the surrogate error for these, but might get another error message:
0xFFFE and 0xFFFF (when using higher planes, also any code point "ending" with those bits, e.g. 0x1FFFE and 0x1FFFF; 0x2FFFE and 0x2FFFF etc.) Those are "non-characters".
The same goes for 0xFDD0-0xFEDF - also "non-characters".
AS3 actually uses UTF-16 to store its strings, but even if it didn't, the surrogate code points would still have no meaning outside pairs - the code points are reserved and can't be used in other Unicode encodings either (e.g. UTF-8 or UTF-32)

How can I get better randomization in my sql query?

I am attempting to get a random bearing, from 0 to 359.9.
SET bearing = FLOOR((RAND() * 359.9));
I may call the procedure that runs this request within the same while loop, immediately one after the next. Unfortunately, the randomization seems to be anything but unique. e.g.
Results
358.07
359.15
357.85
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
In any other situation, I would wait a few milliseconds in between calls or reinit my Random object (such as in C#), which would greatly vary my randomness. However, I don't want to wait in this situation.
How can I increase randomness without waiting?
I understand how randomization works, and I know because of my quick calls to the same function, the ticks used to generate the random number are very close to one another.
That's not quite right. Where folks get into trouble is when they re-seed a random number generator repeatedly with the current time, and because they do it very quickly the time is the same and they end up re-seeding the RNG with the same seed. This results in the RNG spitting out the same sequence of numbers each time it is re-seeded.
Importantly, by "the same" I mean exactly the same. An RNG is either going to return an identical sequence or a completely different one. A "close" seed won't result in a "similar" sequence. You will either get an identical sequence or a totally different one.
The correct solution to this is not to stagger your re-seeds, but actually to stop re-seeding the RNG. You only need to seed an RNG once.
Anyways, that is neither here nor there. MySQL's RAND() function does not require explicit seeding. When you call RAND() without arguments the seeding is taken care of for you meaning you can call it repeatedly without issue. There's no time-based limitation with how often you can call it.
Actually your SQL looks fine as is. There's something missing from your post, in fact. Since you're calling FLOOR() the result you get should always be an integer. There's no way you'll get a fractional result from that assignment. You should see integral results like this:
187
274
89
345
That's what I got from running SELECT FLOOR(RAND() * 359.9) repeatedly.
Also, for what it's worth RAND() will never return 1.0. Its range is 0 &leq; RAND() < 1.0. You are safe using 360 vs. 359.9:
SET bearing = FLOOR(RAND() * 360);

Generating unique codes that are different in two digits

I want to generate unique code numbers (composed of 7 digits exactly). The code number is generated randomly and saved in MySQL table.
I have another requirement. All generated codes should differ in at least two digits. This is useful to prevent errors while typing the user code. Hopefully, it will prevent referring to another user code while doing some operations as it is more unlikely to miss two digits and match another existing user code.
The generate algorithm works simply like:
Retrieve all previous codes if any from MySQL table.
Generate one code at a time.
Subtract the generated code with all previous codes.
Check the number of non-zero digits in the subtraction result.
If it is > 1, accept the generated code and add it to previous codes.
Otherwise, jump to 2.
Repeat steps from 2 to 6 for the number of requested codes.
Save the generated codes in the DB table.
The algorithm works fine, but the problem is related to performance. It takes a very long to finish generating the codes when requesting to generate a large number of codes like: 10,000.
The question: Is there any way to improve the performance of this algorithm?
I am using perl + MySQL on Ubuntu server if that matters.
Have you considered a variant of the Luhn algorithm? Luhn is used to generate a check digit for strings of numbers in lots of applications, including credit card account numbers. It's part of the ISO-7812-1 standard for generating identifiers. It will catch any number that is entered with one incorrect digit, which implies any two valid numbers differ in a least two digits.
Check out Algorithm::LUHN in CPAN for a perl implementation.
Don't retrieve the existing codes, just generate a potential new code and see if there are any conflicting ones in the database:
SELECT code FROM table WHERE abs(code-?) regexp '^[1-9]?0*$';
(where the placeholder is the newly generated code).
Ah, I missed the generating lots of codes at once part. Do it like this (completely untested):
my #codes = existing_codes();
my $frontwards_index = {};
my $backwards_index = {};
for my $code (#codes) {
index_code($code, $frontwards_index);
index_code(reverse($code), $backwards_index);
}
my #new_codes = map generate_code($frontwards_index, $backwards_index), 1..10000;
sub index_code {
my ($code, $index) = #_;
push #{ $index{ substr($code, 0, length($code)/2) } }, $code;
return;
}
sub check_index {
my ($code, $index) = #_;
my $found = grep { ($_ ^ $code) =~ y/\0//c <= 1 } #{ $index{ substr($code, 0, length($code)/2 } };
return $found;
}
sub generate_code {
my ($frontwards_index, $backwards_index) = #_;
my $new_code;
do {
$new_code = sprintf("%07d", rand(10000000));
} while check_index($new_code, $frontwards_index)
|| check_index(reverse($new_code), $backwards_index);
index_code($new_code, $frontwards_index);
index_code(reverse($new_code), $backwards_index);
return $new_code;
}
Put the numbers 0 through 9,999,999 in an augmented binary search tree. The augmentation is to keep track of the number of sub-nodes to the left and to the right. So for example when your algorithm begins, the top node should have value 5,000,000, and it should know that it has 5,000,000 nodes to the left, and 4,999,999 nodes to the right. Now create a hashtable. For each value you've used already, remove its node from the augmented binary search tree and add the value to the hashtable. Make sure to maintain the augmentation.
To get a single value, follow these steps.
Use the top node to determine how many nodes are left in the tree. Let's say you have n nodes left. Pick a random number between 0 and n. Using the augmentation, you can find the nth node in your tree in log(n) time.
Once you've found that node, compute all the values that would make the value at that node invalid. Let's say your node has value 1,111,111. If you already have 2,111,111 or 3,111,111 or... then you can't use 1,111,111. Since there are 8 other options per digit and 7 digits, you only need to check 56 possible values. Check to see if any of those values are in your hashtable. If you haven't used any of those values yet, you can use your random node. If you have used any of them, then you can't.
Remove your node from the augmented tree. Make sure that you maintain the augmented information.
If you can't use that value, return to step 1.
If you can use that value, you have a new random code. Add it to the hashtable.
Now, checking to see if a value is available takes O(1) time instead of O(n) time. Also, finding another available random value to check takes O(log n) time instead of... ah... I'm not sure how to analyze your algorithm.
Long story short, if you start from scratch and use this algorithm, you will generate a complete list of valid codes in O(n log n). Since n is 10,000,000, it will take a few seconds or something.
Did I do the math right there everybody? Let me know if that doesn't check out or if I need to clarify anything.
Use a hash.
After generating a successful code (not conflicting with any existing code), but that code in the hash table, and also put the 63 other codes that differ by exactly one digit into the hash.
To see if a randomly generated code will conflict with an existing code, just check if that code exists in the hash.
Howabout:
Generate a 6 digit code by autoincrementing the previous one.
Generate a 1 digit code by incrementing the previous one mod 10.
Concatenate the two.
Presto, guaranteed to differ in two digits. :D
(Yes, being slightly facetious. I'm assuming that 'random' or at least quasi-random is necessary. In which case, generate a 6 digit random key, repeat until its not a duplicate (i.e. make the column unique, repeat until the insert doesn't fail the constraint), then generate a check digit, as someone already said.)

Algorithm to generate all possible letter combinations of given string down to 2 letters

Algorithm to generate all possible letter combinations of given string down to 2 letters
Trying to create an Anagram solver in AS3, such as this one found here:
http://homepage.ntlworld.com/adam.bozon/anagramsolver.htm
I'm having a problem wrapping my brain around generating all possible letter combinations for the various lengths of strings. If I was only generating permutations for a fixed length, it wouldn't be such a problem for me... but I'm looking to reduce the length of the string and obtain all the possible permutations from the original set of letters for a string with a max length smaller than the original string. For example, say I want a string length of 2, yet I have a 3 letter string of “abc”, the output would be: ab ac ba bc ca cb.
Ideally the algorithm would produce a complete list of possible combinations starting with the original string length, down to the smallest string length of 2. I have a feeling there is probably a small recursive algorithm to do this, but can't wrap my brain around it. I'm working in AS3.
Thanks!
For the purpose of writing an anagram solver the kind of which you linked, the algorithm that you are requesting is not necessary. It is also VERY expensive.
Let's look at a 6-letter word like MONKEY, for example. All 6 letters of the word are different, so you would create:
6*5*4*3*2*1 different 6-letter words
6*5*4*3*2 different 5-letter words
6*5*4*3 different 4-letter words
6*5*4 different 3-letter words
6*5 different 2-letter words
For a total of 1950 words
Now, presumably you're not trying to spit out all 1950 words (e.g. 'OEYKMN') as anagrams (which they are, but most of them are also gibberish). I'm guessing you have a dictionary of legal English words, and you just want to check if any of those words are anagrams of the query word, with the option of not using all letters.
If that is the case, then the problem is simple.
To determine if 2 words are anagrams of each other, all you need to do is count how many times each letters are used, and compare these numbers!
Let's restrict ourself to only 26 letters A-Z, case insensitive. What you need to do is write a function countLetters that takes a word and returns an array of 26 numbers. The first number in the array corresponds to the count of the letter A in the word, second number corresponds to count of B, etc.
Then, two words W1 and W2 are exact anagram if countLetters(W1)[i] == countLetters(W2)[i] for every i! That is, each word uses each letter the exact same number of times!
For what I'd call sub-anagrams (MONEY is a sub-anagram of MONKEY), W1 is a sub-anagram of W2 if countLetters(W1)[i] <= countLetters(W2)[i] for every i! That is, the sub-anagram may use less of certain letters, but not more!
(note: MONKEY is also a sub-anagram of MONKEY).
This should give you a fast enough algorithm, where given a query string, all you need to do is read through the dictionary once, comparing the letter count array of each word against the letter count array of the query word. You can do some minor optimizations, but this should be good enough.
Alternatively, if you want utmost performance, you can preprocess the dictionary (which is known in advance) and create a directed acyclic graph of sub-anagram relationship.
Here's a portion of such a graph for illustration:
D=1,G=1,O=1 ----------> D=1,O=1
{dog,god} \ {do,od}
\
\-------> G=1,O=1
{go}
Basically each node is a bucket for all words that have the same letter count array (i.e. they're exact anagrams). Then there's a node from N1 to N2 if N2's array is <= (as defined above) N1's array (you can perform transitive reduction to store the least amount of edges).
Then to list all sub-anagrams of a word, all you have to do is find the node corresponding to its letter count array, and recursively explore all nodes reachable from that node. All their buckets would contain the sub-anagrams.
The following js code will find all possible "words" in an n letter word. Of course this doesn't mean that they are real words but does give you all the combinations. On my machine it takes about 0.4 seconds for a 7 letter word and 15 secs for a 9 letter word (up to almost a million possibilities if no repeated letters). However those times include looking in a dictionary and finding which are real words.
var getWordsNew=function(masterword){
var result={}
var a,i,l;
function nextLetter(a,l,key,used){
var i;
var j;
if(key.length==l){
return;
}
for(i=0;i<l;i++){
if(used.indexOf(""+i)<0){
result[key+a[i]]="";
nextLetter(a,l,key+a[i],used+i);
}
}
}
a=masterword.split("");
l=a.length;
for (i = 0; i < a.length; i++) {
result[a[i]] = "";
nextLetter(a, l, a[i], "" + i)
}
return result;
}
Complete code at
Code for finding words in words
You want a sort of arrangements. If you're familiar with the permutation algorithm then you know you have a check to see when you've generated enough numbers. Just change that limit:
I don't know AS3, but here's a pseudocode:
st = an array
Arrangements(LettersInYourWord, MinimumLettersInArrangement, k = 1)
if ( k > MinimumLettersInArrangements )
{
print st;
}
if ( k > LettersInYourWord )
return;
for ( each position i in your word that hasn't been used before )
st[k] = YourWord[i];
Arrangements(<same>, <same>, k + 1);
for "abc" and Arrangements(3, 2, 1); this will print:
ab
abc
ac
acb
...
If you want those with three first, and then those with two, consider this:
st = an array
Arrangements(LettersInYourWord, DesiredLettersInArrangement, k = 1)
if ( k > DesiredLettersInArrangements )
{
print st;
return
}
for ( each position i in your word that hasn't been used before )
st[k] = YourWord[i];
Arrangements(<same>, <same>, k + 1);
Then for "abc" call Arrangements(3, 3, 1); and then Arrangements(3, 2, 1);
You can generate all words in an alphabet by finding all paths in a complete graph of the letters. You can find all paths in that graph by doing a depth first search from each letter and returning the current path at each point.
There is simple O(N) where n is size of vocabulary.
Just sort letters in each word in vocabulary or better, create binary mask of them and then compare whit letters you have.