Contains Test for a Constant Set - language-agnostic

The problem statement:
Given a set of integers that is known in advance, generate code to test if a single integer is in the set. The domain of the testing function is the integers in some consecutive range.
Nothing in particular is known now about the range or the set to be tested. The range could be small or huge (but a solution can reject problems that are to big but higher limits are better). It could be that very few of the values in the allowed range are in the set or most of them are or anything in between. The set may be uniformly distributed or clustered. There may be large sections of only contained/not-contained values or there may be at least a few of each type of value in most swaths. (sort of like the assumption made about items to be sorted when analyzing sorting algorithms)
The objective is a procedure for generating effective code for running the test.
Partial solutions that come to mind include
perfect hash function (costly for large sets)
range tests: foreach(b in ranges) if(b.l <= v && v <= b.h) return true;
trees/indexes (more costly than others in some cases)
table lookup (costly for large sets)
the inverse of any of these these (kodos to Jason S)
It seems that an ideal solution would be able to pick what option is best or if none work well, use a tree to break down the full range into sections and then switch to other options for subsection that are better suited to them.
Topic(s) that might be useful include:
Huffman coding
Note: this is not homework. if it was issued as homework below the doctoral level the prof should be shot with a Nerf gun (if you don't get that then re-read the problem, it is very much non trivial)
Note: This is a problem that occurred to me a few days a go and I've been puzzling over it off and on. I have no direct use for this but thought it would be a cool problem to attack. The reason that I wan to generate the code is because generated code will be no slower than general code (it can be the same thing if needed) and might be faster in some/many cases.
I'm posting this question as much to clarify my thoughts as anything. If I can come up with any reasonable or cool solutions I plan on implementing them as a template meta program (the other reason for generated code)
some people have noted that the problem is very general. That is the point. I'm hoping to generate a system that would work an a very general domain: sets of integers in some range.

a previous question on dictionary/spellchecking had a number of responses that mentioned Bloom filters; maybe that would help.
I would think that testing for large sets is going to be expensive no matter what.

let's pretend, for a moment, that this is a real question:
there are no limits on the size of the base set or the input set
this makes the "problem" unrealistic, underspecified, and un-solvable in any practical sense
if someone wants to posit a solution, here's some unit test cases:
unit test 1:
the base set is all integers between -1,000,000,000,000 and +1,000,000,000,000 except for 100,000,000,000 randomly-removed values
the input set is 100,000,000,000 randomly-generated integers in the same range
unit test 2:
the base set is the Fibonacci series
the input set is 1T randomly-generated integers in the range 0..infinity

there's also boost::dynamic_bitset, not sure how it scales for time, or in space with respect to distribution of original numbers. (e.g. if the bits are stored in chunks of 8/16/32/64, then sparse bitsets are inefficient)
or perhaps this (compressed bit set) or this (bit vector) webpage (I googled for "large sparse bit sets" and "compressed bit sets")

Related

Logic or lookup table: Best practices

Suppose you have a function/method that uses two metric to return a value — essentially a 2D matrix of possible values. Is it better to use logic (nested if/switch statements) to choose the right value, or just build that matrix (as an Array/Hash/Dictionary/whatever), and then the return value becomes simply a matter of performing a lookup?
My gut feeling says that for an M⨉N matrix, relatively small values for both M and N (like ≤3) would be OK to use logic, but for larger values it would be more efficient to just build the matrix.
What are general best practices for this? What about for an N-dimensional matrix?
The decision depends on multiple factors, including:
Which option makes the code more readable and hence easier to maintain
Which option performs faster, especially if the lookup happens squillions of times
How often do the values in the matrix change? If the answer is "often" then it is prob better to externalise the values out of the code and put them in an matrix stored in a way that can be edited simply.
Not only how big is the matrix but how sparse is it?
What I say is that about nine conditions is the limit for an if .. else ladder or a switch. So if you have a 2D cell you can reasonably hard-code the up, down, diagonals, and so on. If you go to three dimensions you have 27 cases and it's too much, but OK if you're restricted to the six cub faces.
Once you've got a a lot of conditions, start coding via look-up tables.
But there's no real answer. For example Windows message loops need to deal with a lot of different messages, and you can't sensibly encode the handling code in look-up tables.

Hashing Function Vs Loop search

I have an array of structures, ~100 unique elements, and the structure is not large. Due to legacy code, to find an element in this array i use a hash function to find a likely starting point to start looping from until i find the element i want.
My question is this: Is the hash function (and resulting hash table) overkill ?
I know for large tables hashing is essential for good response time, but for a table this size ?
More succinctly, is there a table size below which writing a hash function is unnecessary ?
Language agnostic answers please.
Thanks,
A hash lookup trades better scalability for a bigger up-front computation cost. There is no inherent table size, as it depends on the cost of your hash function. Roughly speaking, if calculating your hash function has the same cost as one hundred equality comparisons, then you could only theoretically benefit from the hash map at some point above one hundred items. The only way to get specific answers for your case is to measure the performance.
My guess though, is that a hash map for 100 items for performance reasons is overkill.
The standard, obvious answer would be/is to write the simplest code that can do the job. Ensure that your interface to that code is as clean as possible so you can replace it when/if needed. Later, if you find that code takes an unacceptable amount of time, replace it with something that improves performance.
On a theoretical basis, however, it's impossible to guess at the upper limit on the number of items for which a linear search will provide acceptable performance for your task. It's also impossible to guess at the number of items for which a hash table will provide better performance than a linear search.
The main point, however, is that it's rarely necessary to try to figure out (especially on a poorly-defined theoretical basis) what data structure would be best for a given situation. In most cases, you just need to make an acceptable decision, and implement it so you can change your mind later if it turns out to be unacceptable after all.
When creating (or after it's created) sort your 'array of unique elements' by their 'key value'. Then use 'binary search' rather than hash or linear search. Now you get a simple implementation, no extra memory usage and good performance.

Extracting initial seed value of a PRNG?

I recently read that you can predict the outcomes of a PRNG if you:
Know what algorithm is being used.
Have consecutive data points.
Is it possible to figure out the seed used for a PRNG from only data points?
I managed to find a paper by Kelsey et al which details the different types of attack and also summarises some real-world examples. It seems most attacks rely on similar techniques to those against cryptosystems, and in most cases actually taking advantage of the fact that the PRNG is used in a cryptosystem.
With "enough" data points that are the absolute first data points generated by the PRNG with no gaps, sure. Most PRNG functions are invertible, so just work backwards and you should get the seed.
For example, the typical return seed=(seed*A+B)%N has an inverse of return seed=((seed-B)/A)%N.
It's always theoretically possible, if you're "allowed" to brute force all possible values for the seed, and if you have enough data points that there's only one seed that could have produced that output. If the PRNG was seeded with the time, and you know roughly when that happened, then this might be very fast since there aren't many plausible values to try. If the PRNG was seeded with data from a truly random source having 64 bits of entropy, then this approach is computationally infeasible.
Whether there are other techniques depends on the algorithm. For example doing this for Blum Blum Shub is equivalent to integer factorization, which is generally believed to be a hard computational problem. Other, faster PRNGs might be less "secure" in this sense. Any PRNG used for crypto purposes, for example in a stream cipher, pretty much needs there to be no known feasible way of doing it.

How to improve maintainability of functions

I will expand here on a comment I made to When a method has too many parameters? where the OP was having minor problems with someone else's function which had 97 parameters.
I am a great believer in writing maintainable code (and it is often easier to write than to read, hence Steve McConnell(praise be upon his name)'s phrase "write only code").
Since statistics how that most car accidents happen at junctions and my experience (ymmv) shows that most "anomalies" occur at interfaces, I will list some things that I do to attempt to avoid misunderstandings at interfaces and invite your comments if I am going badly wrong.
But, more importantly, I invite your suggestions for making things even more prophylactic (see, there is a question after all - how to improve things?).
Adequate documentation, in the form of (up to date) DoxyGen format comments describing the nature and porpoise of each parameter.
absolutely NO back-door shenanigans with global variables as hidden parameters.
try to limit parameters to six or eight. If more, pass related parameters as a structure; if they are not related then seriously reconsider the function. If it needs so much information, is it too complex to maintain? Can it be broken down into several smaller functions?
use the CONST as often as possible and meaningful.
a coding standard that says that input parameters come first, then output only, and finally input/output, which are modified by the function.
I also #define some empty macros to make declarations even easier to read:
#define INPUT
#define OUTPUT
#define MODIFY
bool DoSomething(INPUT int howOften, MODIFY Wdiget *myWidget, OUTPUT WidgetPtr * const nextWidget)
Just a few ideas. How can I improve on these? Thanks.
Addressing your points in order:
Well-designed types usually render Doxygen format comments a waste of time.
While true as stated ("shenanigans" are bad by definition), not all use of globals is really as bad as many people imply. If you have to pass a parameter more than about four times before it's really used, chances are that a global will be less error prone.
Eight or even six parameters is usually excessive. Any more than two or three starts to indicate that the function is doing more than one thing. One obvious exception is a constructor that aggregates a number of other items into an object (e.g. an address object that takes a street name, number, city, country, postal code, etc., as inputs).
Better stated as "write const-correct code."
Given C++'s default parameter capability, it's generally best to sort in ascending order of likelihood to use a default value.
Don't. Just don't! If it's not obvious what are inputs and what are outputs, that pretty much proves that the basic design is fatally flawed.
As for ideas I think are actually good:
As implied in the first point, concentrate on types. Once you get them right, most of the other problems just disappear.
Use a few (even just one) central theme(s). For Lisp, everything is a list. For Unix, everything is a file (and files are all simple streams of bytes). Emulate this simplicity.
Edit: replying to comments:
While you do have something of a point, my experience still indicates that documentation produced with Doxygen (and similar such as javadoc) is almost universally useless. In theory the tool doesn't prevent decent documentation, but in fact it's rare at best.
Globals certainly can cause problems -- but I'm old enough to have used Fortran back before it provided much alternative, and with some care it really wasn't nearly as bad as many people imply. A lot of the stories seem to be at least third hand, with a bit of extra "spice" added each time they're re-told. I've seen one story that sounds a lot like an exaggerated version of one I told a couple decades ago or so...
Hm...Markdown formatting doesn't seem to approve of my skipping numbers.
And again...
My comment was specific to C++, but quite a few other languages also support default parameters and/or overloading, and it can apply about as well to most of them. Even without it, a call like f(param1, param2, 0,0,0); is pretty easy to see as having default parameters. To an extent, ordering by usage is handy, but when you do the order you pick doesn't matter nearly as much as simply being consistent.
True, a void * parameter doesn't tell you much -- but a MODIFY void * is little better. A real type and consistent use of const provides far more information and gets checked by the compiler. Other languages may not have/use const, but they probably don't have macros either. OTOH, some directly support what you want -- e.g., Ada has in, out and inout specifiers.
I am not sure we will end at a single point of agreement about this, everyone will come up with different ideas (good or bad in each others perspective). Having said that, i find Code Complete to be a good place to go to when I am stuck with this sort of problems.
A big peeve of mine is control coupling between functions. (Control coupling is when one module controls the execution flow of another, by passing flags telling the called function what to do.)
For example (cut & paste from code I just had to work on):
void UartEnable(bool enable, int baud);
as opposed to:
void UartEnable(int baud);
void UartDisable(void);
Put another way -- parameters are for passing "data", not "control".
I'd use the 'rule' put forward by Uncle Bob in his book Clean Code.
These the ones I think I remember:
2 parameters are ok, 3 are bad, more need refactoring
Comments are a sign of bad names. So there should be none, and the purpose of the function and the parameters should be clear from the names
make the method short. Aim for below 10 lines of code.

What statistics can be maintained for a set of numerical data without iterating?

Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
Count
Sum
Mean
Max*
Min*
Median**
Less Obvious
Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.
First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.
To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.
The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.
As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.
Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
public void Add(double value) {
values.Add(value);
if (value > highValue)
highValue = value;
}
public void Remove(double value) {
values.Remove(value);
if (value.WithinTolerance(highValue))
highValue = RecalculateHighValueByIteration();
}
It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.
If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.
It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.