Optimal DB query for prefix search - mysql

I have a dataset which is a list of prefix ranges, and the prefixes aren't all the same size. Here are a few examples:
low: 54661601 high: 54661679 "bin": a
low: 526219100 high: 526219199 "bin": b
low: 4305870404 high: 4305870404 "bin": c
I want to look up which "bin" corresponds to a particular value with the corresponding prefix. For example, value 5466160179125211 would correspond to "bin" a. In the case of overlaps (of which there are few), we could return either the longest prefix or all prefixes.
The optimal algorithm is clearly some sort of tree into which the bin objects could be inserted, where each successive level of the tree represents more and more of the prefix.
The question is: how do we implement this (in one query) in a database? It is permissible to alter/add to the data set. What would be the best data & query design for this? An answer using mongo or MySQL would be best.

If you make a mild assumption about the number of overlaps in your prefix ranges, it is possible to do what you want optimally using either MongoDB or MySQL. In my answer below, I'll illustrate with MongoDB, but it should be easy enough to port this answer to MySQL.
First, let's rephrase the problem a bit. When you talk about matching a "prefix range", I believe what you're actually talking about is finding the correct range under a lexicographic ordering (intuitively, this is just the natural alphabetic ordering of strings). For instance, the set of numbers whose prefix matches 54661601 to 54661679 is exactly the set of numbers which, when written as strings, are lexicographically greater than or equal to "54661601", but lexicographically less than "54661680". So the first thing you should do is add 1 to all your high bounds, so that you can express your queries this way. In mongo, your documents would look something like
{low: "54661601", high: "54661680", bin: "a"}
{low: "526219100", high: "526219200", bin: "b"}
{low: "4305870404", high: "4305870405", bin: "c"}
Now the problem becomes: given a set of one-dimensional intervals of the form [low, high), how can we quickly find which interval(s) contain a given point? The easiest way to do this is with an index on either the low or high field. Let's use the high field. In the mongo shell:
db.coll.ensureIndex({high : 1})
For now, let's assume that the intervals don't overlap at all. If this is the case, then for a given query point "x", the only possible interval containing "x" is the one with the smallest high value greater than "x". So we can query for that document and check if its low value is also less than "x". For instance, this will print out the matching interval, if there is one:
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(1).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
Suppose now that instead of assuming the intervals don't overlap at all, you assume that every interval overlaps with less than k neighboring intervals (I don't know what value of k would make this true for you, but hopefully it's a small one). In that case, you can just replace 1 with k in the "limit" above, i.e.
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(k).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
What's the running time of this algorithm? The indexes are stored using B-trees, so if there are n intervals in your data set, it takes O(log n) time to lookup the first matching document by high value, then O(k) time to iterate over the next k documents, for a total of O(log n + k) time. If k is constant, or in fact anything less than O(log n), then this is asymptotically optimal (this is in the standard model of computation; I'm not counting number of external memory transfers or anything fancy).
The only case where this breaks down is when k is large, for instance if some large interval contains nearly all the other intervals. In this case, the running time is O(n). If your data is structured like this, then you'll probably want to use a different method. One approach is to use mongo's "2d" indexing, with your low and high values codifying x and y coordinates. Then your queries would correspond to querying for points in a given region of the x - y plane. This might do well in practice, although with the current implementation of 2d indexing, the worst case is still O(n).
There are a number of theoretical results that achieve O(log n) performance for all values of k. They go by names such as Priority Search Trees, Segment trees, Interval Trees, etc. However, these are special-purpose data structures that you would have to implement yourself. As far as I know, no popular database currently implements them.

"Optimal" can mean different things to different people. It seems that you could do something like save your low and high values as varchars. Then all you have to do is
select bin from datatable where '5466160179125211' between low and high
Or if you had some reason to keep the values as integers in the table, you could do the CASTing in the query.
I have no idea whether this would give you terrible performance with a large dataset. And I hope I understand what you want to do.

With MySQL you may have to use a stored procedure, which you call to map value to bin. Said procedure would query the list of buckets for each row and do arithmetic or string ops to find the matching bucket. You could improve this design by using fixed length prefixes, arranged in a fixed number of layers. You could assign a fixed depth to your tree and each layer has a table. You won't get tree-like performance with either of these approaches.
If you want to do something more sophisticated, I suspect you have to use a different platform.
Sql Server has a Hierarchy data type:
http://technet.microsoft.com/en-us/library/bb677173.aspx
PostgreSQL has a cidr data type. I'm not familiar with the level of query support it has, but in theory you could build a routing table inside of your db and use that to assign buckets:
http://www.postgresql.org/docs/7.4/static/datatype-net-types.html#DATATYPE-CIDR

Peyton! :)
If you need to keep everything as integers, and want it to work with a single query, this should work:
select bin from datatable where 5466160179125211 between
low*pow(10, floor(log10(5466160179125211))-floor(log10(low)))
and ((high+1)*pow(10, floor(log10(5466160179125211))-floor(log10(high)))-1);
In this case, it would search between the numbers 5466160100000000 (the lowest number with the low prefix & the same number of digits as the number to find) and 546616799999999 (the highest number with the high prefix & the same number of digits as the number to find). This should still work in cases where the high prefix has more digits than the low prefix. It should also work (I think) in cases where the number is shorter than the length of the prefixes, where the varchar code in the previous solution can give incorrect results.
You'll want to experiment to compare the performance of having a lot of inline math in the query (as in this solution) vs. the performance of using varchars.
Edit: Performance seems to be really good either way even on big tables with no indexes; if you can use varchars then you might be able to further boost performance by indexing the low and high columns. Note that you'd definitely want to use varchars if any of the prefixes have initial zeroes. Here's a fix to allow for the case where the number is shorter than the prefix when using varchars:
select * from datatable2 where '5466' between low and high
and length('5466') >= length(high);

Related

Data structure/Algorithm to manage non-overlapping ranges of values?

I'm working on a system that has 10's of thousands of flags for a user. The flags are all sequential in number, 0 through X, whatever X ends up being. X is expected to grow over time. And we're expecting to have lots and lots of users as well.
Our primary concerns are:
Being able to quickly test whether the user has set any given flag.
Being able to quickly set a flag.
Being able to optimize the data storage to as small a size as possible.
With 10k flags, we're looking at around 1k of data per user, in memory, if we use a bit vector. Which might be too much. And to make matters worse, this is in Javascript, being stored in a document database serialized as JSON, which means that we have several storage options, none of them which I particularly like.
Store flags as the JSON output of the Uint32Array object. Looks like: "{"0":10,"1":4294967295}". Unfortunately needs an average of 17 bytes per 4 bytes stored as the flags approach their filled state, which is over 4x the memory, and leads to about 5k of memory when serialized. This is not ideal.
Perform our own serialization of the JSON, using base64 to avoid the bloated sizes of the numbers-as-strings approach. Unfortunately that adds an extra processing step to the JSON input/output phases which complicates things because now we have to modify our data during the process and will slow everything down.
So... putting aside the bitvector idea for a bit. I was wondering if there's possibly a better approach. I considered using an "array of ranges", something like:
[{"m":0,"x":100},{"m":102},{"m":108,"x":204}]
We can make a few assumptions about the data in this system, which is what led me to this approach:
Flags are never un-set. Once it's set, it will remain set.
Flags are generally clustered. If flag X is set, there's a huge probability that X-1 and X+1 will be set as well.
Flags will generally be set at increasing index values. If Flag X is being set, then X-1 is more likely to be set than X+1, and X+1 is likely to be set fairly soon afterwards.
So because of these conditions, I think storing an array of range objects might be the optimal solution. That way, over time, the user's flags eventually condense down into one large range entry. The optimal case is of course:
[{"m":0,"x":10000}]
The worst case scenario, of course, is if they somehow manage to find themselves in a state where they set every other flag.
[{"m":0},{"m":2},{"m":4},{"m":6},{"m":8},{"m":10}...{"m":10000}]
That would be bad. Far worse than the bitvector solution, I think. But we're pretty confident that won't happen.
So, as to the ability to quickly decide if a flag is set; that's simply an O(logn) binary search (since the array will be sorted); just find the range object closest to your number, check to see if your number is in that range, and return.
Insertions are more tricky. It'll still be a binary search, but now we're modifying the array.
one adjacent sibling insert: optimal scenario. We find a range where the min or max is one off from the number we're inserting, and simply decrement or increment the value on the current range. O(1)
No adjacent siblings insert: simply insert a new node with the min set. O(n), because we'll be moving everything after it in the array downwards.
Two adjacent siblings insert: Change the max to the max value of the right sibling range, delete the right sibling range from the array and shift everything after it to the left. O(n).
So cases 2+3 have me wondering if I shouldn't try to use some sort of balanced binary search tree for this. A Red-Black tree, for example.
Is that worth the bother? Am I overthinking this?

Which DB to choose for finding best matching records?

I'm storing an object in a database described by a lot of integer attributes. The real object is a little bit more complex, but for now let's assume that I'm storing cars in my database. Each car has a lot of integer attributes to describe the car (ie. maximum speed, wheelbase, maximum power etc.) and these are searchable by the user. The user defines a preferred range for each of the objects and since there are a lot of attributes there most likely won't be any car matching all the attribute ranges. Therefore the query has to return a number of cars sorted by the best match.
At the moment I implemented this in MySQL using the following query:
SELECT *, SQRT( POW((a < min_a)*(min_a - a) + (a > max_a)*(a - max_a), 2) +
POW((b < min_b)*(min_b - b) + (b > max_b)*(b - max_b), 2) +
... ) AS match
WHERE a < (min_a - max_allowable_deviation) AND a > (max_a + max_allowable_deviation) AND ...
ORDER BY match ASC
where a and b are attributes of the object and min_a, max_a, min_b and max_b are user defined values. Basically the match is the square root of the sum of the squared differences between the desired range and the real value of the attribute. A value of 0 meaning a perfect match.
The table contains a couple of million records and the WHERE clausule is only introduced to limit the number of records the calculation is performed on. An index is placed on all of the queryable records and the query takes like 500ms. I'd like to improve this number and I'm looking into ways to improve this query.
Furthermore I am wondering whether there would be a different database better suited to perform this job. Moreover I'd very much like to change to a NoSQL database, because of its more flexible data scheme options. I've been looking into MongoDB, but couldn't find a way to solve this problem efficiently (fast).
Is there any database better suited for this job than MySQL?
Take a look at R-trees. (The pages on specific variants go in to a lot more detail and present pseudo code). These data structures allow you to query by a bounding rectangle, which is what your problem of searching by ranges on each attribute is.
Consider your cars as points in n-dimensional space, where n is the number of attributes that describe your car. Then given a n ranges, each describing an attribute, the problem is the find all the points contained in that n-dimensional hyperrectangle. R-trees support this query efficiently. MySQL implements R-trees for their spatial data types, but MySQL only supports two-dimensional space, which is insufficient for you. I'm not aware of any common databases that support n-dimensional R-trees off the shelf, but you can take some database with good support for user-defined tree data structures and implement R-trees yourself on top of that. For example, you can define a structure for an R-tree node in MongoDB, with child pointers. You will then implement the R-tree algorithms in your own code while letting MongoDB take care of storing the data.
Also, there's this C++ header file implementing of an R-tree, but currently it's only an in-memory structure. Though if your data set is only a few million rows, it seems feasible to just load this memory structure upon startup and update it whenever new cars are added (which I assume is infrequent).
Text search engines, such as Lucene, meet your requirements very well. They allow you to "boost" hits depending on how they were matched, eg you can define engine size to be considered a "better match" than wheel base. Using lucene is really easy and above all, it's SUPER FAST. Way faster than mysql.
Mysql offer a plugin to provide text-based searching, but I prefer to use it separately, that way it's easily scalable (being read-only, you can have multiple lucene engines), and easily manageable.
Also check out Solr, which sits on top of lucene and allows you to store, retrieve and search for simple java object (Lists, arrays etc).
Likely, your indexes aren't helping much, and I can't think of another database technology that's going to be significantly better. A few things to try with MySQL....
I'd try putting a copy of the data in a memory table. At least the table scans will be in memory....
http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html
If that doesn't work for you or help much, you could also try a User Defined Function to optimize the calculation of the matching. Basically, this means executing the range testing in a C library you provide:
http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html

Easiest way to find the correct kademlia bucket

In the Kademlia protocol node IDs are 160 bit numbers. Nodes are stored in buckets, bucket 0 stores all the nodes which have the same ID as this node except for the very last bit, bucket 1 stores all the nodes which have the same ID as this node except for the last 2 bits, and so on for all 160 buckets.
What's the fastest way to find which bucket I should put a new node into?
I have my buckets simply stored in an array, and need a method like so:
Bucket[] buckets; //array with 160 items
public Bucket GetBucket(Int160 myId, Int160 otherId)
{
//some stuff goes here
}
The obvious approach is to work down from the most significant bit, comparing bit by bit until I find a difference, I'm hoping there is a better approach based around clever bit twiddling.
Practical note:
My Int160 is stored in a byte array with 20 items, solutions which work well with that kind of structure will be preferred.
Would you be willing to consider an array of 5 32-bit integers? (or 3 64-bit integers)? Working with whole words may give you better performance than working with bytes, but the method should work in any case.
XOR the corresponding words of the two node IDs, starting with the most significant. If the XOR result is zero, move on to the next most significant word.
Otherwise, find the most significant bit that is set in this XOR result using the constant time method from Hacker's Delight.. This algorithm results in 32 (64) if the most significant bit is set, and 1 if the least significant bit is set, and so on. This index, combined with the index of the current word, will will tell you which bit is different.
For starters you could compare byte-by-byte (or word-by-word), and when you find a difference search within that byte (or word) for the first bit of difference.
It seems vaguely implausible to me that adding a node to an array of buckets will be so fast that it matters whether you do clever bit-twiddling to find the first bit of difference within a byte (or word), or just churn in a loop up to CHAR_BIT (or something). Possible, though.
Also, if IDs are essentially random with uniform distribution, then you will find a difference in the first 8 bits about 255/256 of the time. If all you care about is average-case behaviour, not worst-case, then just do the stupid thing: it's very unlikely that your loop will run for long.
For reference, though, the first bit of difference between numbers x and y is the first bit set in x ^ y. If you were programming in GNU C, __builtin_clz might be your friend. Or possibly __builtin_ctz, I'm kind of sleepy...
Your code looks like Java, though, so I guess the bitfoo you're looking for is integer log.

Why should hash functions use a prime number modulus?

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".
What do you expect from a $1.25 book?
Anyway, I've had years to think about the nature of math, and still can't figure it out.
Is the distribution of numbers truly more even when there are a prime number of buckets?
Or is this an old programmer's tale that everyone accepts because everybody else accepts it?
Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:
(first char) + k * (second char) + k^2 * (third char) + ...
Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.
[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]
A hashtable works by taking the modulus of the hash over the number of buckets.
It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.
Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.
It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.
Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.
But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.
In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.
Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.
[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]
The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 'statements' that you most probably have read somewhere
If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will definitely suffer from clustering of many keys in a few hash buckets.
But if you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.
And here is the proof.
If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering
Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
From - http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
Pretty clear explanation, with pictures too.
Edit: As a summary, primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up. For example given a string, multiplying each letter value with the prime number and then adding those all up will give you its hash value.
A better question would be, why exactly the number 31?
Just to put down some thoughts gathered from the answers.
Hashing uses modulus so any value can fit into a given range
We want to randomize collisions
Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value
To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide
Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide
Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values
For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6
12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9
This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high
So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread
tl;dr
index[hash(input)%2] would result in a collision for half of all possible hashes and a range of values. index[hash(input)%prime] results in a collision of <2 of all possible hashes. Fixing the divisor to the table size also ensures that the number cannot be greater than the table.
Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P.
Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.
Just to provide an alternate viewpoint there's this site:
http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth
Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets. It seems like a reasonable possibility. Intuitively, I can certainly see how a larger number of buckets would be better, but I'm unable to make a mathematical argument of this.
It depends on the choice of hash function.
Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine (that modulus is free by just letting the calculation overflow).
You don't want any common factor between a multiplier for a data element and the size of the hash table, because then it could happen that varying the data element doesn't spread the data over the whole table. If you choose a prime for the size of the table such a common factor is highly unlikely.
On the other hand, those factors are usually made up from odd primes, so you should also be safe using powers of two for your hash table (e.g. Eclipse uses 31 when it generates the Java hashCode() method).
Copying from my other answer https://stackoverflow.com/a/43126969/917428. See it for more details and examples.
I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:
8 % 10 = 8
18 % 10 = 8
87865378 % 10 = 8
It doesn't matter what the number is: as long as it ends with 8, its modulo 10 will be 8.
Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.
"The nature of math" regarding prime power moduli is that they are one building block of a finite field. The other two building blocks are an addition and a multiplication operation. The special property of prime moduli is that they form a finite field with the "regular" addition and multiplication operations, just taken to the modulus. This means every multiplication maps to a different integer modulo the prime, so does every addition.
Prime moduli are advantageous because:
They give the most freedom when choosing the secondary multiplier in secondary hashing, all multipliers except 0 will end up visiting all elements exactly once
If all hashes are less than the modulus there will be no collisions at all
Random primes mix better than power of two moduli and compress the information of all the bits not just a subset
They however have a big downside, they require an integer division, which takes many (~ 15-40) cycles, even on a modern CPU. With around half the computation one can make sure the hash is mixed up very well. Two multiplications and xorshift operations will mix better than a prime moudulus. Then we can use whatever hash table size and hash reduction is fastest, giving 7 operations in total for power of 2 table sizes and around 9 operations for arbitrary sizes.
I recently looked at many of the fastest hash table implementations and most of them don't use prime moduli.
The distribution of the hash table indices are mainly dependent on the hash function in use. A prime modulus can't fix a bad hash function and a good hash function does not benefit from a prime modulus. There are cases where they can be advantageous however. It can mend a half-bad hash function for example.
Primes are unique numbers. They are
unique in that, the product of a prime
with any other number has the best
chance of being unique (not as unique
as the prime itself of-course) due to
the fact that a prime is used to
compose it. This property is used in
hashing functions.
Given a string “Samuel”, you can
generate a unique hash by multiply
each of the constituent digits or
letters with a prime number and adding
them up. This is why primes are used.
However using primes is an old
technique. The key here to understand
that as long as you can generate a
sufficiently unique key you can move
to other hashing techniques too. Go
here for more on this topic about
http://www.azillionmonkeys.com/qed/hash.html
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
Suppose your table-size (or the number for modulo) is T = (B*C). Now if hash for your input is like (N*A*B) where N can be any integer, then your output won't be well distributed. Because every time n becomes C, 2C, 3C etc., your output will start repeating. i.e. your output will be distributed only in C positions. Note that C here is (T / HCF(table-size, hash)).
This problem can be eliminated by making HCF 1. Prime numbers are very good for that.
Another interesting thing is when T is 2^N. These will give output exactly same as all the lower N bits of input-hash. As every number can be represented powers of 2, when we will take modulo of any number with T, we will subtract all powers of 2 form number, which are >= N, hence always giving off number of specific pattern, dependent on the input. This is also a bad choice.
Similarly, T as 10^N is bad as well because of similar reasons (pattern in decimal notation of numbers instead of binary).
So, prime numbers tend to give a better distributed results, hence are good choice for table size.
I would say the first answer at this link is the clearest answer I found regarding this question.
Consider the set of keys K = {0,1,...,100} and a hash table where the number of buckets is m = 12. Since 3 is a factor of 12, the keys that are multiples of 3 will be hashed to buckets that are multiples of 3:
Keys {0,12,24,36,...} will be hashed to bucket 0.
Keys {3,15,27,39,...} will be hashed to bucket 3.
Keys {6,18,30,42,...} will be hashed to bucket 6.
Keys {9,21,33,45,...} will be hashed to bucket 9.
If K is uniformly distributed (i.e., every key in K is equally likely to occur), then the choice of m is not so critical. But, what happens if K is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of 3. In this case, all of the buckets that are not multiples of 3 will be empty with high probability (which is really bad in terms of hash table performance).
This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of 4. Needless to say that choosing m to be a multiple of 4 would be a terrible choice: you would have 3m/4 buckets completely empty, and all of your keys colliding in the remaining m/4 buckets.
In general:
Every key in K that shares a common factor with the number of buckets m will be hashed to a bucket that is a multiple of this factor.
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this be achieved? By choosing m to be a number that has very few factors: a prime number.
FROM THE ANSWER BY Mario.
I'd like to add something for Steve Jessop's answer(I can't comment on it since I don't have enough reputation). But I found some helpful material. His answer is very help but he made a mistake: the bucket size should not be a power of 2. I'll just quote from the book "Introduction to Algorithm" by Thomas Cormen, Charles Leisersen, et al on page263:
When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.
Hope it helps.
This question was merged with the more appropriate question, why hash tables should use prime sized arrays, and not power of 2.
For hash functions itself there are plenty of good answers here, but for the related question, why some security-critical hash tables, like glibc, use prime-sized arrays, there's none yet.
Generally power of 2 tables are much faster. There the expensive h % n => h & bitmask, where the bitmask can be calculated via clz ("count leading zeros") of the size n. A modulo function needs to do integer division which is about 50x slower than a logical and. There are some tricks to avoid a modulo, like using Lemire's https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/, but generally fast hash tables use power of 2, and secure hash tables use primes.
Why so?
Security in this case is defined by attacks on the collision resolution strategy, which is with most hash tables just linear search in a linked list of collisions. Or with the faster open-addressing tables linear search in the table directly. So with power of 2 tables and some internal knowledge of the table, e.g. the size or the order of the list of keys provided by some JSON interface, you get the number of right bits used. The number of ones on the bitmask. This is typically lower than 10 bits. And for 5-10 bits it's trivial to brute force collisions even with the strongest and slowest hash functions. You don't get the full security of your 32bit or 64 bit hash functions anymore. And the point is to use fast small hash functions, not monsters such as murmur or even siphash.
So if you provide an external interface to your hash table, like a DNS resolver, a programming language, ... you want to care about abuse folks who like to DOS such services. It's normally easier for such folks to shut down your public service with much easier methods, but it did happen. So people did care.
So the best options to prevent from such collision attacks is either
1) to use prime tables, because then
all 32 or 64 bits are relevant to find the bucket, not just a few.
the hash table resize function is more natural than just double. The best growth function is the fibonacci sequence and primes come closer to that than doubling.
2) use better measures against the actual attack, together with fast power of 2 sizes.
count the collisions and abort or sleep on detected attacks, which is collision numbers with a probability of <1%. Like 100 with 32bit hash tables. This is what e.g. djb's dns resolver does.
convert the linked list of collisions to tree's with O(log n) search not O(n) when an collision attack is detected. This is what e.g. java does.
There's a wide-spread myth that more secure hash functions help to prevent such attacks, which is wrong as I explained. There's no security with low bits only. This would only work with prime-sized tables, but this would use a combination of the two slowest methods, slow hash plus slow prime modulo.
Hash functions for hash tables primarily need to be small (to be inlinable) and fast. Security can come only from preventing linear search in the collisions. And not to use trivially bad hash functions, like ones insensitive to some values (like \0 when using multiplication).
Using random seeds is also a good option, people started with that first, but with enough information of the table even a random seed does not help much, and dynamic languages typically make it trivial to get the seed via other methods, as it's stored in known memory locations.
For a hash function it's not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes.
Say you have an equation:
(x + y*z) % key = x with 0<x<key and 0<z<key.
If key is a primenumber n*y=key is true for every n in N and false for every other number.
An example where key isn't a prime example:
x=1, z=2 and key=8
Because key/z=4 is still a natural number, 4 becomes a solution for our equation and in this case (n/2)*y = key is true for every n in N. The amount of solutions for the equation have practially doubled because 8 isn't a prime.
If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash.
I've read the popular wordpress website linked in some of the above popular answers at the top. From what I've understood, I'd like to share a simple observation I made.
You can find all the details in the article here, but assume the following holds true:
Using a prime number gives us the "best chance" of an unique value
A general hashmap implementation wants 2 things to be unique.
Unique hash code for the key
Unique index to store the actual value
How do we get the unique index? By making the initial size of the internal container a prime as well. So basically, prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container.
Example:
key = "key"
value = "value"
uniqueId = "k" * 31 ^ 2 +
"e" * 31 ^ 1` +
"y"
maps to unique id
Now we want a unique location for our value - so we
uniqueId % internalContainerSize == uniqueLocationForValue , assuming internalContainerSize is also a prime.
I know this is simplified, but I'm hoping to get the general idea through.

What statistics can be maintained for a set of numerical data without iterating?

Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
Count
Sum
Mean
Max*
Min*
Median**
Less Obvious
Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.
First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.
To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.
The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.
As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.
Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
public void Add(double value) {
values.Add(value);
if (value > highValue)
highValue = value;
}
public void Remove(double value) {
values.Remove(value);
if (value.WithinTolerance(highValue))
highValue = RecalculateHighValueByIteration();
}
It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.
If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.
It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.