Why isn't randomized probing more popular in hash table implementations? - language-agnostic

According to various sources, such as Wikipedia and various .edu websites found by Google, the most common ways for a hash table to resolve collisions are linear or quadratic probing and chaining. Randomized probing is briefly mentioned but not given much attention. I've implemented a hash table that uses randomized probing to resolve collisions. Assuming there is a collision, resolution works as follows:
The full (32-bit) hash of an object is used to seed a linear congruential random number generator.
The generator generates 32-bit numbers and the modulus is taken to determine where in the hash table to probe next.
This has the very nice property that, regardless of how many hash collisions there are in modulus space, lookup and insertion times are expected to be O(1) as long as there are few collisions in full 32-bit hash space. Because the probe sequence is pseudo-random, no clustering behavior results from modulus space collisions, unlike with linear probing. Because the entire system is open-addressed and doesn't use linked lists anywhere, you don't need to perform a memory allocation on each insertion, unlike chaining.
Furthermore, because the size of the hash is usually the size of the address space (32 bits on 32-bit machines), it is simply impossible to fit enough items in address space to cause large numbers of hash collisions in full 32-bit hash space under a good hashing scheme.
Why, then, is randomized probing such an unpopular collision resolution strategy?

One of the reasons for using linear lookup (such as double hasing) is cache locality.
By making the second (rehash) function to be an addition of a small integer, most chances are that you'll hit the same cache line. It is very significant for large hashes.
Chain hashing is probably used due to its simplicity.

Python's dictionary implementation does this. A very nice comment in dictobject.c says:
...
The first half of collision resolution is to visit table indices via this
recurrence:
j = ((5*j) + 1) mod 2**i
For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...
Sure looks like a linear congruential RNG to me!
Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.
This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!

Possible reasons are that linear or quadratic probing
have the same worst-case time complexity (O(size of the table))
have the same best-case time complexity (O(1))
are easier to implement
are faster than a good RNG (since speed is a major selling point for hashtables)
But I'm not sure. Did you implement your own hashtable with another collision resolution and compare the two under different circumstances? That would be very enlightening.

Wouldn't you have the problem that for insertions into a non-sparsely populated table there's no guarantee that you'll hit all the elements of the hash table before starting to iterate over duplicate elements?
As a result insertion time wouldn't be well defined.

I think the reason random hashing isn't used much is that hash collisions when a small hash value is computed from a 32-bit hash are apt to be rare unless there's something "wrong" with the hash function, and in that case there's a fair likelihood that all 32 bits of the hash function will match (e.g. because only part of the key was used in computing the hash). If hash functions are decent, and load factors are reasonably low, linear and quadratic probing offer good cache locality (remember that the majority of hash collisions will be resolved by looking at only one extra item, which will with both linear and quadratic probes be the one that follows the first guess). Linear probe offers somewhat better performance in the case where all keys map to the same value, and sometimes even if they map to a small number of values. Chain-bucket hashing allows easy item removal.

Related

Analysis of open addressing

I am currently learning hash tables from "An introduction of algorithms 3th". Get quite confused while trying to understand open addressing from statistical point of view. Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1. In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m. The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand. Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured. I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:
Q: In reality, if the load factor is less than 1, why would we bother open addressing ? Why not project each key to an integer and arrange them in an array ?
Q: In reality, if the load factor is less than 1, why would we bother open addressing? Why not project each key to an integer and arrange them in an array ?
Because in many situations when hash tables are used, there's no good O(1) way to "project each key to an [distinct, not-absurdly-sparse] integer" array index.
A simple thought experiment illustrates this: say you expect the user to type four three-uppercase-letter keys, and you want to store them somewhere in an array with dimension 10. You have 264 possible inputs, so no matter what your logic is, on average 264/10 of them will "project... to an integer" indicating the same array position. When you realise the "project[ion]" can't avoid potential "collisions", and that projection is a logically identical operation to "hashing" and modding to a "bucket", then some collision-handling logic will be needed, your proposed "alternative" morphs back into a hash table....
Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1.
They are very confusing statements. The "number of hash values" is not arbitrarily limited - you could use a 32 bit hash generating any of ~4 billion hash values, a 512-bit hash, or whatever other size you feel like. Given the structure of your statement is "a > b, i.e. load factor n/m < 1", and "n/m < 1" can be rewritten as "n < m" or "m > n", you imply "a" and "m" are meant to be the same thing, as are "b" and "n":
you're referring to m - which "load factor n/m" requires be the number of buckets in the hash table - as "the possible key value number": it's not, and what could that even mean?
you're referring to n - which "load factor n/m" requires be the number of keys stored in the hash table - as "the number of hash values": it's not, except in the trivial sense of that many (not necessarily distinct) hash values being generated when the keys are hashed
In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m.
Again, that's a very poorly defined statement. The hashing of n keys can identify at most n distinct buckets from which collision-handling would kick in, but those n could begin pretty much anywhere within the m buckets, given the hash function's job is to spray them around. And, so what?
The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand.
Understand what?
Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured.
For sure. "Repeatable randomness" of hashing is a very convenient and tangible benchmark against which specific implementations can be compare.
I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

Why should hash functions use a prime number modulus?

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".
What do you expect from a $1.25 book?
Anyway, I've had years to think about the nature of math, and still can't figure it out.
Is the distribution of numbers truly more even when there are a prime number of buckets?
Or is this an old programmer's tale that everyone accepts because everybody else accepts it?
Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:
(first char) + k * (second char) + k^2 * (third char) + ...
Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.
[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]
A hashtable works by taking the modulus of the hash over the number of buckets.
It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.
Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.
It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.
Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.
But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.
In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.
Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.
[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]
The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 'statements' that you most probably have read somewhere
If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will definitely suffer from clustering of many keys in a few hash buckets.
But if you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.
And here is the proof.
If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering
Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
From - http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
Pretty clear explanation, with pictures too.
Edit: As a summary, primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up. For example given a string, multiplying each letter value with the prime number and then adding those all up will give you its hash value.
A better question would be, why exactly the number 31?
Just to put down some thoughts gathered from the answers.
Hashing uses modulus so any value can fit into a given range
We want to randomize collisions
Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value
To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide
Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide
Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values
For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6
12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9
This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high
So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread
tl;dr
index[hash(input)%2] would result in a collision for half of all possible hashes and a range of values. index[hash(input)%prime] results in a collision of <2 of all possible hashes. Fixing the divisor to the table size also ensures that the number cannot be greater than the table.
Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P.
Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.
Just to provide an alternate viewpoint there's this site:
http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth
Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets. It seems like a reasonable possibility. Intuitively, I can certainly see how a larger number of buckets would be better, but I'm unable to make a mathematical argument of this.
It depends on the choice of hash function.
Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine (that modulus is free by just letting the calculation overflow).
You don't want any common factor between a multiplier for a data element and the size of the hash table, because then it could happen that varying the data element doesn't spread the data over the whole table. If you choose a prime for the size of the table such a common factor is highly unlikely.
On the other hand, those factors are usually made up from odd primes, so you should also be safe using powers of two for your hash table (e.g. Eclipse uses 31 when it generates the Java hashCode() method).
Copying from my other answer https://stackoverflow.com/a/43126969/917428. See it for more details and examples.
I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:
8 % 10 = 8
18 % 10 = 8
87865378 % 10 = 8
It doesn't matter what the number is: as long as it ends with 8, its modulo 10 will be 8.
Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.
"The nature of math" regarding prime power moduli is that they are one building block of a finite field. The other two building blocks are an addition and a multiplication operation. The special property of prime moduli is that they form a finite field with the "regular" addition and multiplication operations, just taken to the modulus. This means every multiplication maps to a different integer modulo the prime, so does every addition.
Prime moduli are advantageous because:
They give the most freedom when choosing the secondary multiplier in secondary hashing, all multipliers except 0 will end up visiting all elements exactly once
If all hashes are less than the modulus there will be no collisions at all
Random primes mix better than power of two moduli and compress the information of all the bits not just a subset
They however have a big downside, they require an integer division, which takes many (~ 15-40) cycles, even on a modern CPU. With around half the computation one can make sure the hash is mixed up very well. Two multiplications and xorshift operations will mix better than a prime moudulus. Then we can use whatever hash table size and hash reduction is fastest, giving 7 operations in total for power of 2 table sizes and around 9 operations for arbitrary sizes.
I recently looked at many of the fastest hash table implementations and most of them don't use prime moduli.
The distribution of the hash table indices are mainly dependent on the hash function in use. A prime modulus can't fix a bad hash function and a good hash function does not benefit from a prime modulus. There are cases where they can be advantageous however. It can mend a half-bad hash function for example.
Primes are unique numbers. They are
unique in that, the product of a prime
with any other number has the best
chance of being unique (not as unique
as the prime itself of-course) due to
the fact that a prime is used to
compose it. This property is used in
hashing functions.
Given a string “Samuel”, you can
generate a unique hash by multiply
each of the constituent digits or
letters with a prime number and adding
them up. This is why primes are used.
However using primes is an old
technique. The key here to understand
that as long as you can generate a
sufficiently unique key you can move
to other hashing techniques too. Go
here for more on this topic about
http://www.azillionmonkeys.com/qed/hash.html
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
Suppose your table-size (or the number for modulo) is T = (B*C). Now if hash for your input is like (N*A*B) where N can be any integer, then your output won't be well distributed. Because every time n becomes C, 2C, 3C etc., your output will start repeating. i.e. your output will be distributed only in C positions. Note that C here is (T / HCF(table-size, hash)).
This problem can be eliminated by making HCF 1. Prime numbers are very good for that.
Another interesting thing is when T is 2^N. These will give output exactly same as all the lower N bits of input-hash. As every number can be represented powers of 2, when we will take modulo of any number with T, we will subtract all powers of 2 form number, which are >= N, hence always giving off number of specific pattern, dependent on the input. This is also a bad choice.
Similarly, T as 10^N is bad as well because of similar reasons (pattern in decimal notation of numbers instead of binary).
So, prime numbers tend to give a better distributed results, hence are good choice for table size.
I would say the first answer at this link is the clearest answer I found regarding this question.
Consider the set of keys K = {0,1,...,100} and a hash table where the number of buckets is m = 12. Since 3 is a factor of 12, the keys that are multiples of 3 will be hashed to buckets that are multiples of 3:
Keys {0,12,24,36,...} will be hashed to bucket 0.
Keys {3,15,27,39,...} will be hashed to bucket 3.
Keys {6,18,30,42,...} will be hashed to bucket 6.
Keys {9,21,33,45,...} will be hashed to bucket 9.
If K is uniformly distributed (i.e., every key in K is equally likely to occur), then the choice of m is not so critical. But, what happens if K is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of 3. In this case, all of the buckets that are not multiples of 3 will be empty with high probability (which is really bad in terms of hash table performance).
This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of 4. Needless to say that choosing m to be a multiple of 4 would be a terrible choice: you would have 3m/4 buckets completely empty, and all of your keys colliding in the remaining m/4 buckets.
In general:
Every key in K that shares a common factor with the number of buckets m will be hashed to a bucket that is a multiple of this factor.
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this be achieved? By choosing m to be a number that has very few factors: a prime number.
FROM THE ANSWER BY Mario.
I'd like to add something for Steve Jessop's answer(I can't comment on it since I don't have enough reputation). But I found some helpful material. His answer is very help but he made a mistake: the bucket size should not be a power of 2. I'll just quote from the book "Introduction to Algorithm" by Thomas Cormen, Charles Leisersen, et al on page263:
When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.
Hope it helps.
This question was merged with the more appropriate question, why hash tables should use prime sized arrays, and not power of 2.
For hash functions itself there are plenty of good answers here, but for the related question, why some security-critical hash tables, like glibc, use prime-sized arrays, there's none yet.
Generally power of 2 tables are much faster. There the expensive h % n => h & bitmask, where the bitmask can be calculated via clz ("count leading zeros") of the size n. A modulo function needs to do integer division which is about 50x slower than a logical and. There are some tricks to avoid a modulo, like using Lemire's https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/, but generally fast hash tables use power of 2, and secure hash tables use primes.
Why so?
Security in this case is defined by attacks on the collision resolution strategy, which is with most hash tables just linear search in a linked list of collisions. Or with the faster open-addressing tables linear search in the table directly. So with power of 2 tables and some internal knowledge of the table, e.g. the size or the order of the list of keys provided by some JSON interface, you get the number of right bits used. The number of ones on the bitmask. This is typically lower than 10 bits. And for 5-10 bits it's trivial to brute force collisions even with the strongest and slowest hash functions. You don't get the full security of your 32bit or 64 bit hash functions anymore. And the point is to use fast small hash functions, not monsters such as murmur or even siphash.
So if you provide an external interface to your hash table, like a DNS resolver, a programming language, ... you want to care about abuse folks who like to DOS such services. It's normally easier for such folks to shut down your public service with much easier methods, but it did happen. So people did care.
So the best options to prevent from such collision attacks is either
1) to use prime tables, because then
all 32 or 64 bits are relevant to find the bucket, not just a few.
the hash table resize function is more natural than just double. The best growth function is the fibonacci sequence and primes come closer to that than doubling.
2) use better measures against the actual attack, together with fast power of 2 sizes.
count the collisions and abort or sleep on detected attacks, which is collision numbers with a probability of <1%. Like 100 with 32bit hash tables. This is what e.g. djb's dns resolver does.
convert the linked list of collisions to tree's with O(log n) search not O(n) when an collision attack is detected. This is what e.g. java does.
There's a wide-spread myth that more secure hash functions help to prevent such attacks, which is wrong as I explained. There's no security with low bits only. This would only work with prime-sized tables, but this would use a combination of the two slowest methods, slow hash plus slow prime modulo.
Hash functions for hash tables primarily need to be small (to be inlinable) and fast. Security can come only from preventing linear search in the collisions. And not to use trivially bad hash functions, like ones insensitive to some values (like \0 when using multiplication).
Using random seeds is also a good option, people started with that first, but with enough information of the table even a random seed does not help much, and dynamic languages typically make it trivial to get the seed via other methods, as it's stored in known memory locations.
For a hash function it's not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes.
Say you have an equation:
(x + y*z) % key = x with 0<x<key and 0<z<key.
If key is a primenumber n*y=key is true for every n in N and false for every other number.
An example where key isn't a prime example:
x=1, z=2 and key=8
Because key/z=4 is still a natural number, 4 becomes a solution for our equation and in this case (n/2)*y = key is true for every n in N. The amount of solutions for the equation have practially doubled because 8 isn't a prime.
If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash.
I've read the popular wordpress website linked in some of the above popular answers at the top. From what I've understood, I'd like to share a simple observation I made.
You can find all the details in the article here, but assume the following holds true:
Using a prime number gives us the "best chance" of an unique value
A general hashmap implementation wants 2 things to be unique.
Unique hash code for the key
Unique index to store the actual value
How do we get the unique index? By making the initial size of the internal container a prime as well. So basically, prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container.
Example:
key = "key"
value = "value"
uniqueId = "k" * 31 ^ 2 +
"e" * 31 ^ 1` +
"y"
maps to unique id
Now we want a unique location for our value - so we
uniqueId % internalContainerSize == uniqueLocationForValue , assuming internalContainerSize is also a prime.
I know this is simplified, but I'm hoping to get the general idea through.

Handling close-to-impossible collisions on should-be-unique values

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.
A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.
Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states
The odds of the changeset hash
colliding by accident in your first
billion commits is basically zero. But
we will notice if it happens. And
you'll get to be famous as the guy who
broke SHA1 by accident.
But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.
Rather, I'd like to know (maybe by means of examples from your own work):
Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?
For me, especially the following are interesting, although they are somewhat touchy-feely:
If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?
The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.
However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.
Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.
If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.
Or should even these cases get handled gracefully?
The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).
Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:
1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.
The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.
p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)
where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:
Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png
Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.
Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".