Alternatives to the Big-O notation?

Alternatives to the Big-O notation? - language-agnostic

Good afternoon all,
We say that a hashtable has O(1) lookup (provided that we have the key), whereas a linked list has O(1) lookup for the next node (provided that we have a reference to the current node).
However, due to how the Big-O notation works, it is not very useful in expressing (or differentiating) the cost of an algorithm x, vs the cost of an algorithm x + m.
For example, even though we label both the hashtable's lookup and the linked list's lookup as O(1), these two O(1)s boil down to a very different number of steps indeed,
The linked list's lookup is fixed at x number of steps. However, the hashtable's lookup is variable. The cost of the hashtable's lookup depends on the cost of the hashing function, so the number of steps required for the hashtable's lookup is: x &plus; m,
where x is a fixed number
and m is an unknown variable value
In other words, even though we call both operations O(1), the cost of the hashtable's lookup is a magnitude higher than the cost of the linked list's lookup.
The Big-O notation is specifically about the size of the input data collection. This does have its advantages, but it has its disadvantages as well, as can be seen when we collapse and normalize all non-n variables into 1. We cannot see the m variable (the hashing function) inside it anymore.
Besides the Big-O notation, Is there another (established) notation we can use for expressing the fixed-cost O(1) which means x operations and the variable-cost O(1) which means x + m (m, the hashing function) number of operations?

literal O(1) which means exactly 1 operation
Except it doesn't. The big O-Notation concerns relative comparision of complexity in relation to an input. If the algorithm does take a constant amount of steps, completely independent of the size of your input, than the exact amount of steps doesn't matter.
Take a look at the (informal) definition of O(n):
It means: There is a certain k so that for each n the function f is smaller than the function g.
In the case above, the hashtable lookup and linked list lookup would be f, and g would be g(n) = 1. For each case, you are able to find a k that f(n) <= g(n) * k.
Now, this k doesn't need to fixed, it can vary depending on platform, implementation, specific hardware. The only interesting point is that it exists. That's why both hashtable lookup and linked list node lookup are O(1): Both have a constant complexity, regardless of input. And when evaluating algorithms, that's what interesting, not the physical steps.
Specifically concerning the Hashtable lookup
Yes, the hash function does take a variable amount of operations (depending on implementation). However, it doesn't take a variable amount of operation depending on the size of the input. Big O-Nation is specifically about the size of the input data collection. A hash function takes a single element. For the evaluation of an algorithm it doesn't matter wether a certain function takes 10, 20, 50 or 100 operations, if the number of operations doesn't increase with the input size, it is O(1). There is no way to distinguish this in big O-Notation, as this isn't what big O-Notation is about.

"~" includes the constant factor - see the family of bachmann functions

The issue is that the "number of operations" is highly context dependent. In fact, that's why big-O notation was invented -- it seems to work rather well in modelling a broad number of computers.
Besides, what a programmer things the number of "ops" is doesn't mean how much time it actually does take (e.g. is it already in cache?) or how many steps hardware actually takes (what does your processor do -exactly-? Does it have micro-ops?) or even how many operations are dictated to the processor (what is your compiler doing for you?). And those are all concerns, even when you try to define a precise concept that's abstract enough to be useful.
So. For now, it's Big-O vs. "operations" -- whatever "operations" means to you and your colleagues at the time.

Related

Analysis of open addressing

I am currently learning hash tables from "An introduction of algorithms 3th". Get quite confused while trying to understand open addressing from statistical point of view. Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1. In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m. The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand. Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured. I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:
Q: In reality, if the load factor is less than 1, why would we bother open addressing ? Why not project each key to an integer and arrange them in an array ?

Q: In reality, if the load factor is less than 1, why would we bother open addressing? Why not project each key to an integer and arrange them in an array ?
Because in many situations when hash tables are used, there's no good O(1) way to "project each key to an [distinct, not-absurdly-sparse] integer" array index.
A simple thought experiment illustrates this: say you expect the user to type four three-uppercase-letter keys, and you want to store them somewhere in an array with dimension 10. You have 264 possible inputs, so no matter what your logic is, on average 264/10 of them will "project... to an integer" indicating the same array position. When you realise the "project[ion]" can't avoid potential "collisions", and that projection is a logically identical operation to "hashing" and modding to a "bucket", then some collision-handling logic will be needed, your proposed "alternative" morphs back into a hash table....
Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1.
They are very confusing statements. The "number of hash values" is not arbitrarily limited - you could use a 32 bit hash generating any of ~4 billion hash values, a 512-bit hash, or whatever other size you feel like. Given the structure of your statement is "a > b, i.e. load factor n/m < 1", and "n/m < 1" can be rewritten as "n < m" or "m > n", you imply "a" and "m" are meant to be the same thing, as are "b" and "n":
you're referring to m - which "load factor n/m" requires be the number of buckets in the hash table - as "the possible key value number": it's not, and what could that even mean?
you're referring to n - which "load factor n/m" requires be the number of keys stored in the hash table - as "the number of hash values": it's not, except in the trivial sense of that many (not necessarily distinct) hash values being generated when the keys are hashed
In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m.
Again, that's a very poorly defined statement. The hashing of n keys can identify at most n distinct buckets from which collision-handling would kick in, but those n could begin pretty much anywhere within the m buckets, given the hash function's job is to spray them around. And, so what?
The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand.
Understand what?
Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured.
For sure. "Repeatable randomness" of hashing is a very convenient and tangible benchmark against which specific implementations can be compare.
I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:

CUDA: How to find index of extrema in sub matrices?

I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.

I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.

Randomly sorting an array

Does there exist an algorithm which, given an ordered list of symbols {a1, a2, a3, ..., ak}, produces in O(n) time a new list of the same symbols in a random order without bias? "Without bias" means the probability that any symbol s will end up in some position p in the list is 1/k.
Assume it is possible to generate a non-biased integer from 1-k inclusive in O(1) time. Also assume that O(1) element access/mutation is possible, and that it is possible to create a new list of size k in O(k) time.
In particular, I would be interested in a 'generative' algorithm. That is, I would be interested in an algorithm that has O(1) initial overhead, and then produces a new element for each slot in the list, taking O(1) time per slot.
If no solution exists to the problem as described, I would still like to know about solutions that do not meet my constraints in one or more of the following ways (and/or in other ways if necessary):
the time complexity is worse than O(n).
the algorithm is biased with regards to the final positions of the symbols.
the algorithm is not generative.
I should add that this problem appears to be the same as the problem of randomly sorting the integers from 1-k, since we can sort the list of integers from 1-k and then for each integer i in the new list, we can produce the symbol ai.

Yes - the Knuth Shuffle.

The Fisher-Yates Shuffle (Knuth Shuffle) is what you are looking for.

Hashtable and list side by side?

I need a data structure that is ordered but also gives fast random access and inserts and removes. Linkedlists are ordered and fast in inserts and removes but they give slow random access. Hashtables give fast random access but are not ordered.
So, it seems to nice to use both of them together. In my current solution, my Hashtable includes iterators of the list and the List contains the actual items. Nice and effective. Okay, it requires double the memory but that's not an issue.
I have heard that some tree structures could do this also, but are they as fast as this solution?

The most efficient tree structure I know is Red Black Tree, and it's not as fast as your solution as it has O(log n) for all operations while your solution has O(1) for some, if not all, operations.
If memory is not an issue and you sure your solution is O(1) meaning the time required to add/delete/find item in the structure is not related to the amount of items you have, go for it.

You should consider a Skip List, which is an ordered linked-list with O(log n) access times. In other words, you can enumerate it O(n) and index/insert/delete is O(log n).

Trees are made for this. The most appropriate are self-balancing trees like AVL tree or Red Black tree. If you deal with a very big data amounts, it also may be useful to create B-tree (they are used for filesystems, for example).
Concerning your implementation: it may be more or less efficient then trees depending on data amount you work with and HashTable implementation. E.g. some hash tables with a very dense data may give access not in O(1) but in O(log n) or even O(n). Also remember that computing hash from data takes some time too, so for a quit small data amounts absolute time for computing hash may be more then for searching it in a tree.

What you did is pretty much the right choice.
The cool thing about this is that adding ordering to an existing map implementation by using a double-ended doubly-linked list doesn't actually change its asymptotic complexity, because all the relevant list operations (appending and deleting) have a worst-case step complexity of Θ(1). (Yes, deletion is Θ(1), too. The reason it is usually Θ(n) is because you have to find the element to delete first, which is Θ(n), but the actual deletion itself is Θ(1). In this particular case, you let the map do the finding, which is something like Θ(1) amortized worst-case step complexity or Θ(logb n) worst-case step complexity, depending on the type of map implementation used.)
The Hash class in Ruby 1.9, for example, is an ordered map, and it is implemented at least in YARV and Rubinius as a hash table embedded into a linked list.
Trees generally have a worst-case step complexity of Θ(logb n) for random access, whereas hash tables may be worse in the worst case (Θ(n)), but usually amortize to Θ(1), provided you don't screw up the hash function or the resize function.
[Note: I'm deliberately only talking about asymptotic behavior here, aka "infinitely large" collections. If your collections are small, then just choose the one with the smallest constant factors.]

Java actually contains a LinkedHashTable, which is similar to the data-structure you're describing. It can be surprisingly useful at times.
Tree structures could work as well, seeing they can perform random access (and most other operations) in (O log n) time. Not as fast as Hashtables (O 1), but still fast unless your database is very large.
The only real advantage of trees is that you don't need to decide on the capacity beforehand. Some HashTable implementations can grow their capacity as needed, but simply do so by copying all items into a new, larger hashtable when they've exceeded their capacity, which is very slow. (O n)

Why should hash functions use a prime number modulus?

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".
What do you expect from a $1.25 book?
Anyway, I've had years to think about the nature of math, and still can't figure it out.
Is the distribution of numbers truly more even when there are a prime number of buckets?
Or is this an old programmer's tale that everyone accepts because everybody else accepts it?

Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:
(first char) + k * (second char) + k^2 * (third char) + ...
Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.
[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]
A hashtable works by taking the modulus of the hash over the number of buckets.
It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.
Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.
It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.
Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.
But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.
In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.
Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.
[Edit: there's another, more specialized reason to use a prime number of buckets, which is if you handle collisions with linear probing. Then you calculate a stride from the hashcode, and if that stride comes out to be a factor of the bucket count then you can only do (bucket_count / stride) probes before you're back where you started. The case you most want to avoid is stride = 0, of course, which must be special-cased, but to avoid also special-casing bucket_count / stride equal to a small integer, you can just make the bucket_count prime and not care what the stride is provided it isn't 0.]

The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 'statements' that you most probably have read somewhere
If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will definitely suffer from clustering of many keys in a few hash buckets.
But if you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.
And here is the proof.
If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering
Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
From - http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
Pretty clear explanation, with pictures too.
Edit: As a summary, primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up. For example given a string, multiplying each letter value with the prime number and then adding those all up will give you its hash value.
A better question would be, why exactly the number 31?

Just to put down some thoughts gathered from the answers.
Hashing uses modulus so any value can fit into a given range
We want to randomize collisions
Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value
To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide
Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide
Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values
For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6
12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9
This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high
So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

tl;dr
index[hash(input)%2] would result in a collision for half of all possible hashes and a range of values. index[hash(input)%prime] results in a collision of <2 of all possible hashes. Fixing the divisor to the table size also ensures that the number cannot be greater than the table.

Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P.
Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.

Just to provide an alternate viewpoint there's this site:
http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth
Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets. It seems like a reasonable possibility. Intuitively, I can certainly see how a larger number of buckets would be better, but I'm unable to make a mathematical argument of this.

It depends on the choice of hash function.
Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine (that modulus is free by just letting the calculation overflow).
You don't want any common factor between a multiplier for a data element and the size of the hash table, because then it could happen that varying the data element doesn't spread the data over the whole table. If you choose a prime for the size of the table such a common factor is highly unlikely.
On the other hand, those factors are usually made up from odd primes, so you should also be safe using powers of two for your hash table (e.g. Eclipse uses 31 when it generates the Java hashCode() method).

Copying from my other answer https://stackoverflow.com/a/43126969/917428. See it for more details and examples.
I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:
8 % 10 = 8
18 % 10 = 8
87865378 % 10 = 8
It doesn't matter what the number is: as long as it ends with 8, its modulo 10 will be 8.
Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.

"The nature of math" regarding prime power moduli is that they are one building block of a finite field. The other two building blocks are an addition and a multiplication operation. The special property of prime moduli is that they form a finite field with the "regular" addition and multiplication operations, just taken to the modulus. This means every multiplication maps to a different integer modulo the prime, so does every addition.
Prime moduli are advantageous because:
They give the most freedom when choosing the secondary multiplier in secondary hashing, all multipliers except 0 will end up visiting all elements exactly once
If all hashes are less than the modulus there will be no collisions at all
Random primes mix better than power of two moduli and compress the information of all the bits not just a subset
They however have a big downside, they require an integer division, which takes many (~ 15-40) cycles, even on a modern CPU. With around half the computation one can make sure the hash is mixed up very well. Two multiplications and xorshift operations will mix better than a prime moudulus. Then we can use whatever hash table size and hash reduction is fastest, giving 7 operations in total for power of 2 table sizes and around 9 operations for arbitrary sizes.
I recently looked at many of the fastest hash table implementations and most of them don't use prime moduli.
The distribution of the hash table indices are mainly dependent on the hash function in use. A prime modulus can't fix a bad hash function and a good hash function does not benefit from a prime modulus. There are cases where they can be advantageous however. It can mend a half-bad hash function for example.

Primes are unique numbers. They are
unique in that, the product of a prime
with any other number has the best
chance of being unique (not as unique
as the prime itself of-course) due to
the fact that a prime is used to
compose it. This property is used in
hashing functions.
Given a string “Samuel”, you can
generate a unique hash by multiply
each of the constituent digits or
letters with a prime number and adding
them up. This is why primes are used.
However using primes is an old
technique. The key here to understand
that as long as you can generate a
sufficiently unique key you can move
to other hashing techniques too. Go
here for more on this topic about
http://www.azillionmonkeys.com/qed/hash.html
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Suppose your table-size (or the number for modulo) is T = (B*C). Now if hash for your input is like (N*A*B) where N can be any integer, then your output won't be well distributed. Because every time n becomes C, 2C, 3C etc., your output will start repeating. i.e. your output will be distributed only in C positions. Note that C here is (T / HCF(table-size, hash)).
This problem can be eliminated by making HCF 1. Prime numbers are very good for that.
Another interesting thing is when T is 2^N. These will give output exactly same as all the lower N bits of input-hash. As every number can be represented powers of 2, when we will take modulo of any number with T, we will subtract all powers of 2 form number, which are >= N, hence always giving off number of specific pattern, dependent on the input. This is also a bad choice.
Similarly, T as 10^N is bad as well because of similar reasons (pattern in decimal notation of numbers instead of binary).
So, prime numbers tend to give a better distributed results, hence are good choice for table size.

I would say the first answer at this link is the clearest answer I found regarding this question.
Consider the set of keys K = {0,1,...,100} and a hash table where the number of buckets is m = 12. Since 3 is a factor of 12, the keys that are multiples of 3 will be hashed to buckets that are multiples of 3:
Keys {0,12,24,36,...} will be hashed to bucket 0.
Keys {3,15,27,39,...} will be hashed to bucket 3.
Keys {6,18,30,42,...} will be hashed to bucket 6.
Keys {9,21,33,45,...} will be hashed to bucket 9.
If K is uniformly distributed (i.e., every key in K is equally likely to occur), then the choice of m is not so critical. But, what happens if K is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of 3. In this case, all of the buckets that are not multiples of 3 will be empty with high probability (which is really bad in terms of hash table performance).
This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of 4. Needless to say that choosing m to be a multiple of 4 would be a terrible choice: you would have 3m/4 buckets completely empty, and all of your keys colliding in the remaining m/4 buckets.
In general:
Every key in K that shares a common factor with the number of buckets m will be hashed to a bucket that is a multiple of this factor.
Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this be achieved? By choosing m to be a number that has very few factors: a prime number.
FROM THE ANSWER BY Mario.

I'd like to add something for Steve Jessop's answer(I can't comment on it since I don't have enough reputation). But I found some helpful material. His answer is very help but he made a mistake: the bucket size should not be a power of 2. I'll just quote from the book "Introduction to Algorithm" by Thomas Cormen, Charles Leisersen, et al on page263:
When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.
Hope it helps.

This question was merged with the more appropriate question, why hash tables should use prime sized arrays, and not power of 2.
For hash functions itself there are plenty of good answers here, but for the related question, why some security-critical hash tables, like glibc, use prime-sized arrays, there's none yet.
Generally power of 2 tables are much faster. There the expensive h % n => h & bitmask, where the bitmask can be calculated via clz ("count leading zeros") of the size n. A modulo function needs to do integer division which is about 50x slower than a logical and. There are some tricks to avoid a modulo, like using Lemire's https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/, but generally fast hash tables use power of 2, and secure hash tables use primes.
Why so?
Security in this case is defined by attacks on the collision resolution strategy, which is with most hash tables just linear search in a linked list of collisions. Or with the faster open-addressing tables linear search in the table directly. So with power of 2 tables and some internal knowledge of the table, e.g. the size or the order of the list of keys provided by some JSON interface, you get the number of right bits used. The number of ones on the bitmask. This is typically lower than 10 bits. And for 5-10 bits it's trivial to brute force collisions even with the strongest and slowest hash functions. You don't get the full security of your 32bit or 64 bit hash functions anymore. And the point is to use fast small hash functions, not monsters such as murmur or even siphash.
So if you provide an external interface to your hash table, like a DNS resolver, a programming language, ... you want to care about abuse folks who like to DOS such services. It's normally easier for such folks to shut down your public service with much easier methods, but it did happen. So people did care.
So the best options to prevent from such collision attacks is either
1) to use prime tables, because then
all 32 or 64 bits are relevant to find the bucket, not just a few.
the hash table resize function is more natural than just double. The best growth function is the fibonacci sequence and primes come closer to that than doubling.
2) use better measures against the actual attack, together with fast power of 2 sizes.
count the collisions and abort or sleep on detected attacks, which is collision numbers with a probability of <1%. Like 100 with 32bit hash tables. This is what e.g. djb's dns resolver does.
convert the linked list of collisions to tree's with O(log n) search not O(n) when an collision attack is detected. This is what e.g. java does.
There's a wide-spread myth that more secure hash functions help to prevent such attacks, which is wrong as I explained. There's no security with low bits only. This would only work with prime-sized tables, but this would use a combination of the two slowest methods, slow hash plus slow prime modulo.
Hash functions for hash tables primarily need to be small (to be inlinable) and fast. Security can come only from preventing linear search in the collisions. And not to use trivially bad hash functions, like ones insensitive to some values (like \0 when using multiplication).
Using random seeds is also a good option, people started with that first, but with enough information of the table even a random seed does not help much, and dynamic languages typically make it trivial to get the seed via other methods, as it's stored in known memory locations.

For a hash function it's not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes.
Say you have an equation:
(x + y*z) % key = x with 0<x<key and 0<z<key.
If key is a primenumber n*y=key is true for every n in N and false for every other number.
An example where key isn't a prime example:
x=1, z=2 and key=8
Because key/z=4 is still a natural number, 4 becomes a solution for our equation and in this case (n/2)*y = key is true for every n in N. The amount of solutions for the equation have practially doubled because 8 isn't a prime.
If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash.

I've read the popular wordpress website linked in some of the above popular answers at the top. From what I've understood, I'd like to share a simple observation I made.
You can find all the details in the article here, but assume the following holds true:
Using a prime number gives us the "best chance" of an unique value
A general hashmap implementation wants 2 things to be unique.
Unique hash code for the key
Unique index to store the actual value
How do we get the unique index? By making the initial size of the internal container a prime as well. So basically, prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container.
Example:
key = "key"
value = "value"
uniqueId = "k" * 31 ^ 2 +
"e" * 31 ^ 1` +
"y"
maps to unique id
Now we want a unique location for our value - so we
uniqueId % internalContainerSize == uniqueLocationForValue , assuming internalContainerSize is also a prime.
I know this is simplified, but I'm hoping to get the general idea through.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008