I do not quite understand how universal hashing works. For example, when I insert an item into my hash table, I have to choose a random function from my universal family of hash functions. Now I want to retrieve said item. How will my hash table know which function it has to use to calculate the hash?
Because you'll use the same hash function for all the items in the table.
Which hash function is used is random only in the sense that they are not predictable by an adversary but the choice is a function of the key. There is a nice write up at
http://www.cs.ucsb.edu/~suri/cs130a/Hashing.txt
The matrix method is easier to understand than other methods...
Related
I've got the following question about choosing hash functions for Bloom filters:
Which functions to use?
In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed.
I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function, but not why (or at least without a proof) they are suitable.
Thanks in advance!
I asked myself the same question when building a Java Bloom filter library. See the Github readme for a detailed treatment of my analysis of hash functions for Bloom filters.
I looked at the problem from two perspectives:
How fast is the computation?
How uniform is the output distribution?
Speed can easily be measured by benchmarks on random input. Uniformity is a bit harder and requires some statistics. Using Chi-Square goodness of fit tests I measured how similar the distribution of hash values is to a uniform distribution.
The result is:
Use Murmur3 for the best trade-off between speed and uniformity. Do not use Murmur2 as it is not uniform for inputs that change in small increments.
Use a cryptographic hash function like SHA-256 for the best uniformity.
Apply the Kirsch-Mitzenmacher-Optimization to only compute 2 instead of k hash functions (hash_i = hash1 + i x hash2).
If your implementation is using Java I would recommend using our Bloom filter hash library. It is well documented and thoroughly tested. For the details, including the benchmark results for different hash function and their unformity according to Chi-Square test, see the Github readme of the repo.
Hash Functions should provide you with graphical proof of why FNV would be a bad choice, and why Murmur2 or one of Bob Jenkins' Hashes would be a good choice.
I think a reasonable option would be multiple CRC hashes. I'm assuming that, if you want multiple n-bit hash values, then for polynomials with Boolean field coefficients, there are multiple prime polynomials of degree n+1. But I don't know of a process for finding these polynomials.
Another possibility would be to use multiple modulo hashes. The size of the Bloom Filter bit array would have to be the maximum modulo value. But I think, for it to work well, the modulus values would have to be product of primes greater than 10, and relatively prime to each other. And the range from the minimum to the maximum modulus value would have to be as small as possible. I don't know of a way to find such values. I have written some open-source C++ code for quick calculation of remainders: https://github.com/wkaras/C-plus-plus-intrusive-container-templates/blob/master/modulus_hash.h
I was recently given a homework that asked whether given a list of keys it would be possible to make a hash function that doesnt have any collisions. Doing some research, I found out that given a preordered list of keys, perfect hash functions are possible.
However, I'm not quite sure what to say beyond that. Could anyone give me some advice on how perfect hash functions are made, or what exactly giving a predefined list does to a hash function creator that allows for a perfect function?
Thanks for any help.
The only way to have no collisions is to have a 1-to-1 relationship between the key and the hash value. The range of hash values must be at least as large as the number of keys, and the mapping function must transform each key to a unique value. Much more info here: http://en.wikipedia.org/wiki/Perfect_hash
In CLRS book, section 11.5 "Perfect hashing", we find how given a fixed set of n input keys, we can build a hash-table with no collision. Outline:
if we can afford table size m = n*n, then based on Theorem 11.9 (quoted below) in that section, we know that we can easily find a hash-function from a universal-class of hash-functions, which gives no collision.
otherwise, "secondary hash tables" can be kept for any slot with more than 1 key. Such table itself can be built based on the idea of Theorem 11.9, because now the number of keys nj, in that slot, are small, and so will be nj*nj.
Theorem 11.9, quoted:
"If we store n keys in a hash table of size m=n*n using a hash function h randomly chosen from a universal class of hash functions, then the probability of there being any collisions is less than 1/2."
GUIDs are typically used for uniquely identifying all kinds of entities - requests from external systems, files, whatever. Work like magic - you call a "GiveMeGuid()" (UuidCreate() on Windows) function - and a fresh new GUID is here at your service.
Given my code really calls that "GiveMeGuid()" function each time I need a new GUID is there any not so obvious way to misuse it?
Just found an answer to an old question: How deterministic Are .Net GUIDs?. Requoting it:
It's not a complete answer, but I can tell you that the 13th hex digit is always 4 because it denotes the version of the algorithm used to generate the GUID (id est, v4); also, and I quote Wikipedia:
Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random, given the initial state one can predict up to the next 250 000 GUIDs returned by the function UuidCreate. This is why GUIDs should not be used in cryptography, e.g., as random keys.
So, if you got lucky and get same seed, you'll break 250k mirrors in sequence. To quote another Wikipedia piece:
While each generated GUID is not guaranteed to be unique, the total number of unique keys (2128 or 3.4×1038) is so large that the probability of the same number being generated twice is extremely small.
Bottom line: maybe a misuse form it's to consider GUID always unique.
It depends. Some implementations of GUID generation are time dependant, so calling CreateGuid in quick succession MAY create clashing GUIDs.
edit: I now remember the problem. I was once working on some php code where the GUID generating function was reseeding the RNG with the system time each call. Don't do this.
The only way I can see of misusing a Guid is trying to interpret the value in some logical manner. Not that it really invites you to do so, which is one of the characteristics around Guid's that I really like.
Some GUIDs include some identifier of the machine it was generated on, so it can be used in client/server environments, but some can't. Be sure if yours doesn't to not use them in, for instance, a database multiple clients access.
Maybe the entropy could be manipulated by playing with some parameters used to generate the GUIDs in the first place (e.g. interface identifiers).
What's the most natural way to model a group of objects that form a set? For example, you might have a bunch of user objects who are all subscribers to a mailing list.
Obviously you could model this as an array, but then you have to order the elements and whoever is using your interface might be confused as to why you're encoding arbitrary ordering data.
You can use a hash where the members are keys that map to "1" or "true", but in most languages there are restrictions on what data types a hash key can be.
What's the standard way to do this in modern languages (PHP, Perl, Ruby, Python, etc)?
In Python, you would use the set datatype. A set supports containing any hashable object, so if you have a custom class you need to store in a set and the default hashable behaviour is not appropriate, you can implement __hash__ to implement the behaviour you want.
C# has the HashSet<T> generic collection.
public class EmailAddress // probably needs to override GetHashCode()
{
...
}
var addresses = new HashSet<EmailAddress>();
Most modern languages are going to have some form of Set data structure. Java has HashSet, which implements the Set interface.
In PHP you can use an array to store your data. Either search the array before you add a new element, or use array_unique to remove duplicates after inserting all elements.
In c as a stand-in for understanding the machine directly:
For small, discrete and well defined ranges: use a bitwise array to indicate the presence of each possible item (set for present, unset for absent).
Use a hash-table for all other cases.
Write functions to implement adding and removing items, testing for presence or absence, testing for sub-sets, etc as needed.
As the other answers note, however, if you just want the functionality, use a language feature or third-party library that is already well debugged.
A lot of the time hash-based sets are the correct thing to use, but if you don't need to do key-based lookups and don't worry about enforcing unique values, a vector or list is fine. There is overhead to a hash table, after all.
You seem to be concerned that people will think that the order in the vector is important, but I think that it is a common enough usage that, with documentation, you shouldn't confuse people.
It really depends on how you want to access and use the data.
and Array is usually the simplest way to store data, without any other requirements. Usually other data types are used for different reasons (you want to append data, you want to search data in constant time, you need quick set union/intersection, etc) If your only concern is the abstraction you could wrap it in some kind of unordered facade.
In Perl I would use a hash, definitely. In other languages I would lament the lack of a hash.
I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?
Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.
lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.
if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.
Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).