Acceptable types to use as keys in a HashTable - language-agnostic

I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex reference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?
In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?

It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.
By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.
But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).
Hope this helps.

Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.
What the hash function does is that it maps a very large number of states onto a smaller number of states.
That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.
This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.
In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).

As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).
So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).
It seems it is easier to define such a function for simple types like integers.

The best hash keys are those that
Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
Have quick comparisons (for when there are hash collisions).
All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.

If you were to use a complex type as a key then:
It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
The hash table may need to have intimate knowledge of the type in order to pick a bucket.
There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.
Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.

Related

Alternate field for autoincrement PK

In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().

Scala immutable Map like datastructure that has constant/effective constant lookup

In designing a JSON AST for Scala, we realised we hit a problem (which can be described in greater detail here https://github.com/json4s/json4s-ast/issues/8), where ideally we would like to represent a JObject (JSON object) with a Map structure that either preserves ordering for its keys OR is is sorted by key using a default Ordering that has either constant or effective constant lookup time.
The reason why we need something that either preserves ordering, or guarantees the keys are sorted, is that when someone serializes the JValue, we need to make sure that it always outputs the same JSON for the same JValue (most obvious case is caching JValue serializations)
The issue is that Scala stdlib doesn't appear to have an immutable Map like datastructure that preserves/orders by key with a O(c) or O(eC) lookup time (reference http://docs.scala-lang.org/overviews/collections/performance-characteristics.html)
Does anyone know if there is an implementation of such a datastructure in Scala somewhere that follows the Scala collections library (or even if such a datastructure exists in general)?
You can't sort in constant time in general, so using an Ordering is right out.
And there aren't actually any immutable data structures I'm aware of that let you add and delete in arbitrary spots with constant performance (best you can get is O(log N) though some people like to pretend that if the base on the logarithm is large enough it's "effectively constant").
So what you're asking for isn't possible.
But as a fallback, TreeMap isn't bad for sorted keys. It's pretty efficient for a O(log N) solution.
For keys in the correct order, you in general need to maintain three maps: index to key, key to index, and key to value. Index to key should be a TreeMap so you can walk in order. The others can be whatever. The essential idea is that when you want to add a key-value pair, you increment the index and add (index -> key) to the first map and (key -> index) to the second. Then when you walk in order, you walk along the index-to-key map and look up in the key-value map. When you delete by key, you find the index with the key-to-index map so you can delete it from the index-to-key map also.
After much deliberation with other people, it seems we will maintain 2 data structures (a vector and a map, i.e. https://github.com/json4s/json4s-ast/issues/8#issuecomment-125025604) which is essentially what #Aivean mentioned. Performance is a bigger concern than potential memory usage.
If you need ordering, a TreeMap is indeed the fastest structure, which does have O(log(n)), without ordering you have you standard immutable Map, which is O(n(log(n)) (Haskell also has Data.Map, which works similarly).
Thanks for the answers!

MySQL: primary key is a 8-byte string. Is it better to use BIGINT or BINARY(8)?

We need to store many rows in a MySQL (InnoDB) table, all of them having a 8-byte binary string as primary key.
I was wondering wether it was best to use the BIGINT column type (which contains 64-bit, thus 8-byte, integers) or BINARY(8), which is fixed length.
Since we're using those ids as strings in our application, and not numbers, storing them as binary strings sounds more coherent to me. However, I wonder if there are performance issues with this. Does it make any difference?
If that matters, we are reading/storing these ids using hex notation (like page_id = 0x1122334455667788).
We wouldn't use integers in queries anyway, since we're writing a PHP application and, as you surely know, there isn't a "unsigned long long int" type, so all integers are machine-dependant size.
I'd use the binary(8) if this matches your design.
Otherwise you'll always have a conversion overhead in performance or complexity somewhere. There won't be much (if any) difference between the types at the RDBMS level

Is creating a "binding map" a good idea for allowing O(1) random lookup to a Map?

There are plenty of situations where you have to chose a Random entry of a Map containing thousands or millions of elements. For example displaying a random quote, or definition, or customer feedback. For the sake of clarity, I will assume in the rest of this post that I am dealing with a dictionary map, where the key String is the word, and the value String its definition.
The bruteforce approach for this kind of problem would be something like this.
// Bruteforce approach
x = new integer between 0 and the number of entries in the map
iterate x times
return the element
I feel that using a bruteforce approach is not only slow and time consuming, but stupid.
Why would I have to iterate over thousands of elements to find something I am not particularly looking for?
The only way of avoiding bruteforce and having O(1) random lookup, is to have an integer as the key of the map, because of two points:
The only random object we can get is an integer
The only way of having O(1) lookup in a map, is to know the key.
But as you can only have one key, if you put an integer as your key, then you can't have O(1) lookup for the definition of a given word, which was the point of using a Map in the first place.
The way I found of doing this, is to declare a second map, the "binding" map, that just binds each key of the dictionary, to an integer.
So basically, you end up having two Maps:
The dictionary where you have words as keys and definitions as values
The bindingMap where you have integers as keys, and words as values
So if you want to retrieve the definition of a given word, you do:
dictionary.get("hello");
And if you want to retrieve a random word you just do:
dictionary.get(bindingMap.get(myRandomNumber));
Pros:
It allows O(1) lookup AND random access
Cons:
Space complexity is O(2n) instead of O(n)... Which is still O(n)...
What do you think of this method? Do you see any better way of doing this?
If you're willing to settle for O(lg n) lookup rather than O(1) lookup, you might want to consider storing your elements in an order statistic tree, a modified balanced binary search tree that supports lookup, insertion, and deletion in O(lg n) as well as the ability to query the element at position k for any arbitrary k in O(lg n) time.
If you're using strings as keys, you could also consider modifying the construction so that you have an "order statistic trie," where you would store the strings in a trie augmented with the number of elements down each branch. This would give very fast lookup of elements while still supporting quick random queries.
And finally, if you want to go with the two-maps approach, consider replacing the last map with a standard array or dynamic array; it's more space efficient.
Hope this helps!

Storing a binary array in MySQL

I have an array of values called A, B... X, Y, Z. Fun though it would be to have 26 columns in the table I can't help but feel there is a better way. I have considered creating a second table with the id value of row from the first table, the id of the item in the array and then the boolean value but it seems clunky and confusing.
Is there a better way?
Short answer, no. Long answer, it depends.
You can store binary data in a bunch of ways - abusing a number, using a BINARY OR VARBINARY, using a BLOB or TINYBLOB, etc. BINARY types will generally be faster than BLOB types, provided your data is a known size.
However, relational databases aren't designed for doing anything intelligent with binary data. On a project I used to work on, there was a table where each record had as specific binary pattern - stored as some sort of integer - and searching required a lot of ANDs, ORs, XORs and NOTs. It never really worked very well, performance sucked, and it held the whole project down. Looking back, I would have taken a completely different approach.
So if you just want to drop the data in and pull it out again, great. If you want to use it for anything intelligent, tough.
The situation may be different on other database vendors. In fact, have you considered using something else in place of the database? Some sort of object persistence?
Are your possible array values static?
If so, try using MySQL's SET data type.
You can try storing it as a TINYBLOB, or even an UNSIGNED INT, but you'll have to do bit masking in your code.
You can store it as a string and use text manipulation functions to (re)create your array.