Can somebody explain me the difference between the following structures:
Hash Map
Hash Table,
Hash Set, and
Hash Dictionary
HashMap, HashTable, and HashDictionary all mean a dictionary, mapping unique unordered keys to corresponding values, implemented using hashcodes.
HashSet means a unique, unordered set, implemented using hashcodes.
Related
I have a large number of items stored in a Redis Sorted Set (of the order 100,000) that fairly frequently get updated. These items are objects encoded as JSON strings, and the rank for sorting in the set is derived (on insert, by my code) from a date/time property on the object.
Each item in the set has an Id property (which is a Guid encoded as a string) which uniquely identifies the item within the system.
When these items are updated, I need to either update the item within the sorted set, or delete and reinsert the item. The problem I have is how to find that item to perform the operation.
What I'm currently doing is loading the entire contents of the sorted set into memory, operating on that collection in my code and then writing the complete collection back to Redis. Whilst this works, it's not particularly efficient and won't scale well if the lists start to grow very large.
Would anybody have any suggestions as to how to do this in a more efficient manner? The only unique identifier I have for the items is the Id property as encoded in the item.
Many Thanks,
Richard.
Probably, your case is just a bad design choice.
You shouldn't store JSON strings in sorted sets: you need to store identifiers, and the whole JSON serialized objects should be stored in a hash.
This way, when you need to update an object, you update the whole hash key using hset and you can locate the whole object by its unique identifier.
In the other hand, any key in the hash must be present in your sorted set. When you add an object to the sorted set, you're adding its unique identifier.
When you need to list your objects in a particular order, you do the following operations:
You get a page of identifiers from the sorted set (for example, using zrange).
You get all objects from the page giving their identifiers to a hmget command.
In designing a JSON AST for Scala, we realised we hit a problem (which can be described in greater detail here https://github.com/json4s/json4s-ast/issues/8), where ideally we would like to represent a JObject (JSON object) with a Map structure that either preserves ordering for its keys OR is is sorted by key using a default Ordering that has either constant or effective constant lookup time.
The reason why we need something that either preserves ordering, or guarantees the keys are sorted, is that when someone serializes the JValue, we need to make sure that it always outputs the same JSON for the same JValue (most obvious case is caching JValue serializations)
The issue is that Scala stdlib doesn't appear to have an immutable Map like datastructure that preserves/orders by key with a O(c) or O(eC) lookup time (reference http://docs.scala-lang.org/overviews/collections/performance-characteristics.html)
Does anyone know if there is an implementation of such a datastructure in Scala somewhere that follows the Scala collections library (or even if such a datastructure exists in general)?
You can't sort in constant time in general, so using an Ordering is right out.
And there aren't actually any immutable data structures I'm aware of that let you add and delete in arbitrary spots with constant performance (best you can get is O(log N) though some people like to pretend that if the base on the logarithm is large enough it's "effectively constant").
So what you're asking for isn't possible.
But as a fallback, TreeMap isn't bad for sorted keys. It's pretty efficient for a O(log N) solution.
For keys in the correct order, you in general need to maintain three maps: index to key, key to index, and key to value. Index to key should be a TreeMap so you can walk in order. The others can be whatever. The essential idea is that when you want to add a key-value pair, you increment the index and add (index -> key) to the first map and (key -> index) to the second. Then when you walk in order, you walk along the index-to-key map and look up in the key-value map. When you delete by key, you find the index with the key-to-index map so you can delete it from the index-to-key map also.
After much deliberation with other people, it seems we will maintain 2 data structures (a vector and a map, i.e. https://github.com/json4s/json4s-ast/issues/8#issuecomment-125025604) which is essentially what #Aivean mentioned. Performance is a bigger concern than potential memory usage.
If you need ordering, a TreeMap is indeed the fastest structure, which does have O(log(n)), without ordering you have you standard immutable Map, which is O(n(log(n)) (Haskell also has Data.Map, which works similarly).
Thanks for the answers!
I have many records that start with the pattern as shown below:
user:8d6120be2e7247e49545502092c389fd and
user:000935dc3bb16bd2e0de50988751acfd
Though the hash represent user object, one hash may have more keys than the other. Say if user is a Manager then he may have few additional keys like Reportees, Benifits etc., Without actually looking into all the records is there a way to know the maximum number of keys in any hash? I am in the process of converting the Redis structure in to Relational schema and this gives me an idea on what all columns should be present.
Just use HLEN if your user:<hash> key is HSET. The most of data structures in redis have the way to get they len:
LLEN in LIST
SMEMBERS in SET
ZCARD in SORTED SET
how to define integer array as a field when creating new table in mySQL
There is currently no way to store an array of integers in MySQL, so you should implement this by yourself. You could choose one of a few things, these two approaches included:
serialise the data with a separator (e.g. LONGTEXT: 123|4|65|864)
pack the integers into a blob (e.g. LONGBLOB: 0x0000007b000000040000004100000360)
You can't. You could do something like convert your array to a comma-separated string, and store this. Or you could define a normalised table structure and have each integer in it's own row. Each row for a given array would also contain some kind of array key as a separate field. This also has the advantage that you can easily query for individual array elements.
Edit In my view the first option is not very elegant. Unless you define your field as TEXT you're going to have issues with varying string lengths, and defining your field as VARCHAR(10000) or whatever is not very efficient. Certainly if your array lengths are long you should consider a normalised solution.
I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex reference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?
In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?
It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.
By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.
But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).
Hope this helps.
Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.
What the hash function does is that it maps a very large number of states onto a smaller number of states.
That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.
This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.
In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).
As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).
So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).
It seems it is easier to define such a function for simple types like integers.
The best hash keys are those that
Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
Have quick comparisons (for when there are hash collisions).
All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.
If you were to use a complex type as a key then:
It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
The hash table may need to have intimate knowledge of the type in order to pick a bucket.
There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.
Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.