What are hashtables and hashmaps and their typical use cases? - language-agnostic

I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?

Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.

lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.

if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.

Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).

Related

Will pre-padding Tcl dicts with empty values speed up runtime?

While solving one of the Advent of Code 2021 puzzles in Tcl, I wanted to speed up the runtime of my script.
My script uses a dictionary with keys as {x y} coordinates and a 0 or 1 as the value. The x-y area of interest for the puzzle increases for each iteration of a loop. As a result, additional key-value pairs are added to the dict with each iteration of the loop.
I think I once learned that Tcl dicts may become re-structured in memory if necessary, possibly due to adding more and more keys. If so, does this cause a runtime hit?
To speed up runtime, would it be a good idea to pre-pad a dict with keys set to empty strings matching the expected final size of the dict?
At the implementation level, yes, rebuilding the hash table has a cost that is linear in the number of entries; after all, each entry has to be placed in a new bucket of the enlarged hash table array. However, the entries themselves do not need to be reallocated; the only memory management changes are for the hash table arrays themselves (allocate new, dispose old) so the cost isn't crazy high. The rebuild triggers whenever the number of entries in the hash table exceeds a fixed multiplier of the size of the hash table; that loading factor is a compile time constant. (Dicts are wrappers around hash tables with Tcl_Obj keys, mostly to add value semantics and ensure that the iteration order is consistent; those aren't things that matter for the rebuild semantics.) There's no notion of pre-sizing a hash table; the implementation doesn't expose that in a useful way. It also doesn't shrink the array; once it has grown, it stays grown (and most of the time that's not a problem at all).
The complexities of rebuild semantics are part of why Tcl's associative arrays are said to have a random order of enumeration: it's not actually random, but the deterministic algorithm is sensitive to a lot of factors that people normally ignore. You don't need to care about that when working with dicts, where the order of iteration is exactly knowable from the way that the value was built, irrespective of the details of how the hashing is done.
If you're doing lookups using compact integer keys from 0 up, a list will be substantially faster, as hashing is currently always performed on string representations. Compound integer keys may become nested lists.

Why are hash table based data structures not the default when implementing adjacency lists?

I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.

Efficient way to Store Huge Number of Large Arrays of Integers in Database

I need to store an array of integers of length about 1000 against an integer ID and string name. The number of such tuples is almost 160000.
I will pick one array and calculate the root mean square deviation (RMSD) elementwise with all others and store an (ID1,ID2,RMSD) tuple in another table.
Could you please suggest the best way to do this? I am currently using MySQL for other datatables in the same project but if necessary I will switch.
One possibility would be to store the arrays in a BINARY or a BLOB type column. Given that the base type of your arrays is an integer, you could step through four bytes at a time to extract values at each index.
If I understand the context correctly, the arrays must all be of the same fixed length, so a BINARY type column would be the most efficient, if it offers sufficient space to hold your arrays. You don't have to worry about database normalisation here, because your array is an atomic unit in this context (again, assuming I'm understanding the problem correctly).
If you did have a requirement to access only part of each array, then this may not be the most practical way to store the data.
The secondary consideration is whether to compute the RMSD value in the database itself, or in some external language on the server. As you've mentioned in your comments, this will be most efficient to do in the database. It sounds like queries are going to be fairly expensive anyway, though, and the execution time may not be a primary concern: simplicity of coding in another language may be more desirable. Also depending on the cost of computing the RMSD value relative to the cost of round-tripping a query to the database, it may not even make that much of a difference?
Alternatively, as you've alluded to in your question, using Postgres could be worth considering, because of its more expressive PL/pgSQL language.
Incidentally, if you want to search around for more information on good approaches, searching for database and time series would probably be fruitful. Your data is not necessarily time series data, but many of the same considerations would apply.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Which datastructure is appropriate for this situation?

I'm trying to decide which datastructure to use to store key-value pairs when only features needed are
insertion
lookup
Specifically, I don't need to be able to delete pairs, or iterate through keys/values/pairs.
The keys are integer tuples, the values are pointers (references, whatever). I'm only storing a couple million pairs spread out over (many) objects.
Currently I'm considering using either
a hash table
a kd-tree
a b-tree
I'm leaning toward the hash table (for the O(1) insertion/lookup time), but I wanted to confirm my leanings.
Which structure (of those above or other) would you recommend and why? If you would recommend a hash table, should I create a separate table for each object, or just create a single table and use the object's id as part of the key tuple?
A hashtable will be the best choice here as all the operations that matter to you are O(1) (and as such you shouldn't need to worry about creating multiple hashtables).
I'm a big fan of hash tables, since they are easy and there are implementations available for pretty much every major language out there. The O(1) insertion/lookup is an especially good feature.
You should probably use a single table, to save on memory. Hash tables are notoriously inefficient memory-wise, and using a single table would help to minimize that.
Hash tables would be usefull here and I see no reason to have more than one table.
Most trees have an O(n ln n) lookup time, but hashtables have an O(1) lookup time, so that's the one you want to use. It's also very common, and often the implementation is highly-optimised to boot.