A searchable heap structure - language-agnostic

The problem is to access a series of values by two different methods. First, by priority; that is implemented simply enough with a heap. Additionally, it must be possible to "tag" each value with one or more symbols through which a list of items may be accessed.
This would be easy enough to implement efficiently by referencing the same data in two different structures. However, these must form a cohesive queue. Therefore, items removed through the one structure must also be removed from the other, an operation for which a heap is not extremely suitable.
Is there a data structure which is able to provide efficient ordering by one value (ideally optimized for pushing/popping), without wholly degrading the performance of finding/deleting nodes at arbitrary locations?

You can delete any element from a binary heap in O(log(n)) time if you know which one to delete. Any node can be treated as a valid "sub-heap" and you can use the delete-max (or delete-min) operation just as you would on the whole thing.
The only problem is how do you know which one to delete? I think I have a solution for that. Use a wrapper for the stored class that has a reference to its heap node and delete that node from the wrappers destructor. When you want to remove any element from the collection you can just delete the wrapper through any of the indexes and it will take care of the rest. When you insert something to the collection you need to create a wrapper object and pass a reference to its heap node.
Here is some C++ code:
template <class T> class Wrapper {
T data;
HeapNode* index_heap;
Foo* index_tags;
Bar* index_queue;
public: ~Wrapper() {
index_heap->delete_max();
index_tags->delete_foo();
index_queue->delete_bar();
}
};

Related

Why are hash table based data structures not the default when implementing adjacency lists?

I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.

Is there an implementation of a binary search tree annotated with sub-tree size

I have been researching the tree data structure described at this link (near the bottom):
http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/
It is mentioned that this data structure could be a finger tree. However, after more research around finger trees, I've found that this lacks the "fingers" that makes finger trees finger trees. Instead, it seems this is just an annotated binary tree (annotated with subtree size).
Do you know of an existing implementation (in any language) of this data-structure that I could use as a reference for my own implementation (though, preferably not an implementation in a functional programming language)?
Or, what would be the most optimal way of retrofitting the subtree size annotations into an existing tree data-structure?
Thanks!
Simon Tatham's Counted B-Trees are similar. if the node count is replaced with a width of buffer like in tweak, these provide operations like ropes.
in fact from reading that the page you reference i see that it was being used like a piece table or line table for an editor
in the paper, Positional Delta Trees to reconcile updates with read-optimized data storage, the authors present a tree which behavior in regard to the invariants it holds between the nodes in the tree bares a striking resemblance to xanadu's enfilades to which the Counted B-tree is also similar.
I've got a project on github called Boost.Intrusive Annotated Trees that aims to provide generic support for annotations like subtree count in Boost.Intrusive. Subtree count was my original use case for it.
Currently it requires C++11 variadic templates and only supports the rbtree, but it works, and I hope to remove both of those restrictions in time
Update: Now builds with C++03. Still only supports rbtree.
When used with a subtree count annotation it's similar to what jordan describes in the answer above - it calculates (left+right+1) at each node. The implementation is quite different - it works with any node and/or value traits; the annotation updates are integrated into the rbtree algorithms instead, which keeps the number of recalculations done minimal.
I've implemented something similar based on a question I asked the other day. I added annotations to the boost::intrusive::rbtree/avltree nodes to calculate the size of each subtree (foreach node count = node->left->count + node->right->count + 1). I perform this update on insertion/deletetion/rebalance of the tree by using the boost value_traits hook for set_parent, set_left, and set_right. Pretty much, as stated in the site you referenced, after each node update, update the current node's size and then traverse up the tree until you hit the root, updating each node's size as you go.
The problem comes when you want to insert into the tree at a specific position. Pretty much the moment you do this, you'll invalidate the key-ordering invariant for the tree structure. This means you won't be able to perform efficient O(log n) lookups by key. But, if you wanted that, you probably wouldn't be needed the size annotations anyway.

Why can you only prepend to lists in functional languages?

I have only used 3 functional languages -- scala, erlang, and haskell, but in all 3 of these, the correct way to build lists is to prepend the new data to the front and then reversing it instead of just appending to the end. Of course, you could append to the lists, but that results in an entirely new list being constructed.
Why is this? I could imagine it's because lists are implemented internally as linked lists, but why couldn't they just be implemented as doubly linked lists so you could append to the end with no penalty? Is there some reason all functional languages have this limitation?
Lists in functional languages are immutable / persistant.
Appending to the front of an immutable list is cheap because you only have to allocate a single node with the next pointer pointing to the head of the previous list. There is no need to change the original list since it's only a singly linked list and pointers to the previous head cannot see the update.
Adding a node to the end of the list necessitates modifying the last node to point to the newly created node. Only this is not possible because the node is immutable. The only option is to create a new node which has the same value as the last node and points to the newly created tail. This process must repeat itself all the way up to the front of the list resulting in a brand new list which is a copy of the first list with the exception of thetail node. Hence more expensive.
Because there is no way to append to a list in O(1) without modifying the original (which you don't do in functional languages)
Because it's faster
They certainly could support appending, but it's so much faster to prepend that they limit the API. It's also kind of non-functional to append, as you must then modify the last element or create a whole new list. Prepend works in an immutable, functional, style by its nature.
That is the way in which lists are defined. A list is defined as a linked list terminated by a nil, this is not just an implementation detail. This, coupled with that these languages have immutable data, at least erlang and haskell do, means that you cannot implement them as doubly linked lists. Adding an element would them modify the list, which is illegal.
By restricting list construction to prepending, it means that anybody else that is holding a reference to some part of the list further down, will not see it unexpectedly change behind their back. This allows for efficient list construction while retaining the property of immutable data.

The data structure for the "order" property of an object in a list

I have a list of ordered objects stored in the database and accessed through an ORM (specifically, Django's ORM). The list is sorted arbitrarily by the user and I need some way to keep track of it. To that end, each object has an "order" property that specifies its order in relation to the other objects.
Any data structure I use to sort on this order field will have to be recreated upon the request, so creation has to be cheap. I will often use comparisons between multiple objects. And insertions can't require me to update every single row in the database.
What data structure should I use?
Heap
As implemented by std::set (c++) with comparison on the value (depends on the value stored), e.g. integers by value, strings by string ordering and so on. Good for ordering and access but not sure when you need to do reordering, is that during insertion, constantly and so on.
You're not specifying what other constraints do you have. If number of elements is know and it is reasonable or possible to keep fixed array in memory then each position would be specified by index. Assumes fixed number of elements.
You did not specify if you want operations you want to perform often.
If you load up values once then access them without modification you can use one algorithm for creation then either build index or transform storage to be able to use fast access ...
Linked list (doubly)?
I don't think you can get constant time for insertions (but maybe amortized constant time?).
I would use a binary tree, then the bit sequence describing the path can be used as your "order property".

What are hashtables and hashmaps and their typical use cases?

I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?
Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.
lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.
if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.
Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).