Scala immutable Map like datastructure that has constant/effective constant lookup - json

In designing a JSON AST for Scala, we realised we hit a problem (which can be described in greater detail here https://github.com/json4s/json4s-ast/issues/8), where ideally we would like to represent a JObject (JSON object) with a Map structure that either preserves ordering for its keys OR is is sorted by key using a default Ordering that has either constant or effective constant lookup time.
The reason why we need something that either preserves ordering, or guarantees the keys are sorted, is that when someone serializes the JValue, we need to make sure that it always outputs the same JSON for the same JValue (most obvious case is caching JValue serializations)
The issue is that Scala stdlib doesn't appear to have an immutable Map like datastructure that preserves/orders by key with a O(c) or O(eC) lookup time (reference http://docs.scala-lang.org/overviews/collections/performance-characteristics.html)
Does anyone know if there is an implementation of such a datastructure in Scala somewhere that follows the Scala collections library (or even if such a datastructure exists in general)?

You can't sort in constant time in general, so using an Ordering is right out.
And there aren't actually any immutable data structures I'm aware of that let you add and delete in arbitrary spots with constant performance (best you can get is O(log N) though some people like to pretend that if the base on the logarithm is large enough it's "effectively constant").
So what you're asking for isn't possible.
But as a fallback, TreeMap isn't bad for sorted keys. It's pretty efficient for a O(log N) solution.
For keys in the correct order, you in general need to maintain three maps: index to key, key to index, and key to value. Index to key should be a TreeMap so you can walk in order. The others can be whatever. The essential idea is that when you want to add a key-value pair, you increment the index and add (index -> key) to the first map and (key -> index) to the second. Then when you walk in order, you walk along the index-to-key map and look up in the key-value map. When you delete by key, you find the index with the key-to-index map so you can delete it from the index-to-key map also.

After much deliberation with other people, it seems we will maintain 2 data structures (a vector and a map, i.e. https://github.com/json4s/json4s-ast/issues/8#issuecomment-125025604) which is essentially what #Aivean mentioned. Performance is a bigger concern than potential memory usage.
If you need ordering, a TreeMap is indeed the fastest structure, which does have O(log(n)), without ordering you have you standard immutable Map, which is O(n(log(n)) (Haskell also has Data.Map, which works similarly).
Thanks for the answers!

Related

IndexedDB - IDBKeyRange on simple index with arbitrary key list

I have an object store in an IDB that has a simple (non-compound) index on a field X. This index is not unique (many items may have the same value for X).
I'd like to query the IDB to return all items that have an X value of either "foo", "bar", or "bat".
According to the documentation, index getAll takes either a key (in my case a string) or an IDBKeyRange. However, it's not obvious to me how to construct an IDBKeyRange with an arbitrary set of keys, and get the union of all results based on those keys.
You cannot do this in a single request. indexedDB does not currently support "OR" style queries.
An alternative solution is to do one request per value. Basically, for each value, use getAll on the index for the value, then concatenate all of the arrays into a single array (possibly merging duplicates). You don't actually have that many round trips against the DB since you are using getAll. In setting up this index, you basically want a store of let's say "things", where each "thing" has a property such as "tags", where tags is an array of the values (each being a string). The index you create on the "tags" property should be flagged as a multi-entry index.
There are, of course, creative hacky solutions. Here is one. Keep in mind it is completely useless if you have things that have different tag sets but you still want to match the ones that share, this would only work if you do not care about whether any one thing has extra tags. For example, consider each distinct set of values, ignoring order. Let's call them groups. E.g. foo is 1, bar is 2, bat is 3, foo-bar is 4, foo-bat is 5, bar-bat is 6, etc. You can give each group a key, like the numerical counter value I just used in the example. Then you can store the group key as a property in the object. Each time you go to store the object, calculate its group key. You can precalculate all group keys, or develop a hash-style function that generates a particular key given a set of arbitrary string values. Sure, you pay a tiny bit more upfront at time of storage, and when building the request query, but you save a ton of processing because indexedDB does all the processing after that. So you want a simple fast hash. And sure, this is added complexity. But maybe it will work. Just find a simple JS hash. Modify it so that you lexicographically store the value set prior to use (so that difference in value order does not cause difference in hash value). So, to explain more concretely, for the things object store, each thing object has a property called "tags-hash". You create a basic index on this (not unique, not multi-entry). Each time you put a thing in the store, you calculate the value of tags-hash, and set the property's value, before calling put. Then, each time you want to query, you calculate the hash of the array of tags by which you wish to query, then call getAll(calculated-hash-value), and it will give you all things that have those exact tags.

Fastest database engine to store huge string list

I have a huge unique string list (1.000.000.000+ lines).
I need to know if a string does exist in this list or not.
What is the fastest way to do it ?
I guess I need a very simple database engine with a Btree index which lets me do fast lookup ... and MySQL may be too slow and complex for this.
If this is all you need to do, you should take a long look at tries and related data structures specialized for strings (e.g. suffix array). With this many strings, you are guaranteed to have a lot of overlap, and these data structures can eliminate such overlap (saving not only memory but also processing time).

Best approach for having unique row IDs in the whole database rather than just in one table?

I'm designing a database for a project of mine, and in the project I have many different kinds of objects.
Every object might have comments on it - which it pulls from the same comments table.
I noticed I might run into problems when two different kind of objects have the same id, and when pulling from the comments table they will pull each other comments.
I could just solve it by adding an object_type column, but it will be harder to maintain when querying, etc.
What is the best approach to have unique row IDs across my whole database?
I noticed Facebook number their objects with a really, really large number IDs, and probably determine the type of it by id mod trillion or some other really big number.
Though that might work, are there any more options to achieve the same thing, or relying on big enough number ranges should be fine?
Thanks!
You could use something like what Twitter uses for their unique IDs.
http://engineering.twitter.com/2010/06/announcing-snowflake.html
For every object you create, you will have to make some sort of API call to this service, though.
Why not tweaking your concept of object_type by integrating it in the id column? For example, an ID would be a concatenation of the object type, a separator and a unique ID within the column.
This approach might scale better, as a unique ID generator for the whole database might lead to a performance bottleneck.
If you only have one database instance, you can create a new table to allocate IDs:
CREATE TABLE id_gen (
id BIGINT PRIMARY KEY AUTO_INCREMENT NOT NULL
);
Now you can easily generate new unique IDs and use them to store your rows:
INSERT INTO id_gen () VALUES ();
INSERT INTO foo (id, x) VALUES (LAST_INSERT_ID(), 42);
Of course, the moment you have to shard this, you're in a bit of trouble. You could set aside a single database instance that manages this table, but then you have a single point of failure for all writes and a significant I/O bottleneck (that only grows worse if you ever have to deal with geographically disparate datacenters).
Instagram has a wonderful blog post on their ID generation scheme, which leverages PostgreSQL's awesomeness and some knowledge about their particular application to generate unique IDs across shards.
Another approach is to use UUIDs, which are extremely unlikely to exhibit collisions. You get global uniqueness for "free", with some tradeoffs:
slightly larger size: a BIGINT is 8 bytes, while a UUID is 16 bytes;
indexing pains: INSERT is slower for unsorted keys. (UUIDs are actually preferable to hashes, as they contain a timestamp-ordered segment.)
Yet another approach (which was mentioned previously) is to use a scalable ID generation service such as Snowflake. (Of course, this involves installing, integrating, and maintaining said service; the feasibility of doing that is highly project-specific.)
I am using tables as object classes, rows as objects and columns as object parameters. Everything starts with the class techname, in which every object has its unique identifier, which is unique in the database. The object classes are registered as objects in the table object classes, and the parameters for each object class are linked to it.

Is creating a "binding map" a good idea for allowing O(1) random lookup to a Map?

There are plenty of situations where you have to chose a Random entry of a Map containing thousands or millions of elements. For example displaying a random quote, or definition, or customer feedback. For the sake of clarity, I will assume in the rest of this post that I am dealing with a dictionary map, where the key String is the word, and the value String its definition.
The bruteforce approach for this kind of problem would be something like this.
// Bruteforce approach
x = new integer between 0 and the number of entries in the map
iterate x times
return the element
I feel that using a bruteforce approach is not only slow and time consuming, but stupid.
Why would I have to iterate over thousands of elements to find something I am not particularly looking for?
The only way of avoiding bruteforce and having O(1) random lookup, is to have an integer as the key of the map, because of two points:
The only random object we can get is an integer
The only way of having O(1) lookup in a map, is to know the key.
But as you can only have one key, if you put an integer as your key, then you can't have O(1) lookup for the definition of a given word, which was the point of using a Map in the first place.
The way I found of doing this, is to declare a second map, the "binding" map, that just binds each key of the dictionary, to an integer.
So basically, you end up having two Maps:
The dictionary where you have words as keys and definitions as values
The bindingMap where you have integers as keys, and words as values
So if you want to retrieve the definition of a given word, you do:
dictionary.get("hello");
And if you want to retrieve a random word you just do:
dictionary.get(bindingMap.get(myRandomNumber));
Pros:
It allows O(1) lookup AND random access
Cons:
Space complexity is O(2n) instead of O(n)... Which is still O(n)...
What do you think of this method? Do you see any better way of doing this?
If you're willing to settle for O(lg n) lookup rather than O(1) lookup, you might want to consider storing your elements in an order statistic tree, a modified balanced binary search tree that supports lookup, insertion, and deletion in O(lg n) as well as the ability to query the element at position k for any arbitrary k in O(lg n) time.
If you're using strings as keys, you could also consider modifying the construction so that you have an "order statistic trie," where you would store the strings in a trie augmented with the number of elements down each branch. This would give very fast lookup of elements while still supporting quick random queries.
And finally, if you want to go with the two-maps approach, consider replacing the last map with a standard array or dynamic array; it's more space efficient.
Hope this helps!

Acceptable types to use as keys in a HashTable

I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex reference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?
In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?
It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.
By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.
But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).
Hope this helps.
Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.
What the hash function does is that it maps a very large number of states onto a smaller number of states.
That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.
This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.
In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).
As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).
So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).
It seems it is easier to define such a function for simple types like integers.
The best hash keys are those that
Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
Have quick comparisons (for when there are hash collisions).
All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.
If you were to use a complex type as a key then:
It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
The hash table may need to have intimate knowledge of the type in order to pick a bucket.
There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.
Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.