Is creating a "binding map" a good idea for allowing O(1) random lookup to a Map? - language-agnostic

There are plenty of situations where you have to chose a Random entry of a Map containing thousands or millions of elements. For example displaying a random quote, or definition, or customer feedback. For the sake of clarity, I will assume in the rest of this post that I am dealing with a dictionary map, where the key String is the word, and the value String its definition.
The bruteforce approach for this kind of problem would be something like this.
// Bruteforce approach
x = new integer between 0 and the number of entries in the map
iterate x times
return the element
I feel that using a bruteforce approach is not only slow and time consuming, but stupid.
Why would I have to iterate over thousands of elements to find something I am not particularly looking for?
The only way of avoiding bruteforce and having O(1) random lookup, is to have an integer as the key of the map, because of two points:
The only random object we can get is an integer
The only way of having O(1) lookup in a map, is to know the key.
But as you can only have one key, if you put an integer as your key, then you can't have O(1) lookup for the definition of a given word, which was the point of using a Map in the first place.
The way I found of doing this, is to declare a second map, the "binding" map, that just binds each key of the dictionary, to an integer.
So basically, you end up having two Maps:
The dictionary where you have words as keys and definitions as values
The bindingMap where you have integers as keys, and words as values
So if you want to retrieve the definition of a given word, you do:
dictionary.get("hello");
And if you want to retrieve a random word you just do:
dictionary.get(bindingMap.get(myRandomNumber));
Pros:
It allows O(1) lookup AND random access
Cons:
Space complexity is O(2n) instead of O(n)... Which is still O(n)...
What do you think of this method? Do you see any better way of doing this?

If you're willing to settle for O(lg n) lookup rather than O(1) lookup, you might want to consider storing your elements in an order statistic tree, a modified balanced binary search tree that supports lookup, insertion, and deletion in O(lg n) as well as the ability to query the element at position k for any arbitrary k in O(lg n) time.
If you're using strings as keys, you could also consider modifying the construction so that you have an "order statistic trie," where you would store the strings in a trie augmented with the number of elements down each branch. This would give very fast lookup of elements while still supporting quick random queries.
And finally, if you want to go with the two-maps approach, consider replacing the last map with a standard array or dynamic array; it's more space efficient.
Hope this helps!

Related

Store and query array or group of words in MYSQL and PHP

I am working on a project that uses PHP/MYSQL as the backend for an IOS app that makes a lot of use of dictionaries and arrays containing text or strings.
I need to store this text in MYSQL (coming from Arrays of srtrings on phone) and then query to see the text contains (case insensitive) a word or phrase in question.
For example, if the array consists of {Ford, Chevy, Toyota, BMW, Buick}, I might want to query it to see it contains Saab.
I know storing arrays in a field is not MYSQL friendly as it prevents optimization. However, it would be way too complicated to create individual tables for these collections of words which are created by users.
So I'm looking for a reasonable way to store them, perhaps delimited with spaces or with commas that makes possible reasonably efficient searches.
If they are stored separated by spaces, I gather you can do something with regex like:
SELECT
*
FROM
`wordgroups`
WHERE
wordgroup regexp '(^|[[:space:]])BLA([[:space:]]|$)';
But this seems funky.
Is there a better way to do this? Thanks for any insights
Consider using a FULLTEXT index. And use MATCH(...) AGAINST(... IN NATURAL LANGUAGE MODE).
FULLTEXT is very fast for "words", and IN NATURAL MODE may solve your Saab example.
Using regexp can achieve what you want, however, your query will be inefficient, since it cannot rely on any indexes.
If you want to store a list of words and their position within the array does not matter, then you may consider storing them in a single field, space delimited. But instead of using a regexp, use fulltext indexing and searching. This method has a clear advantage over searching with regexp: it uses an index. It has some drawbacks as well: there is a stopword list (these are excluded from searching) and there is a minimum word length as well. The good news is that these parameters are configurable. Also, you get all the drawbacks of storing data in a delimited field, as detailed in Is storing a delimited list in a database column really that bad? question here on SO.
However, if you want to use dictionaries (key - value pairs) or the position within the list may be important, then the above data structure will not do.
In this case, I would consider if mysql is the right choice for storing my data in the first place. If you have multi-dimensional lists, or lists containing lists, then I would definitely choose a different nosql solution.
If you only need simple, two-dimensional lists / dictionaries, then you can store all of them in a single table with a similar structure as below:
list_id - unique identifier of the list, primary key
user_id - id of the user the list belongs to
key - for dictionaries this is the lookup field (indexed), for other lists it may store the position of the element. String data type.
value - the field holding the value (indexed). Data type should be string, so that it could hold different data types as well.
A search to determine if a list holds a certain value would be fast and efficient lookup using the index on either the key or value fields.

IndexedDB - IDBKeyRange on simple index with arbitrary key list

I have an object store in an IDB that has a simple (non-compound) index on a field X. This index is not unique (many items may have the same value for X).
I'd like to query the IDB to return all items that have an X value of either "foo", "bar", or "bat".
According to the documentation, index getAll takes either a key (in my case a string) or an IDBKeyRange. However, it's not obvious to me how to construct an IDBKeyRange with an arbitrary set of keys, and get the union of all results based on those keys.
You cannot do this in a single request. indexedDB does not currently support "OR" style queries.
An alternative solution is to do one request per value. Basically, for each value, use getAll on the index for the value, then concatenate all of the arrays into a single array (possibly merging duplicates). You don't actually have that many round trips against the DB since you are using getAll. In setting up this index, you basically want a store of let's say "things", where each "thing" has a property such as "tags", where tags is an array of the values (each being a string). The index you create on the "tags" property should be flagged as a multi-entry index.
There are, of course, creative hacky solutions. Here is one. Keep in mind it is completely useless if you have things that have different tag sets but you still want to match the ones that share, this would only work if you do not care about whether any one thing has extra tags. For example, consider each distinct set of values, ignoring order. Let's call them groups. E.g. foo is 1, bar is 2, bat is 3, foo-bar is 4, foo-bat is 5, bar-bat is 6, etc. You can give each group a key, like the numerical counter value I just used in the example. Then you can store the group key as a property in the object. Each time you go to store the object, calculate its group key. You can precalculate all group keys, or develop a hash-style function that generates a particular key given a set of arbitrary string values. Sure, you pay a tiny bit more upfront at time of storage, and when building the request query, but you save a ton of processing because indexedDB does all the processing after that. So you want a simple fast hash. And sure, this is added complexity. But maybe it will work. Just find a simple JS hash. Modify it so that you lexicographically store the value set prior to use (so that difference in value order does not cause difference in hash value). So, to explain more concretely, for the things object store, each thing object has a property called "tags-hash". You create a basic index on this (not unique, not multi-entry). Each time you put a thing in the store, you calculate the value of tags-hash, and set the property's value, before calling put. Then, each time you want to query, you calculate the hash of the array of tags by which you wish to query, then call getAll(calculated-hash-value), and it will give you all things that have those exact tags.

Scala immutable Map like datastructure that has constant/effective constant lookup

In designing a JSON AST for Scala, we realised we hit a problem (which can be described in greater detail here https://github.com/json4s/json4s-ast/issues/8), where ideally we would like to represent a JObject (JSON object) with a Map structure that either preserves ordering for its keys OR is is sorted by key using a default Ordering that has either constant or effective constant lookup time.
The reason why we need something that either preserves ordering, or guarantees the keys are sorted, is that when someone serializes the JValue, we need to make sure that it always outputs the same JSON for the same JValue (most obvious case is caching JValue serializations)
The issue is that Scala stdlib doesn't appear to have an immutable Map like datastructure that preserves/orders by key with a O(c) or O(eC) lookup time (reference http://docs.scala-lang.org/overviews/collections/performance-characteristics.html)
Does anyone know if there is an implementation of such a datastructure in Scala somewhere that follows the Scala collections library (or even if such a datastructure exists in general)?
You can't sort in constant time in general, so using an Ordering is right out.
And there aren't actually any immutable data structures I'm aware of that let you add and delete in arbitrary spots with constant performance (best you can get is O(log N) though some people like to pretend that if the base on the logarithm is large enough it's "effectively constant").
So what you're asking for isn't possible.
But as a fallback, TreeMap isn't bad for sorted keys. It's pretty efficient for a O(log N) solution.
For keys in the correct order, you in general need to maintain three maps: index to key, key to index, and key to value. Index to key should be a TreeMap so you can walk in order. The others can be whatever. The essential idea is that when you want to add a key-value pair, you increment the index and add (index -> key) to the first map and (key -> index) to the second. Then when you walk in order, you walk along the index-to-key map and look up in the key-value map. When you delete by key, you find the index with the key-to-index map so you can delete it from the index-to-key map also.
After much deliberation with other people, it seems we will maintain 2 data structures (a vector and a map, i.e. https://github.com/json4s/json4s-ast/issues/8#issuecomment-125025604) which is essentially what #Aivean mentioned. Performance is a bigger concern than potential memory usage.
If you need ordering, a TreeMap is indeed the fastest structure, which does have O(log(n)), without ordering you have you standard immutable Map, which is O(n(log(n)) (Haskell also has Data.Map, which works similarly).
Thanks for the answers!

Acceptable types to use as keys in a HashTable

I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex reference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?
In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?
It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.
By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.
But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).
Hope this helps.
Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.
What the hash function does is that it maps a very large number of states onto a smaller number of states.
That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.
This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.
In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).
As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).
So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).
It seems it is easier to define such a function for simple types like integers.
The best hash keys are those that
Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
Have quick comparisons (for when there are hash collisions).
All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.
If you were to use a complex type as a key then:
It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
The hash table may need to have intimate knowledge of the type in order to pick a bucket.
There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.
Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.

Optimal Way to Store/Retrieve Array in Table

I currently have a table in MySQL that stores values normally, but I want to add a field to that table that stores an array of values, such as cities. Should I simply store that array as a CSV? Each row will need it's own array, so I feel uneasy about making a new table and inserting 2-5 rows for each row inserted in the previous table.
I feel like this situation should have a name, I just can't think of it :)
Edit
number of elements - 2-5 (a selection from a dynamic list of cities, the array references the list, which is a table)
This field would not need to be searchable, simply retrieved alongside other data.
The "right" way would be to have another table that holds each value but since you don't want to go that route a delimited list should work. Just make sure that you pick a delimiter that won't show up in the data. You can also store the data as XML depending on how you plan on interacting with the data this may be a better route.
I would go with the idea of a field containing your comma (or other logical delimiter) separated values. Just make sure that your field is going to be big enough to hold your maximum array size. Then when you pull the field out, it should be easy to perform an explode() on the long string using your delimiter, which will then immediately populate your array in the code.
Maybe the word you're looking for is "normalize". As in, move the array to a separate table, linked to the first by means of a key. This offers several advantages:
The array size can grow almost indefinitely
Efficient storage
Ability to search for values in the array without having to use "like"
Of course, the decision of whether to normalize this data depends on many factors that you haven't mentioned, like the number of elements, whether or not the number is fixed, whether the elements need to be searchable, etc.
Is your application PHP? It might be worth investigating the functions serialize and unserialize.
These two functions allow you to easily store an array in the database, then recreate that array at a later time.
As others have mentioned, another table is the proper way to go.
But if you really don't want to do that(?), assuming you're using PHP with MySQL, why not use the serialize() and store a serialized value?