The data structure for the "order" property of an object in a list - language-agnostic

I have a list of ordered objects stored in the database and accessed through an ORM (specifically, Django's ORM). The list is sorted arbitrarily by the user and I need some way to keep track of it. To that end, each object has an "order" property that specifies its order in relation to the other objects.
Any data structure I use to sort on this order field will have to be recreated upon the request, so creation has to be cheap. I will often use comparisons between multiple objects. And insertions can't require me to update every single row in the database.
What data structure should I use?

Heap
As implemented by std::set (c++) with comparison on the value (depends on the value stored), e.g. integers by value, strings by string ordering and so on. Good for ordering and access but not sure when you need to do reordering, is that during insertion, constantly and so on.
You're not specifying what other constraints do you have. If number of elements is know and it is reasonable or possible to keep fixed array in memory then each position would be specified by index. Assumes fixed number of elements.
You did not specify if you want operations you want to perform often.
If you load up values once then access them without modification you can use one algorithm for creation then either build index or transform storage to be able to use fast access ...

Linked list (doubly)?

I don't think you can get constant time for insertions (but maybe amortized constant time?).
I would use a binary tree, then the bit sequence describing the path can be used as your "order property".

Related

Rails - json column vs seperate table

I'm currently working on a Ruby on Rails project in which I have objects with association to instructions, meaning, each object, can have zero or more instruction objects that hold some basic data, like title, data (string), and position (for ordering them in the UI). I tried looking up an answer in google but found no relevant answer. the instructions are specific to each object and shouldn't be used for lookup or search of any kind, and therefore I figured I should store them as JSON within the object's own table instead of making a join table. The reason I think of doing so is that join table would explode when there would be many objects and because of that querying for each object's instructions would get longer over time. Is that a reasonable concern for storing this data as a JSON instead of has_many association?
Think of using JSON in an RDBMS as a form of denormalization. There are legitimate reasons to use denormalization, but you must keep in mind that it always optimizes for one type of query at the expense of other types of queries.
For example, in this case you could query your object and it would include the JSON document containing all instructions. But if you wanted to search for a specific instruction, it would be quite complex to search for the row that has a JSON documenting containing a specific instruction. Have you thought about how you would query that?
Using normalized database design, i.e. the join table you mention, allows for more flexibility in queries. You can query the object table, or you can query the instruction table. Either way, then simply join to the other table to the the corresponding rows.
The way to make this more optimized is to use indexes on the columns you want to search. See my presentation How to Design Indexes, Really or the video.
Using JSON creates a lot of complexity that you probably haven't considered. See my presentation How to Use JSON in MySQL Wrong.

Efficient way to Store Huge Number of Large Arrays of Integers in Database

I need to store an array of integers of length about 1000 against an integer ID and string name. The number of such tuples is almost 160000.
I will pick one array and calculate the root mean square deviation (RMSD) elementwise with all others and store an (ID1,ID2,RMSD) tuple in another table.
Could you please suggest the best way to do this? I am currently using MySQL for other datatables in the same project but if necessary I will switch.
One possibility would be to store the arrays in a BINARY or a BLOB type column. Given that the base type of your arrays is an integer, you could step through four bytes at a time to extract values at each index.
If I understand the context correctly, the arrays must all be of the same fixed length, so a BINARY type column would be the most efficient, if it offers sufficient space to hold your arrays. You don't have to worry about database normalisation here, because your array is an atomic unit in this context (again, assuming I'm understanding the problem correctly).
If you did have a requirement to access only part of each array, then this may not be the most practical way to store the data.
The secondary consideration is whether to compute the RMSD value in the database itself, or in some external language on the server. As you've mentioned in your comments, this will be most efficient to do in the database. It sounds like queries are going to be fairly expensive anyway, though, and the execution time may not be a primary concern: simplicity of coding in another language may be more desirable. Also depending on the cost of computing the RMSD value relative to the cost of round-tripping a query to the database, it may not even make that much of a difference?
Alternatively, as you've alluded to in your question, using Postgres could be worth considering, because of its more expressive PL/pgSQL language.
Incidentally, if you want to search around for more information on good approaches, searching for database and time series would probably be fruitful. Your data is not necessarily time series data, but many of the same considerations would apply.

mysql key/value store problem

I'm trying to implement a key/value store with mysql
I have a user table that has 2 columns, one for the global ID and one for the serialized data.
Now the problem is that everytime any bit of the user's data changes, I will have to retrieve the serialized data from the db, alter the data, then reserialize it and throw it back into the db. I have to repeat these steps even if there is a very very small change to any of the user's data (since there's no way to update that cell within the db itself)
Basically i'm looking at what solutions people normally use when faced with this problem?
Maybe you should preprocess your JSON data and insert data as a proper MySQL row separated into fields.
Since your input is JSON, you have various alternatives for converting data:
You mentioned many small changes happen in your case. Where do they occur? Do they happen in a member of a list? A top-level attribute?
If updates occur mainly in list members in a part of your JSON data, then perhaps every member should in fact be represented in a different table as separate rows.
If updates occur in an attribute, then represent it as a field.
I think cost of preprocessing won't hurt in your case.
When this is a problem, people do not use key/value stores, they design a normalized relational database schema to store the data in separate, single-valued columns which can be updated.
To be honest, your solution is using a database as a glorified file system - I would not recommend this approach for application data that is core to your application.
The best way to use a relational database, in my opinion, is to store relational data - tables, columns, primary and foreign keys, data types. There are situations where this doesn't work - for instance, if your data is really a document, or when the data structures aren't known in advance. For those situations, you can either extend the relational model, or migrate to a document or object database.
In your case, I'd see firstly if the serialized data could be modeled as relational data, and whether you even need a database. If so, move to a relational model. If you need a database but can't model the data as a relational set, you could go for a key/value model where you extract your serialized data into individual key/value pairs; this at least means that you can update/add the individual data field, rather than modify the entire document. Key/value is not a natural fit for RDBMSes, but it may be a smaller jump from your current architecture.
when you have a key/value store, assuming your serialized data is JSON,it is effective only when you have memcached along with it, because you don't update the database on the fly every time but instead you update the memcache & then push that to your database in background. so definitely you have to update the entire value but not an individual field in your JSON data like address alone in database. You can update & retrieve data fast from memcached. since there are no complex relations in database it will be fast to push & pull data from database to memcache.
I would continue with what you are doing and create separate tables for the indexable data. This allows you to treat your database as a single data store which is managed easily through most operation groups including updates, backups, restores, clustering, etc.
The only thing you may want to consider is to add ElasticSearch to the mix if you need to perform anything like a like query just for improved search performance.
If space is not an issue for you, I would even make it an insert only database so any changes adds a new record that way you can keep the history. Of course you may want to remove the older records but you can have a background job that would delete the superseded records in a batch in the background. (Mind you what I described is basically Kafka)
There's many alternatives out there now that beats RDBMS in terms of performance. However, they all add extra operational overhead in that it's yet another middleware to maintain.
The way around that if you have a microservices architecture is to keep the middleware as part of your microservice stack. However, you have to deal with transmitting the data across the microservices so you'd still end up with a switch to Kafka underneath it all.

Quickest way to represent array in mysql for retrieval

I have an array of php objects that I want to store into a mysql database table. The only way I can think of is just have a table to represent the object with a unique id and a separate table to store the array (there could be a column array_id and an object_id) but retrieving would require a join I believe which could get expensive. Is there a better way? I don't care much about storage space or insertion time as much as retrieval time.
I don't necessarily need this to work for associative arrays but if the solution could, that would be preferred.
Building a tree structure (read as Array) in mysql can be tricky but it is done all of the time. Almost any forum with nested threads has some mechanism to store a tree structure. As another poster said they do not have to be expensive.
The real question is how you want to use the data. If you need to be able to add/remove data fields from individual nodes in the tree then you can use one of two models
1) Adjacency List Model
2) Modified Preorder Tree Traversal Algorithm
(They sound scary, but it's not that bad I promise.)
The first one listed is probably the more common you will encounter and the second is the one I have begun to use more frequently and has some nice benefits once you wrap your head around it. Take a look at this page--it has an EXCELLENT writeup about both.
http://articles.sitepoint.com/article/hierarchical-data-database
As another poster said though, if you don't need to change the data with queries or search inside the text then use a PHP function to store it in a single field.
$array = array('something'=>'fun', 'nothing'=>'to do')
$storage_array = serialize($array);
//INSERT INTO DB
//DRAW OUT OF DB
$array = unserialize($row['stored_array']);
Presto-changeo, that one is easy.
If you are comfortable with not being able to SQL search through the data within the array, you could add a single column to the table, and serialize the array into it. You would have to deserialize it on retreival.
You could use JSON / PHP serializeation or whatever is more appropriate for the language you're developing in.
Joins don't have to be so expensive - you can define an index.

What are hashtables and hashmaps and their typical use cases?

I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?
Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.
lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.
if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.
Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).