I have a web app that I've made that I now want to make WELL. It contains a large tree that I want to persist in a database. The tree will be about 50 nodes wide and 30 nodes deep. It will have frequent reads and writes to mostly single nodes, but copying/pasting subtrees is possible.
I've already implemented it using Nested Intervals. But the implementation of this that I used (Node key encoding) causes very large integers to occur when the tree gets deep.
My question is: what is the most efficient representation of hierarchical data that we know of today?
Thanks,
Marco.
Related
In simple words: Is
{
"diary":{
"number":100,
"year":2006
},
"case":{
"number":12345,
"year":2006
}
}
or
{
"diary_number":100,
"diary_year":2006,
"case_number":12345,
"case_year":2006
}
better when using Elasticsearch?
In my case total keys are only a few (10-15). Which is better performance wise?
Use case is displaying data from noSQL database (mostly dynamoDB). Also feeding it into Elasticsearch.
My rule of thumb - if you would need to query/update nested fields, use flat structure.
If you use nested structure, then elastic will make it flat but then has an overhead of managing those relations. Performance wise - flat is always better since elastic doesnt need to related and find nested documents.
Here's an excerpt from Managing Relations Inside Elasticsearch which lists some disadvantages you might want to consider.
Elasticsearch is still fundamentally flat, but it manages the nested
relation internally to give the appearance of nested hierarchy. When
you create a nested document, Elasticsearch actually indexes two
separate documents (root object and nested object), then relates the
two internally. Both docs are stored in the same Lucene block on the
same Shard, so read performance is still very fast.
This arrangement does come with some disadvantages. Most obvious, you
can only access these nested documents using a special nested
query. Another big disadvantage comes when you need to update the
document, either the root or any of the objects.
Since the docs are all stored in the same Lucene block, and Lucene
never allows random write access to it's segments, updating one field
in the nested doc will force a reindex of the entire document.
This includes the root and any other nested objects, even if they were
not modified. Internally, ES will mark the old document as deleted,
update the field and then reindex everything into a new Lucene block.
If your data changes often, nested documents can have a non-negligible
overhead associated with reindexing.
Lastly, it is not possible to "cross reference" between nested
documents. One nested doc cannot "see" another nested doc's
properties. For example, you are not able to filter on "A.name" but
facet on "B.age". You can get around this by using include_in_root,
which effectively copies the nested docs into the root, but this get's
you back to the problems of inner objects.
Nested data is quite good. Unless you explicitly declare diary and case as nested field, they will be indexed as object fields. So elasticsearch will convert them itself to
{
"diary.number":100,
"diary.year":2006,
"case.number":12345,
"case.year":2006
}
Consider also, that every field value in elasticsearch can be a array. You need the nested datatype only if you have many diaries in a single document and need to "maintain the independence of each object in the array".
The answer is a clear it-depends. JSON is famous for its nested structures. However, there are some tools which only can deal with key-value structures and flat JSONs and I feel Elastic is more fun with flat JSONs, in particular if you use Logstash, see e.g. https://discuss.elastic.co/t/what-is-the-best-way-of-getting-mongodb-data-into-elasticsearch/40840/5
I am happy to be proven wrong..
I've read some mongo documentation but I wasn't able to find an answer to my question.
I'm developing an application where I want to store Json documents. I've read about indexes and so on but one question is remaining for me.
The data I want to store contains information that does not need to be loaded by the client as a whole. So I planed to normalize the data and split my big json into smaller ones and offer them by a seperate rest endpoint.
Not I was thinking about creating a different collection for each group of jsons.
The reason for that is that I want to reduce the search space compared to the option to store everything in one collection.
So each user will have 5 collections and I expect 1 million users.
Is this a good solution in point of performance and scaling?
Is querying multiple collections more expensive then querying one?
Recently while working on a project, I and my team faced this situation where we had a huge data set and in the future, it is supposed to increase rapidly.
We had MongoDB in place as data grew the performance started to degrade. The reason was mainly due to multiple collections, we have to have the lookup to join the collections and get the data.
Interestingly the way we map the two collections plays a very important role in the performance.
We had an initial structure as :
Collection A {
"_id" : ...,
"info" : [
// list of object id of other collection
]
}
Field info was used to map with "_id" of Collection B.
Since mongo have _id as a unique identifier, no matter what indexes we have, it will scan all documents of Collection B and if B is of GBS or TBS, it will take very long to get even one matching the document.
So the change we made as :
Removed array of objects id from Collection A and added new field in Collection B which will have _id of a document in Collection A.
Long story short, we reversed the mapping we had.
Now apply the index on Collection B's fields used in the query. This improved the performance a lot.
So it's not a bad idea to have multiple collections, executing proper mapping between collections, MongoDB can provide excellent performance. You can also use sharding to further enhance it.
As mentioned in the following article : http://www.couchbase.com/why-nosql/nosql-database
When looking up data, the desired information needs to be collected from many tables (often hundreds in today’s enterprise applications) and combined before it can be provided to the application. Similarly, when writing data, the write needs to be coordinated and performed on many tables.
and the given example of data in JSON format tells
ease of efficiently distributing the resulting documents and read and write performance improvements make it an easy trade-off for web-based applications
But what if i capture all my data in a single table in mysql as is done in mongoDB [in the link given] , would that performance be like equivalent to mongoDB [meaning extracting data from mysql without JOINS] ?
It all depends on the structure you require. The main point of splitting data into tables is being able to index pieces of data, accelerating the retrieval of data.
Another point is that the normalization that a relational database offers ties you to a rigid structure. You can, of course, store json in mysql, but the json document won't have its pieces indexed. If you want fast retrieval of a json document by its pieces then you are looking into splitting it into parts.
If your data can change, which means, doesn't require a schema, then use Mongo.
If your data structure doesn't change then I'd go with MySQL
I'm building a multi level commenting system and need a solution for quick reads and writes.
I've looked into adjacency list and nested set and it seems to me that for my particular scenario neither is the right method to use, so I'm looking into non RDBMS solutions as well.
What I would like to achieve:
Multy level parent/child relationship
Lots of reads and lots of writes
Adding/editing any child at any level
Sorting entire tree by dateime(old/new), voting score
I feel like the best solution for RDBMS is adjacency list, where you have recursive reads. But this is very inneficient because there will be thousands of reads per minute. Nested set is great for reads, but I will have lot of writes too which will make it really slow and inefficinet.
Do you know any other techniques that I could use here? Maybe other types of databases?
Most comment threads are very small in size ...less than a few K. So rather than storing each comment as it's own record in the database, you can store the entire comment graph as a single object. This will make it very easy to read and write the comment tree quickly.
This method lends itself very well to a shared cache ala redis or memcached.
i've got a list of all countries -> states -> cities (-> subcities/villages etc) in a XML file and to retrieve for example a state's all cities it's really quick with XML (using xml parser).
i wonder, if i put all this information in mysql, is retrieving a state's all cities as fast as with XML? cause XML is designed to store hierarchical data while relational databases like mysql are not.
the list contains like 500 000 entities. so i wonder if its as fast as XML using either of:
Adjacency list model
Nested Set model
And which one should i use? Cause (theoretically) there could be unlimited levels under a state (i heard that adjacency isn't good for unlimited child-levels). And which is fastest for this huge dataset?
Thanks!
In this article Quassnoi creates a table with 2,441,405 rows in a heirarchical structure, and tests the performance of highly optimized queries for nested sets and adjacency lists. He runs a variety of different tests, for example fetching ancestors or descendents and times the results (read article for more details of exactly what was tested):
Nested Sets Adjacency Lists
All descendants 300ms 7000ms
All ancestors 15ms 600ms
All descendants up to a certain level 5000ms 600ms
His conclusion is that for MySQL nested sets is faster to query, but has a drawback that it is much slower to update. If you have infrequent updates, use nested sets. Otherwise prefer adjacency lists.
You might also wish to consider if using another database that supports recursive CTEs is an option for you.
I would imagine that an XML file of this size would take a reasonably long time to parse, but if you can cache the parsed structure in memory rather than reading it from disk each time then queries against it will be very fast.
Note that the main drawback of using MySQL for storing heirarchical data is that it requires some very complex queries. Whilst you can just copy the code from the article I linked to, if ever you need you modify it slightly then you will have to understand how it works. If you prefer to keep things simple then XML definitely has an advantage as it was designed for this type of data and so you should easily be able to create the queries you need.