I am having problems understanding some of the Microsoft PST file format specification.
My understanding:
In the NDB layer, our entry is a NID. Given a NID, we can find the leaf node in the BTree. From there, we have bidData and bidSub.
bidData either points to an external data node, or a data tree.
bidSub points to a subnode tree.
My questions:
Can we have a subnode tree without a data tree?
What circumstances would we have a subnode tree?
Is the result of the subnode tree to be concatenated with the result of the data tree?
You have 5 questions.
What is the relationship between nodes, subnodes, and blocks?
Can we have a subnode tree without a data tree?
What circumstances would we have a subnode tree?
Is the result of the subnode tree to be concatenated with the result of the data tree?
Should I iterate through the data blocks/tree first, then the subnode tree, or is there some other way to organize the data I read ?
The answer to question 1 per the standard located here is
A node is an abstraction that consists of a stream of bytes and a
collection of subnodes. It is implemented by the NDB layer as a data
block (section 2.2.2.8.3.1) and a subnode BTree (section 2.2.2.8.3.3).
The NBTENTRY structures in the Node BTree (section 2.2.2.7.7.4) exist
to define which blocks combine to form nodes.
As per the diagram here, you see that a node has the NID, bidData and bidSub as per your understanding.
A summary is a node made up of a data block or a data BTree which can point to data blocks and subnode BTree.
A subnode BTree contains SIBLOCK and SLBLOCK structures, which contain SIENTRY and SLENTRY structures.
Answering Question 2, Yes you can have a subnode Btree without a data BTree. Because a data tree is only one form of specifying bidData, the other is to specify the data block directory.
More specifically, in order to create a subnode as per 2.6.1.2.2, you are required to have a data block in order to associate the SLENTRY with it. This data block can be directly specified or can be datatree containing one or more external or internal data block references.
If your question is can we have a subnode Btree e.g. bidSub, without the relevant bidData being set, the answer is no as per above. If you have bidData initialised to 0x0 as per here for representing a placeholder Node that doesn't yet have a datablock, the SLENTRY's won't be associated with the placeholder Node, until a valid bidData is set.
Answering Question 3, Subnodes are used to divide data in to logical / hierarchical sections. I don't know your circumstances , so I can't really answer this question beyond the following answer.
You would have subnode trees, when you are needing to store data in PSTs that leads itself well to logical separation. Examples of existing use of subnodes are in Message object as per here to store attachments in the Messaging Layer , and in storing Table Contexts in the LTP layer as per here and for additional storage in Property Contexts as per here.
Answering Question 4, I don't understand what you mean by the term result. As far as I know combination of information from data trees and subnode trees happens at the LTP or Messaging Layers.
Answering Question 5, it really depends what you are doing ? The PST SDK provides a mechanism to iterate over nodes in a node database, a mechanism to read from the node a stream, rather than directly, it also provides a method to iterate over the the first level of subnodes as per here and here.
References (other than already linked)
NDB Layer Overview
Understanding the Outlook MS-PST Binary File Format
PST SDK documentation
Related
I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.
In simple words: Is
{
"diary":{
"number":100,
"year":2006
},
"case":{
"number":12345,
"year":2006
}
}
or
{
"diary_number":100,
"diary_year":2006,
"case_number":12345,
"case_year":2006
}
better when using Elasticsearch?
In my case total keys are only a few (10-15). Which is better performance wise?
Use case is displaying data from noSQL database (mostly dynamoDB). Also feeding it into Elasticsearch.
My rule of thumb - if you would need to query/update nested fields, use flat structure.
If you use nested structure, then elastic will make it flat but then has an overhead of managing those relations. Performance wise - flat is always better since elastic doesnt need to related and find nested documents.
Here's an excerpt from Managing Relations Inside Elasticsearch which lists some disadvantages you might want to consider.
Elasticsearch is still fundamentally flat, but it manages the nested
relation internally to give the appearance of nested hierarchy. When
you create a nested document, Elasticsearch actually indexes two
separate documents (root object and nested object), then relates the
two internally. Both docs are stored in the same Lucene block on the
same Shard, so read performance is still very fast.
This arrangement does come with some disadvantages. Most obvious, you
can only access these nested documents using a special nested
query. Another big disadvantage comes when you need to update the
document, either the root or any of the objects.
Since the docs are all stored in the same Lucene block, and Lucene
never allows random write access to it's segments, updating one field
in the nested doc will force a reindex of the entire document.
This includes the root and any other nested objects, even if they were
not modified. Internally, ES will mark the old document as deleted,
update the field and then reindex everything into a new Lucene block.
If your data changes often, nested documents can have a non-negligible
overhead associated with reindexing.
Lastly, it is not possible to "cross reference" between nested
documents. One nested doc cannot "see" another nested doc's
properties. For example, you are not able to filter on "A.name" but
facet on "B.age". You can get around this by using include_in_root,
which effectively copies the nested docs into the root, but this get's
you back to the problems of inner objects.
Nested data is quite good. Unless you explicitly declare diary and case as nested field, they will be indexed as object fields. So elasticsearch will convert them itself to
{
"diary.number":100,
"diary.year":2006,
"case.number":12345,
"case.year":2006
}
Consider also, that every field value in elasticsearch can be a array. You need the nested datatype only if you have many diaries in a single document and need to "maintain the independence of each object in the array".
The answer is a clear it-depends. JSON is famous for its nested structures. However, there are some tools which only can deal with key-value structures and flat JSONs and I feel Elastic is more fun with flat JSONs, in particular if you use Logstash, see e.g. https://discuss.elastic.co/t/what-is-the-best-way-of-getting-mongodb-data-into-elasticsearch/40840/5
I am happy to be proven wrong..
I've read some mongo documentation but I wasn't able to find an answer to my question.
I'm developing an application where I want to store Json documents. I've read about indexes and so on but one question is remaining for me.
The data I want to store contains information that does not need to be loaded by the client as a whole. So I planed to normalize the data and split my big json into smaller ones and offer them by a seperate rest endpoint.
Not I was thinking about creating a different collection for each group of jsons.
The reason for that is that I want to reduce the search space compared to the option to store everything in one collection.
So each user will have 5 collections and I expect 1 million users.
Is this a good solution in point of performance and scaling?
Is querying multiple collections more expensive then querying one?
Recently while working on a project, I and my team faced this situation where we had a huge data set and in the future, it is supposed to increase rapidly.
We had MongoDB in place as data grew the performance started to degrade. The reason was mainly due to multiple collections, we have to have the lookup to join the collections and get the data.
Interestingly the way we map the two collections plays a very important role in the performance.
We had an initial structure as :
Collection A {
"_id" : ...,
"info" : [
// list of object id of other collection
]
}
Field info was used to map with "_id" of Collection B.
Since mongo have _id as a unique identifier, no matter what indexes we have, it will scan all documents of Collection B and if B is of GBS or TBS, it will take very long to get even one matching the document.
So the change we made as :
Removed array of objects id from Collection A and added new field in Collection B which will have _id of a document in Collection A.
Long story short, we reversed the mapping we had.
Now apply the index on Collection B's fields used in the query. This improved the performance a lot.
So it's not a bad idea to have multiple collections, executing proper mapping between collections, MongoDB can provide excellent performance. You can also use sharding to further enhance it.
I have a web app that I've made that I now want to make WELL. It contains a large tree that I want to persist in a database. The tree will be about 50 nodes wide and 30 nodes deep. It will have frequent reads and writes to mostly single nodes, but copying/pasting subtrees is possible.
I've already implemented it using Nested Intervals. But the implementation of this that I used (Node key encoding) causes very large integers to occur when the tree gets deep.
My question is: what is the most efficient representation of hierarchical data that we know of today?
Thanks,
Marco.
I have to store messages that my web app fetch from Twitter into a local database. The purpose of storing messages is that I need to display these messages in a hierarchical order i.e. certain messages(i.e. status updates) that user input through my application are child nodes of others (I have to show them as sub-list item of parent message). Which data model should I use Adjacency List Model OR Nested Set Model? I have to manage four types of messages & messages in each category could have two child node. One more question here is that what I see(realize) in both cases that input is controlled manually that is how reference to parent node in adjacency model or right, left are given in Nested List. My app fetch messages data from twitter like:
foreach ($xml4->entry as $status4) {
echo'<li>'.$status4->content.'</li>';
}
So its no manual, any number of messages can be available anytime. How could I make a parent child relation among messages from it. At the moment, users enter messages in different windows that correspond to four types of messages, my app adds keywords & fetches those back to display in diff windows. All those messages are at the moment parent messages. Now how I make user enter a messages that could be saved into database as child of another.
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
If you are going to have more or less deep trees of data (starting from each root node) consider using nested set, because AL will be slow.
When you say
depth of tree is 2 nodes. i.e. each
parent msg could have two child nodes.
i get confused.
If each of the two child nodes can have more children then you are not taking about depth, but width of a branch of a node.
1) depth really = 2
If your max depth is really 2 (in another words, all nodes connect to root, or zero level nodes in 2 steps; yet in another words, for each node there is no other ancestor then parent and grandparent) then you could even use relational model directly to store hierarchical data (either through self join, which is not so bad with such low maximum depth or by splitting the data into 3 entities - grandparents, parents and children)
2) depth >> 2
If number 2 was the width and the depth is variable and potentially quite deep then look at nested sets, with two additional possibilities to explore
using the nested set idea you could explore geom type to store hierarchical data, (the benefits might not be so interesting - few useful operators, single field, possibly better indexing strategy)
continued fractions (based on nested set, tropashko offered generalization which seemed interesting as they promised to improve on some of the problems with nested sets; didn't implement it though so... do your own tests).