Storing Unbalanced Tree in Database - mysql

I'm working on a project where I need to store a Tree structure in a database, in the past I already dealt with the same scenario and I used a particular solution (explained below).
I know that there is no BEST solution, and usually the best solution is the one who gives the major advantages, but there is undoubtedly a worst one, and I'd like to not use that...
As I was saying I need to:
store an unbalanced tree structure
any node can have an "unlimited" number of children
have the ability to easily obtain all the children (recursively) of one node
have the ability to easily "rebuild" the tree structure
The solution I used in the past consisted into use a VARCHAR(X * Y) primary key where:
X is the "hypothetical" maximum level possible
Y is the character number of the "hypothetical" maximum number of direct children of one node...
i.e.
If I have:
- at maximum 3 levels then X = 3
- at maximum 20 direct children per node, Y = 2 (20 has two characters - it is possible then to store up to 99 children)
The PRIMARY KEY column will be created as VARCHAR(6)
The ID is a composite combination of PARENT ID + NODE_ID
NODE ID is an incremental numerical value padded with zeros on the left side.
the node in the first level will be then stored as:
[01,02,03,04,...,99]
the nodes in the second level will be stored as:
[0101, 0102, 0103, ..., 0201, 0202, 0203, ... , 9901, 9999]
the nodes in the third level will be stored as:
[010101, 010102, 010103, ..., 020101, 020102, 020301, ... , 990101, 999999]
and so on...
PROs:
It's easy to rebuild the tree
It's super easy obtain the children list of one particular node (ie. select ... where id like '0101%')
Only one column for both the identifier and the parent link.
CONs:
It's mandatory to define a MAX number of CHILDREN/LEVELS
It the X and Y values are great the id key will be a way too long
VARCHAR type as primary key
Changing the tree structure (move one node from one parent to another) will be difficult (if not impossible) and consuming because of the necessity to re-create the entire ids for the node and all it's children.
Preorder Tree Traversal
I did some research and the best solution I found to my main problems (obtaining all the children of one node, etc.), is to use the Preorder Tree Traversal solution
(for the sake of brevity I will post a link where the solution is explained: HERE )
Whilst this solution is better in almost every aspects, it has a HUGE downside, any change in the structure (add/remove/change parent of a node) needs to RECREATE the entire left/right indexes, and this operation is time and resource consuming.
Conclusion
Having said so, any suggestion is very much appreciated.
Which is for you the best solution to maximize the needs explained in the beginning?

Related

Depth of Nested Set

I have a large mysql table with parent-child relationships stored in the nested set model (Left and right values).
It makes it EASY to find all the children of a given item.
Now, how do I find the DEPTH of a certain item.
Example of the row:
Parent_ID, Taxon_ID, Taxon_Name, lft, rgt
for somerow(taxon_id) I want to know how far it is from the root node.
NOW it may be important here to note that in the way I have the data structured is that each terminal node (a node with no children of its own) lft = rgt. I know many of the examples posted online have rgt = lft +1, but we decided not to do that just for sake of ease.
Summary:
Nested set model, need to find the depth (number of nodes to get to the root) of a given node.
I figured it out.
Essentially you have to query all the nodes that contain the node you are looking for inside. So for example, I was looking at one node that has lft=rgt=7330 and I wanted the depth of it. I just needed to
Select count(*)
from table
where lft<7330
AND rgt>7330
You may want to add 1 to the result before you use it because its really telling you the number of generations preceding rather than the actual level. But it works and its fast!
MySQL does not support recursive queries. I believe PostgreSQL offers limited support, but it would be inefficient and messy. There is no reason, however, why you couldn't execute queries in a recursive fashion (i.e., programatically) to arrive at the desired result.
If "how deep is this node?" is a question you need answered often, you might consider adjusting the schema of your table so that each node stores and maintains its depth. Then you can just read that value instead of calculating it with awkward recursion. (Maintenance of the depth values may become tedious if you're shuffling the table, but assuming you are doing far fewer writes than reads, this is a more efficient approach.)

modeling many to many unary relationship and 1:M unary relationship

Im getting back into database design and i realize that I have huge gaps in my knowledge.
I have a table that contains categories. Each category can have many subcategories and each subcategory can belong to many super-categories.
I want to create a folder with a category name which will contain all the subcategories folders. (visual object like windows folders)
So i need to preform quick searches of the subcategories.
I wonder what are the benefits of using 1:M or M:N relationship in this case?
And how to implement each design?
I have create a ERD model which is a 1:M unary relationship. (the diagram also contains an expense table which stores all the expense values but is irrelevant in this case)
is this design correct?
will many to many unary relationship allow for faster searches of super-categories and is the best design by default?
I would prefer an answer which contains an ERD
If I understand you correctly, a single sub-category can have at most one (direct) super-category, in which case you don't need a separate table. Something like this should be enough:
Obviously, you'd need a recursive query to get the sub-categories from all levels, but it should be fairly efficient provided you put an index on PARENT_ID.
Going in the opposite direction (and getting all ancestors) would also require a recursive query. Since this would entail searching on PK (which is automatically indexed), this should be reasonably efficient as well.
For some more ideas and different performance tradeoffs, take a look at this slide-show.
In some cases the easiest way to maintain a multilevel hierarchy in a relational database is the Nested Set Model, sometimes also called "modified preorder tree traversal" (MPTT).
Basically the tree nodes store not only the parent id but also the ids of the left-most and right-most leaf:
spending_category
-----------------
parent_id int
left_id int
right_id int
name char
The major benefit from doing this is that now you are able to get an entire subtree of a node with a single query: the ids of subtree nodes are between left_id and right_id. There are many variations; others store the depth of the node in addition to or instead of the parent node id.
A drawback is that left_id and right_id have to be updated when nodes are inserted or deleted, which means this approach is useful only for trees of moderate size.
The wikipedia article and the slideshow mentioned by Branko explains the technique better than I can. Also check out this list of resources if you want to know more about different ways of storing hierarchical data in a relational database.

Database Modeling: How to catogorize products like Amazon?

Assume I had a number of products (from a few thousands to hundred of thousands) that needed to be categorized in a hierarchical manner. How would I model such a solution in a database?
Would a simple parent-child table like this work:
product_category
- id
- parent_id
- category_name
Then in my products table, I would just do this:
product
- id
- product_category_id
- name
- description
- price
My concern is that this won't scale. By the way, I'm using MySQL for now.
Course it will scale. That will work just fine, it is a commonly used structure.
Include a level_no. That will assist in the code, but more important, it is required to exclude duplicates.
If you want a really tight structure, you need something like the Unix concept of inodes.
You may have difficulty getting your head around the code required to produce the hierarchy, say from a product, but that is a separate issue.
And please change
(product_category)) id to product_category_id
(product id to product_id
parent_id to parent_product_category_id
Responses to Comments
level_no. Have a look at this Data Model, it is for a Directory Tree structure (eg. the FlieManager Explorer window):
Directory Data Model
See if you can make sense of it, that's the Unix inode concept. The FileNames have to be unique within the Node, hence the second Index. That is actually complete, but some developers these days will have a hissy fit writing the code required to navigate the hierarchy, the levels. Those developers need a level_no to identify what level in the hierarchy they are dealing with.
Recommended changes. Yes, it is called Good Naming Conventions. I am rigid about it, and I publish it, so it is a Naming Standard. There are reasons for it, which will become clear to you when you write some SQL with 3 or 4 levels of joins; especially when you go to same one parent two different ways. If you search SO, you will find many questions for this; always the same answer. It will also be highlit in the next model I write for you.
I used to struggle with the same problem 10 years ago. Here's my personal solution to this problem. But before I start explaining, I would like to mention its pros and cons.
Pros:
You can select subbranches of a given node within any number of
desired depths, with the lowest imaginable cost.
The same can be done to select parent nodes.
No RDBMS specific feature is needed. So the same technique can be
implemented in any of them.
It is all implemented using a single field.
Cons:
You should be able to define a maximum number of depth for your
tree. You also need to define the maximum number of direct children
for the nodes.
Restructuring the tree is more expensive than traversing it. But not
as expensive as Nest Set Model. Adding a new branch is the matter of
finding the right value for the field. And in order to move a branch
into a new parent you need to update that node and all its children
(direct and indirect). The good news is that deleting a node and its
children is as easy as traversing it (which is absolutely nothing).
The technique:
Consider the following table as your tree holder:
CREATE TABLE IF NOT EXISTS `product_category` (
`product_category_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(20) NOT NULL,
`category_code` varchar(62) NOT NULL,
PRIMARY KEY (`product_category_id`),
UNIQUE KEY `uni_category_code` (`category_code`)
) DEFAULT CHARSET=utf8 ;
All the magic is done in category_code field. You need to encode your branch address into a text value as follow:
**node_name -> category_code**
Root -> 01
First child -> 01:01
Second child -> 01:02
First grandchild -> 01:01:01
First child of second child -> 01:02:01
In the above example, each node can have up to 99 direct children (assuming we are thinking in decimal). And since category_code is of type varchar(62), we can have up to (62-2)/3 = 20 depth. It's a trade off between the depth you want and the number of direct children each node can have and the size of your field. Scientifically speaking, this is an implementation of a complete tree in which unused branches are not actually created but reserved.
The good parts:
Now imagine you want to select nodes under 01:02. You can do this using a single query:
SELECT *
FROM product_category
WHERE
category_code LIKE '01:02:%'
Selecting direct nodes under the 01:02:
SELECT *
FROM product_category
WHERE
category_code LIKE '01:02:__'
Selecting all the ancestors of 01:02:
SELECT *
FROM product_category
WHERE
'01:02' LIKE CONCAT(category_code, ':%')
The bad parts:
Inserting a new node into the tree is the matter of finding the right category_code. This can be done using a stored procedure or even in a programming language like PHP.
Since the tree is limited in the number of direct children and depth, an insert can fail. But I believe in most practical cases we can assume such a limitation.
Cheers.
Your solution uses the adjacency list model of a hierarchy. It's by far the most common. It will scale ok up to thousands of products. The problem is that it takes either a recursive query or product specific extensions to SQL to deal with an indefinitely deep hierarchy.
There are other models of a hierarchy. In particular, there's the nested set model. The nested set model is good for retrieving the path of any node in a single query. It's also good for retrieving any desired sub tree. It's more work to keep it up to date. A lot more work.
You may want to briefly explore it before you bite off more than you want to chew.
What are you going to do with the hierarchy?
I think your big issue is that this is a deficiency in MySQL. For most RDBMSs which support WITH and WITH RECURSIVE, you should require only one scan per level. This makes deep hierarchies a bit problematic but usually not too bad.
I think to make this work well you will have to code a fairly extensive stored procedure, or you will have to go to another tree model, or you will have to move to a different RDBMS. For example this is easy to do with PostgreSQL and WITH RECURSIVE and this offers a lot better scalability than many other approaches.

Adjecency List Model OR Nested Set Model, which data model should I use to store my hierarchiecal data?

I have to store messages that my web app fetch from Twitter into a local database. The purpose of storing messages is that I need to display these messages in a hierarchical order i.e. certain messages(i.e. status updates) that user input through my application are child nodes of others (I have to show them as sub-list item of parent message). Which data model should I use Adjacency List Model OR Nested Set Model? I have to manage four types of messages & messages in each category could have two child node. One more question here is that what I see(realize) in both cases that input is controlled manually that is how reference to parent node in adjacency model or right, left are given in Nested List. My app fetch messages data from twitter like:
foreach ($xml4->entry as $status4) {
echo'<li>'.$status4->content.'</li>';
}
So its no manual, any number of messages can be available anytime. How could I make a parent child relation among messages from it. At the moment, users enter messages in different windows that correspond to four types of messages, my app adds keywords & fetches those back to display in diff windows. All those messages are at the moment parent messages. Now how I make user enter a messages that could be saved into database as child of another.
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
If you are going to have more or less deep trees of data (starting from each root node) consider using nested set, because AL will be slow.
When you say
depth of tree is 2 nodes. i.e. each
parent msg could have two child nodes.
i get confused.
If each of the two child nodes can have more children then you are not taking about depth, but width of a branch of a node.
1) depth really = 2
If your max depth is really 2 (in another words, all nodes connect to root, or zero level nodes in 2 steps; yet in another words, for each node there is no other ancestor then parent and grandparent) then you could even use relational model directly to store hierarchical data (either through self join, which is not so bad with such low maximum depth or by splitting the data into 3 entities - grandparents, parents and children)
2) depth >> 2
If number 2 was the width and the depth is variable and potentially quite deep then look at nested sets, with two additional possibilities to explore
using the nested set idea you could explore geom type to store hierarchical data, (the benefits might not be so interesting - few useful operators, single field, possibly better indexing strategy)
continued fractions (based on nested set, tropashko offered generalization which seemed interesting as they promised to improve on some of the problems with nested sets; didn't implement it though so... do your own tests).

Can a binary tree or tree be always represented in a Database as 1 table and self-referencing?

I didn't feel this rule before, but it seems that a binary tree or any tree (each node can have many children but children cannot point back to any parent), then this data structure can be represented as 1 table in a database, with each row having an ID for itself and a parentID that points back to the parent node.
That is in fact the classical Employee - Manager diagram: one boss can have many people under him... and each person can have n people under him, etc. This is a tree structure and is represented in database books as a common example as a single table Employee.
The answer to your question is 'yes'.
Simon's warning about your trees becoming a cyclic graph is correct too.
All the stuff that has been said about "You have to ensure by hand that this won't happen, i.e. the DBMS won't do that for you automatically, because you will not break any integrity or reference rules.", is WRONG.
This remark and the coresponding comments holds true, as long as you only consider SQL systems.
There exist systems which CAN do this for you in a pure declarative way, that is without you having to write *any* code whatsoever. That system is SIRA_PRISE (http://shark.armchair.mb.ca/~erwin).
Yes, you can represent hierarchical structures by self-referencing the table. Just be aware of such situations:
Employee Supervisor
1 2
2 1
Yes, that is correct. Here's a good reference
Just be aware that you generally need a loop in order to unroll the tree (e.g. find transitive relationships)