Database Modeling: How to catogorize products like Amazon? - mysql

Assume I had a number of products (from a few thousands to hundred of thousands) that needed to be categorized in a hierarchical manner. How would I model such a solution in a database?
Would a simple parent-child table like this work:
product_category
- id
- parent_id
- category_name
Then in my products table, I would just do this:
product
- id
- product_category_id
- name
- description
- price
My concern is that this won't scale. By the way, I'm using MySQL for now.

Course it will scale. That will work just fine, it is a commonly used structure.
Include a level_no. That will assist in the code, but more important, it is required to exclude duplicates.
If you want a really tight structure, you need something like the Unix concept of inodes.
You may have difficulty getting your head around the code required to produce the hierarchy, say from a product, but that is a separate issue.
And please change
(product_category)) id to product_category_id
(product id to product_id
parent_id to parent_product_category_id
Responses to Comments
level_no. Have a look at this Data Model, it is for a Directory Tree structure (eg. the FlieManager Explorer window):
Directory Data Model
See if you can make sense of it, that's the Unix inode concept. The FileNames have to be unique within the Node, hence the second Index. That is actually complete, but some developers these days will have a hissy fit writing the code required to navigate the hierarchy, the levels. Those developers need a level_no to identify what level in the hierarchy they are dealing with.
Recommended changes. Yes, it is called Good Naming Conventions. I am rigid about it, and I publish it, so it is a Naming Standard. There are reasons for it, which will become clear to you when you write some SQL with 3 or 4 levels of joins; especially when you go to same one parent two different ways. If you search SO, you will find many questions for this; always the same answer. It will also be highlit in the next model I write for you.

I used to struggle with the same problem 10 years ago. Here's my personal solution to this problem. But before I start explaining, I would like to mention its pros and cons.
Pros:
You can select subbranches of a given node within any number of
desired depths, with the lowest imaginable cost.
The same can be done to select parent nodes.
No RDBMS specific feature is needed. So the same technique can be
implemented in any of them.
It is all implemented using a single field.
Cons:
You should be able to define a maximum number of depth for your
tree. You also need to define the maximum number of direct children
for the nodes.
Restructuring the tree is more expensive than traversing it. But not
as expensive as Nest Set Model. Adding a new branch is the matter of
finding the right value for the field. And in order to move a branch
into a new parent you need to update that node and all its children
(direct and indirect). The good news is that deleting a node and its
children is as easy as traversing it (which is absolutely nothing).
The technique:
Consider the following table as your tree holder:
CREATE TABLE IF NOT EXISTS `product_category` (
`product_category_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(20) NOT NULL,
`category_code` varchar(62) NOT NULL,
PRIMARY KEY (`product_category_id`),
UNIQUE KEY `uni_category_code` (`category_code`)
) DEFAULT CHARSET=utf8 ;
All the magic is done in category_code field. You need to encode your branch address into a text value as follow:
**node_name -> category_code**
Root -> 01
First child -> 01:01
Second child -> 01:02
First grandchild -> 01:01:01
First child of second child -> 01:02:01
In the above example, each node can have up to 99 direct children (assuming we are thinking in decimal). And since category_code is of type varchar(62), we can have up to (62-2)/3 = 20 depth. It's a trade off between the depth you want and the number of direct children each node can have and the size of your field. Scientifically speaking, this is an implementation of a complete tree in which unused branches are not actually created but reserved.
The good parts:
Now imagine you want to select nodes under 01:02. You can do this using a single query:
SELECT *
FROM product_category
WHERE
category_code LIKE '01:02:%'
Selecting direct nodes under the 01:02:
SELECT *
FROM product_category
WHERE
category_code LIKE '01:02:__'
Selecting all the ancestors of 01:02:
SELECT *
FROM product_category
WHERE
'01:02' LIKE CONCAT(category_code, ':%')
The bad parts:
Inserting a new node into the tree is the matter of finding the right category_code. This can be done using a stored procedure or even in a programming language like PHP.
Since the tree is limited in the number of direct children and depth, an insert can fail. But I believe in most practical cases we can assume such a limitation.
Cheers.

Your solution uses the adjacency list model of a hierarchy. It's by far the most common. It will scale ok up to thousands of products. The problem is that it takes either a recursive query or product specific extensions to SQL to deal with an indefinitely deep hierarchy.
There are other models of a hierarchy. In particular, there's the nested set model. The nested set model is good for retrieving the path of any node in a single query. It's also good for retrieving any desired sub tree. It's more work to keep it up to date. A lot more work.
You may want to briefly explore it before you bite off more than you want to chew.
What are you going to do with the hierarchy?

I think your big issue is that this is a deficiency in MySQL. For most RDBMSs which support WITH and WITH RECURSIVE, you should require only one scan per level. This makes deep hierarchies a bit problematic but usually not too bad.
I think to make this work well you will have to code a fairly extensive stored procedure, or you will have to go to another tree model, or you will have to move to a different RDBMS. For example this is easy to do with PostgreSQL and WITH RECURSIVE and this offers a lot better scalability than many other approaches.

Related

Storing Unbalanced Tree in Database

I'm working on a project where I need to store a Tree structure in a database, in the past I already dealt with the same scenario and I used a particular solution (explained below).
I know that there is no BEST solution, and usually the best solution is the one who gives the major advantages, but there is undoubtedly a worst one, and I'd like to not use that...
As I was saying I need to:
store an unbalanced tree structure
any node can have an "unlimited" number of children
have the ability to easily obtain all the children (recursively) of one node
have the ability to easily "rebuild" the tree structure
The solution I used in the past consisted into use a VARCHAR(X * Y) primary key where:
X is the "hypothetical" maximum level possible
Y is the character number of the "hypothetical" maximum number of direct children of one node...
i.e.
If I have:
- at maximum 3 levels then X = 3
- at maximum 20 direct children per node, Y = 2 (20 has two characters - it is possible then to store up to 99 children)
The PRIMARY KEY column will be created as VARCHAR(6)
The ID is a composite combination of PARENT ID + NODE_ID
NODE ID is an incremental numerical value padded with zeros on the left side.
the node in the first level will be then stored as:
[01,02,03,04,...,99]
the nodes in the second level will be stored as:
[0101, 0102, 0103, ..., 0201, 0202, 0203, ... , 9901, 9999]
the nodes in the third level will be stored as:
[010101, 010102, 010103, ..., 020101, 020102, 020301, ... , 990101, 999999]
and so on...
PROs:
It's easy to rebuild the tree
It's super easy obtain the children list of one particular node (ie. select ... where id like '0101%')
Only one column for both the identifier and the parent link.
CONs:
It's mandatory to define a MAX number of CHILDREN/LEVELS
It the X and Y values are great the id key will be a way too long
VARCHAR type as primary key
Changing the tree structure (move one node from one parent to another) will be difficult (if not impossible) and consuming because of the necessity to re-create the entire ids for the node and all it's children.
Preorder Tree Traversal
I did some research and the best solution I found to my main problems (obtaining all the children of one node, etc.), is to use the Preorder Tree Traversal solution
(for the sake of brevity I will post a link where the solution is explained: HERE )
Whilst this solution is better in almost every aspects, it has a HUGE downside, any change in the structure (add/remove/change parent of a node) needs to RECREATE the entire left/right indexes, and this operation is time and resource consuming.
Conclusion
Having said so, any suggestion is very much appreciated.
Which is for you the best solution to maximize the needs explained in the beginning?

Recursion the optimal solution in this case?

I'm getting a headache over this. I'm building a system, that can handle a number of projects, groups and file references.
Please take a look at this:
A user should be able to create an infinite number of projects, an infinite numbers of groups and attach an infinite number of file references - much like an ordinary PC file structure works with drive-letters, folders and files.
All of the mentioned elements resides inside a MySQL database. However, I'm not sure if this (see below) is the optimal way of structuring the whole thing:
As you can see, it contains one entity called "Xrefs", containing projects and groups. The rows points inside itself, probably making it ideal to do a recursive call when retrieving the data.
A different approach could be to create 1 entity for projects, 1 entity for groups and 1 entity for file references... as well as 1 helper entity, that ties the three entities together, also containing a "parent" value, that (similar to the first solution) refers to the upper level tuples in order to create a hierachy.
If you were to build a similar project, what would you do?
You hit one of the best known restrictions of MySQL: the ability to use what is called recursive queries (PostgreSQL) or CTE queries (Oracle). There are some possibles workarounds, but considering a project with this kind of requirements you'd probably suffer a lot with many other well known MySQL limitations. Even SQLLite would be more usefull (except for the one concurrent user restriction) on this matter.
DBIx::Class has some components to help you circumvent this MySQL limitations, search for Nested Trees, Ordered Trees, WITH RECURSIVE QUERY… [DBIx::Class::Tree::NestedSet][1]
You will need support for something like: 7.8. WITH Queries (Common Table Expressions), which MySQL do not offer to you.
Your structure is fine - since you are building a tree, not a general graph, there is no need for a separate table that ties entities together. I would put projects into their own table, because they appear to stand on their own, unless you must support hierarchy among projects as well.
However, given that your RDBMS is MySQL, you would have problems building recursive queries. For example, try thinking of a query that would give you all files related to xfer_id of 1 (i.e. the project). None of the files is tied to that ID, so you need to locate your first-level groups, then your second level groups, and then tie files to them. Since your groups can be nested in any number of levels, your query would have to be recursive as well.
Although you can certainly do it, it is currently not simple, and requires writing stored procedures. A common approach for situations like that is to build the tree in memory, with some assistance from RDBMS. The trick is to store the id of the top project in each group, i.e.
xfer_id xfer_fk xfer_top
------- ------- --------
1 - 1
2 1 1
3 1 1
4 3 1
5 3 1
Now a query with the condition WHERE xfer_top=... will give your all the individual "parts", which could be combined in memory without having to bring the entire table in memory.

Depth of Nested Set

I have a large mysql table with parent-child relationships stored in the nested set model (Left and right values).
It makes it EASY to find all the children of a given item.
Now, how do I find the DEPTH of a certain item.
Example of the row:
Parent_ID, Taxon_ID, Taxon_Name, lft, rgt
for somerow(taxon_id) I want to know how far it is from the root node.
NOW it may be important here to note that in the way I have the data structured is that each terminal node (a node with no children of its own) lft = rgt. I know many of the examples posted online have rgt = lft +1, but we decided not to do that just for sake of ease.
Summary:
Nested set model, need to find the depth (number of nodes to get to the root) of a given node.
I figured it out.
Essentially you have to query all the nodes that contain the node you are looking for inside. So for example, I was looking at one node that has lft=rgt=7330 and I wanted the depth of it. I just needed to
Select count(*)
from table
where lft<7330
AND rgt>7330
You may want to add 1 to the result before you use it because its really telling you the number of generations preceding rather than the actual level. But it works and its fast!
MySQL does not support recursive queries. I believe PostgreSQL offers limited support, but it would be inefficient and messy. There is no reason, however, why you couldn't execute queries in a recursive fashion (i.e., programatically) to arrive at the desired result.
If "how deep is this node?" is a question you need answered often, you might consider adjusting the schema of your table so that each node stores and maintains its depth. Then you can just read that value instead of calculating it with awkward recursion. (Maintenance of the depth values may become tedious if you're shuffling the table, but assuming you are doing far fewer writes than reads, this is a more efficient approach.)

modeling many to many unary relationship and 1:M unary relationship

Im getting back into database design and i realize that I have huge gaps in my knowledge.
I have a table that contains categories. Each category can have many subcategories and each subcategory can belong to many super-categories.
I want to create a folder with a category name which will contain all the subcategories folders. (visual object like windows folders)
So i need to preform quick searches of the subcategories.
I wonder what are the benefits of using 1:M or M:N relationship in this case?
And how to implement each design?
I have create a ERD model which is a 1:M unary relationship. (the diagram also contains an expense table which stores all the expense values but is irrelevant in this case)
is this design correct?
will many to many unary relationship allow for faster searches of super-categories and is the best design by default?
I would prefer an answer which contains an ERD
If I understand you correctly, a single sub-category can have at most one (direct) super-category, in which case you don't need a separate table. Something like this should be enough:
Obviously, you'd need a recursive query to get the sub-categories from all levels, but it should be fairly efficient provided you put an index on PARENT_ID.
Going in the opposite direction (and getting all ancestors) would also require a recursive query. Since this would entail searching on PK (which is automatically indexed), this should be reasonably efficient as well.
For some more ideas and different performance tradeoffs, take a look at this slide-show.
In some cases the easiest way to maintain a multilevel hierarchy in a relational database is the Nested Set Model, sometimes also called "modified preorder tree traversal" (MPTT).
Basically the tree nodes store not only the parent id but also the ids of the left-most and right-most leaf:
spending_category
-----------------
parent_id int
left_id int
right_id int
name char
The major benefit from doing this is that now you are able to get an entire subtree of a node with a single query: the ids of subtree nodes are between left_id and right_id. There are many variations; others store the depth of the node in addition to or instead of the parent node id.
A drawback is that left_id and right_id have to be updated when nodes are inserted or deleted, which means this approach is useful only for trees of moderate size.
The wikipedia article and the slideshow mentioned by Branko explains the technique better than I can. Also check out this list of resources if you want to know more about different ways of storing hierarchical data in a relational database.

Is it possible to query a tree structure table in MySQL in a single query, to any depth?

I'm thinking the answer is no, but I'd love it it anybody had any insight into how to crawl a tree structure to any depth in SQL (MySQL), but with a single query
More specifically, given a tree structured table (id, data, data, parent_id), and one row in the table, is it possible to get all descendants (child/grandchild/etc), or for that matter all ancestors (parent/grandparent/etc) without knowing how far down or up it will go, using a single query?
Or is using some kind of recursion require, where I keep querying deeper until there are no new results?
Specifically, I'm using Ruby and Rails, but I'm guessing that's not very relevant.
Yes, this is possible, it's a called a Modified Preorder Tree Traversal, as best described here
Joe Celko's Trees and Hierarchies in SQL for Smarties
A working example (in PHP) is provided here
http://www.sitepoint.com/article/hierarchical-data-database/2/
Here are several resources:
http://forums.mysql.com/read.php?10,32818,32818#msg-32818
Managing Hierarchical Data in MySQL
http://lists.mysql.com/mysql/201896
Basically, you'll need to do some sort of cursor in a stored procedure or query or build an adjacency table. I'd avoid recursion outside of the db: depending on how deep your tree is, that could get really slow/sketchy.
Daniel Beardsley's answer is not that bad a solution at all when the main questions you are asking are 'what are all my children' and 'what are all my parents'.
In response to Alex Weinstein, this method actually results in less updates to nodes on a parent movement than in the Celko technique. In Celko's technique, if a level 2 node on the far left moves to under a level 1 node on the far right, then pretty much every node in the tree needs updating, rather than just the node's children.
What I would say however is that Daniel possibly stores the path back to root the wrong way around.
I would store them so that the query would be
SELECT FROM table WHERE ancestors LIKE "1,2,6%"
This means that mysql can make use of an index on the 'ancestors' column, which it would not be able to do with a leading %.
I came across this problem before and had one wacky idea. You could store a field in each record that is concatenated string of it's direct ancestors' ids all the way back to the root.
Imagine you had records like this (indentation implies heirarchy and the numbers are id, ancestors.
1, "1"
2, "2,1"
5, "5,2,1"
6, "6,2,1"
7, "7,6,2,1"
11, "11,6,2,1"
3, "3,1"
8, "8,3,1"
9, "9,3,1"
10, "10,3,1"
Then to select the descendents of id:6, just do this
SELECT FROM table WHERE ancestors LIKE "%6,2,1"
Keeping the ancestors column up to date might be more trouble than it's worth to you, but it's feasible solution in any DB.
Celko's technique (nested sets) is pretty good. I also have used an adjacency table with fields "ancestor" and "descendant" and "distance" (e.g. direct children/parents have a distance of 1, grandchildren/grandparents have a distance of 2, etc).
This needs to be maintained, but is fairly easy to do for inserts: you use a transaction, then put the direct link (parent, child, distance=1) into the table, then INSERT IGNORE a SELECTion of existing parent&children by adding distances (I can pull up the SQL when I have a chance), which wants an index on each of the 3 fields for performance. Where this approach gets ugly is for deletions... you basically have to mark all the items that have been affected and then rebuild them. But an advantage of this is that it can handle arbitrary acyclic graphs, whereas the nested set model can only do straight hierarchies (e.g. each item except the root has one and only one parent).
SQL isn't a Turing Complete language, which means you're not going to be able to perform this sort of looping. You can do some very clever things with SQL and tree structures, but I can't think of a way to describe a row which has a certain id "in its hierarchy" for a hierarchy of arbitrary depth.
Your best bet is something along the lines of what #Dan suggested, which is to just work your way through the tree in some other, more capable language. You can actually generate a query string in a general-purpose language using a loop, where the query is just some convoluted series of joins (or sub-queries) which reflects the depth of the hierarchy you are looking for. That would be more efficient than looping and multiple queries.
This can definitely be done and it isn't that complicated for SQL. I've answered this question and provided a working example using mysql procedural code here:
MySQL: How to find leaves in specific node
Booth: If you are satisfied, you should mark one of the answers as accepted.
I used the "With Emulator" routine described in https://stackoverflow.com/questions/27013093/recursive-query-emulation-in-mysql (provided by https://stackoverflow.com/users/1726419/yossico). So far, I've gotten very good results (performance wise), but I don't have an abundance of data or a large number of descendents to search through/for.
You're almost definitely going to want to employ some recursion for that. And if you're doing that, then it would be trivial (in fact easier) to get the entire tree rather than bits of it to a fixed depth.
In really rough pseudo-code you'll want something along these lines:
getChildren(parent){
children = query(SELECT * FROM table WHERE parent_id = parent.id)
return children
}
printTree(root){
print root
children = getChildren(root)
for child in children {
printTree(child)
}
}
Although in practice you'd rarely want to do something like this. It will be rather inefficient since it's making one request for every row in the table, so it'll only be sensible for either small tables, or trees that aren't nested too deeply. To be honest, in either case you probably want to limit the depth.
However, given the popularity of these kinds of data structure, there may very well be some MySQL stuff to help you with this, specifically to cut down on the numbers of queries you need to make.
Edit: Having thought about it, it makes very little sense to make all these queries. If you're reading the entire table anyway, then you can just slurp the whole thing into RAM - assuming it's small enough!