Can a binary tree or tree be always represented in a Database as 1 table and self-referencing? - mysql

I didn't feel this rule before, but it seems that a binary tree or any tree (each node can have many children but children cannot point back to any parent), then this data structure can be represented as 1 table in a database, with each row having an ID for itself and a parentID that points back to the parent node.
That is in fact the classical Employee - Manager diagram: one boss can have many people under him... and each person can have n people under him, etc. This is a tree structure and is represented in database books as a common example as a single table Employee.

The answer to your question is 'yes'.
Simon's warning about your trees becoming a cyclic graph is correct too.
All the stuff that has been said about "You have to ensure by hand that this won't happen, i.e. the DBMS won't do that for you automatically, because you will not break any integrity or reference rules.", is WRONG.
This remark and the coresponding comments holds true, as long as you only consider SQL systems.
There exist systems which CAN do this for you in a pure declarative way, that is without you having to write *any* code whatsoever. That system is SIRA_PRISE (http://shark.armchair.mb.ca/~erwin).

Yes, you can represent hierarchical structures by self-referencing the table. Just be aware of such situations:
Employee Supervisor
1 2
2 1

Yes, that is correct. Here's a good reference
Just be aware that you generally need a loop in order to unroll the tree (e.g. find transitive relationships)

Related

database design: X:X to 1:many vs X:X to 0:many

(edited 1/5 10:22hr. added some explanation about my notation. And added some additional information I received)
I am doing a course on database design and currently we're doing ERD's and designing db's in MySQL worksbench. Think 1st, 2nd and 3rd NF, creating schema, tables, constraints, etc.
Most of it is pretty clear to me.
However there's one aspect where things remain unclear: the X:X to 1:many relationship vs the X:X to 0:many relationship (meaning: whatever to 0:many, vs whatever to 1:many, etc).
In some cases it's obvious, in others not so much. Whenever it's unclear to me, it's mostly something like this:
Example :
an artist has 1 to many paintings. A painting has 1-and only 1 artist.
Relationship:
|artist| 1:1 -------- 1:many |painting|
the same in another notation
|artist| ||------------ 1< |painting|
This seems fair, but....Then there's the thought: I could be a new artist, not having produced a painting yet.
Or: I could be entering a new artist into a artist table, not yet having entered his paintings yet (which could lead to a practical issue).
Another example:
A workshop has 1 to many participants. A participant enters 0-to-many workshops.
Relationship:
|workshop| many:0 ------- 1:many |participant|
Okay. However: a workshop could have 0 participants (no one want to participate, probably leading to cancellation).
Or: I could be entering a new workshop into a table, not having added any participants yet.
Another example:
An event is held at 1 only 1 location. A location had 1 to many events.
Relationship: |event| many:1 -------- 1:1 |location|
However, maybe you're entering a new (future) location, and there have not been events there yet.
Long shorty short: I am having a hard time establishing the minimal cardinality in cases like above.
Also, when I'm designing a db and get Workbench to forward engineer the SQL for creating the tables (based on my ERD), there doesn't seem to be any difference between a X to 1/many vs a X to 0/many variant. Which makes me wonder: what's the actual (practical) effect or implication of doing one or the other? Maybe the implications (further down the road) make for an easier choice?
Can anyone explain this matter in a simple (fool-proof) way?
Looking forward to understanding!
Addition 1/5:
I've talked about my question/issue with a teacher. He agreed with me that certain minimum cardinalities could lead to a deadlock:
one table cannot be inserted without there being a occurence in the other, and vice versa.
He explained to me that the ERD diagram is a logical model, not perse a fysical model. In other words, the ERD's minimum cardinality is not neccessarily for technical implementation.
Well, if that is the case, I understand his point. Usually an artist has at least one painting. A workshop normally has at least one participant. A location usually has at least one event. So on a logical level, that seems fine.
On a technical/implementation level, it is another deal. You should be able to enter a artist, workshop or location without there already being occurrences in another table.
My question now is:
is this true? Is a ERD a logical model, not a technical model?
and if that is so, WHAT is the reason for adding the minimum cardinality? It seems of little use.
Let's continue with your artist::paintings relationship, I think that's the clearest. When we say 1 to many, we often mean 0/1 to 0/many, but not all the 4 permutations work or are meaningful.
How does "I can have no artists with no paintings" sound? That's the zero-to-zero permutation. It does not sound wrong, but it's a degenerate case that is of no use to us.
1 artist to 0 paintings is OK, as matbailie described, and presents no problems.
1 artist to many paintings is OK, and is the main use case.
0 artists to many paintings is just not correct and should not be supported in the model. A painting must have an artist.
Because cases 1 & 4 don't work, it is not really correct to say 0/1 to 0/many. It is more correct to characterize the cardinality as 1 to 0/many, which encompasses cases 2 & 3 above.
You might say, 'but there are cases where we don't know the artist so we should leave an opportunity to have paintings with no artists'. This statement feels to me like you are leaving the realm of the ERD and entering into physical design. From a design standpoint you could just as easily say there is an artist we just don't know who, so let's create an UNKNOWN ARTIST record and connect those paintings to it.
Your second example (workshops::participants) looks more like a 0/many to 0/many. If you run through the same 4 permutations, they all look credible, though the 0-to-0 case still seems kind of ludicrous.
Your last example is another 1 to 0/many because events without locations cannot be held. When you get to the physical level you can talk about the best way to handle virtual events.
So, none of your examples seem to show a true zero-to-many (which is more accurately stated 0/1 to 0/many). I'm thinking they are pretty rare, if they exist at all. It would have to be associated with some kind of optional activity where if you did enroll in it there were a constrained set of choices.
But... A participant can be in 0-N workshops. So that needs to be many-to-many.
In general "zero" is a degenerate case of "1" or "N"; it is rarely worth worrying about.
I like to start by identifying the "entities" in the model. Your participant, workshop, painting, event, artist, location are excellent examples of such. Then I look for "relations" between obvious pairs.
I say there are only 3 cases. And I like to remember that the two "entities" are manifested as two "database tables":
1:1 -- At which point I ask why the two entities (database tables) are not merged together. The two tables share the same unique key.
1:many -- Represented by the id of the "1" as a column in the "many" table.
Many:many -- This needs a link table between the two tables. This linking table has two ids.
"Id" means a unique key for a table. It is usually a number, but that is not a requirement.
For "0"...
1:0 or many:0 -- You may need a LEFT JOIN to provide NULLs when the entry on the "0" side is missing
Many:many -- If either id is non-existent (the zero case), then there are no rows for that relationship.
Then comes defining the INDEXes for efficient access. And, optionally, FOREIGN KEYs for integrity. The indexes that represent how two entities are related are prime candidates for FKs. Other INDEXes should be added to optimize WHERE clauses in SQL queries.
In all cases, the id/FK/index may be "composite" -- meaning that it is two or more columns that are used for a single id/FK/index.

Is there a more efficient way to handle multi-valued attributes other than creating a relationship table?

I have three tables, tbl_school, tbl_courses and tbl_branches.
Each course can be taught in one or more branches of a school.
tbl_school has got:
id
school_name
total_branches
...
tbl_courses:
id
school_id
course_title
....
tbl_branches:
id
school_id
city
area
address
When I want to list all the branches of a school, it is a pretty straight forward JOIN.
However, each course will be taught in one or more branches or all the branches of the school and I need to store this information. Since there is a one-to-many relationship between tbl_courses and tbl_branches, I will have to create a new relationship table that maps each course record to it's respective branches.
When my users want to filter a course by city or area, this relationship table will be used.
I would like to know if this is the right approach or is there something better for my problem?
I was planning to store a JSON of branches of courses which would eliminate the relationship table and query would be much easier to find the city or area pattern in JSON string.
I am new to design patterns so kindly bear with me.
Issues
The table description you have given has a few errors, which need to be corrected first, after which my proposal will make more sense.
The use of a table prefix, especially tbl_, is incorrect. All the tables are tbl_s. If you do use a prefix, it is to group tables by Subject Area. Further, SQL allows a table qualifier when referring to any table in the code:
`... WHERE table_name.column_name = "something" ...
If you would like some advice re Naming Convention, please review this Answer.
Use singular, because the table name is supposed to refer to a row (relation), not to the content (we know it contains many rows). Then all the English used re the table_name makes sense. (Eg. refer my Predicates.)
You have some incorrect or extraneous columns. It is easier to give you a Data Model, than to explain each item. A couple of items do need explanation:
school.total_branches is a duplicate, because that value can easily be derived (by COUNT() of the Branches). It breaks Normalisation rules, and introduces an Update Anomaly, which can get "out of synch".
course.school_id is incorrect, given that each Branch may or may not teach a Course. That relation is 1 Course to many Branches, it should be in the new table you are contemplating.
By JSON, if you mean construct an array on the client instead of keeping the relations in the database, then no, definitely not. Data and relationships to data, should be implemented in the database. For many reasons, the most important of which is Integrity. Following that, you may easily drag it into the client, and keep it there for stream-performance purposes.
The table you are thinking about is an Associative Table, an ordinary Relational construct to relate ("map", "link") two parent tables, here Courses to Branches.
Data duplication is not prevented. Refer to the Keys is the Data Model.
ID columns you have do not provide row uniqueness, which the Relational Model demands. If that is not clear to you please read this Answer.
Solution
Here is the model.
Proposed School Data Model
Please review and comment.
I need to ensure that you understand the notation in IDEF1X models, that unlike non-standard diagrams: every little notch, tick and line means something very specific. If not, please got to the IDEF1X Notation link at the bottom right of the model.
Please check the Predicates carefully, they (a) explain the model, and (b) are used to verify it. It is a feedback loop. They have two separate benefits.
If you would like more information on Predicates, why they are relevant, please go to this Answer and read the Predicate section.
If you wish to thoroughly understand Predicates, with a view to understanding Data Modelling, consider that Data Model (latest version is linked at the top of the Answer) against those Predicates. Ie. see if you understand a database that you have never seen before, via the model plus Predicates.
The Relational Keys I have given provide the row uniqueness that is required for Relational databases, duplicate data must be prevented. Note that ID columns are simply not needed. The Relational Keys provide:
Data Integrity
Relational access to data (notice the ease of, and unlimited, joins)
Relational speed
None of which a Record Filing System (characterised by ID columns) has.
Column description:
I have implemented two address_lines. Obviously, that should not include city because that is a separate column.
I presume area means something like borough or county or the area that the school branch operates in. If it is a fixed geographic administrative region (my first two descriptors) then it requires a formal structure. If not (my third descriptor), ie. it is loose, or (eg) it spans counties, then a simple Lookup table is enough.
If you use formal administrative regions, then city must move into that structure.
Your approach with an additional table seems the simplest and most straightforward to me. I would not mix JSON in this.

modeling many to many unary relationship and 1:M unary relationship

Im getting back into database design and i realize that I have huge gaps in my knowledge.
I have a table that contains categories. Each category can have many subcategories and each subcategory can belong to many super-categories.
I want to create a folder with a category name which will contain all the subcategories folders. (visual object like windows folders)
So i need to preform quick searches of the subcategories.
I wonder what are the benefits of using 1:M or M:N relationship in this case?
And how to implement each design?
I have create a ERD model which is a 1:M unary relationship. (the diagram also contains an expense table which stores all the expense values but is irrelevant in this case)
is this design correct?
will many to many unary relationship allow for faster searches of super-categories and is the best design by default?
I would prefer an answer which contains an ERD
If I understand you correctly, a single sub-category can have at most one (direct) super-category, in which case you don't need a separate table. Something like this should be enough:
Obviously, you'd need a recursive query to get the sub-categories from all levels, but it should be fairly efficient provided you put an index on PARENT_ID.
Going in the opposite direction (and getting all ancestors) would also require a recursive query. Since this would entail searching on PK (which is automatically indexed), this should be reasonably efficient as well.
For some more ideas and different performance tradeoffs, take a look at this slide-show.
In some cases the easiest way to maintain a multilevel hierarchy in a relational database is the Nested Set Model, sometimes also called "modified preorder tree traversal" (MPTT).
Basically the tree nodes store not only the parent id but also the ids of the left-most and right-most leaf:
spending_category
-----------------
parent_id int
left_id int
right_id int
name char
The major benefit from doing this is that now you are able to get an entire subtree of a node with a single query: the ids of subtree nodes are between left_id and right_id. There are many variations; others store the depth of the node in addition to or instead of the parent node id.
A drawback is that left_id and right_id have to be updated when nodes are inserted or deleted, which means this approach is useful only for trees of moderate size.
The wikipedia article and the slideshow mentioned by Branko explains the technique better than I can. Also check out this list of resources if you want to know more about different ways of storing hierarchical data in a relational database.

Possible to have a table with referential integrity to itself?

Is it possible to have a table with circular referential integrity keys to itself? In example, if I had a table called Container
ObjectId ParentId
1 1
2 1
3 2
ObjectId 1 references itself. Id's 2 and 3 reference their respective parents, which are also in the same table. It wouldn't be possible to delete 3 without deleteing 2, 2 without deleting 1, and it would be impossible to delete 1.
I know I could accomplish the same thing by having a cross reference table, such as,
ObjectId ContainerId
1 1
2 2
3 3
ContainerId ObjectId
1 1
2 1
3 3
But I'm interested in the first way of accomplishing it more, as it would eliminate a possibly unnecessary table. Is this possible?
Yes, self referencing tables are fine.
They are the classical way to represent deeply nested hierarchies.
Just set a foreign key from the child column to the parent column (so, a value in the child must exist in the parent column).
I have done this many times. But be aware if you really are managing hierachies of data, SQL isn't good at tree-like queries. Some SQL vendors have SQL extensions to help with this that might be usable, but Joe Celko's 'Nested Sets' is the cat's meow for this. You'll get lots of hits in a search.
Currently I use the nested-sets approach with a self-reference 'parentID' as a short-cut for the references:
Who is my parent?
Who are my immediate children?
The rest are nested-sets queries.
The first way works, however if you're trying to store an arbitrarily deep tree, the recursive queries will be slow. You could look into storing an adjacency list or a different method (see http://vadimtropashko.wordpress.com/2008/08/09/one-more-nested-intervals-vs-adjacency-list-comparison/).
One thing we do is to store (in a separate table) each object along with all of its successors as well as having a "parent" indicator in the main table, which we use to build the tree in the application.
The goal, George, is not to eliminate an unnecessary table when using the self-reference nested set approach. Rather, it is to handle a hierarchy whose depth is not known in advance: your boss's boss's boss's boss. Who knows how deep that organizational tree may go? If you do know the depth of the hierarchy in advance, and it is not subject to frequent change, you would be better served with separate tables because writing queries against nested sets is a headache best avoided. Simplicity is better than complexity.

Retrieving data with a hierarchical structure in MySQL

Given the following table
id parentID name image
0 0 default.jpg
1 0 Jason
2 1 Beth b.jpg
3 0 Layla l.jpg
4 2 Hal
5 4 Ben
I am wanting to do the following:
If I search for Ben, I would like to find the image, if there is no image, I would like to find the parent's image, if that does not exist, I would like to go to the grandparent's image... up until we hit the default image.
What is the most efficient way to do this? I know SQL isn't really designed for hierarchical values, but this is what I need to do.
Cheers!
MySQL lacks recursive queries, which are part of standard SQL. Many other brands of database support this feature, including PostgreSQL (see http://www.postgresql.org/docs/8.4/static/queries-with.html).
There are several techniques for handling hierarchical data in MySQL.
Simplest would be to add a column to note the hierarchy that a given photo belongs to. Then you can search for the photos that belong to the same hierarchy, fetch them all back to your application and figure out the ones you need there. This is slightly wasteful in terms of bandwidth, requires you to write more application code, and it's not good if your trees have many nodes.
There are also a few clever techniques to store hierarchical data so you can query them:
Path Enumeration stores the list of ancestors with each node. For instance, photo 5 in your example would store "0-2-4-5". You can search for ancestors by searching for nodes whose path concatenated with "%" is a match for 5's path with a LIKE predicate.
Nested Sets is a complex but clever technique popularized by Joe Celko in his articles and his book "Trees and Hierarchical in SQL for Smarties." There are numerous online blogs and articles about it too. It's easy to query trees, but hard to query immediate children or parents and hard to insert or delete nodes.
Closure Table involves storing every ancestor/descendant relationship in a separate table. It's easy to query trees, easy to insert and delete, and easy to query immediate parents or children if you add a pathlength column.
You can see more information comparing these methods in my presentation Practical Object-Oriented Models in SQL or my upcoming book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Perhaps Managing Hierarchical Data in MySQL helps.