I'm trying to get a better understanding of normalisation so I can use best practices going forward. I've found a question in an old book and I'm a little confused by it. Essentially I'm given this table with the following data:
Name Sport Sport Centre
Jim Tennis A1
Jim Golf A2
Dan Tennis A1
Dan Golf A3
Ben Golf A2
So we're assuming that each sport centre can ONLY host one sport. What I want is to convert this to BCNF. My process (from what I've learned so far) is as follows:
1, I identified all of the functional dependencies here:
Sport Centre->Sport
(Name, Sport Centre)->Sport
2, I identified all candidate keys:
(Name, Sport Centre)
But this is where I get stuck. I thought to be in BCNF that the table must have more than 1 candidate key and I can only see one. I'm unsure how to get this to BCNF. What I have done is the following splitting up of the table:
Name Sport Centre
Jim A1
Jim A2
Dan A1
Dan A3
Ben A2
Sport Centre Sport
A1 Tennis
A2 Golf
A3 Golf
But I also understand that to be in 3NF (before BCNF) every attribute must be dependant on the full primary key, yet my splitting up breaks this rule.
How do I normalize properly here?
1, I identified all of the functional dependencies here:
You have not identified all the FDs (functional dependencies) that hold. First: FDs are between sets of attributes. Although it happens that if we restrict ourselves to FDs from a set of attributes to a set holding a single attribute then we can infer what other FDs hold. So we can restrict what we mean by "all", but you should know what you are saying. Next: You have identified some FDs that hold. But all the ones implied by them via Armstrong's axioms also hold. This always means some trivial FDs, eg {Sport Centre} -> Sport Centre & {} -> {}. Although it happens that we can infer the trivial FDs just from knowing the attributes. So again we can restrict what we mean by "all", but you should know what you are saying. It happens that you have identified all the non-trivial FDs with one attribute on the RHS. But you have not justified that the ones you found hold or that you have found all the ones that hold.
You need to learn algorithms & relevant definitions for generating a description of the set of all FDs that hold. Including Armstrong's axioms, the notion of a FD transitive closure & the notion of a FD canonical cover to concisely characterize a closure.
2, I identified all candidate keys:
Assuming that { {Sport Centre} -> Sport } is a canonical cover, the only CK is {Name, Sport Centre}.
You need to learn algorithms & relevant definitions for finding all CKs.
I thought to be in BCNF that the table must have more than 1 candidate key
That's wrong. You seem to be trying to recall something like "3NF & not BCNF implies more than 1 CK" or "3NF & 1 CK implies BCNF", which are true. But these don't give that BCNF implies more than 1 CK, or equivalently, that 1 CK implies not BCNF.
You need to learn a definition of BCNF & other relevant definitions.
I'm unsure how to get this to BCNF.
We can always decompose to a BCNF design. Most definitions of BCNF say it is when there are no FDs of a certain form. It happens that we can get to BCNF by repeatedly losslessly decomposing to eliminate a problem FD. However, that might needlessly not "preserve" FDs. So we typically decompose with preservation to 3NF/EKNF first, which can always preserve FDs. Although then going to BCNF might fail to preserve a FD even though there was a FD-preserving decomposition directly from the original.
You need to learn algorithms & relevant definitions for decomposing to a given NF. Including the notions of lossless decomposition & FD preservation.
But I also understand that to be in 3NF (before BCNF) that every attribute must be dependant on the full primary key and my splitting up breaks this rule.
To normalize to a given NF it is not necessary to go through lower NFs. In general that can eliminate good final NF designs from arising.
Also "to be in 3NF [...] every attribute must be dependent on the full primary key" is not correct. You need to memorize definitions--necessary & sufficient conditions. And PKs (primary keys) do not matter to normalization, CKs do. Although we can investigate the special case of just one CK, which we could then refer to as the PK. Also "my splitting up breaks this rule" doesn't make sense. A necessary condition for a table to be in some NF is not a rule about how to decompose to it or any other NF.
You need to find a (good) academic textbook and learn its normalization definitions & algorithms. (Dozens of textbooks are free online, also slides & courses.) When you are stuck following it, reference & quote it, show your work following it, and explain about how you are stuck.
I think I might have answered my own question, but I won't mark it unless an expert on the community can confirm.
So my splitting up is valid, I have incorrectly identified the candidate keys.
There are 2 candidate keys which are:
(Name,Sport Centre)
(Sport Centre, Sport)
If this is correct, then me splitting the tables up is BCNF and valid. I think this is correct.
Related
(edited 1/5 10:22hr. added some explanation about my notation. And added some additional information I received)
I am doing a course on database design and currently we're doing ERD's and designing db's in MySQL worksbench. Think 1st, 2nd and 3rd NF, creating schema, tables, constraints, etc.
Most of it is pretty clear to me.
However there's one aspect where things remain unclear: the X:X to 1:many relationship vs the X:X to 0:many relationship (meaning: whatever to 0:many, vs whatever to 1:many, etc).
In some cases it's obvious, in others not so much. Whenever it's unclear to me, it's mostly something like this:
Example :
an artist has 1 to many paintings. A painting has 1-and only 1 artist.
Relationship:
|artist| 1:1 -------- 1:many |painting|
the same in another notation
|artist| ||------------ 1< |painting|
This seems fair, but....Then there's the thought: I could be a new artist, not having produced a painting yet.
Or: I could be entering a new artist into a artist table, not yet having entered his paintings yet (which could lead to a practical issue).
Another example:
A workshop has 1 to many participants. A participant enters 0-to-many workshops.
Relationship:
|workshop| many:0 ------- 1:many |participant|
Okay. However: a workshop could have 0 participants (no one want to participate, probably leading to cancellation).
Or: I could be entering a new workshop into a table, not having added any participants yet.
Another example:
An event is held at 1 only 1 location. A location had 1 to many events.
Relationship: |event| many:1 -------- 1:1 |location|
However, maybe you're entering a new (future) location, and there have not been events there yet.
Long shorty short: I am having a hard time establishing the minimal cardinality in cases like above.
Also, when I'm designing a db and get Workbench to forward engineer the SQL for creating the tables (based on my ERD), there doesn't seem to be any difference between a X to 1/many vs a X to 0/many variant. Which makes me wonder: what's the actual (practical) effect or implication of doing one or the other? Maybe the implications (further down the road) make for an easier choice?
Can anyone explain this matter in a simple (fool-proof) way?
Looking forward to understanding!
Addition 1/5:
I've talked about my question/issue with a teacher. He agreed with me that certain minimum cardinalities could lead to a deadlock:
one table cannot be inserted without there being a occurence in the other, and vice versa.
He explained to me that the ERD diagram is a logical model, not perse a fysical model. In other words, the ERD's minimum cardinality is not neccessarily for technical implementation.
Well, if that is the case, I understand his point. Usually an artist has at least one painting. A workshop normally has at least one participant. A location usually has at least one event. So on a logical level, that seems fine.
On a technical/implementation level, it is another deal. You should be able to enter a artist, workshop or location without there already being occurrences in another table.
My question now is:
is this true? Is a ERD a logical model, not a technical model?
and if that is so, WHAT is the reason for adding the minimum cardinality? It seems of little use.
Let's continue with your artist::paintings relationship, I think that's the clearest. When we say 1 to many, we often mean 0/1 to 0/many, but not all the 4 permutations work or are meaningful.
How does "I can have no artists with no paintings" sound? That's the zero-to-zero permutation. It does not sound wrong, but it's a degenerate case that is of no use to us.
1 artist to 0 paintings is OK, as matbailie described, and presents no problems.
1 artist to many paintings is OK, and is the main use case.
0 artists to many paintings is just not correct and should not be supported in the model. A painting must have an artist.
Because cases 1 & 4 don't work, it is not really correct to say 0/1 to 0/many. It is more correct to characterize the cardinality as 1 to 0/many, which encompasses cases 2 & 3 above.
You might say, 'but there are cases where we don't know the artist so we should leave an opportunity to have paintings with no artists'. This statement feels to me like you are leaving the realm of the ERD and entering into physical design. From a design standpoint you could just as easily say there is an artist we just don't know who, so let's create an UNKNOWN ARTIST record and connect those paintings to it.
Your second example (workshops::participants) looks more like a 0/many to 0/many. If you run through the same 4 permutations, they all look credible, though the 0-to-0 case still seems kind of ludicrous.
Your last example is another 1 to 0/many because events without locations cannot be held. When you get to the physical level you can talk about the best way to handle virtual events.
So, none of your examples seem to show a true zero-to-many (which is more accurately stated 0/1 to 0/many). I'm thinking they are pretty rare, if they exist at all. It would have to be associated with some kind of optional activity where if you did enroll in it there were a constrained set of choices.
But... A participant can be in 0-N workshops. So that needs to be many-to-many.
In general "zero" is a degenerate case of "1" or "N"; it is rarely worth worrying about.
I like to start by identifying the "entities" in the model. Your participant, workshop, painting, event, artist, location are excellent examples of such. Then I look for "relations" between obvious pairs.
I say there are only 3 cases. And I like to remember that the two "entities" are manifested as two "database tables":
1:1 -- At which point I ask why the two entities (database tables) are not merged together. The two tables share the same unique key.
1:many -- Represented by the id of the "1" as a column in the "many" table.
Many:many -- This needs a link table between the two tables. This linking table has two ids.
"Id" means a unique key for a table. It is usually a number, but that is not a requirement.
For "0"...
1:0 or many:0 -- You may need a LEFT JOIN to provide NULLs when the entry on the "0" side is missing
Many:many -- If either id is non-existent (the zero case), then there are no rows for that relationship.
Then comes defining the INDEXes for efficient access. And, optionally, FOREIGN KEYs for integrity. The indexes that represent how two entities are related are prime candidates for FKs. Other INDEXes should be added to optimize WHERE clauses in SQL queries.
In all cases, the id/FK/index may be "composite" -- meaning that it is two or more columns that are used for a single id/FK/index.
Given the follwing functional dependencies, it is a little bit confusing for me because third normal form says no non-prime attribute of R is transitively dependent on the primary key. So i removed the functional dependency
C --> DE from table and placed it in new relation but all these attributes can also be determinded by the primary key of the relation. I think that i can't remove D and E from this table or should i remove because further BCNF also does not help in removing these attributes.Question is when i remove first functional dependency should i also remove D and E from the first table?enter image description here
To put a relation into a given NF (normal form) you should follow an algorithm that has been advised for that NF. (Eg given some FDs, there are lots of others that hold, per Armstrong's axioms; you need to deal with them too. Eg there are certain benefits to "preserving" FDs when possible, and a decomposition to 3NF components that preserves FDs is always possible; but if we decompose so that some FD's attributes are split between components, we can fail to preserve FDs.)
Note that these algorithms do not involve first normalizing to lower NFs. (That can stop "good" higher-NF designs from being the final result.)
When you do decompose to get rid of a FD X -> Y from a relation with attributes R, the decomposition will always be non-loss/non-additive if the components have attribute sets X U Y and R - Y. By repeated decompositions all your components will eventually be in the NF you want (if it is BCNF or below). But your overall decomposition won't necessarily be as "nice" as an advised algorithm would give you.
I started to teach myself the basics of databases and i am currently working through 1. to 3. normal forms. What i understand until now is the wish to remove redundancy to make my databases less prone to inconsistency during phases of data-change as well as saving space by eliminating as much duplicates as possible.
For example if we have a table with the following columns:
CD_ID
title
artist
year
and change the design to have multiple tables where the first (CD) contains:
CD_ID
title
artist_ID
the second (artist) contains:
artist_ID
artist
year
I see that in the original table the year is transitively dependent on the ID via the artist. So we wanna get rid of that and create a table for the artists so our new CD table is now in third normal form.
But to do so i created another table (the artist table) which again is not in third normal form as far as I understand it, as we have the same type of transitive dependency like before just in another table.
Is this correct and if yes should i also normalize the artist table to be in 3rd NF? When do I stop?
TL;DR You need to follow a published algorithm to decompose to a given normal form.
PS You didn't get Artist from the original CD via normalization, since you introduced a new column. But assume table Artist has the obvious meaning. Why do you think it "again is not in third normal form as far as I understand it"? If artist -> year in the original CD then it also does in Artist. But then {artist} is, with {artist_id}, a CK (candidate key) of Artist, and Artist is in 3NF (and 5NF).
From your question's original version plus the current one, you have a proposed base table CD with columns cd_id, title, group & year, holding tuples where cd cd_id titled title was made by group group that formed in year year. Column cd_id is unique, hence is a CK. FD {group} -> year also holds.
Normalization does not introduce new column names. It replaces a proposed base table by others, each with a smaller subset of its columns, that always join to what its value would have been. Normalization up to BCNF is based on FDs (functional dependencies), which are also what determine the CKs of a base table. So your question does not contain a decomposition. A possible decomposition reminiscent of your question, which might or might not have any particular properties, would be to tables with column sets {cd_id, title, group} and {group, year}.
Other FDs hold in the original. Some hold because of what the columns are; some hold because of the CK; some hold because {group} -> year holds; in general, certain ones hold because all three of those do. And maybe others hold because of what tuples are supposed to go into the relation and what situations can arise. You need to decide for every possible FD whether it holds.
Of course, you might have been told that the only ones that hold are the ones that have to hold under those circumstances. But you won't have been told that the only FD that holds is {group} -> year, because there are trivial FDs and every superset of a CK functionally determines every set of columns.
One definition of 3NF is that a relation is in 2NF and no non-prime column is transitively functionally dependent on any CK. (Notice each condition involves other definitions.) If you want to use this to find out whether your relation is in 3NF then you next need to find out what all the CKs are. You can do this fastest via an appropriate algorithm, but you can just see which sets of columns functionally determine every column but don't contain a smaller such set, since those are the CKs. Then check the two conditions in the definition.
If you want to normalize to 3NF then you need to follow an algorithm for decomposing to 3NF. You don't explain what process you think you should follow. But if you aren't following a proven algorithm then whatever components you pick might or might not always join to the original and might or might not each be in any particular higher normal form. Note that examples of decompositions you have seen are not presentations of decomposition algorithms.
The NF (normal form) definitions give conditions that a relation must meet to be in that NF. They don't tell you how to nonloss decompose (preserving FDs when possible) to relations in higher NFs. People have worked out algorithms for producing decompositions to particular NFs. (And decomposing to a given NF doesn't in general involve first decomposition to lower NFs. Going through lower NFs can actually prevent good higher-NF decompositions of the original from being generated when you get to decomposing per a higher NF.)
You may also not realize that when some FDs hold, certain other ones must hold. The latter can be determined via Armstrong's axioms from the former. So just because you decomposed to get rid of a particular FD whose presence violates a particular NF doesn't mean there weren't a bunch of other ones that violated it that you didn't deal with. They can be present in the new components. Or they can be not present in problematic ways, so that you have not "preserved" them when you could have, leading to poor designs.
Learn about specific NF algorithms, and for that matter NFs and normalization itself, in a college/university textbook/course/presentation. Many are online.
I have three tables, tbl_school, tbl_courses and tbl_branches.
Each course can be taught in one or more branches of a school.
tbl_school has got:
id
school_name
total_branches
...
tbl_courses:
id
school_id
course_title
....
tbl_branches:
id
school_id
city
area
address
When I want to list all the branches of a school, it is a pretty straight forward JOIN.
However, each course will be taught in one or more branches or all the branches of the school and I need to store this information. Since there is a one-to-many relationship between tbl_courses and tbl_branches, I will have to create a new relationship table that maps each course record to it's respective branches.
When my users want to filter a course by city or area, this relationship table will be used.
I would like to know if this is the right approach or is there something better for my problem?
I was planning to store a JSON of branches of courses which would eliminate the relationship table and query would be much easier to find the city or area pattern in JSON string.
I am new to design patterns so kindly bear with me.
Issues
The table description you have given has a few errors, which need to be corrected first, after which my proposal will make more sense.
The use of a table prefix, especially tbl_, is incorrect. All the tables are tbl_s. If you do use a prefix, it is to group tables by Subject Area. Further, SQL allows a table qualifier when referring to any table in the code:
`... WHERE table_name.column_name = "something" ...
If you would like some advice re Naming Convention, please review this Answer.
Use singular, because the table name is supposed to refer to a row (relation), not to the content (we know it contains many rows). Then all the English used re the table_name makes sense. (Eg. refer my Predicates.)
You have some incorrect or extraneous columns. It is easier to give you a Data Model, than to explain each item. A couple of items do need explanation:
school.total_branches is a duplicate, because that value can easily be derived (by COUNT() of the Branches). It breaks Normalisation rules, and introduces an Update Anomaly, which can get "out of synch".
course.school_id is incorrect, given that each Branch may or may not teach a Course. That relation is 1 Course to many Branches, it should be in the new table you are contemplating.
By JSON, if you mean construct an array on the client instead of keeping the relations in the database, then no, definitely not. Data and relationships to data, should be implemented in the database. For many reasons, the most important of which is Integrity. Following that, you may easily drag it into the client, and keep it there for stream-performance purposes.
The table you are thinking about is an Associative Table, an ordinary Relational construct to relate ("map", "link") two parent tables, here Courses to Branches.
Data duplication is not prevented. Refer to the Keys is the Data Model.
ID columns you have do not provide row uniqueness, which the Relational Model demands. If that is not clear to you please read this Answer.
Solution
Here is the model.
Proposed School Data Model
Please review and comment.
I need to ensure that you understand the notation in IDEF1X models, that unlike non-standard diagrams: every little notch, tick and line means something very specific. If not, please got to the IDEF1X Notation link at the bottom right of the model.
Please check the Predicates carefully, they (a) explain the model, and (b) are used to verify it. It is a feedback loop. They have two separate benefits.
If you would like more information on Predicates, why they are relevant, please go to this Answer and read the Predicate section.
If you wish to thoroughly understand Predicates, with a view to understanding Data Modelling, consider that Data Model (latest version is linked at the top of the Answer) against those Predicates. Ie. see if you understand a database that you have never seen before, via the model plus Predicates.
The Relational Keys I have given provide the row uniqueness that is required for Relational databases, duplicate data must be prevented. Note that ID columns are simply not needed. The Relational Keys provide:
Data Integrity
Relational access to data (notice the ease of, and unlimited, joins)
Relational speed
None of which a Record Filing System (characterised by ID columns) has.
Column description:
I have implemented two address_lines. Obviously, that should not include city because that is a separate column.
I presume area means something like borough or county or the area that the school branch operates in. If it is a fixed geographic administrative region (my first two descriptors) then it requires a formal structure. If not (my third descriptor), ie. it is loose, or (eg) it spans counties, then a simple Lookup table is enough.
If you use formal administrative regions, then city must move into that structure.
Your approach with an additional table seems the simplest and most straightforward to me. I would not mix JSON in this.
In designing RDBMS schema, I wonder if there is formal principle of concrete objects: for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
But what about a table such as Courses (as in school). It can have a description, number of units, offered only in Autumn (Fall) or Spring, etc, which are the "general properties" of a course.
And then there is actual CourseSessions, which has information about the time_from and time_to (such as 10 to 11am), whether it is Monday, Wednesday or Tue / Thur, and the instructor teaching it, and also pointing back using a course_id to the Courses table.
So the above 2 tables are both needed.
Are there principles of table design for "concrete" vs "abstract"?
Update: what I mean "abstract" here is that a course is an abstract idea... there can be multiple instances of it... such as the course Physics 10 from 10-11am, and another at 12-1pm.
for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
That is the hope, but not the reality of the situation.
By immigration or legal death status, it is possible for there to be two (or more records) that represent the same person. Uniquely identifying people is difficult - first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed.
Are there principles of table design for "concrete" vs "abstract"
The classification of being "concrete" vs "abstract" is arbitrary, subject to interpretation. Does the start and end date really make a Course session "concrete"? Because I can book numerous things in [Calendaring software of choice] - doesn't mean class actually took place, or that final grades are legitimate values...
Table design is based on business rules, and the logical entities (which can become tables in the physical model) required to support those rules. Normalization helps make these entities more obvious.
The relational data model, base on mathematics, prove a way to design your data model on which certain operations is correct without risk.
Unfortunatly, this kind of data model is not a suitable solution for performance issue in database. How to organize tables for certain business domain is need to consider about not only the abstract model of objects or database normalization but also performance planning on your system. Yes, the leak of abstraction.
For example, there are two design strategies for tree structure: Adjacency model and Materialized path model(The art of SQL). Which one is better is based on which operations need to be optimized.
There is a good and classical article I recommend: The Law of Leaky Abstractions
Abstraction has its price (& it is often higher than expected)
By Keith Cooper
The art of SQL, of course, the soul of database design in my opinion.