Third Normal Form -- transitive dependence between two foreign keys? - relational-database

I am creating a database containing books that I own and have read. I want to track both the book (or "title") that I own and read, and the edition (or "physical bound paper") of that book that I own and read.
Book and Edition are many-to-many. I own multiple editions of the book Democracy in America. I also own an edition called "Hemingway" that contains several books (or "titles"), including one called "For Whom the Bell Tolls".
Thus, I need a bridge between book and edition. My tables are:
Book (book_pk*,title)
Edition (edition_pk*,ISBN,year)
Book_Edition (book_fk,edition_fk)
I believe it is correct to say that the Book_Edition table contains a composite primary key.
Now, I am working on my Read table, which will contain the books that I have read and the date on which I read them. My read table so far contains:
Read (read_pk,date,note)
However, I now need to tie my Read table to my books and editions. It appears to me that book_fk and edition_fk are transitively dependent in this case. So how do I comply with the third normal form?
Option 1:
Modify the Read table to: Read (read_pk,date,note,book_fk,edition_fk)
Option 2:
Modify the Book_Edition table to: Book_Edition (book_edition_pk,book_fk,edition_fk)
Modify the Read table to: Read (read_pk,date,note,book_edition_fk)
Option 3:
???
Any insight would be appreciated. Apologies if this has been treated elsewhere; I saw a couple posts that looked promising but as a relative n00b I was not able to decipher them and apply them to my situation.
EDIT per sqlvogel:
Let me take a stab at identifying dependencies -- that is, I am trying to identify places where if Field A is changed, then Field B must or may change. I think I am finding this difficult because books (both "titles" and "collections of bound paper") are inherently permanent. The only time I would expect to edit the title, ISBN, or year fields would be if there is a data entry error. If the ISBN for a particular edition_pk is entered incorrectly, it's probably slightly more likely that the year for the same edition_pk was also entered incorrectly, but is that a dependency?
With respect to the read table, I believe the situation is similar. Records would be created each time a book is read, and theoretically never edited. I want to identify the book and edition that were read on a particular date. If there is a data entry error, it might affect one or more of the fields. In particular, if the wrong book_fk is entered, it's probably more likely that the wrong edition_fk was entered too. Again, is that a dependency I should be worried about?
Is there anything else I need to consider when thinking about dependencies?

Option 1: Read (read_pk,date,note,book_fk,edition_fk)
Assumptions:
{read_pk}->{date,note,book_fk,edition_fk}
{read_pk} is the primary key of Read.
For the sake of example just suppose that {book_fk,edition_fk}->{date}, meaning that each book is read only once (only a single date per book/edition). If you didn't make {book_fk,edition_fk} a candidate key in Read then {book_fk,edition_fk}->{date} would be an example of a non-key dependency in violation of 3NF because the determinant is not a key. The same would be true even if you substituted {book_edition_fk} in place of {book_fk,edition_fk}. i.e. your Option 2 is apparently the same as Option 1 as far as 3NF is concerned.
Since you haven't specified any dependencies I have just given this as an example. I can't say whether those dependencies would be a correct description of your situation. You yourself need to determine what dependencies actually should be in force here.

Transitive dependencies require the dependent attribute to be a non-key attribute. Since the two attributes you're concerned about are foreign keys, you do not have a transitive dependency problem in your structure.
You do not need to alter the original design.

Related

How would transitive functional dependencies occur here?

so im reading up on database normalization, and it seems like for the most part, a lot of us are already following up to 2NF or even 3NF without realizing it. I wonder why our professor 4 years ago told us thats a topic discussed in masters database course because its "too complicated" lol...sounds straight forward to me really...
anyways, part of this article here, talks about 3NF, and to achieve that, you need to have 2NF and no no transitive functional dependencies.
the example given is this image, but i dont understand...how could a value of a non-key column change a value of another non-key column? if anything, it sounds to me like a glitch in the system if that were to happen...
Consider the table 1. Changing the non-key column Full Name may change Salutation.
That article is terrible. So clearly some people find it difficult to understand normalisation. (Many people struggle with the difference between 3NF vs Boyce-Codd NF, which the article ducks out of explaining.) The article says
Normalization helps produce database systems that are cost-effective and have better security models.
That's not the chief reason for normalising a design. Indeed in the early days of the Relational Model (when disk was expensive, so for example dates had two-digit years), normalisation (i.e. vertical partitioning) was the opposite of cost-effective, and a lot of attention was paid to trade-offs in (partially) denormalised schemas.
The chief reason for normalisation is to avoid duplicating information and/or duplicates getting out of step, called 'update anomalies'. Specifically:
A transitive functional dependency is when changing a non-key column, might cause any of the other non-key columns to change
Is a terrible way to put it. But might mean: when you update one column (Full Name) of one row (Membership Id 3) you need to also update other column(s) (Salutation) of the same or other row(s) (Membership Id 2?); or if you don't, you break the consistency of the data.
The article doesn't tell us what FDs are expected to hold. Does Full Name determine Salutation? Is it possible the Membership ID 3 Robert Phil could qualify as a Doctor and therefore change his Salutation without Member 2 also becoming a Doctor? Then there is no FD from Full Name to Salutation, and what looks like duplicated entry is not.
Presumably what the example is trying to show (I'm not sure, because it's wrong) is that there's a dependency between Full Name and Salutation. Introducing a Salutation Id is so ... stupid, I'm very tempted to say "not even wrong". It has not removed the Transitive Functional Dependency at all.
Normalisation would (assuming there is a FD from Full Name to Salutation):
Put Full Name and Salutation in a separate table, keyed by Full Name -- that represents one of the FDs.
Remove Salutation from the Membership table.
Not introduce a Salutation ID field.
You can recover the original Membership table by joining to the Full Name, Salutation table.
The alleged 3NF form has not removed the Transitive Functional Dependency, and so is not in 3NF. All it has done is replace a Transitive FD from Membership ID to Full Name to Salutation with one from Membership ID to Full Name to Salutation ID. So if Member 3 changes their name from Robert Phil to Roberta Phil, under the initial design Salutation would have to change in step from Mr to Ms; under the alleged 3NF design still the Salutation ID has to change from 1 to 2.
There are other reasons to think that alleged 3NF design is not 3NF. I expect a dependency from Person to Full Name and to Address. There's no column Person, with the consequence there are two Mr Robert Phils. Are they the same person? Then what if they flatted together? The article tries to introduce a composite key {Full Name, Address}, but that won't help; and it's quite common for same-named father and son to live at the same address. We'd have two people with same name at same address. (What if one of them then qualified as a Doctor?)
Normalisation would introduce a Person ID, key to a Person table, with columns Full Name, Salutation, Address. The partitioned Membership table would have columns Membership ID (key), Person ID (Foreign Key references Person).

SQL Table Design Issue

So I am building out a set of tables in an existing database at the moment, and have run into a weird problem.
First things first, the tables in question are called Organizations, Applications, and PostOrganizationsApplicants.
Organizations is a pre-existing table that is already populated with lots of data in regards to an organization's information which has been filled out in another form on another portal. EDIT: I cannot edit this table.
Applications is a table that records all information that a user inputs in the application form of the website. It is a new table.
PostOrganizationsApplicants is basically a copy of Organizations. This is also a new table.
The process goes:
1. Go to website and choose between two different web forms, Form A pertains to companies who are in the Organizations table, and Form B pertains to companies who are not in that table.
2a. If Form A is chosen, a lot of the fields in the application will be auto-populated because of their previous submission.
2b. If Form B is chosen, the company has to start from scratch and fill out the entire application from scratch.
3. Any Form B applicants must go into the PostOrganizationsApplicants table.
Now I am extremely new to SQL and Database Management so I may sound pretty stupid, but when I am linking the Organizations and PostOrganizationsApplicants tables to the Applications table, FK's for the OrganizationsID column and PostOrganizationsApplicantsID columns will have lots of empty spaces.
Is this good practice? Is there a better way to structure my tables? I've been racking my brain over this and just can't figure out a better way.
No, it's not necessarily bad practice to allow NULL values for foreign key columns.
If an instance of an entity doesn't have a relationship to an instance of another entity, then storing a NULL in the foreign key column is the normative practice.
From your description of the use case, a "Form A" Applications won't be associated with a row in Organizations or a row in PostOrganizationsApplicants.
Getting the cardinality right is what is important. How manyOrganizations can a given Applications be related to? Zero? One? More than One? And vice versa.
If the relationship is many-to-many, then the usual pattern is to introduce a third relationship table.
Less frequently, we will also implement a relationship table for very sparse relationships, when a relationship to another entity is an exception, rather than the rule.
I'm assuming that the OrganizationsID column you are referring to is in the PostOrganizationsApplicants table (which would mean that a PostOrganizationsApplicants can be associated with (at most) one Organizations.
I'm also assuming that PostOrganizationsApplicantsID column is in the Applications table, which means an instance of Applications can be associated with at most one PostOrganizationsApplicants.
Bottomline, have a "zero-or-one to many" relationship is valid, as long as that supports a suitable representation of the business model.
Why not just add a column to the Organizations table that indicates that the Organization is a "Post" type of organization and set it for the Form B type of applicants? - then, all your orgs are in one table - with a single property to indicate where they came from.
If you can add a new record to Organizations (I hope you can) just
create FK from Organizations as PK of PostOrganizationsApplicants. So
if Organizations has corresponding record in PostOrganizationsApplicants - it's "Post"!
Thanks everybody, I think I found the most efficient way to do it inspired by all of your answers.
My solution below, in case anyone else has a similar problem...
Firstly I will make the PK of PostOrganizationsApplicants the FK of Organizations by making a "link" table.
Then I am going to add a column in PostOrganizationsApplicants which will take in a true/false value on whether they completed the form from the other portal or not. Then I will ask a question in the form whether they have already done the other version of the form or not. If the boolean value is true, then I will point those rows to the Organizations table to auto-populate the forms.
Thanks again!

Is it proper to make a grand-parent key, a primary key, in its grand-child, in a multi-level identifying relationship?

Asked this here a couple of days ago, but haven't gotten many views, let alone a response, so I'm reposting to stackoverflow.
I'm modeling a DB for a conference ticketing system. In this system attendees are members of an attendee group, which belong to a conference. These relationships are identifying, and therefore FKs must be PKs in the respective children.
My current model:
Q: Is it proper to have attendeeGroupConferenceId FK, as a PK, in the attendee table, as MySQL Workbench has automatically set up for me?
On one side one would get a performance boost by keeping it in there for quick association at "check in". However, it does not strictly necessary since the combination of id, attendeeGroupId, and a corresponding lookup of conferenceId in the respective attendeeGroup table, is enough. (Therefore becomes redundant data.)
To me, it feels like it might violate some form of normalization, but I plan on keeping it in for the speed boost as described. I'm just curious about what proper design says about giving it PK status or not.
You definitely don't need the attendeeGroupConferenceId in your attendee table. It's redundant and notice that candidate key is the combination of (attendeeGroupId, personId), not the attendeeGroupConferenceId alone.
The table attendee also seems to violate the Second normal form (2NF) as it is.
My suggestion is to remove the attribute attendeeGroupConferenceId. In any case you can just join the tables in your queries to get extra info rather than keeping an extra attribute.

Database Design - structure

I'm designing a website with courses and jobs.
I have a jobs table and courses table, and each job or course is offered by a 'body', which is either an institution(offering courses) or a company(offering jobs). I am deciding between these two options:
option1: use a 'Bodies' table, with a body_type column for both insitutions and companies.
option2: use separate 'institution' and 'company' tables.
My main problem is that there is also a post table where all adverts for courses and jobs are displayed from. Therefore if I go with the first option, I would just need to put a body_id as a record for each post, whereas if I choose the second option, I would need to have an extra join somewhere when displaying posts.
Which option is best? or is there an alternative design?
Don't think so much in terms of SQL syntax and "extra joins", think more in terms of models, entities, attributes, and relations.
At the highest level, your model's central entity is a Post. What are the attributes of a post?
Who posted it
When it was posted
Its contents
Some additional metadata for search purposes
(Others?)
Each of these attributes is either unique to that post and therefore should be in the post table directly, or is not and should be in a table which is related; one obvious example is "who posted it" - this should simply be a PostedBy field with an ID which relates another table for poster/body entities. (NB: Your poster entity does not necessarily have to be your body entity ...)
Your poster/body entity has its own attributes that are either unique to each poster/body, or again, should be in some normalized entity of their own.
Are job posts and course posts substantially different? Perhaps you should consider CoursePosts and JobPosts subset tables with job- and course-specific data, and then join these to your Posts table.
The key thing is to get your model in such a state that all of the entity attributes and relationships make sense where they are. Correctly modeling your actual entities will prevent both performance and logic issues down the line.
For your specific question, if your bodies are generally identical in terms of attributes (name, contact info, etc) then you want to put them in the same table. If they are substantially different, then they should probably be in different tables. And if they are substantially different, and your jobs and courses are substantially different, then definitely consider creating two entirely different data models for JobPosts versus CoursePosts and then simply linking them in some superset table of Posts. But as you can tell, from an object-oriented perspective, if your Posts have nothing in common but perhaps a unique key identifier and some administrative metadata, you might even ask why you're mixing these two entities in your application.
When resolving hierarchies there are usually 3 options:
Kill children: Your option 1
Kill parent: Your option 2
Keep both
I get the issue you're talking about when you kill the parent. Basically, you don't know to what table you have to create a foreign key. So unless you also create a post hierarchy where you have a post related to institution and a separate post table relating to company (horrible solution!) that is a no go. You could also solve this outside the design itself adding metadata in each post stating which table they should join against (not a good option either as your schema will not be self documentation and the data will determine how to join tables... which is error prone).
So I would discard killing the parent. Killing the children works good if you don't have too many different fields between the different tables. Also you should bear in mind that that approach is not good to solve issues wether the children can be both: institution and companies but it doesn't seem to be the case. Killing the children is also the most efficient one.
The third option that you haven't evaluated is the keeping both approach. This way you keep a dummy table containing the shared values between the bodies and each of the bodies have a FK to this "abstract" table (if you know what I mean). This is usually the least efficient way but most likely the most flexible. This way you can easily handle bodies that are of both types, and also that are only of type "body" but not a company nor an institution themselves (if that is even possible or might be possible in the future). You should note that in order to join a post to an institution you should always reference the parent table and then join the parent with the children.
This question might also be useful for you:
What is the best database schema to support values that are only appropriate to specific rows?

unnecessary normalization

My friend and I are building a website and having a major disagreement. The core of the site is a database of comments about 'people.' Basically people can enter comment and they can enter the person the comment is about. Then viewers can search the database for words that are in the comment or parts of the person name. It is completely user generated. For example, if someone wants to post a comment on a mispelled version of a person's name, they can, and that's OK. So there may be multiple spellings of different people listed as several different entries (some with middle name, some with nickname, some mispelled, etc.), but this is all OK. We don't care if people make comments about random people or imaginary people.
Anyway, the issue is about how we are structuring the database. Right now it is just one table with the comment ID as the primary key, and then there is a field for the 'person' the comment is about:
comment ID - comment - person
1 - "he is weird" - John Smith
2 - "smelly girl" - Jenny
3 - "gay" - John Smith
4 - "owes me $20" - Jennyyyyyyyyy
Everything is working fine. Using the database, I am able to create pages that list all the 'comments' for a particular 'person.' However, he is obsessed that the database isn't normalized. I read up on normalization and learned that he was wrong. The table IS currently normalized, because the comment ID is unique and dictates the 'comment' and the 'person.' Now he is insistant that 'person' should have it's OWN table because it is a 'thing.' I don't think it is necessary, because even though 'person' really is the bigger container (one 'person' can have many 'comments' about them), the database seems to operate just fine with 'person' being an attribute of the comment ID. I use various PHP calls for different SQL selections to make it magically appear more sophisticated on the output and the different way the user can search and see results, but in reality, the set-up is quite simple. I am now letting users rank comments with thumbs up and thumbs down, and I keep a 'score' as another field on the same table.
I feel that there is currently no need to have a separate table for just unique 'person' entries because the 'persons' don't have their own 'score' or any of their own attributes. Only the comments do. My friend is so insistant that it is necessary for efficiency. Finally I said, "OK, if you want me to create a separate table and let 'person' be it's own field, then what would be the second field? Because if a table has just a single column, it seems pointless. I agree that we may later create a need to give 'person' it's own table, but we can deal with that then." He then said that strings can't be primary keys, and that we would convert the 'persons' in the current table to numbers, and the numbers would be the primary key in the new 'person' table. To me this seems unnecessary and it would make the current table harder to read. He also thinks it will be impossible to create the second table later, and that we need to anticipate now that we might need it for something later.
Who is right?
In my opinion your friend is right.
Person should live in a different table and you should try to normalize. Don't overdo-it, though.
In the long run you may want to do more things with your site, say you want to attach multiple files to a person (ie. pictures) you'll be very thankfull then for the normalization.
Creating a new table for person and using the key of that table in place of the person attribute has nothing to do with normalization. It may be a good idea for other reasons but doing so does not make the database "more normalized" than not doing it. So you are right: as far as normalization is concerned, creating another table is unnecessary.
I would vote for your friend. I like to normalize and plan for the future and even if you never need it, this normalization is so easy to do it literally takes no time. You can create a view that you query in order to make your SQL cleaner and eliminate the need for you to join the tables yourself.
If you have already reached all of your capabilities and have no plans for expansion of capabilities I think you leave it as it is.
If you plan to add more, namely allowing people to have accounts, or anything really, I think it might be smart to separate your data into Person, Comments tables. Its not hard and makes expanding your functionality easier.
You're right.
Person may be a thing in general, but not in your model. If you were going to hassle people into properly identifying the person they're talking about, a Person table would be necessary. For example, if the comments were only about persons already registered in the database.
But here it looks like you have an unstructured data, without identity; and that nothing/nobody is interested in making sure whether "jenny" and "jennyyy" are in fact the same person, not to mentionned "jenny doe", and "my cousin"...
Well, there are two schools of thought. One says, create your data model in the most normalized way possible, then de-normalize if you need more efficiency. The other is basically "do the minimum work necessary for the job, then change it as your requirements change". Also known as YAGNI (You aren't going to need it).
It all depends on where you see this going. If this is all it will be, then your approach is probably fine. If you intend to improve it with new features over time, then your friend is right.
If you never intend to associate the person column with a user or anything else and data apparently needs no consistency or data integrity checks, just why is this in a relational database at all? Wouldn't this be a use case for a nosql database? Or am I missing something?
Normalization is all about functional dependencies (FD's). You need to identify all of the
FD's that exist among the attributes of your data model before it can be fully normalized.
Lets review what you have:
Any given instance of a CommentId functionally determines the Person (FD: CommentId -> Person)
Any given instance of a CommentId functionally determines the Comment (FD: CommentId -> Comment)
Any given instance of a CommentId functionally determines the UserId (FD: CommentId -> UserId)
Any given instance of a CommentId functionally determines the Score (FD: CommentId -> Score)
Everything here is a dependant attribute on CommentId and
CommentId alone. This might lead you to the belief that a relation (table) containing all of, or a subset of, the
above attributes must be normalized.
First thing to ask yourself is why did you create the CommentId attribute anyway? Strictly speaking,
this is a manufactured attribute - it does not relate to anything 'real'. CommentId is
commonly referred to as a surrogate key. A surrogate key is just a made up value that stands in
for a unique value set corresponding to some other group of attributes. So what group of attributes is CommentId
a surrogate for? We can figure that
out by asking the following questions and adding new FD's to the model:
1) Does a Comment have to be unique? If so the FD: Comment -> CommentId must be true.
2) Can the same Comment be made multiple times as long as it is about a different Person? If so, then
FD: Person + Comment -> CommentId must be true and the FD in 1 above is false.
3) Can the same Comment be made multiple times about the same Person provided it was made by
different UserId's? If so, the FDs in 1 and 2 cannot be true but
FD: Person + Comment + UserId -> CommentId may be true.
4) Can the same Comment be made multiple times about the same Person by the same UserId but
have different Scores? This implies FD: Person + Comment + UserId' + Score -> CommentId is true and the others are false.
Exactly one of the above 4 FD's above must be true. Whichever it is affects how your data model is normalized.
Suppose FD: Person + Comment + UserId -> CommentId turns out to be true. The logical
consequences are that:
Person + Comment + UserId and CommentId serve as equivalent keys with respect to Score
Score should be put in a relation with one but not both of its keys (to avoid transitive dependencies).
The obvious choice would be CommentId since it was specifically created as a surrogate.
A relation comprised of: CommentId, Person, Comment, UserId is needed to tie the
Key to its surrogate.
From a theoretical point of view, the surrogate key CommentId is not
required to make your data model or database work. However, its presence may affect how relations are constructed.
Creation of surrogate keys is a practical issue of some importance.
Consider what might happen if you choose to not use a surrogate key but the full
attribute set Person + Comment + UserId in its place, especially if it was required
on multiple tables as a foreign or primary key:
Comment might add a lot of space overhead
to your database because it is repeated in multiple tables. It is probably more than a couple of characters long.
What happens if someone chooses to edit a Comment? That change needs to be propagated
to all tables where Comment is part of a key. Not a pretty sight!
Indexing long complex keys can take a lot of space and/or make for slow update performance
The value assigned to a surrogate key never changes, no matter what you do to the values
associated to the attributes that it determines. Updating the dependant attributes is now
limited to the one table defining the surrogate key. This is of huge practical significance.
Now back to whether you should be creating a surrogate for Person. Does Person live
on the left hand side of many, or any, FDs? If it does, its value will propogate through your
database and there is a case for creating a surrogate for it. Whether Person is a text or numeric attribute is irrelevant to the choice of creating a surrogate key.
Based on what you have said, there is at best a weak argument to create a
surrogate for Person. This argument is based on the suspicion that its value may at some point become a key or part of a key at some point in the future.
Here's the deal. Whenever you create something, you want to make sure that it has room to grow. You want to try to anticipate future projects and future advancements for your program. In this scenario, you're right in saying that there is no need currently to add a persons table that just holds 1 field (not counting the ID, assuming you have an int ID field and a person name). However, in the future, you may want to have other attributes for such people, like first name, last name, email address, date added, etc.
While over-normalizing is certainly harmful, I personally would create another, larger table to hold the person with additional fields so that I can easily add new features in the future.
Whenever you're dealing with users, there should be a dedicated table. Then you can just join the tables and refer to that user's ID.
user -> id | username | password | email
comment -> id | user_id | content
SQL to join the comments to the users:
SELECT user.username, comment.content FROM user JOIN comment WHERE user.id = comment.user_id;
It'll make it so much easier in the future when you want to find information about that specific user. The amount of extra effort is negligible.
Concerning the "score" for each comment, that should also be a separate table as well. That way you can connect a user to a "like" or "dislike."
With this database, you might feel that it is okay but there may be some problem in the future when you want the users to know more from the database.Suppose you want to know about the number of comments made on a person with the name='abc'.In this case ,you will have to go through the entire table of comments and keep counting.In place of this, you can have an attribute called 'count' for every person and increment it whenever a comment is made on that person.
As far as normalization is concerned,it is always better to have a normalized database because it reduces redundancy and makes the database intuitive to understand. If you are expecting that your database will go large in future then normalization must be present.