ER diagram - avoiding one-to-one relationship - mysql

I've been working on an ER diagram for university project. It is about transport company. That company does particular jobs for other companies and for each job, there are three types of documents needed, and those documents have unique identifiers among other documents of the same kind. So what I did is made these types of documents as separate entities. Now when I want to join them(call them Doc1, Doc2, Doc3) into one entity(call it Job), they are basically made only for that one job and for no other. Also, this job has only one of each of these documents, so therefore it looks like relationships between documents and job are one-to-one. However, when the professor was teaching us ER models, he told that we should always avoid drawing one-to-one relationships(that there should be a way to make these documents kind of attributes of job). So what I want to know is - is it correct to draw the identifiers of these documents as attributes of job, and then make them as foreign keys referencing corresponding fields in documents' table(in relations model)? Or is there any other, more elegant way to connect them somehow avoiding these one-to-one relationships?
Also, if I do it this way, I guess I should make all 3 columns representing documents' identifiers UNIQUE in Job table, right? So that I avoid making two jobs having, for example, same Doc1?
Thank you!

One-to-one relationships are to be avoided, because they signal that the entities joined by the relationship are actually one. However, in the case specified here, the relationship is not one-to-one. Instead it is "one to zero or one", also known as "one-to-one optional".
An example is the relationship between a Home and a Lot. The Home must be located on a Lot, and only one Home can be located on any given Lot, but the Lot can exist before the Home is built. If you are modelling this relationship, you would have a "one to zero or one" relationship between Lot and Home. It would be shown like this:
In your case you have three separate dependencies, so it would look like:
Physically, these relationships may be represented in two ways:
A nullable foreign key in the "one" row (Lot, in my example above),
or
A non-nullable foreign key in the "zero or one" row (Home, in my example above)
You can choose the approach that is most comfortable and efficient for you, depending on the direction in which your application usually navigates.
You may decide to have the database enforce the uniqueness constraint (the fact that only one Home can be on a Lot). In some databases, a null value participates in uniqueness constraints (in other words, a unique index can only have one Null entry). In such a database, you would be constrained to the second approach. In MySQL, this is not the case; a uniqueness constraint ignores null values, so you can choose either approach. The second approach is more common.

Related

Multiple foreign key columns vs multiple join tables

This is yet another database normalization question/discussion but i'm hoping to get a some additional perspective on the trade offs, advantages, disadvantage of different scenerios for multiple foreign key columns vs multiple join/intersection tables as I can't seem to find any practical information or advice for how MYSQL would optimize or fail on different approaches.
I'm asking as a general guidance for how others approach for objects with multiple 1:N relationships and foreign keys but a majority of them will always be null.
As a basic example let's say I have a Project Management app with a uploads table for storing upload files information. For "scale" theres 20 Million current uploads, with 1000 added daily.
Uploads can have a direct relation to a couple different objects as it's "parent" or "owner". Directly to a Project, directly to a Todo, and directly to a Comment. Each upload would only ever have a single relationship at a time never multiple.
Potential options for structuring I see
Option 1: Single table multiple foreign key columns
uploads
upload_id, filepath, project_id, todo_id, comment_id
foreign keys for project_id, todo_id, comment_id.
Potential Problem: Large amount of null values in foreign keys. Potentially slow writes/locks in high volumes due to fk constraints and the additional index sizes.
Option 2: Multiple Intersection/Join tables
uploads
upload_id, filepath
project_uploads
project_id, upload_id
todo_uploads
todo_id, upload_id
comment_uploads
comment_id, upload_id
foreign keys on all columns for *_uploads tables
Potential Problem: People will confuse for N:N instead of 1:N relationship. "Relative", but more difficult selects to produce in application layer, especially when selecting uploads for projects as you would need to join all tables to get the entire list of project Ids for the uploads since todos and comments both would also belong to a parent.
Option 3: Single Relation/Join table with a type
uploads
upload_id, filepath
objects_uploads
upload_id, object_id, type
foreign key on upload_id, standard indexes on object_id and type.
Potential Problem: more confusing schema, not truely "relational" or normalized
I'd also like to throw out the potential of using JSON fields on individual objects and just always enforcing project_id on the uploads. I have very limited experience with JSON field types or their pitfalls. I'm assuming selections to get uploads specifically parented/uploaded to a todo or comment would be far more difficult as you would need to select the ids out of the json.
Are there any other approaches or considerations I'm overlooking? Are their specific consideration to follow based on different workloads, higher write volumes, high read, etc. Thanks for any information, insights or resources.
Edit
To clarify, I understand that the above outlines can represent differences in schemes/relationships of the objects. I'm really only thinking about write and select performance and considerations or tradeoffs to make around indexes/constraints and joins.Specifically for this question i'm less concerned about referential integrity or 100% data integrity.
I've modified some language in the my original example above. I'm looking for ideal configurations for objects that can be related to many different objects as but never at the same time leaving most foreign key columns null. Here is a similar questions from 3.5 years ago...
https://softwareengineering.stackexchange.com/questions/335284/disadvantages-of-using-a-nullable-foreign-key-instead-of-creating-an-intersectio
Basically trying to find some general advice when to consider or optimize in the different ways, gauge any real impact of large amount of nulls in Foreign keys and potential reasons for when to prefer different approaches.
Option 1 models three many-to-one relationships. That is, a given upload can have at most one reference to project, at most one reference to todo, and at most one reference to comment. This would be a simple way to model these as many-to-one relationships. Don't shy away from using NULLs, they don't take storage space.
Option 2 models three many-to-many relationships. A given upload may be associated with multiple projects, multiple todos, and multiple comments. I think this is what Akina was commenting about above. If your application needs these to be many-to-many relationships, then you need these three intersection tables to model that data. If you don't need these to be many-to-many relationships, then don't create these tables.
Option 3 is not a relational data model at all. It conflicts with several normal forms.

Polymorphic relationships vs separate tables per type

I am working on a database which has some types (e.g. User, Appointment, Task etc.) which can have zero or more Notes associated with each type.
The possible solutions I have come across for implementing these relationships are:
Polymorphic relationship
Separate table per type
Polymorphic Relationship
Suggested by many as being the easiest solution to implement and seemingly the most common implementation for frameworks that follow the Active Record pattern, I would add a table whose data is morphable:
My notable_type would allow me to distinguish between the type (User, Appointment, Task) the Note relates to, whilst the notable_id would allow me to obtain the individual type record in the related type table.
PROS:
Easy to scale, more models can be easily associated with the polymorphic class
Limits table bloat
Results in one class that can be used by many other classes (DRY)
CONS
More types can make querying more difficult and expensive as the data grows
Cannot have a foreign key
Lack of data consistency
Separate Table per Type
Alternatively I could create a table for each type which is responsible for the Notes associated with that type only. The type_id foreign key would allow me to quickly obtain the individual type record.
Deemed by many online as a code smell, many articles advocate avoiding the polymorphic relationship in favour of an alternative (here and here for example).
PROS:
Allows us to use foreign keys effectively
Efficient data querying
Maintains data consistency
CONS:
Increases table bloat as each type requires a separate table
Results in multiple classes, each representing the separate type_notes table
Thoughts
The polymorphic relationship is certainly the simpler of the two options to implement, but the lack of foreign key constraints and therefore potential for consistency issues feels wrong.
A table per notes relationship (user_notes, task_notes etc.) with foreign keys seems the correct way (in keeping with design patterns) but could result in a lot of tables (addition of other types that can have notes or addition of types similar to notes [e.g. events]).
It feels like my choice is either simplified table structure but forgo foreign keys and increased query overhead, or increase the number of tables with the same structure but simplify queries and allow for foreign keys.
Given my scenario which of the above would be more appropriate, or is there an alternative I should consider?
What is "table bloat"? Are you concerned about having too many tables? Many real-world databases I've worked on have between 100 and 200 tables, because that's what it takes.
If you're concerned with adding multiple tables, then why do you have separate tables for User, Appointment, and Task? If you had a multi-valued attribute for User, for example for multiple phone numbers per user, would you create a separate table for phones, or would you try to combine them all into the user table somehow? Or have a polymorphic "things that belong to other things" table for user phones, appointment invitees, and task milestones?
Answer: No, you'd create a Phone table, and use it to reference only the User table. If Appointments have invitees, that gets its own table (probably a many-to-many between appointments and users). If tasks have milestones, that gets its own table too.
The correct thing to do is to model your database tables like you would model object types in your application. You might like to read a book like SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date to learn more about how tables are analogous to types.
You already know instinctively that the fact that you can't create a foreign key is a red flag. A foreign key must reference exactly one parent table. This should be a clue that it's not valid relational database design to make a polymorphic foreign key. Once you start thinking of tables and their attributes as concrete types (like described in SQL and Relational Theory), this will become obvious.
If you must create one notes table, you could make it reference one table called "Notable" which is like a superclass of User, Appointment, and Task. Then each of those three tables would also reference a primary key of Notable. This mimics the object-oriented structure of polymorphism, where you can have a class Note have a reference to an object by its superclass type.
But IMHO, that's more complex than it needs to be. I would just create separate tables for UserNotes, AppointmentNotes, and TaskNotes. I'm not troubled by having three more tables, and it makes your code more clear and maintainable.
I think you should think about these two things, before you can make a decision.
Performance. a lot of reads, a lot of writes ? Test which is better.
Growth of your model. Can it easily be expanded ?

Naming Conventions for Multivariable Dependency Tables MySQL

Conventions for normalized databases rule that the best practice for dealing with multivariable dependencies is spinning them off into their own table with two columns. One column is the primary key of the original table (for example, customer name, of which there is one), while the other is the value with has multiple values (for example, email or phone- the customer could have multiple of these). Together these two columns constitute the primary key for the spun off table.
However, when building normalized databases, I often find naming these spun off tables troublesome. It's hard to come up with a meaningful names for these tables. Is there a standard way of identifying these tables as multivariable dependency tables that are meaningless without the presence of the other table? Some examples I can think of (referencing the example above) are 'customer_phones' or 'customer_has_phones'. I don't think just 'phones' would be good, because that doesn't identify this table as related to and heavily dependent on the customers table.
In real life you end up running into a lot of combinations that vary a lot from each other.
Try to be as clear as possible in case someone else ends up inheriting your design. I personally like to keep short names in the parent tables so they don't end up being super long whenever the relationship grows or spans off new children.
For instance, if I have "Customer", "Subscriptions", "Product" tables I would end up naming their links like "Customer_Subscriptions" or "Subscriptions_Products" and such.
Most of the time it just gets down to what works better for you in terms of maintainability.
The convention we use is the name of the entity table, followed by the name of the attribute.
In your example, if the entity table is customer, the name of the table for the repeating (multi-valued) attribute would be customer_phone or customer_phone_number. (We almost always name tables in the singular, based on the idea that we are naming what ONE tuple (row) represents. (e.g. a row in that table represents one occurrence of a phone number for a customer.)

Foreign key column optionally contains NULL or ID. Is there a better design?

I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?
I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.
Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.
You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.
I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.
NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.

Performance gain or less using an association table even when there is just a one-to-many relationship

I am going to build a PHP web application and have already decided to use Codeigniter + DataMapper (OverZealous edition).
I have discovered that DataMapper (OverZealous edition) requires the use of an extra association table even when there is actually just a one-to-many relationship.
For example, a country can have many players but a player can only belong to one country. Typical database design would be like this:
[countries] country_id(pk), country_name
[players] player_id(pk), player_name, country_id(fk)
However, in DataMapper, it requires the design to be like this:
[countries] country_id(pk), country_name
[players] player_id(pk), player_name
[asso_countries_players] countries_players_id(pk), country_id(fk), player_id(fk)
It's good for maintenance because if later we change our mind that a player can belong to more than one country, it can be done with very little effort.
But what I would like to know is, for such database design, in general, is there any performance gain or loss when compared to the typical design?
"The fastest way to do anything is not to do it at all." -- Cary Millsap, Optimizing Oracle Performance.
"A designer knows he has achieved true elegance not when there is nothing left to add, but when there is nothing left to take away." -- Antoine de Saint-Exupéry
The simpler implementation has two tables and three indexes, two unique.
The more complicated implementation has three tables and five indexes, four unique. The index on asso_countries_players.player_name (which should be a surrogate ID -- what happens if a player's name changes, like if they get married or legally change it, as Chad Ochocinco (nee Johnson) did?) must also be unique to enforce the 0..1 nature of the relationship between players and countries.
If the associative entity isn't required by the data model, then eliminate it. It's generally pretty trivial to transform a 0..1 relationship or 1..n relationship to an n..n relationship:
Add associative entity (and I'd question the need for a surrogate key there unless the relationship itself had attributes, like a start or end date)
Populate associative entity with existing data
Reimplement the foreign key constraints
Remove superseded foreign key column in child table.
Selecting data and searching will mean more joins : you'll have to work on 3 tables instead of 2.
Inserting data will mean more insert queries : you'll have to insert to 3 tables instead of 2.
So I'm guessing this could mean a bit more work -- which, in turns, might hurt performances a bit.
Because this is one-to-many I'd personally not use an association table, it's totally unnecessary.
The performance hit from this decision won't be too great. But think about the context of your data too, don't just do it because some program tells you - understand your data.