Multiple foreign key columns vs multiple join tables - mysql

This is yet another database normalization question/discussion but i'm hoping to get a some additional perspective on the trade offs, advantages, disadvantage of different scenerios for multiple foreign key columns vs multiple join/intersection tables as I can't seem to find any practical information or advice for how MYSQL would optimize or fail on different approaches.
I'm asking as a general guidance for how others approach for objects with multiple 1:N relationships and foreign keys but a majority of them will always be null.
As a basic example let's say I have a Project Management app with a uploads table for storing upload files information. For "scale" theres 20 Million current uploads, with 1000 added daily.
Uploads can have a direct relation to a couple different objects as it's "parent" or "owner". Directly to a Project, directly to a Todo, and directly to a Comment. Each upload would only ever have a single relationship at a time never multiple.
Potential options for structuring I see
Option 1: Single table multiple foreign key columns
uploads
upload_id, filepath, project_id, todo_id, comment_id
foreign keys for project_id, todo_id, comment_id.
Potential Problem: Large amount of null values in foreign keys. Potentially slow writes/locks in high volumes due to fk constraints and the additional index sizes.
Option 2: Multiple Intersection/Join tables
uploads
upload_id, filepath
project_uploads
project_id, upload_id
todo_uploads
todo_id, upload_id
comment_uploads
comment_id, upload_id
foreign keys on all columns for *_uploads tables
Potential Problem: People will confuse for N:N instead of 1:N relationship. "Relative", but more difficult selects to produce in application layer, especially when selecting uploads for projects as you would need to join all tables to get the entire list of project Ids for the uploads since todos and comments both would also belong to a parent.
Option 3: Single Relation/Join table with a type
uploads
upload_id, filepath
objects_uploads
upload_id, object_id, type
foreign key on upload_id, standard indexes on object_id and type.
Potential Problem: more confusing schema, not truely "relational" or normalized
I'd also like to throw out the potential of using JSON fields on individual objects and just always enforcing project_id on the uploads. I have very limited experience with JSON field types or their pitfalls. I'm assuming selections to get uploads specifically parented/uploaded to a todo or comment would be far more difficult as you would need to select the ids out of the json.
Are there any other approaches or considerations I'm overlooking? Are their specific consideration to follow based on different workloads, higher write volumes, high read, etc. Thanks for any information, insights or resources.
Edit
To clarify, I understand that the above outlines can represent differences in schemes/relationships of the objects. I'm really only thinking about write and select performance and considerations or tradeoffs to make around indexes/constraints and joins.Specifically for this question i'm less concerned about referential integrity or 100% data integrity.
I've modified some language in the my original example above. I'm looking for ideal configurations for objects that can be related to many different objects as but never at the same time leaving most foreign key columns null. Here is a similar questions from 3.5 years ago...
https://softwareengineering.stackexchange.com/questions/335284/disadvantages-of-using-a-nullable-foreign-key-instead-of-creating-an-intersectio
Basically trying to find some general advice when to consider or optimize in the different ways, gauge any real impact of large amount of nulls in Foreign keys and potential reasons for when to prefer different approaches.

Option 1 models three many-to-one relationships. That is, a given upload can have at most one reference to project, at most one reference to todo, and at most one reference to comment. This would be a simple way to model these as many-to-one relationships. Don't shy away from using NULLs, they don't take storage space.
Option 2 models three many-to-many relationships. A given upload may be associated with multiple projects, multiple todos, and multiple comments. I think this is what Akina was commenting about above. If your application needs these to be many-to-many relationships, then you need these three intersection tables to model that data. If you don't need these to be many-to-many relationships, then don't create these tables.
Option 3 is not a relational data model at all. It conflicts with several normal forms.

Related

Polymorphic relationships vs separate tables per type

I am working on a database which has some types (e.g. User, Appointment, Task etc.) which can have zero or more Notes associated with each type.
The possible solutions I have come across for implementing these relationships are:
Polymorphic relationship
Separate table per type
Polymorphic Relationship
Suggested by many as being the easiest solution to implement and seemingly the most common implementation for frameworks that follow the Active Record pattern, I would add a table whose data is morphable:
My notable_type would allow me to distinguish between the type (User, Appointment, Task) the Note relates to, whilst the notable_id would allow me to obtain the individual type record in the related type table.
PROS:
Easy to scale, more models can be easily associated with the polymorphic class
Limits table bloat
Results in one class that can be used by many other classes (DRY)
CONS
More types can make querying more difficult and expensive as the data grows
Cannot have a foreign key
Lack of data consistency
Separate Table per Type
Alternatively I could create a table for each type which is responsible for the Notes associated with that type only. The type_id foreign key would allow me to quickly obtain the individual type record.
Deemed by many online as a code smell, many articles advocate avoiding the polymorphic relationship in favour of an alternative (here and here for example).
PROS:
Allows us to use foreign keys effectively
Efficient data querying
Maintains data consistency
CONS:
Increases table bloat as each type requires a separate table
Results in multiple classes, each representing the separate type_notes table
Thoughts
The polymorphic relationship is certainly the simpler of the two options to implement, but the lack of foreign key constraints and therefore potential for consistency issues feels wrong.
A table per notes relationship (user_notes, task_notes etc.) with foreign keys seems the correct way (in keeping with design patterns) but could result in a lot of tables (addition of other types that can have notes or addition of types similar to notes [e.g. events]).
It feels like my choice is either simplified table structure but forgo foreign keys and increased query overhead, or increase the number of tables with the same structure but simplify queries and allow for foreign keys.
Given my scenario which of the above would be more appropriate, or is there an alternative I should consider?
What is "table bloat"? Are you concerned about having too many tables? Many real-world databases I've worked on have between 100 and 200 tables, because that's what it takes.
If you're concerned with adding multiple tables, then why do you have separate tables for User, Appointment, and Task? If you had a multi-valued attribute for User, for example for multiple phone numbers per user, would you create a separate table for phones, or would you try to combine them all into the user table somehow? Or have a polymorphic "things that belong to other things" table for user phones, appointment invitees, and task milestones?
Answer: No, you'd create a Phone table, and use it to reference only the User table. If Appointments have invitees, that gets its own table (probably a many-to-many between appointments and users). If tasks have milestones, that gets its own table too.
The correct thing to do is to model your database tables like you would model object types in your application. You might like to read a book like SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date to learn more about how tables are analogous to types.
You already know instinctively that the fact that you can't create a foreign key is a red flag. A foreign key must reference exactly one parent table. This should be a clue that it's not valid relational database design to make a polymorphic foreign key. Once you start thinking of tables and their attributes as concrete types (like described in SQL and Relational Theory), this will become obvious.
If you must create one notes table, you could make it reference one table called "Notable" which is like a superclass of User, Appointment, and Task. Then each of those three tables would also reference a primary key of Notable. This mimics the object-oriented structure of polymorphism, where you can have a class Note have a reference to an object by its superclass type.
But IMHO, that's more complex than it needs to be. I would just create separate tables for UserNotes, AppointmentNotes, and TaskNotes. I'm not troubled by having three more tables, and it makes your code more clear and maintainable.
I think you should think about these two things, before you can make a decision.
Performance. a lot of reads, a lot of writes ? Test which is better.
Growth of your model. Can it easily be expanded ?

ER diagram - avoiding one-to-one relationship

I've been working on an ER diagram for university project. It is about transport company. That company does particular jobs for other companies and for each job, there are three types of documents needed, and those documents have unique identifiers among other documents of the same kind. So what I did is made these types of documents as separate entities. Now when I want to join them(call them Doc1, Doc2, Doc3) into one entity(call it Job), they are basically made only for that one job and for no other. Also, this job has only one of each of these documents, so therefore it looks like relationships between documents and job are one-to-one. However, when the professor was teaching us ER models, he told that we should always avoid drawing one-to-one relationships(that there should be a way to make these documents kind of attributes of job). So what I want to know is - is it correct to draw the identifiers of these documents as attributes of job, and then make them as foreign keys referencing corresponding fields in documents' table(in relations model)? Or is there any other, more elegant way to connect them somehow avoiding these one-to-one relationships?
Also, if I do it this way, I guess I should make all 3 columns representing documents' identifiers UNIQUE in Job table, right? So that I avoid making two jobs having, for example, same Doc1?
Thank you!
One-to-one relationships are to be avoided, because they signal that the entities joined by the relationship are actually one. However, in the case specified here, the relationship is not one-to-one. Instead it is "one to zero or one", also known as "one-to-one optional".
An example is the relationship between a Home and a Lot. The Home must be located on a Lot, and only one Home can be located on any given Lot, but the Lot can exist before the Home is built. If you are modelling this relationship, you would have a "one to zero or one" relationship between Lot and Home. It would be shown like this:
In your case you have three separate dependencies, so it would look like:
Physically, these relationships may be represented in two ways:
A nullable foreign key in the "one" row (Lot, in my example above),
or
A non-nullable foreign key in the "zero or one" row (Home, in my example above)
You can choose the approach that is most comfortable and efficient for you, depending on the direction in which your application usually navigates.
You may decide to have the database enforce the uniqueness constraint (the fact that only one Home can be on a Lot). In some databases, a null value participates in uniqueness constraints (in other words, a unique index can only have one Null entry). In such a database, you would be constrained to the second approach. In MySQL, this is not the case; a uniqueness constraint ignores null values, so you can choose either approach. The second approach is more common.

Not defined database schema

I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!
Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs

Enhancing the Database Design

My CMS Application which lets users to post Classifieds, Articles, Events, Directories, Properties etc has its database designed as follows :
1st Approach:
Each Section (i.e 'classifieds','events' etc) has three tables dedicated to store data relevant to it:
Classified:
classified-post
classified-category
classified-post-category
Event:
events_post.
events_category.
events_post-category.
The same applies for Articles, Properties, Directories etc. each Section has three tables dedicated to its posts, categories.
The problem with this approach is:
Too many database table. (which leads to increasing number of model,
controller files)
Two Foreign Key's to avoid duplicate entries in associative tables.
For example: Lets say table comments, ratings, images belongs to classified-post, events-posts etc, so the structure of the tables would be:
Image [id, post_id, section]
The second FK section must be stored and associated to avoid duplicate posts.
2nd Approach:
This approach will have single posts table which has section column associated to each posts as foreign key. i.e
post: id, section, title etc ....VALUES ( 1, 'classifieds','abc') (2,'events','asd')
While the second approach is little bit cumbersome when doing sql queries, it eases up the process when performing relational table queries. ex: table images, ratings, comments belongs to posts table.
image [ id, post_id (FK) ]
While this approach seems clean and easy, this will end up in having oodles of columns in posts table, that it will have columns related to events, classifieds, directories etc which will lead to performance issues while querying for rows and columns.
The same applies for categories. It could be either one of the two approach, either save section column as second foreign key or have separate tables for each sections ( 1st approach ).
So now my question is, which approach is considered to be better than the other? does any of the two approaches have benefit over the other in performance wise? or what is the best approach to tackle while dealing with these paradigms?
I will favor second approach with some considerations.
A standard database design guidance is that the designer should first create a fully normalized dsign then selective denormalization can be performed for performance reasons.
Normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.
Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data.
Hint: Programmers building their first database are often primarily concerned with performance. There’s no question that performance is important. A bad design can easily result in database operations that take ten to a hundred times as much time as they should.
A very likely example could be seen here
A draft model following the mentioned approach could be:
Approach 1 has the problem of too many tables
Approach 2 has too many columns
Consider storing your data on a single table like Approach 2, but dividing storing all the optional foreign key data in XML.
The XML field will only have data that it needs for a particular section. If a new section is added, then you just add that kind of data to the XML
Your table may look like
UserID int FK
ImageID int FK
ArtifactCategory int FK
PostID int FK
ClassifiedID int FK
...
Other shared
...
Secondary xml
Now you have neither too many columns nor too many tables

Mysql: separate or common relationship tables for different entities

In my database I have different entities like todos, events, discussions, etc. Each of them can have tags, comments, files, and other related items.
Now I have to design the relationships between these tables, and I think I have to choose from the following two possible solutions:
1. Separated relationship tables
So I will create todos_tags, events_tags, discussions_tags, todos_comments, events_comments, discussions_comments, etc. tables.
2. Common relationship tables
I will create only these tables: related_tags, related_comments, related_files, etc. having a structure like this:
related_tags
entity (event|discussion|todo|etc. - as enum or tinyint (1|2|3|etc.))
entity_id
tag_id
Which design should I use?
Probably you will say: it depends on the situation, and I think this is correct.
I my case most of the time (maybe 70%+) I will have to query only one of the entities (events, discussion or todos), but in some cases I need them all in the same query (both events, discussion, todos having a specified tag for example). In this case I'll have to do on union on 3+ tables (in my case it can be 5+ tables) if I go with separated relationship tables.
I'll not have more than 1000-2000 rows in each table(events, discussions, todos);
What is the correct way to go? What are some personal experiences about this?
The second schema is more extensible. This way you will be able to extend your application to construct queries involving more than one type. In addition, it's possible to easily add new types to the future even dynamically. Furthermore, it allows greater aggregation freedom, for example allowing you to count how many rows in each type exist, or how many were created during a particular timeframe.
On the other hand, the first design does not really have many advantages other than speed: But MySQL is already good at handling these types of queries fast enough for you. You can create an index "entity" to make it work smoothly. If in the future you need to partition your tables to increase speed, you can do so at a later stage.
It is a far simpler design to have a single, common relationship table such as related_tags where you specify the entity type in a column rather than having multiple tables. Just be sure you properly index the entity and tag_id fields together to have optimum performance.