My CMS Application which lets users to post Classifieds, Articles, Events, Directories, Properties etc has its database designed as follows :
1st Approach:
Each Section (i.e 'classifieds','events' etc) has three tables dedicated to store data relevant to it:
Classified:
classified-post
classified-category
classified-post-category
Event:
events_post.
events_category.
events_post-category.
The same applies for Articles, Properties, Directories etc. each Section has three tables dedicated to its posts, categories.
The problem with this approach is:
Too many database table. (which leads to increasing number of model,
controller files)
Two Foreign Key's to avoid duplicate entries in associative tables.
For example: Lets say table comments, ratings, images belongs to classified-post, events-posts etc, so the structure of the tables would be:
Image [id, post_id, section]
The second FK section must be stored and associated to avoid duplicate posts.
2nd Approach:
This approach will have single posts table which has section column associated to each posts as foreign key. i.e
post: id, section, title etc ....VALUES ( 1, 'classifieds','abc') (2,'events','asd')
While the second approach is little bit cumbersome when doing sql queries, it eases up the process when performing relational table queries. ex: table images, ratings, comments belongs to posts table.
image [ id, post_id (FK) ]
While this approach seems clean and easy, this will end up in having oodles of columns in posts table, that it will have columns related to events, classifieds, directories etc which will lead to performance issues while querying for rows and columns.
The same applies for categories. It could be either one of the two approach, either save section column as second foreign key or have separate tables for each sections ( 1st approach ).
So now my question is, which approach is considered to be better than the other? does any of the two approaches have benefit over the other in performance wise? or what is the best approach to tackle while dealing with these paradigms?
I will favor second approach with some considerations.
A standard database design guidance is that the designer should first create a fully normalized dsign then selective denormalization can be performed for performance reasons.
Normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.
Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data.
Hint: Programmers building their first database are often primarily concerned with performance. There’s no question that performance is important. A bad design can easily result in database operations that take ten to a hundred times as much time as they should.
A very likely example could be seen here
A draft model following the mentioned approach could be:
Approach 1 has the problem of too many tables
Approach 2 has too many columns
Consider storing your data on a single table like Approach 2, but dividing storing all the optional foreign key data in XML.
The XML field will only have data that it needs for a particular section. If a new section is added, then you just add that kind of data to the XML
Your table may look like
UserID int FK
ImageID int FK
ArtifactCategory int FK
PostID int FK
ClassifiedID int FK
...
Other shared
...
Secondary xml
Now you have neither too many columns nor too many tables
Related
This is yet another database normalization question/discussion but i'm hoping to get a some additional perspective on the trade offs, advantages, disadvantage of different scenerios for multiple foreign key columns vs multiple join/intersection tables as I can't seem to find any practical information or advice for how MYSQL would optimize or fail on different approaches.
I'm asking as a general guidance for how others approach for objects with multiple 1:N relationships and foreign keys but a majority of them will always be null.
As a basic example let's say I have a Project Management app with a uploads table for storing upload files information. For "scale" theres 20 Million current uploads, with 1000 added daily.
Uploads can have a direct relation to a couple different objects as it's "parent" or "owner". Directly to a Project, directly to a Todo, and directly to a Comment. Each upload would only ever have a single relationship at a time never multiple.
Potential options for structuring I see
Option 1: Single table multiple foreign key columns
uploads
upload_id, filepath, project_id, todo_id, comment_id
foreign keys for project_id, todo_id, comment_id.
Potential Problem: Large amount of null values in foreign keys. Potentially slow writes/locks in high volumes due to fk constraints and the additional index sizes.
Option 2: Multiple Intersection/Join tables
uploads
upload_id, filepath
project_uploads
project_id, upload_id
todo_uploads
todo_id, upload_id
comment_uploads
comment_id, upload_id
foreign keys on all columns for *_uploads tables
Potential Problem: People will confuse for N:N instead of 1:N relationship. "Relative", but more difficult selects to produce in application layer, especially when selecting uploads for projects as you would need to join all tables to get the entire list of project Ids for the uploads since todos and comments both would also belong to a parent.
Option 3: Single Relation/Join table with a type
uploads
upload_id, filepath
objects_uploads
upload_id, object_id, type
foreign key on upload_id, standard indexes on object_id and type.
Potential Problem: more confusing schema, not truely "relational" or normalized
I'd also like to throw out the potential of using JSON fields on individual objects and just always enforcing project_id on the uploads. I have very limited experience with JSON field types or their pitfalls. I'm assuming selections to get uploads specifically parented/uploaded to a todo or comment would be far more difficult as you would need to select the ids out of the json.
Are there any other approaches or considerations I'm overlooking? Are their specific consideration to follow based on different workloads, higher write volumes, high read, etc. Thanks for any information, insights or resources.
Edit
To clarify, I understand that the above outlines can represent differences in schemes/relationships of the objects. I'm really only thinking about write and select performance and considerations or tradeoffs to make around indexes/constraints and joins.Specifically for this question i'm less concerned about referential integrity or 100% data integrity.
I've modified some language in the my original example above. I'm looking for ideal configurations for objects that can be related to many different objects as but never at the same time leaving most foreign key columns null. Here is a similar questions from 3.5 years ago...
https://softwareengineering.stackexchange.com/questions/335284/disadvantages-of-using-a-nullable-foreign-key-instead-of-creating-an-intersectio
Basically trying to find some general advice when to consider or optimize in the different ways, gauge any real impact of large amount of nulls in Foreign keys and potential reasons for when to prefer different approaches.
Option 1 models three many-to-one relationships. That is, a given upload can have at most one reference to project, at most one reference to todo, and at most one reference to comment. This would be a simple way to model these as many-to-one relationships. Don't shy away from using NULLs, they don't take storage space.
Option 2 models three many-to-many relationships. A given upload may be associated with multiple projects, multiple todos, and multiple comments. I think this is what Akina was commenting about above. If your application needs these to be many-to-many relationships, then you need these three intersection tables to model that data. If you don't need these to be many-to-many relationships, then don't create these tables.
Option 3 is not a relational data model at all. It conflicts with several normal forms.
I am working on a database which has some types (e.g. User, Appointment, Task etc.) which can have zero or more Notes associated with each type.
The possible solutions I have come across for implementing these relationships are:
Polymorphic relationship
Separate table per type
Polymorphic Relationship
Suggested by many as being the easiest solution to implement and seemingly the most common implementation for frameworks that follow the Active Record pattern, I would add a table whose data is morphable:
My notable_type would allow me to distinguish between the type (User, Appointment, Task) the Note relates to, whilst the notable_id would allow me to obtain the individual type record in the related type table.
PROS:
Easy to scale, more models can be easily associated with the polymorphic class
Limits table bloat
Results in one class that can be used by many other classes (DRY)
CONS
More types can make querying more difficult and expensive as the data grows
Cannot have a foreign key
Lack of data consistency
Separate Table per Type
Alternatively I could create a table for each type which is responsible for the Notes associated with that type only. The type_id foreign key would allow me to quickly obtain the individual type record.
Deemed by many online as a code smell, many articles advocate avoiding the polymorphic relationship in favour of an alternative (here and here for example).
PROS:
Allows us to use foreign keys effectively
Efficient data querying
Maintains data consistency
CONS:
Increases table bloat as each type requires a separate table
Results in multiple classes, each representing the separate type_notes table
Thoughts
The polymorphic relationship is certainly the simpler of the two options to implement, but the lack of foreign key constraints and therefore potential for consistency issues feels wrong.
A table per notes relationship (user_notes, task_notes etc.) with foreign keys seems the correct way (in keeping with design patterns) but could result in a lot of tables (addition of other types that can have notes or addition of types similar to notes [e.g. events]).
It feels like my choice is either simplified table structure but forgo foreign keys and increased query overhead, or increase the number of tables with the same structure but simplify queries and allow for foreign keys.
Given my scenario which of the above would be more appropriate, or is there an alternative I should consider?
What is "table bloat"? Are you concerned about having too many tables? Many real-world databases I've worked on have between 100 and 200 tables, because that's what it takes.
If you're concerned with adding multiple tables, then why do you have separate tables for User, Appointment, and Task? If you had a multi-valued attribute for User, for example for multiple phone numbers per user, would you create a separate table for phones, or would you try to combine them all into the user table somehow? Or have a polymorphic "things that belong to other things" table for user phones, appointment invitees, and task milestones?
Answer: No, you'd create a Phone table, and use it to reference only the User table. If Appointments have invitees, that gets its own table (probably a many-to-many between appointments and users). If tasks have milestones, that gets its own table too.
The correct thing to do is to model your database tables like you would model object types in your application. You might like to read a book like SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date to learn more about how tables are analogous to types.
You already know instinctively that the fact that you can't create a foreign key is a red flag. A foreign key must reference exactly one parent table. This should be a clue that it's not valid relational database design to make a polymorphic foreign key. Once you start thinking of tables and their attributes as concrete types (like described in SQL and Relational Theory), this will become obvious.
If you must create one notes table, you could make it reference one table called "Notable" which is like a superclass of User, Appointment, and Task. Then each of those three tables would also reference a primary key of Notable. This mimics the object-oriented structure of polymorphism, where you can have a class Note have a reference to an object by its superclass type.
But IMHO, that's more complex than it needs to be. I would just create separate tables for UserNotes, AppointmentNotes, and TaskNotes. I'm not troubled by having three more tables, and it makes your code more clear and maintainable.
I think you should think about these two things, before you can make a decision.
Performance. a lot of reads, a lot of writes ? Test which is better.
Growth of your model. Can it easily be expanded ?
We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...
I have three tables with common fields - users, guests and admins.
The last two tables have some of the users fields.
Here's an example:
users
id|username|password|email|city|country|phone|birthday|status
guests
id|city|country|phone|birthday
admins
id|username|password|status
I'm wondering if it's better to:
a)use one table with many NULL values
b)use three tables
The question is less about "one table with many NULL versus three tables" that about the data structure. The real question is how other tables in your data structure will refer to these entities.
This is a classic situation, where you have "one-of" relationships and need to represent them in SQL. There is a "right" way, and that is to have four tables:
"users" (I can't think of a good name) would encompass everyone and have a unique id that could be referenced by other tables
"normal", "admins", "guests" each of which would have a 1-0/1 relationship with "users"
This allows other tables to refer to any of the three types of users, or to users in general. This is important for maintaining proper relationships.
You have suggested two shortcuts. One is that there is no information about "normal" users so you dispense with that table. However, this means that you can't refer to "normal" users in another table.
Often, when the data structures are similar, the data is simply denormalized into a single row (as in your solution a).
All three approach are reasonable, in the context of applications that have specific needs. As for performance, the difference between having additional NULLABLE columns is generally minimal when the data types are variable length. If a lot of the additional columns are numeric, then these occupy real space even when NULL, which can be a factor in designing the best solution.
In short, I wouldn't choose between the different options based on the premature optimization of which might be better. I would choose between them based on the overall data structure needed for the database, and in particular, the relationships that these entities have with other entities.
EDIT:
Then there is the question of the id that you use for the specialized tables. There are two ways of doing this. One is to have a separate id, such as AdminId and GuestId for each of these tables. Another column in each table would be the UserId.
This makes sense when other entities have relationships with these particular entities. For instance, "admins" might have a sub-system that describes rights and roles and privileges that they have, perhaps along with a history of changes. These tables (ahem, entities) would want to refer to an AdminId. And, you should probably oblige by letting them.
If you don't have such tables, then you might still split out the Admins, because the 100 integer columns they need are a waste of space for the zillion other users. In that case, you can get by without a separate id.
I want to emphasize that you have asked a question that doesn't have a "best" answer in general. It does have a "correct" answer by the rules of normalization (that would be 4 tables with 4 separate ids). But the best answer in a given situation depends on the overall data model.
Why not have one parent user table with three foreign keyed detail tables. Allows unique user id that can transition.
I generally agree with Chriseyre2000, but in your specific example, I don't see a need for the other 2 tables. Everything is contained in users, so why not just add Guest and Admin bit fields? Or even a single UserType field.
Though Chriseyre2000's solution will give you better scalability should you later want to add fields that are specific to guests and admins.
In my database I have different entities like todos, events, discussions, etc. Each of them can have tags, comments, files, and other related items.
Now I have to design the relationships between these tables, and I think I have to choose from the following two possible solutions:
1. Separated relationship tables
So I will create todos_tags, events_tags, discussions_tags, todos_comments, events_comments, discussions_comments, etc. tables.
2. Common relationship tables
I will create only these tables: related_tags, related_comments, related_files, etc. having a structure like this:
related_tags
entity (event|discussion|todo|etc. - as enum or tinyint (1|2|3|etc.))
entity_id
tag_id
Which design should I use?
Probably you will say: it depends on the situation, and I think this is correct.
I my case most of the time (maybe 70%+) I will have to query only one of the entities (events, discussion or todos), but in some cases I need them all in the same query (both events, discussion, todos having a specified tag for example). In this case I'll have to do on union on 3+ tables (in my case it can be 5+ tables) if I go with separated relationship tables.
I'll not have more than 1000-2000 rows in each table(events, discussions, todos);
What is the correct way to go? What are some personal experiences about this?
The second schema is more extensible. This way you will be able to extend your application to construct queries involving more than one type. In addition, it's possible to easily add new types to the future even dynamically. Furthermore, it allows greater aggregation freedom, for example allowing you to count how many rows in each type exist, or how many were created during a particular timeframe.
On the other hand, the first design does not really have many advantages other than speed: But MySQL is already good at handling these types of queries fast enough for you. You can create an index "entity" to make it work smoothly. If in the future you need to partition your tables to increase speed, you can do so at a later stage.
It is a far simpler design to have a single, common relationship table such as related_tags where you specify the entity type in a column rather than having multiple tables. Just be sure you properly index the entity and tag_id fields together to have optimum performance.