In my database I have different entities like todos, events, discussions, etc. Each of them can have tags, comments, files, and other related items.
Now I have to design the relationships between these tables, and I think I have to choose from the following two possible solutions:
1. Separated relationship tables
So I will create todos_tags, events_tags, discussions_tags, todos_comments, events_comments, discussions_comments, etc. tables.
2. Common relationship tables
I will create only these tables: related_tags, related_comments, related_files, etc. having a structure like this:
related_tags
entity (event|discussion|todo|etc. - as enum or tinyint (1|2|3|etc.))
entity_id
tag_id
Which design should I use?
Probably you will say: it depends on the situation, and I think this is correct.
I my case most of the time (maybe 70%+) I will have to query only one of the entities (events, discussion or todos), but in some cases I need them all in the same query (both events, discussion, todos having a specified tag for example). In this case I'll have to do on union on 3+ tables (in my case it can be 5+ tables) if I go with separated relationship tables.
I'll not have more than 1000-2000 rows in each table(events, discussions, todos);
What is the correct way to go? What are some personal experiences about this?
The second schema is more extensible. This way you will be able to extend your application to construct queries involving more than one type. In addition, it's possible to easily add new types to the future even dynamically. Furthermore, it allows greater aggregation freedom, for example allowing you to count how many rows in each type exist, or how many were created during a particular timeframe.
On the other hand, the first design does not really have many advantages other than speed: But MySQL is already good at handling these types of queries fast enough for you. You can create an index "entity" to make it work smoothly. If in the future you need to partition your tables to increase speed, you can do so at a later stage.
It is a far simpler design to have a single, common relationship table such as related_tags where you specify the entity type in a column rather than having multiple tables. Just be sure you properly index the entity and tag_id fields together to have optimum performance.
Related
I am working on a database which has some types (e.g. User, Appointment, Task etc.) which can have zero or more Notes associated with each type.
The possible solutions I have come across for implementing these relationships are:
Polymorphic relationship
Separate table per type
Polymorphic Relationship
Suggested by many as being the easiest solution to implement and seemingly the most common implementation for frameworks that follow the Active Record pattern, I would add a table whose data is morphable:
My notable_type would allow me to distinguish between the type (User, Appointment, Task) the Note relates to, whilst the notable_id would allow me to obtain the individual type record in the related type table.
PROS:
Easy to scale, more models can be easily associated with the polymorphic class
Limits table bloat
Results in one class that can be used by many other classes (DRY)
CONS
More types can make querying more difficult and expensive as the data grows
Cannot have a foreign key
Lack of data consistency
Separate Table per Type
Alternatively I could create a table for each type which is responsible for the Notes associated with that type only. The type_id foreign key would allow me to quickly obtain the individual type record.
Deemed by many online as a code smell, many articles advocate avoiding the polymorphic relationship in favour of an alternative (here and here for example).
PROS:
Allows us to use foreign keys effectively
Efficient data querying
Maintains data consistency
CONS:
Increases table bloat as each type requires a separate table
Results in multiple classes, each representing the separate type_notes table
Thoughts
The polymorphic relationship is certainly the simpler of the two options to implement, but the lack of foreign key constraints and therefore potential for consistency issues feels wrong.
A table per notes relationship (user_notes, task_notes etc.) with foreign keys seems the correct way (in keeping with design patterns) but could result in a lot of tables (addition of other types that can have notes or addition of types similar to notes [e.g. events]).
It feels like my choice is either simplified table structure but forgo foreign keys and increased query overhead, or increase the number of tables with the same structure but simplify queries and allow for foreign keys.
Given my scenario which of the above would be more appropriate, or is there an alternative I should consider?
What is "table bloat"? Are you concerned about having too many tables? Many real-world databases I've worked on have between 100 and 200 tables, because that's what it takes.
If you're concerned with adding multiple tables, then why do you have separate tables for User, Appointment, and Task? If you had a multi-valued attribute for User, for example for multiple phone numbers per user, would you create a separate table for phones, or would you try to combine them all into the user table somehow? Or have a polymorphic "things that belong to other things" table for user phones, appointment invitees, and task milestones?
Answer: No, you'd create a Phone table, and use it to reference only the User table. If Appointments have invitees, that gets its own table (probably a many-to-many between appointments and users). If tasks have milestones, that gets its own table too.
The correct thing to do is to model your database tables like you would model object types in your application. You might like to read a book like SQL and Relational Theory: How to Write Accurate SQL Code 3rd Edition by C. J. Date to learn more about how tables are analogous to types.
You already know instinctively that the fact that you can't create a foreign key is a red flag. A foreign key must reference exactly one parent table. This should be a clue that it's not valid relational database design to make a polymorphic foreign key. Once you start thinking of tables and their attributes as concrete types (like described in SQL and Relational Theory), this will become obvious.
If you must create one notes table, you could make it reference one table called "Notable" which is like a superclass of User, Appointment, and Task. Then each of those three tables would also reference a primary key of Notable. This mimics the object-oriented structure of polymorphism, where you can have a class Note have a reference to an object by its superclass type.
But IMHO, that's more complex than it needs to be. I would just create separate tables for UserNotes, AppointmentNotes, and TaskNotes. I'm not troubled by having three more tables, and it makes your code more clear and maintainable.
I think you should think about these two things, before you can make a decision.
Performance. a lot of reads, a lot of writes ? Test which is better.
Growth of your model. Can it easily be expanded ?
Imagine a hypothetical database, which is storing products. Each product have have 100 attributes, although any given product will only have values set for ~50 of these. I can see three ways to store this data:
A single table with 100 columns,
A single table with very few (say the 10 columns that have a value for every product), and another table with columns (product_id, attribute, value). I.e, An EAV data store.
A separate table for every columns. So the core products table might have 2 columns, and there would be 98 other tables, each with the two columns (product_id, value).
Setting aside the shades of grey between these extremes, from a pure efficiency standpoint, which is best to use? I assume it depends on the types of queries being run, i.e. if most queries are for several attributes of a product, or the value of a single attribute for several products. How does this affect the efficiency?
Assume this is a MySQL database using InnoDB, and all tables have appropriate foreign keys, and an index on the product_id. Imagine that the attribute names and values are strings, and are not indexed.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
I found a similar question here: Best to have hundreds of columns or split into multiple tables?
The difference is, that question is asking about a specific case, and doesn't really tell me about efficiency in the general case. Other similar questions are all talking about the best way to organize the data, I just want to know how the different organizational systems impact the speed of queries.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
JOIN will be slower.
However, if you usually query only a specific subset of columns, and this subset is "vertically partitioned" into its own separate table, querying such "lean" table is typically quicker than querying the "fat" table with all the columns.
But this is very specific and fragile (easy to break-apart as the system evolves) situation and you should test very carefully before going down that path. Your default starting position should be one table.
In general, the more tables you have, the more normalised, more correct, and hence better (ie: reduced redundancy of data) your design.
If you later find you have problems with reporting on this data, then that may be the moment to consider creating denormalised values to improve any specific performance issues. Adding denormalised values later is much less painful than normalising an existing badly designed database.
In most cases, EAV is a querying and maintenance nightmare.
An outline design would be to have a table for Products, a table for Attributes, and a ProductAttributes table that contained the ProductID and the AttributeID of the relevant entries.
As you mentioned - it strictly depends on queries, which will be executed on these data. As you know, joins are aggravating for database. I can't imagine to make 50-60 joins for simple data reading. In my humble opinion it would be madness. :) The best thing, you can do is to introduce test data and check out your specific queries in tool as Estimated Execution Plan in Management Studio. There should exist similar tool for MySQL.
I would tend to advice you to avoid creating so much tables. I think, it have to cause problems in future. Maybe it is possible to categorise rarely used data for separate tables or to use complex types? For string data you can try to use nonclustered indexes.
I have three tables with common fields - users, guests and admins.
The last two tables have some of the users fields.
Here's an example:
users
id|username|password|email|city|country|phone|birthday|status
guests
id|city|country|phone|birthday
admins
id|username|password|status
I'm wondering if it's better to:
a)use one table with many NULL values
b)use three tables
The question is less about "one table with many NULL versus three tables" that about the data structure. The real question is how other tables in your data structure will refer to these entities.
This is a classic situation, where you have "one-of" relationships and need to represent them in SQL. There is a "right" way, and that is to have four tables:
"users" (I can't think of a good name) would encompass everyone and have a unique id that could be referenced by other tables
"normal", "admins", "guests" each of which would have a 1-0/1 relationship with "users"
This allows other tables to refer to any of the three types of users, or to users in general. This is important for maintaining proper relationships.
You have suggested two shortcuts. One is that there is no information about "normal" users so you dispense with that table. However, this means that you can't refer to "normal" users in another table.
Often, when the data structures are similar, the data is simply denormalized into a single row (as in your solution a).
All three approach are reasonable, in the context of applications that have specific needs. As for performance, the difference between having additional NULLABLE columns is generally minimal when the data types are variable length. If a lot of the additional columns are numeric, then these occupy real space even when NULL, which can be a factor in designing the best solution.
In short, I wouldn't choose between the different options based on the premature optimization of which might be better. I would choose between them based on the overall data structure needed for the database, and in particular, the relationships that these entities have with other entities.
EDIT:
Then there is the question of the id that you use for the specialized tables. There are two ways of doing this. One is to have a separate id, such as AdminId and GuestId for each of these tables. Another column in each table would be the UserId.
This makes sense when other entities have relationships with these particular entities. For instance, "admins" might have a sub-system that describes rights and roles and privileges that they have, perhaps along with a history of changes. These tables (ahem, entities) would want to refer to an AdminId. And, you should probably oblige by letting them.
If you don't have such tables, then you might still split out the Admins, because the 100 integer columns they need are a waste of space for the zillion other users. In that case, you can get by without a separate id.
I want to emphasize that you have asked a question that doesn't have a "best" answer in general. It does have a "correct" answer by the rules of normalization (that would be 4 tables with 4 separate ids). But the best answer in a given situation depends on the overall data model.
Why not have one parent user table with three foreign keyed detail tables. Allows unique user id that can transition.
I generally agree with Chriseyre2000, but in your specific example, I don't see a need for the other 2 tables. Everything is contained in users, so why not just add Guest and Admin bit fields? Or even a single UserType field.
Though Chriseyre2000's solution will give you better scalability should you later want to add fields that are specific to guests and admins.
I need to implement custom fields in a booking software. I need to extend some tables containing, for example, the user groups with dynamic attributes.
But also, a product table where each product can have custom fields (and ideally these fields could be nested).
I already made some searches about EAV but I read many negative comments, so I'm wondering which design to use for this kind of things.
I understand using EAV causes many joins to sort a page of products, but I don't feel like I want to alter the groups/products tables, each time an attribute is created.
Note : I use Innodb
The only good solution is pretty much what you don't want to do, alter the groups/products tables, each time an attribute is created. It's a pain, yes, but it will guarantee data integrity and better performance.
If you don't want to do that, you can create a table with TableName, FieldName, ID and value, and hold lets say:
TableName='Customer', FieldName='Address', ID =1 (customers ID), Value
='customers address'
But as you said, it will need loads of joins. I don't think it is a good solution, I've seen it but wouldn't really recommend it. Just showing because well, it is one possible solution.
Another solution would be to add several pre-defined columns on your tables like column1, column2, column3 and so on and use them as necessary. It's a solution as worst as the previous one but I've seen major ERPs that use it.
Mate, based on experience, anything you will find on this area would be a huge work around and won't be worth implementing, the headache you will have to maintain it will be bigger than adding your fields to your table. Keep it simple and correct.
I am working on a project entirely based on EAV. I agree that EAV make things complex and slow, but it has its own advantages like we don't need to change the database structure or code for adding new attributes and we can have hierarchies among the data in the database tables.
The system can get extremely slow if we are using EAV at all the places.
But, Eav is very helpful, if used wisely. I will never design my entire DB based on EAV. I will divide the common and useful attributes and put them in flat tables while for the additional attributes (which might need to be changed depending on clients or various requirements), I will use EAV.
This way we can have the advantages of EAV which includes flexibility what you want without getting much trouble.
This is just my suggestion, there might be a better solution.
You can do this by adding at least 2 more tables.
One table will contain attribute unique key (attr_id) and attribute values, like attribute name and something else that is needed by your business logic.
Second table will serve as join between your say products table and attributes table and should have the following fields:
(id, product_id, attr_id)
This way, you can add as many dynamic attributes as you like, and your database schema will be future proof.
The only downside that queries now will have to add 2 more tables to be joined.
When should one use one to one relationships? When should you add new fields and when should you separate them into a new table?
It seems to me that you'd use it whenever you're grouping fields and/or that group tends to be optional. Yes?
I'm trying to create the tables for an object but grouping/separating everything would require me about 20 joins and some even 4 levels deep.
Am I doing something wrong? How can I improve?
First, I highly recommend reading about Normal Forms
A normalized relational database is extremely useful, and doing this properly is the reason tools such as Hibernate exist - to help manage the difference between objects-represented-as-relational-mappings and objects-as-progrommatic-entities.
Anything that has a one-to-one mapping should probably be in the same table. A Person has only one first name, one last name. Those should logically be in the same table. Having a reference to a table of names isn't necessary - in particular because little additional data can be stored about a name. Obviously, this isn't always true (an etymology database might want to do exactly that), but for most uses, you don't care about where a name comes from - indeed all you want is the name.
Therefore, think of the objects being represented. A person has some singular data points, and some one-to-many relationships (addresses they have lived, for instance). One to many and many to many will almost always require a separate table (or two, to have many to many). Following those two guidelines, you can get a normalized database pretty fast.
Note that optional fields should be avoided if at all possible. Usually this is a case of having a separate table holding the field with a reference back to the original table. Try to keep your tables lean. If a field isn't likely to have something, it probably should be a row in it's own table. Many such properties suggests a 'Property' table that can hold arbitrary optional properties of a particular type (ie, as are applied to a 'Person').