For example, I have posts, answers for post and comments for answers.
What better: to have each table for each entity (posts, answers, comments) or one table with 'post_type' and 'parent_id' parameter?
UPD: answers and comments are similar with properties.
The best route would be 3 tables:
Posts
Answers
Posts_ID (parent)
Comments
Answers_ID (parent)
Posts being the ultimate parent, answers linked to it, and comments linked to answers.
Whether you have separate tables for posts and comments depends partly on whether one can be used instead of the other and partly on how similar is the processing on each. Put another way, how much information is unique to each? If there is nothing or almost nothing unique to each, then one table would very probably work.
Under certain circumstances it makes sense to use a single table with a differentiator column (option 2) BUT any columns that belong solely to one type should go into its own table, and that table will have a foreign key mapping back to the main table. This will form a table inheritance hierarchy.
This makes sense if the objects these tables represent share a significant number of columns; and their commonalities allows for some processing to be generalised across the different types without having to know the specific subtype.
If the majority of your queries end up having to filter by the differentiator column, what real advantage do you gain from storing them in the same place?
In short: K.I.S.S. applies; if there is a programming advantage to be had by using the single table approach, use that; otherwise keep it simple and use 3 (counter intuitive I know, but the added cognitive load just to do self-joins all the time should convince you).
Related
I want to implement a vote system for several different entities/tables (e.g. articles, blog posts, users).
What is the best/more efficient approach?:
Create a table votes to store all the votes of all entities?
votes
vote_id
user_id
type (articles, blogposts or users)
Create a table votes for each entity? votes_articles, votes_blogposts, votes_users
What I see is:
First option will result with a bigger table and there's an additional field which I need to include in my queries. More generic table that can be easily extended for more entities if needed and everything is kind of centralised. (Can use a generic function to retrieve/insert/update the table.)
Second option will result with smaller tables; faster to query? But not necessarily better to maintain.
The second method has many advantages. Presumably, the votes are actually on entities, so you also have an id in each table pointing to the article, blogpost, or whatever that is being voted on. In a standard SQL database, you would like to have foreign key references to other tables, and the one-table-per-entity approach provides that capability.
You could modify the first approach to do this. However, that would require a separate column for each possible entity. And, then, you lose the easy flexibility of adding new entities.
When is the first approach advantageous? First, when maintaining valid foreign key references is not important. And, when you often want to bring together votes as votes. So, how many times did a user vote today regardless of what s/he voted on? How many votes do user A and user B have in common regardless of what they voted on? Get the idea. If votes starts to behave like its own entity, then it deserves its own table.
I happen to think that your very question highlights a major weakness in SQL and relational databases. This is an example of wanting different entities to "inherit" features from a class (to borrow terminology from the OO world). Wouldn't it be nice if you could just specify that a new entity inherits properties from another entity (such as "Votable")? Oh, never mind, that's not the real world of popular databases. At least not today.
EDIT:
If you care about performance, don't go with the modified first approach -- that is, a separate column for each possible entity. Normally, primary keys are 4-byte integers. These (in most databases at least) will occupy four bytes, regardless of whether the column has a NULL value. So, one table with three entity columns is (to a very rough approximation) three times the size of three tables specialized for each entity. Such wasted space only slows down the query processing.
If you are only going to have two or three entities, maybe this isn't that big a deal. But once you get to more than you can count on one hand, it really is a waste of space, memory, and processing power.
I have a table titled videos. In it there are three columns: media_id, project_id, and video_url. My questions is, is it necessary for me to have media_id? I'm not using it in any other tables. I would expect there to be multiple project_ids with the same number but different video_urls.
Having or not having surrogate ID's for something has nothing to do with normalization.
(copyright catcall)
Having or not having surrogate ID's for something depends on whether or not you have a useful use for it. You already gave the answer to that yourself. And it depends on whether or not there is a significant likelihood that, even if there is no actual use for it right now, such a use might quickly emerge in a nearby future.
You could use project_id and video_url as a function dependency key in your model but at a physical level I would not like to use a URL as part of a key.
By this I mean I prefer an ID or number to avoid typing in long string each time the key is referenced in different tables.
I would consider it necessary. This is purely based on the fact that the media entry is unique and there could be multiple media entries for any one project. This keeps a unique id for the row, a proper project relationship and the valuable URL data for the media resource.
I am designing a database which holds a lot of information for a user. Currently I need to store 20 different values, but over time I could be be adding more and more.
I have looked around StackOverflow for simular questions, but it usually ends up with the asker just not designing his table correctly.
So based of what I have seen around StackOverflow, should I:
Create a table with many null columns and use them when needed (this seems terrible to me)
Create a users table and a information table where information is a key-value pair: [user_id, key, value]
Anything else you can suggest?
Keep in mind this is for a MySQL database, so I understand the disliking for a Key-Value table on a relational database.
Thanks.
hmm, i am a bit confused by the question, but it sounds like you want to have lots of attributes for one user right? And in the future you want to add more??
Well, isn't that just have a customer_attribute_ref ref table of some sort, then you can easily add more by then inserting to the ref table, then in the customer table you have at least three columns : 1. customer ID 2. customer attribute ID 3. customer attribute value...
may be i missed your question. Can you clarify
I'd suggest 3. A hybrid of 1 and 2. That is, put your core fields, which are already known, and you know you'll be querying frequently, into the main table. Then add the key-value table for more obscure or expanded properties. I think this approach balances competing objectives of keeping your table width relatively narrow, and minimizing the number of joins needed for basic queries.
Another approach you could consider instead of or in combination with the above is an ETL process of some kind. Maybe you define a key-value table as a convenient way for your applications to add data; then set up replication, triggers, and/or a nightly/hourly stored procedure to transform the data into a form more suitable for querying and reporting purposes.
The exact best approach should be determined by careful planning and consideration of the entire architecture of your application.
I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?
I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.
Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.
You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.
I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.
NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.
Assuming you were modeling a Q&A database using MySQL, I am aware of two ways to approach the model architecture:
Create a single table for questions and answers with a "typeId"
Create two separate tables; one for questions and one for answers
Can anyone elaborate on the advantages and disadvantages of both approaches, and why you would use one approach over the other?
My own observations:
Approach 2 is more normalized
Approach 2 requires two "comments" tables for Q's and A's or a single table with composite PKs; (Q's & A's may identical IDs)
Approach 1 can become very complicated with self joins and so on
The specific of the design would really depend of your requirements and what you want to achieve and how huge your database would be.
1-table approach:
You may be able to use a single table in the case where you only provide/allow one answer per question (à la FAQ), where you would only have id,question,answer fields and questions are not added to DB until answer is given, or update the row when answer is available.
2-table approch:
As soon as there may be more than one answer/comment per question. I could choose a model a little bit different than #Spredzy's as I would just include everything just like "emails": message_id, in_reply_to, timestamp, text for simplicity. This simplicity will not allow you to tag specific (answers VS comments unless only one answer and in_reply_to answer becomes comments like on SO). Questions are those with in_reply_to IS NULL.
3/more-table approach:
If you really want performance by having FIXED-ROW length on the main table and don't need to display excerpt of questions and answers, but only want to know numbers. You would separate the text, any attachments, etc. Or just because you would want to avoid self joins as suggested by #orangepips: "Finally, self joins suck and present an excellent way to kill performance.") and have a separate tables for everything.
Model this as two tables. Questions can have more than one Answer. Create separate Comment tables for Questions and Answers; most likely use case I imagine does not see the comment data intermingling in a single DML statement.
A single table distinguished by a type column might make sense if you were representing an object model's inheritance, but that's not the case here. In addition, the intent of the table is muddied for anyone who reviews the schema because they'd need to know the enumerated possibilities for the type; could be a lookup table I supposed, but for two possibilities - and no more - seems a waste.
Finally, self joins suck and present an excellent way to kill performance.
I Would create 2 tables :
One that represents Question, Answer and Comment. IIf you look carefully they have the same core data : user_id, text, date, plus a type_id field and all the other field you might need.
The other table would be a pretty simple table : type
type_id type_desc
xxx-x-xx question
xxx-x-xx answer
xxx-x-xx comment
By doing that, your model will be highly scalable, faster with no duplication of data (normalization).
Finally, technically talking to get all the question or all the answer of one question it is just a simple join.
Hope it could help,
One table per type of data. If questions and answers are identical (as if objects in OOP), one table suffices. If not, not.
A single comment table with composite PK's is right because the comments are still of one type of object: Comment. The fact that they can reference both Q's and A's doesn't affect that.