Related
Let's say I have a simple situation. I have user table for my website and I need to place somewhere activation codes for email verification.
What is better option:
activation_code column in user table, which means I'll get NULLs or useless data when email is already verified. But on the other hand, I'm sure, I have only one code per user and one table less.
Separate table for codes. No useless data or NULLs. But an additional table for one column ( + user_id).
Because I try not to use any NULLs ( that's because I was taught so, not sure if rightly), I would prefer 2nd option. But I've seen first way in many web apps, so that's why I'm giving this question.
I would advocate for the separate table because in my opinion it models the data better. And it also allows you to generate multiple activation codes for the same user - this can come in handy!
On the other hand, I totally disagree with the practice of not using NULLs. I know some people advocate steering clear of them for reasons which can usually be attributed to laziness but the reality is that NULLs can be very useful in modeling data! They have a purpose which is to represent missing data or unknown values and they should definitely be used for this purpose!
I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system
Here is a concrete example:
Wordpress stores user information(meta) in a table called wp_usermeta where you get the meta_key field (ex: first_name) and the meta_value (John)
However, only after 50 or so users, the table already packs about 1219 records.
So, my question is: On a large scale, performance wise, would it be better to have a table with all the meta as a field, or a table like WordPress does with all the meta as a row ?
Indexes are properly set in both cases. There is little to no need of adding new metas. Keep in mind that a table like wp_usermeta must use a text/longtext field type (large footprint) in order to accommodate any type of data that could be entered.
My assumptions are that the WordPress approach is only good when you don't know what the user might need. Otherwise:
retrieving all the meta requires more I/O because the fields aren't stored in a single row. The field isn't optimised.
You can't really have an index on the meta_value field without suffering from major drawbacks (indexing a longtext ? unless it's a partial index...but then, how long?)
Soon, your database is cluttered with many rows, cursing your research even for the most precise meta
Developer-friendly is absent. You can't really do a join request to get everything you need and displayed properly.
I may be missing a point though. I'm not a database engineer, and I know only the basics of SQL.
You're talking about Entity-Attribute-Value.
- Entity = User, in your Wordpress Example
- Attribute = 'First Name', 'Last Name', etc
- Value = 'John', 'Smith', etc
Such a schema is very good at allowing a dynamic number of Attributes for any given Entity. You don't need to change the schema to add an Attribute. Depending on the queries, the new attributes can often be used without changing any SQL at all.
It's also perfectly fast enough at retrieving those attributes values, provided that you know the Entity and the Attribute that you're looking for. It's just a big fancy Key-Value-Pair type of set-up.
It is, however, not so good where you need to search the records based on the Value contents. Such as, get me all users called 'John Smith'. Trivial to ask in English. Trivial to code against a 'normal' table; first_name = 'John' AND last_name = 'Smith'. But non-trivial to write in SQL against EAV, and awful relative performance; (Get all the Johns, then all the Smiths, then intersect them to get Entities that match both.)
There is a lot said about EAV on-line, so I won't go in to massive detail here. But a general rule of thumb is: If you can avoid it, you probably should.
Depends on the number of names packed into wp_usermeta on average.
Text field searches are notoriously slow. Indexes are generally faster.
But some data warehouses index the crap out of every field and Wordpress might be doing the same thing.
I would vote for meta as a field not a row.
Good SQL, good night.
Mike
Examples from two major software in the GPL arena would illustrate how big difference there is in between the two designs :
Wordpress & oScommerce
Both have their flaws and strengths, and both are massively dominant in their respective areas and a lot of things are done with them. But one of the fundamental and biggest differences in between them is their approach to database table design. Of course, when comparing these, their code architecture also plays a role in how fast they do searches, but both are hampered by their own drawbacks and boosted by their own advantages, so the comparison is more or less accurate for production environments.
Wordpress uses EAV. The general data (called posts with different post types) is stored as the main entity, and all else is stored in post meta tables. Some fundamental data is stored in the main table, like revisions, post type etc, but almost all the rest is stored in metas.
VERY good for adding, modifying data, and therefore functionality.
But try a search with a complex SQL join which needs to pick up 3-4 different values from the meta table and get the resulting set. Its an iron dog. Search comes out VERY slow depending on the data you are searching for.
This is one of the reasons why you dont see many wordpress plugins which need to host complex data, and the ones which actually do, tend to create their own tables.
oScommerce on the other hand, keeps almost all product related data in products table. And majority of oScommerce mods modify this table and add their fields. There is products_attribute table, however this is also rather flat, and doesnt have any meta design. Its just linked to products over product ids.
As a result, despite being an aged spaghetti code from a very long time ago, oScommerce comes up with stunningly fast search results even when you search for complicated and combined product criteria. Actually, most of oScommerce's normal display code (like what it shows in product details page) comes from quite complicated SQLs pulling data from around 2-3 tables in complicated join statements. Comparably much more simpler sql with even one join could make wordpress duke it out with the database.
Therefore its rather plain conclusion : EAV is very good for easy extension and modification of data for extended functionality (ie as in wordpress). Flat, big monolithic tables are MUCH better for databases which will represent complicated records, and will have complicated searches with multiple criteria run on them.
Its a question of application.
For what i've seen the EAV model doesn't affect the performance. Unless you need the null values. In that case you should make a join with the table that holds all the type_meta.
I don't agree with the answer of Dems.
If you want to make the fullname of the user you don't ask for every name that matchs the name.
For that you should use a 5th or 6th NF.
Or you may even have a table of the user entity where you have:
id
username
password
salt
and there you go. That's the base, and for all the user "extra" data you should have a user_meta and user_type_meta entities. Then with the user.
I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.
Greetings stackers,
I'm trying to come up with the best database schema for an application that lets users create surveys and present them to the public. There are a bunch of "standard" demographic fields that most surveys (but not all) will include, like First Name, Last Name, etc. And of course users can create an unlimited number of "custom" questions.
The first thing I thought of is something like this:
Survey
ID
SurveyName
SurveyQuestions
SurveyID
Question
Responses
SurveyID
SubmitTime
ResponseAnswers
SurveyID
Question
Answer
But that's going to suck every time I want to query data out. And it seems dangerously close to Inner Platform Effect
An improvement would be to include as many fields as I can think of in advance in the responses table:
Responses
SurveyID
SubmitTime
FirstName
LastName
Birthdate
[...]
Then at least queries for data from these common columns is straightforward, and I can query, say, the average age of everyone who ever answered any survey where they gave their birthdate.
But it seems like this will complicate the code a bit. Now to see which questions are asked in a survey I have to check which common response fields are enabled (using, I guess, a bitfield in Survey) AND what's in the SurveyQuestions table. And I have to worry about special cases, like if someone tries to create a "custom" question that duplicates a "common" question in the Responses table.
Is this the best I can do? Am I missing something?
Your first schema is the better choice of the two. At this point, you shouldn't worry about performance problems. Worry about making a good, flexible, extensible design. There are all sorts of tricks you can do later to cache data and make queries faster. Using a less flexible database schema in order to solve a performance problem that may not even materialize is a bad decision.
Besides, many (perhaps most) survey results are only viewed periodically and by a small number of people (event organizers, administrators, etc.), so you won't constantly be querying the database for all of the results. And even if you were, the performance will be fine. You would probably paginate the results somehow anyway.
The first schema is much more flexible. You can, by default, include questions like name and address, but for anonymous surveys, you could simply not create them. If the survey creator wants to only view everyone's answers to three questions out of five hundred, that's a really simple SQL query. You could set up a cascading delete to automatically deleting responses and questions when a survey is deleted. Generating statistics will be much easier with this schema too.
Here is a slightly modified version of the schema you provided. I assume you can figure out what data types go where :-)
surveys
survey_id (index)
title
questions
question_id (index, auto increment)
survey_id (link to surveys->survey_id)
question
responses
response_id (index, auto increment)
survey_id (link to surveys->survey_id)
submit_time
answers
answer_id (index, auto increment)
question_id (link to questions-question_id)
answer
I would suggest you always take a normalized approach to your database schema and then later decided if you need to create a solution for performance reasons. Premature optimization can be dangerous. Premature database de-normalization can be disastrous!
I would suggest that you stick with the original schema and later, if necessary, create a reporting table that is a de-normalized version of your normalized schema.
One change that may or may not help simplify things would be to not link the ResponseAnswers back to the SurveyID. Rather, create an ID per response and per question and let your ResponseAnswers table contain the fields ResponseID, QuestionID, Answer. Although this would require keeping unique Identifiers for each unit it would help keep things a little bit more normalized. The response answers do no need to associate with the survey they were answering just the specific question they are answering and the response information that they are associated.
I created a customer surveys system at my previous job and came up with a schema very similar to what you have. It was used to send out surveys (on paper) and tabulate the responses.
A couple of minor differences:
Surveys were NOT anonymous, and this was made very clear in the printed forms. It also meant that the demographic data in your example was known in advance.
There was a pool of questions which were attached to the surveys, so one question could be used on multiple surveys and analyzed independently of the survey it appeared on.
Handling different types of questions got interesting -- we had a 1-3 scale (e.g., Worse/Same/Better), 1-5 scale (Very Bad, Bad, OK, Good, Very Good), Yes/No, and Comments.
There was special code to handle the comments, but the other question types were handled generically by having a table of question types and another table of valid answers for each type.
To make querying easier you could probably create a function to return the response based on a survey ID and question ID.