I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value
Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.
It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.
This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.
Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.
If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.
This really depends on what you're trying to do.
Related
Lately I've been rethinking a database design I made a couple of months ago. The main reason is that last night I read the databse schema of vBulletin and saw that they use many, MANY, tables.
The current "idea" I'm using for my schema, for instance my log table, is to keep everything in one table by differencing the type of Log with an integer:
id, type, type_id, action, message
1 , 1, 305, 2, 'Explanation for user Ban'
2, 2, 1045, 1, 'Reason for deletion of Article'
Where type 1 = user, type 2 = article, type_id = the ID of the user, article or w/e and action 2 = ban, action 1 = deletion.
Should I change the design to two tables logBans, logSomething and so on? or is it better to keep the method I'm currently using?
The issue here is subtyping. There are three basic approaches to dealing with subtypes.
Put each record type into a completely separate table;
Put a record in a parent table and then a record in a subtype table; and
Put all the records in one table, having nullable columns for the "optional" data (ie things that don't apply to that type).
Each strategy has its merits.
For example, (3) is particularly applicable if there is little to no difference between different subtypes. In your case, do different log records have extra columns if they're of a particular type? If they don't or there are few cases when they do putting them all in one table makes perfect sense.
(2) is common used for a Party table. This is a common model in CRMs that involves a parent Party object which has subtypes for Person and Organization (Organization may also have subtypes like Company, Association, etc). Person and Organization have different properties (eg salutation, given names, date of birth, etc for Person) so it makes sense to split this up rather than using nullable columns.
(2) is potentially more space efficient (although the overhead of NULL columns in modern DBMSs is very low). The bigger issue is that (2) might be more confusing to developers. You will get a situation where someone needs to store an extra field somewhere and will whack it in a column that's empty for that type simply because it's easier doing that than getting approval for the DBAs to add a column (no, I'm not kidding).
(1) is probably the least frequently used scheme of the 3 in my experience.
Lastly, scalability has to be considered and is probably the best case for (1). At a certain points JOINs don't scale effectively and you'll need to use some kind of partitioning scheme to cut down your table sizes. (1) is one method of doing that (but a crude method).
I wouldn't worry too much about that though. You'll typically need to get to hundreds of millions or billions of records before that becomes an issue (unless your records are really really large, in which case it'll happen sooner).
It depends. If you're going to have 1500000000 entries of type 1 and 1000 entries of type 2 and you'll be doing a LOT of queries on type 2, separate the tables. If not, it's more convenient to keep only one table.
Keep in mind scalability:
How many entries of each type will I have in 1 year?
How many requests on this table will I be doing ?
Can you, at some point, clear this log? Can you move it to another table (like archive entries older than X months) ?
The one drawback I see right now is that you cannot enforce foreign key integrity on your type_id since it points to many different tables.
I want to add a small tip. A little off topic, and quite basic, but it's a lot clearer to use enum instead of tinyint for status flags, i.e.
enum('user','type')
If there are only two statuses, tinyint is a little more memory efficient, but less clear. Another disadvantage in enum is that you put a part of the business logic in the data tier - when you need to add or remove statuses, you have to alter the DB. Otherwise it's much more clear and I prefer enum.
I would keep things as specific as possible - in this case I would create two tables.
Each table has a specific purpose so I cannot see why you would combine them.
I wouldn't do what vBulletin does. The problem with older apps like vBulletin is that while they might have started as lean-machines, over the time they collect a lot of entropy and end up being bloated. Since there are plugins, and third-party tools, and developers who've worked on the old code, breaking it is a tough choice.
That's why there is not much refactoring going on here. Don't make them your programming model. Look around, find out what works best and use that. A lot of table sounds like a bad thing to me, not good.
Pardon the elementary question but my newness to the realm of database design leaves me in a bind quite often.
I have a site that keeps growing with regard to families of information. In the beginning I had one sort of item I was describing and all was well. That item occupied one record and had 34 columns (a lot now that I look back) attributed to it of descriptive data. As I get more and more into this stuff, I see that many developers break out data (when practical) into distinct tables.
I've now got additional tables that relate to the original item but are not always needed when describing the original item so I broke them out so they're not queried unnecessarily.
Anyhow, I have a new item I've been trying to organize which is a USER. The user table has typical columns like username, email, last_login, path to associated image, etc. These users have been making comments, which I keep in yet another table that contains columns with IDs that relate to the user and the item on which they are commenting.
Now... I am in the process of adding the obligatory user profile page to the site. Should I create yet another table containing only essential profile data or append the existing user record with profile data in the original user table? I am thinking housekeeping might be a pain if I am to add a "Remove me from site" function as I would have to run something that kills the user record, the user profile record, and any other data associated with that user ID in other tables.
Basically what I am asking is should I keep going with this "granular" design method - breaking everything out into essential parts or does it ever serve me to consolidate into larger tables? I see a few instances where if a user deletes their account, I'll be left with a bunch of non-relevant data. For instance, the original item are restaurants... if I make a table to record "Visits" to restaurants, containing the Restaurant ID and the User ID, if the user or restaurant get removed from the site, this "Visits" table will have a bunch of useless records saying either "non existent restaurant was visited by user 45" or "Restaurant 21 was visited by non-existent user"
I hope I make sense here... I'm just wondering if it's normal to end up with this "junk" data over time.
Thanks much,
Rob
Deleting that "on-relevant" data is a normal, healthy part of an application's life. It's just what happens. You just have to do it, like you brush your teeth or make your bed. Don't let two or three DELETE queries influence how your tables get structured. They're not that expensive, and honestly, if you think that's too much of a pain, you're in the wrong business :)
If you're using InnoDB tables, you can look into foreign key constraints that will take care of some of the cleanup for you.
You'll be able to make these decisions much more easily if you learn about normalization.
In general, if data all relates to the same logical entity -- the same "thing" -- then it should go in the same table. Breaking one table into two just to keep the tables smaller is generally not a good idea. Depending on what you are doing, it may or may not make queries faster, and it introduces unnecessary complexity. Let me explain.
Whether it makes queries faster depends on the nature of the data and how you use it. If you have some very large field, like "rambling_comments varchar(5000)" or some such, and it is rarely used, then breaking it into a separate table so that what's left in the "main" table is relatively small could indeed make your queries faster, for the fairly obvious reason that there is now less data to read. But if the size of the fields you are thinking of breaking out are modest, and you often need data from both tables, then queries that only use one table don't gain that much, and queries that use both now need to do a join, which is usually more expensive than reading a somewhat bigger record.
But breaking up your tables will certainly make your programs more complex. Now you have to keep track of which data is in which table. You'll constantly be checking if that field is in the Item_Descriptive_Data table or the Item_Stock_Data table or whatever. You're liable to lose track at some point and accidentally put the same field into two tables. (Or worse, you'll decide this is a good idea and do it deliberately.) Then you have redundant and potentially contradictory data.
You have to do joins every time you need data that crosses tables. You create the possibility that records in one or more of the tables may not exist. Like, if you break your User table into User_Main and User_Profile, and you need data from both tables so you do a join, what happens if there is a record in User_Profile with no corresponding record in User_Main? You're going to have to add code to check for the possibility and deal with it. Oh, and blithely saying "That can never happen, no need to worry about it" is a very dangerous attitude: No matter that it's not SUPPOSED to happen, sooner or later it will, and if you don't handle the error gracefully, you could have a real mess.
In short, breaking up tables for performance reasons is usually a premature optimization. If you find that you have some real performance problem, THEN look at the tables and see if you should denormalize for efficiency. But don't start out trashing your database just to avoid a problem that might possibly happen someday.
Overview (Sorry its vague - I think if I went into more detail it would just over complicate things)
I have three tables, table one contains an id, table two contains its own id and table one's id and table three contains its own id and table two's id.
I have spent a lot of time pondering and I think it would be more efficient for table three to also contain the related table ones id.
-It will mean I will not have to join three tables, I can just query table three (for a query that will be used very often)
-It will allow me to implement a reservation system more easily by only locking rows within table three that contain a specific id from table one.
For anyone who wants to know more about the database layout there is more info here
Question
What are the disadvantaged to de-normalisation? I have seen some people who are completely against it and others who believe in the right situation it is a useful tool. The id's will never change so I do not really see any disadvantage other than having to insert the same data twice and thus the additional space it will consume (which as it is just id's will surely be negligible).
My advice is to follow this general rule: Normalise by default, then denormalise if and when you identify a performance problem which it will solve.
I find normalised data, and code dealing with it, easier and more logical to maintain. I don't think there is any problem using denormalisation to improve performance, but I would not speculatively apply any performance optimisation which results in a decrease in maintainability until you are sure they are necessary.
The only time you really want to denormalize is if its required to get the performance you want
This was already asked several times. See here
As its a one (Table 1) to many (Table 2), with another one (table 2) to many (Table 3) I would keep the same structure as their seems to be 3 layers there.
e.g.
Table 1
Table 2
Table 3
Also, a lot will depend on what additional fields you are storing within those tables.
Every rule might be broken if there is a good reason for it.
In your case I wonder what the three tables contain. Does Table three really describe Table two or does it describe table one directly?
The disadvantage to have self-id, table-two-id and table-one-id in table three in this case is, that it can lead to inconsistence - what if you have table-one-id 1 in table two and table-one-id 15 in table three by a mistake?
It depends on the data and the entity relationship of your data. For me, it would be more important to have no inconsistencies and to have a little bit more time at selection...
EDIT: After reading about your Tables I would suggest to add a table-one-id to table three (areas), because table-one-id doesn't change after all and for that reason its relatively save for inconsistency.
Normalization vs efficiency is usually a trade-off, while normalization is generally a good thing, it is not a silver bullet. If you have a clear reason (as it seems you do), denormalization is perfectly acceptable.
Schemas containing less than fully normalized tables suffer from what is called "harmful redundancy". Harmful redundancy can result in storing the same fact in more than one place, or in not having any place to store a fact that needs to be stored. These problems are known as "insert anomalies", "update anomalies", or "delete anomalies".
To make a long story short, if you store a fact in more than one place, then sooner or later you are going to store mutually contradictory facts in the two places, and your database will begin to give contradictory answers, depending on which version of the facts the query found.
If you are forced to "invent a dummy record" in order to have a place to store a needed fact, then sooner or later you are going to write a query that mistakenly treats the dummy record like a real one.
If you are a super programmer, and you never make mistakes, then you don't have to worry about the above. I never met such a programmer, although I've met lots of people who think they never make mistakes.
I would refrain from "denormalizing" as a practice. That's like "driving away from Chicago". You still don't know where you are going. However, there are times when normalization rules should be disregarded, as others have noted. If you are designing a star schema (or a snowflake schema) you are going to have to disregard some of the normalization rules in order to get the best star (or snowflake).
I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.
In my database I currently have two tables that are almost identical except for one field.
For a quick explanation, with my project, each year businesses submit to me a list of suppliers that they sale to, and also purchase things from. Since this is done on an annual basis, I have a table called sales and one called purchases.
So in the sales table, I would have the fields like: BusinessID, year, PurchaserID, etc. And the complete opposite would be in the purchases table, except that there would be a SellerID.
So basically both tables are exactly the same field wise except for the PurchaserID/SellerID. I inherited this system, so I did not design the DB this way. I'm debating combing the two tables into one table called suppliers and just adding a type field to distinguish between whether they are selling to, or purchasing from.
Does this sound like a good idea? Is there something I'm missing in regards to why this wouldn't be a good idea?
Do what works for you.
The textbook answer is normalize. If you normalized you would probably have 2 tables, one with both your buyers and sellers as companies. And a transactions table telling who bought what from who.
If it ain't broke, don't fix it. Leave them separate.
Since the system is already built, I would only consider this if you find yourself doing a lot of queries across the two tables, like big nasty UNION queries. Joining the two tables in one makes queries like "show me all sellers or purchasers who sold/bought between these dates..." much easier.
But it sounds like these two groups are treated very differently from the business rule perspective, so its probably not worth the trouble to make application changes at this point. (Every query would have to have a "WHERE Type = 1" or something like that).
If you'd have asked this during the db design phase, my answer might be different.
Normalization would say "yes".
How many applications are affected by this change? That would affect the decision.
Definitely one table. And I wouldn't call it supplier since this does not reflect the meaning of the table. Something like busibess_partner or something better than that might be more appropriate. Instead of purchase_id and seller_id, then be more generic like business_partner_id, and yes, add a field to distinguish.
Not one table. They are different entities that have a similar structure. There's nothing to be gained by consolidating them. (Nothing lost, either, except lucidity; but that's critical IMHO).
"Normalization" doesn't include looking for tables with similar schemas, and merging them.
A database is always a limited model of your business objective. If it doesn't make sense for you business, ignore those who say you should add complexity to your data model by creating a new companies table (though you probably already have something similar). If you really want to get into the "perfect model" game, just start abstracting everything away into an "entities" table and pretty soon you will have a completely unmanageable database.
Normalization would dictate that you NOT combine the two fields, unless the foreign keys actually point to the same table. A key rule to keep in mind is that each column in a table should only mean one thing. Adding a second field that explains what the first field means breaks this rule.
If your queries are getting to be a mess because you are always joining the two tables, you could create a view.
Also, the number of records in the table is almost completely irrelevant. Always optimize for performance after you have the system in place. If it killing your application to have all the records in one table, set a clustered index on a column that partitions your table in a meaningful way.
You must take into consideration the number of records on both tables. if they are to big it could have a big inpact on queries that have multiple joins to customers and suppliers.
Example: Who sold computers to us and to whom did we sell them to.
From a completely different point of view. I tend to consider logic over technology. To me the decision is not whether the data is similar in shape or fields, but whether it makes sense mixing them. That is as much to say that whether the technical answer might be normalize, my answer would be: does it make sense to you (business logic) to have both together?
Another answer talks about merging both and changing naming conventions. To me that is a logic decision: you are saying that you don't work with buyers and sellers, but with business partners. If that is your case, then do it.
You might also consider what your use of the tables would be. If they are of one unique logic type (business partner) you will surely have queries that need to access both buyers and sellers. Else, if all your queries are separate, that might be an indication that they are not the same, and should not be held together. Pushing them together will imply a lot of extra checks and cpu time spent differing from what were separate entities.
There is a long used metaphor about interfaces that might apply here. Just because a fire gun and a camera both shoot, that does not mean they share an interface, unless you like playing Russian roulette.
From a logical view, there seems to be no difference between the reported transactions, it is just a difference in who reports it to you. It should be a single table with SellerID, BuyerID, and (if you need it) ReporterID(s) (and perhaps additional transaction information).
This is how it should be. Now, how to make the transition? Making a script that uses the two old tables to fill a new table should be an easy exercise, but then you also need to change all the queries that use the information. This is likely a lot of work, and might not be worth the effort.
Since none of the experts reporting in are willing to answer your question, the simple answer is: query1 UNION query2
EX.
SELECT * FROM table1 UNION SELECT * FROM table2 assuming table1 and table2 have the same structure/heading titles