Overview (Sorry its vague - I think if I went into more detail it would just over complicate things)
I have three tables, table one contains an id, table two contains its own id and table one's id and table three contains its own id and table two's id.
I have spent a lot of time pondering and I think it would be more efficient for table three to also contain the related table ones id.
-It will mean I will not have to join three tables, I can just query table three (for a query that will be used very often)
-It will allow me to implement a reservation system more easily by only locking rows within table three that contain a specific id from table one.
For anyone who wants to know more about the database layout there is more info here
Question
What are the disadvantaged to de-normalisation? I have seen some people who are completely against it and others who believe in the right situation it is a useful tool. The id's will never change so I do not really see any disadvantage other than having to insert the same data twice and thus the additional space it will consume (which as it is just id's will surely be negligible).
My advice is to follow this general rule: Normalise by default, then denormalise if and when you identify a performance problem which it will solve.
I find normalised data, and code dealing with it, easier and more logical to maintain. I don't think there is any problem using denormalisation to improve performance, but I would not speculatively apply any performance optimisation which results in a decrease in maintainability until you are sure they are necessary.
The only time you really want to denormalize is if its required to get the performance you want
This was already asked several times. See here
As its a one (Table 1) to many (Table 2), with another one (table 2) to many (Table 3) I would keep the same structure as their seems to be 3 layers there.
e.g.
Table 1
Table 2
Table 3
Also, a lot will depend on what additional fields you are storing within those tables.
Every rule might be broken if there is a good reason for it.
In your case I wonder what the three tables contain. Does Table three really describe Table two or does it describe table one directly?
The disadvantage to have self-id, table-two-id and table-one-id in table three in this case is, that it can lead to inconsistence - what if you have table-one-id 1 in table two and table-one-id 15 in table three by a mistake?
It depends on the data and the entity relationship of your data. For me, it would be more important to have no inconsistencies and to have a little bit more time at selection...
EDIT: After reading about your Tables I would suggest to add a table-one-id to table three (areas), because table-one-id doesn't change after all and for that reason its relatively save for inconsistency.
Normalization vs efficiency is usually a trade-off, while normalization is generally a good thing, it is not a silver bullet. If you have a clear reason (as it seems you do), denormalization is perfectly acceptable.
Schemas containing less than fully normalized tables suffer from what is called "harmful redundancy". Harmful redundancy can result in storing the same fact in more than one place, or in not having any place to store a fact that needs to be stored. These problems are known as "insert anomalies", "update anomalies", or "delete anomalies".
To make a long story short, if you store a fact in more than one place, then sooner or later you are going to store mutually contradictory facts in the two places, and your database will begin to give contradictory answers, depending on which version of the facts the query found.
If you are forced to "invent a dummy record" in order to have a place to store a needed fact, then sooner or later you are going to write a query that mistakenly treats the dummy record like a real one.
If you are a super programmer, and you never make mistakes, then you don't have to worry about the above. I never met such a programmer, although I've met lots of people who think they never make mistakes.
I would refrain from "denormalizing" as a practice. That's like "driving away from Chicago". You still don't know where you are going. However, there are times when normalization rules should be disregarded, as others have noted. If you are designing a star schema (or a snowflake schema) you are going to have to disregard some of the normalization rules in order to get the best star (or snowflake).
Related
Related but not quite the same thing:which is more effcient? (or at least reading through it didn't help me any)
So I am working on a new site (selling insurance policies) we already have several sites up (its a rails application) that do this so I have a table in my sql database called policies.
As you can imagine it has lots of columns to support all the different options available.
While working on this new site I realized I needed to keep track of 20+ more options.
My concern is that the policies table is already large, but the columns in it right now are almost all used by every application we have. Whereas if I add these they would only be used for the new site and would leave tons of null cells on all the rest of the policies.
So my question is do I add those to the existing table or create a new table just for the policies sold on that site? Also I believe that if I created a new table I could leave out some of the columns (but not very many) from the main policies table because they are not needed for this application.
"[A]lmost all used" suggests that you could, upon considering it, split it more naturally.
Now, much of the efficiency concern here goes down to three things:
A single table can be scanned through more quickly than joins across several.
Large rows have a memory and disk-space cost in themselves.
If a single table represents something that is really a 1-to-many, then it requires more work on insert, delete or update.
Point 2 only really comes in, should there be a lot of cases where you need one particular subset of the data, and another batch where you need another subset, and maybe just a few where you need them all. If you're using most of the columns in most places, then it doesn't gain you anything. In that case, splitting tables is bad.
Point 1 and 3 argue for and against joining into one big table, respectively.
Before any of that though, let's get back to "almost all". If there are several rows with a batch of null fields, why? Often answering that "why?" reveals that really there's a natural split there, that should be broken off into another table as part of normal normalisation*. Repetition of fields, is an even greater suggestion that this is the case.
Do this first.
To denormalise - whether by splitting what is naturally one table, or joining what is naturally several - is a very particular type of optimisation - it makes some things more efficient at the cost of making other things less efficient, and it introduces possibilities of bugs that don't exist otherwise. I would never say you should never denormalise - I do it myself - but you need to be able to say "I am denormalising table X & Y in this manner, because it will help case C which happens enough and I can live with the extra cost to case D". Then you need to check it actually did help case C significantly and case D insignificantly, along with looking for hidden costs.
One of the reasons for normalising in the first place is it gives good average performance over a wide range of cases. It's the balance you want most of the time. Denormalising from the get-go rather than with a normalised database as a starting point is almost always premature.
*Fun trivia fact: The name "normalization" was in part a take on Richard Nixon's "Vietnamisation" policy meaning there was a running joke in some quarters of adding "-isation" onto just about anything. Were it not for the Whitehouse's reaction to the Tet Offensive, we could be using the gernund "normalising," or something completely different instead.
I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?
I had a 'large' MySQL table that originally contained ~100 columns and I ended up splitting it up into 5 individual tables and then joining them back up with CodeIgniter Active Record...
From a performance point of view is it better to keep the original table with 100 columns or keep it split up.
Each table has around 200 rows.
200 rows? That's nothing.
I would split the table if the new ones combined columns in a way that was meaningful for your problem. I would do it with an eye towards normalization.
You sound like you're splitting them to meet some unstated criteria for "goodness" or because your current performance is unacceptable. Do you have some data that suggests a performance problem that is caused by your schema? If not, I'd recommend rethinking this approach.
No one can say what the impact on performance will be. More JOINs may be slower when you query, but you don't say what your use cases are.
So you've already made the change and now you're asking if we know which version of your schema goes faster?
(if the answer is the split tables, then you're doing something wrong).
Not only should the consolidated table be faster, it should also require less code and therefore less likely to have bugs.
You've not provided any information about the structure of your data.
And with 200 rows in your database, performance is the last thing you need to worry about.
The concept you're referring to is called vertical partitioning and it can have surprising effects on performance. On a Mysql.com Performance Post they discuss this in particular. An excerpt from the article:
Although you have to do vertical
partitioning manually, you can benefit
from the practice in certain
circumstances. For example, let's say
you didn't normally need to reference
or use the VARCHAR column defined in
our previously shown partitioned
table.
Important thing is - you can (and its good style!) move columns containing temporary data into separate table. You can move optional columns into separate table (this depends on logic).
When you are making a database the most important thing is: each table should incapsulate some essence. You should better create more tables, but separate different essences into different tables. The only exclusion is when you have to optimize your software, because 'straight' logical solution works slowly.
If you deal with some very complicated model, you should divide it into few simple blocks with simple relations - this works with database design as well.
As for perfomance - of course one table should give better perfomance since you would not need any kind of joins and keys to access all data. Less relations - less lags.
I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value
Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.
It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.
This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.
Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.
If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.
This really depends on what you're trying to do.
In my database I currently have two tables that are almost identical except for one field.
For a quick explanation, with my project, each year businesses submit to me a list of suppliers that they sale to, and also purchase things from. Since this is done on an annual basis, I have a table called sales and one called purchases.
So in the sales table, I would have the fields like: BusinessID, year, PurchaserID, etc. And the complete opposite would be in the purchases table, except that there would be a SellerID.
So basically both tables are exactly the same field wise except for the PurchaserID/SellerID. I inherited this system, so I did not design the DB this way. I'm debating combing the two tables into one table called suppliers and just adding a type field to distinguish between whether they are selling to, or purchasing from.
Does this sound like a good idea? Is there something I'm missing in regards to why this wouldn't be a good idea?
Do what works for you.
The textbook answer is normalize. If you normalized you would probably have 2 tables, one with both your buyers and sellers as companies. And a transactions table telling who bought what from who.
If it ain't broke, don't fix it. Leave them separate.
Since the system is already built, I would only consider this if you find yourself doing a lot of queries across the two tables, like big nasty UNION queries. Joining the two tables in one makes queries like "show me all sellers or purchasers who sold/bought between these dates..." much easier.
But it sounds like these two groups are treated very differently from the business rule perspective, so its probably not worth the trouble to make application changes at this point. (Every query would have to have a "WHERE Type = 1" or something like that).
If you'd have asked this during the db design phase, my answer might be different.
Normalization would say "yes".
How many applications are affected by this change? That would affect the decision.
Definitely one table. And I wouldn't call it supplier since this does not reflect the meaning of the table. Something like busibess_partner or something better than that might be more appropriate. Instead of purchase_id and seller_id, then be more generic like business_partner_id, and yes, add a field to distinguish.
Not one table. They are different entities that have a similar structure. There's nothing to be gained by consolidating them. (Nothing lost, either, except lucidity; but that's critical IMHO).
"Normalization" doesn't include looking for tables with similar schemas, and merging them.
A database is always a limited model of your business objective. If it doesn't make sense for you business, ignore those who say you should add complexity to your data model by creating a new companies table (though you probably already have something similar). If you really want to get into the "perfect model" game, just start abstracting everything away into an "entities" table and pretty soon you will have a completely unmanageable database.
Normalization would dictate that you NOT combine the two fields, unless the foreign keys actually point to the same table. A key rule to keep in mind is that each column in a table should only mean one thing. Adding a second field that explains what the first field means breaks this rule.
If your queries are getting to be a mess because you are always joining the two tables, you could create a view.
Also, the number of records in the table is almost completely irrelevant. Always optimize for performance after you have the system in place. If it killing your application to have all the records in one table, set a clustered index on a column that partitions your table in a meaningful way.
You must take into consideration the number of records on both tables. if they are to big it could have a big inpact on queries that have multiple joins to customers and suppliers.
Example: Who sold computers to us and to whom did we sell them to.
From a completely different point of view. I tend to consider logic over technology. To me the decision is not whether the data is similar in shape or fields, but whether it makes sense mixing them. That is as much to say that whether the technical answer might be normalize, my answer would be: does it make sense to you (business logic) to have both together?
Another answer talks about merging both and changing naming conventions. To me that is a logic decision: you are saying that you don't work with buyers and sellers, but with business partners. If that is your case, then do it.
You might also consider what your use of the tables would be. If they are of one unique logic type (business partner) you will surely have queries that need to access both buyers and sellers. Else, if all your queries are separate, that might be an indication that they are not the same, and should not be held together. Pushing them together will imply a lot of extra checks and cpu time spent differing from what were separate entities.
There is a long used metaphor about interfaces that might apply here. Just because a fire gun and a camera both shoot, that does not mean they share an interface, unless you like playing Russian roulette.
From a logical view, there seems to be no difference between the reported transactions, it is just a difference in who reports it to you. It should be a single table with SellerID, BuyerID, and (if you need it) ReporterID(s) (and perhaps additional transaction information).
This is how it should be. Now, how to make the transition? Making a script that uses the two old tables to fill a new table should be an easy exercise, but then you also need to change all the queries that use the information. This is likely a lot of work, and might not be worth the effort.
Since none of the experts reporting in are willing to answer your question, the simple answer is: query1 UNION query2
EX.
SELECT * FROM table1 UNION SELECT * FROM table2 assuming table1 and table2 have the same structure/heading titles