In my database I currently have two tables that are almost identical except for one field.
For a quick explanation, with my project, each year businesses submit to me a list of suppliers that they sale to, and also purchase things from. Since this is done on an annual basis, I have a table called sales and one called purchases.
So in the sales table, I would have the fields like: BusinessID, year, PurchaserID, etc. And the complete opposite would be in the purchases table, except that there would be a SellerID.
So basically both tables are exactly the same field wise except for the PurchaserID/SellerID. I inherited this system, so I did not design the DB this way. I'm debating combing the two tables into one table called suppliers and just adding a type field to distinguish between whether they are selling to, or purchasing from.
Does this sound like a good idea? Is there something I'm missing in regards to why this wouldn't be a good idea?
Do what works for you.
The textbook answer is normalize. If you normalized you would probably have 2 tables, one with both your buyers and sellers as companies. And a transactions table telling who bought what from who.
If it ain't broke, don't fix it. Leave them separate.
Since the system is already built, I would only consider this if you find yourself doing a lot of queries across the two tables, like big nasty UNION queries. Joining the two tables in one makes queries like "show me all sellers or purchasers who sold/bought between these dates..." much easier.
But it sounds like these two groups are treated very differently from the business rule perspective, so its probably not worth the trouble to make application changes at this point. (Every query would have to have a "WHERE Type = 1" or something like that).
If you'd have asked this during the db design phase, my answer might be different.
Normalization would say "yes".
How many applications are affected by this change? That would affect the decision.
Definitely one table. And I wouldn't call it supplier since this does not reflect the meaning of the table. Something like busibess_partner or something better than that might be more appropriate. Instead of purchase_id and seller_id, then be more generic like business_partner_id, and yes, add a field to distinguish.
Not one table. They are different entities that have a similar structure. There's nothing to be gained by consolidating them. (Nothing lost, either, except lucidity; but that's critical IMHO).
"Normalization" doesn't include looking for tables with similar schemas, and merging them.
A database is always a limited model of your business objective. If it doesn't make sense for you business, ignore those who say you should add complexity to your data model by creating a new companies table (though you probably already have something similar). If you really want to get into the "perfect model" game, just start abstracting everything away into an "entities" table and pretty soon you will have a completely unmanageable database.
Normalization would dictate that you NOT combine the two fields, unless the foreign keys actually point to the same table. A key rule to keep in mind is that each column in a table should only mean one thing. Adding a second field that explains what the first field means breaks this rule.
If your queries are getting to be a mess because you are always joining the two tables, you could create a view.
Also, the number of records in the table is almost completely irrelevant. Always optimize for performance after you have the system in place. If it killing your application to have all the records in one table, set a clustered index on a column that partitions your table in a meaningful way.
You must take into consideration the number of records on both tables. if they are to big it could have a big inpact on queries that have multiple joins to customers and suppliers.
Example: Who sold computers to us and to whom did we sell them to.
From a completely different point of view. I tend to consider logic over technology. To me the decision is not whether the data is similar in shape or fields, but whether it makes sense mixing them. That is as much to say that whether the technical answer might be normalize, my answer would be: does it make sense to you (business logic) to have both together?
Another answer talks about merging both and changing naming conventions. To me that is a logic decision: you are saying that you don't work with buyers and sellers, but with business partners. If that is your case, then do it.
You might also consider what your use of the tables would be. If they are of one unique logic type (business partner) you will surely have queries that need to access both buyers and sellers. Else, if all your queries are separate, that might be an indication that they are not the same, and should not be held together. Pushing them together will imply a lot of extra checks and cpu time spent differing from what were separate entities.
There is a long used metaphor about interfaces that might apply here. Just because a fire gun and a camera both shoot, that does not mean they share an interface, unless you like playing Russian roulette.
From a logical view, there seems to be no difference between the reported transactions, it is just a difference in who reports it to you. It should be a single table with SellerID, BuyerID, and (if you need it) ReporterID(s) (and perhaps additional transaction information).
This is how it should be. Now, how to make the transition? Making a script that uses the two old tables to fill a new table should be an easy exercise, but then you also need to change all the queries that use the information. This is likely a lot of work, and might not be worth the effort.
Since none of the experts reporting in are willing to answer your question, the simple answer is: query1 UNION query2
EX.
SELECT * FROM table1 UNION SELECT * FROM table2 assuming table1 and table2 have the same structure/heading titles
Related
I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?
I'm creating an order system to to keep track of orders. There's about 60 products or so, each with their own prices. The system isn't very complicated though, just need to be able to submit how many of each product the person orders.
My question is, is it more efficient to have an 'orders' table, with columns representing each product, and numeric values representing how many of each they ordered.., example:
orders
id
product_a
product_b
product_c
etc...
OR
should I break it into different tables, with a many-to-many table to join them. something like this maybe:
customers
id
name
email
orders
id
customer_id
products
id
product
orders_products
order_id
product_id
I would break it out apart like you show in your second sample. This will make your application much more scalable and will still be quite efficient.
Always build for future features and expansion in mind. A shortcut here or there always seems to bite you later when you have to re-architect and refactor the whole thing. Look up normalization and why you want to separate every independent element in a relational DB.
I am often asked “why make it a separate table, when this way is simpler?” Then remind them that their “oh, there are no other of this type of thing we will use” then later having them ask for a feature that necessitates many-to-many, not realizing they painted you into a corner by not considering future features. People who do not understand data structures tend not to be able to realize this and are pretty bad at specifying system requirements. This usually happens when the DB starts getting big and they realize they want to be able to look at only subset of data. A flat DB means adding columns to handle a ton of different desires, while a many-to-many join table can do it with a few lines of code.
I'd also use the second way. If the db is simple as you say, the difference might not be much in terms of speed and such. But the second way is more efficient and easy to reuse/enhance in case you get new ideas and add to your application.
Should you go for the 1st case, how will you keep track of the prices and discounts you gave to your customers for each product? And even if you have no plans to track it now, this is quite common thing, so might have request for such change.
With normalized schema all you have to do is add a couple of fields.
I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value
Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.
It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.
This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.
Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.
If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.
This really depends on what you're trying to do.
Overview (Sorry its vague - I think if I went into more detail it would just over complicate things)
I have three tables, table one contains an id, table two contains its own id and table one's id and table three contains its own id and table two's id.
I have spent a lot of time pondering and I think it would be more efficient for table three to also contain the related table ones id.
-It will mean I will not have to join three tables, I can just query table three (for a query that will be used very often)
-It will allow me to implement a reservation system more easily by only locking rows within table three that contain a specific id from table one.
For anyone who wants to know more about the database layout there is more info here
Question
What are the disadvantaged to de-normalisation? I have seen some people who are completely against it and others who believe in the right situation it is a useful tool. The id's will never change so I do not really see any disadvantage other than having to insert the same data twice and thus the additional space it will consume (which as it is just id's will surely be negligible).
My advice is to follow this general rule: Normalise by default, then denormalise if and when you identify a performance problem which it will solve.
I find normalised data, and code dealing with it, easier and more logical to maintain. I don't think there is any problem using denormalisation to improve performance, but I would not speculatively apply any performance optimisation which results in a decrease in maintainability until you are sure they are necessary.
The only time you really want to denormalize is if its required to get the performance you want
This was already asked several times. See here
As its a one (Table 1) to many (Table 2), with another one (table 2) to many (Table 3) I would keep the same structure as their seems to be 3 layers there.
e.g.
Table 1
Table 2
Table 3
Also, a lot will depend on what additional fields you are storing within those tables.
Every rule might be broken if there is a good reason for it.
In your case I wonder what the three tables contain. Does Table three really describe Table two or does it describe table one directly?
The disadvantage to have self-id, table-two-id and table-one-id in table three in this case is, that it can lead to inconsistence - what if you have table-one-id 1 in table two and table-one-id 15 in table three by a mistake?
It depends on the data and the entity relationship of your data. For me, it would be more important to have no inconsistencies and to have a little bit more time at selection...
EDIT: After reading about your Tables I would suggest to add a table-one-id to table three (areas), because table-one-id doesn't change after all and for that reason its relatively save for inconsistency.
Normalization vs efficiency is usually a trade-off, while normalization is generally a good thing, it is not a silver bullet. If you have a clear reason (as it seems you do), denormalization is perfectly acceptable.
Schemas containing less than fully normalized tables suffer from what is called "harmful redundancy". Harmful redundancy can result in storing the same fact in more than one place, or in not having any place to store a fact that needs to be stored. These problems are known as "insert anomalies", "update anomalies", or "delete anomalies".
To make a long story short, if you store a fact in more than one place, then sooner or later you are going to store mutually contradictory facts in the two places, and your database will begin to give contradictory answers, depending on which version of the facts the query found.
If you are forced to "invent a dummy record" in order to have a place to store a needed fact, then sooner or later you are going to write a query that mistakenly treats the dummy record like a real one.
If you are a super programmer, and you never make mistakes, then you don't have to worry about the above. I never met such a programmer, although I've met lots of people who think they never make mistakes.
I would refrain from "denormalizing" as a practice. That's like "driving away from Chicago". You still don't know where you are going. However, there are times when normalization rules should be disregarded, as others have noted. If you are designing a star schema (or a snowflake schema) you are going to have to disregard some of the normalization rules in order to get the best star (or snowflake).
I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.