MySQL DB design help - mysql

Pardon the elementary question but my newness to the realm of database design leaves me in a bind quite often.
I have a site that keeps growing with regard to families of information. In the beginning I had one sort of item I was describing and all was well. That item occupied one record and had 34 columns (a lot now that I look back) attributed to it of descriptive data. As I get more and more into this stuff, I see that many developers break out data (when practical) into distinct tables.
I've now got additional tables that relate to the original item but are not always needed when describing the original item so I broke them out so they're not queried unnecessarily.
Anyhow, I have a new item I've been trying to organize which is a USER. The user table has typical columns like username, email, last_login, path to associated image, etc. These users have been making comments, which I keep in yet another table that contains columns with IDs that relate to the user and the item on which they are commenting.
Now... I am in the process of adding the obligatory user profile page to the site. Should I create yet another table containing only essential profile data or append the existing user record with profile data in the original user table? I am thinking housekeeping might be a pain if I am to add a "Remove me from site" function as I would have to run something that kills the user record, the user profile record, and any other data associated with that user ID in other tables.
Basically what I am asking is should I keep going with this "granular" design method - breaking everything out into essential parts or does it ever serve me to consolidate into larger tables? I see a few instances where if a user deletes their account, I'll be left with a bunch of non-relevant data. For instance, the original item are restaurants... if I make a table to record "Visits" to restaurants, containing the Restaurant ID and the User ID, if the user or restaurant get removed from the site, this "Visits" table will have a bunch of useless records saying either "non existent restaurant was visited by user 45" or "Restaurant 21 was visited by non-existent user"
I hope I make sense here... I'm just wondering if it's normal to end up with this "junk" data over time.
Thanks much,
Rob

Deleting that "on-relevant" data is a normal, healthy part of an application's life. It's just what happens. You just have to do it, like you brush your teeth or make your bed. Don't let two or three DELETE queries influence how your tables get structured. They're not that expensive, and honestly, if you think that's too much of a pain, you're in the wrong business :)
If you're using InnoDB tables, you can look into foreign key constraints that will take care of some of the cleanup for you.

You'll be able to make these decisions much more easily if you learn about normalization.

In general, if data all relates to the same logical entity -- the same "thing" -- then it should go in the same table. Breaking one table into two just to keep the tables smaller is generally not a good idea. Depending on what you are doing, it may or may not make queries faster, and it introduces unnecessary complexity. Let me explain.
Whether it makes queries faster depends on the nature of the data and how you use it. If you have some very large field, like "rambling_comments varchar(5000)" or some such, and it is rarely used, then breaking it into a separate table so that what's left in the "main" table is relatively small could indeed make your queries faster, for the fairly obvious reason that there is now less data to read. But if the size of the fields you are thinking of breaking out are modest, and you often need data from both tables, then queries that only use one table don't gain that much, and queries that use both now need to do a join, which is usually more expensive than reading a somewhat bigger record.
But breaking up your tables will certainly make your programs more complex. Now you have to keep track of which data is in which table. You'll constantly be checking if that field is in the Item_Descriptive_Data table or the Item_Stock_Data table or whatever. You're liable to lose track at some point and accidentally put the same field into two tables. (Or worse, you'll decide this is a good idea and do it deliberately.) Then you have redundant and potentially contradictory data.
You have to do joins every time you need data that crosses tables. You create the possibility that records in one or more of the tables may not exist. Like, if you break your User table into User_Main and User_Profile, and you need data from both tables so you do a join, what happens if there is a record in User_Profile with no corresponding record in User_Main? You're going to have to add code to check for the possibility and deal with it. Oh, and blithely saying "That can never happen, no need to worry about it" is a very dangerous attitude: No matter that it's not SUPPOSED to happen, sooner or later it will, and if you don't handle the error gracefully, you could have a real mess.
In short, breaking up tables for performance reasons is usually a premature optimization. If you find that you have some real performance problem, THEN look at the tables and see if you should denormalize for efficiency. But don't start out trashing your database just to avoid a problem that might possibly happen someday.

Related

What should be the best way to store entries that are rarely used?

I'm in the process of designing a database (MySQL) for a security company and wants to keep track of all security guards it hires. Due to the nature of the industry, a significant number of people are moved into a "terminated" list (mostly people who were fired on bad terms). The company wants to keep track of them since some of them have the tendency to try and re-apply to work after a year or two. Also, there are times that executives in the company think that putting a certain person in that list was unjust and they reinstate them (which is why, to my understanding, a MySQL Archive won't work)
The "center" of the database is guards table that has many relationships with other tables in the database, and I'm trying to decide what would be the most efficient way to design the "terminated" list. I thought of two options:
Have the guards table be in a one-to-one relationship with a terminatedGuards table. The problem I see in this solution is that any time I want to query the data I would always need to add a clause in my SELECT statement to exclude people that are in the terminatedGuards table.
Make a separate table with columns similar to the guards table, and any time a guard is moved to that table I completely erase their entry from guards table and just copy it to terminatedGuards table. The problem I see with this approach is that I would need to follow a lot of relationships that are associated with that entry (and sometime I would want to re-create them with the copied entry in the terminatedGuards list for reference. For example, I would need to re-link a table that holds work history of guards in different sites managed by the company with the terminatedGuards table, so I could preserve the work history of that guard, even if he or she was fired).
Which approach should be more efficient?
Thanks.
I really doubt you're going to have a million records in this table. Flag them by status, add an index on that status flag, and you should be fine.
Moving records between tables is always trouble, so it's usually done as a last resort. For example, if you had a billion records in the table you'd want to partition it or shard it in some capacity, but what you're talking about here is trivial amounts of data in comparison. It's unlikely you'll ever have more than a million records in this table, and if you do, obviously you're involved in a project that's of such a massive scale you can afford the hardware to host a database of that size.
Usually you'd architect this to have a guards table, and then some kind of associated records that define when they were hired, fired, or any other event that impacted their employment.

More efficient to have two tables or one table with tons of fields

Related but not quite the same thing:which is more effcient? (or at least reading through it didn't help me any)
So I am working on a new site (selling insurance policies) we already have several sites up (its a rails application) that do this so I have a table in my sql database called policies.
As you can imagine it has lots of columns to support all the different options available.
While working on this new site I realized I needed to keep track of 20+ more options.
My concern is that the policies table is already large, but the columns in it right now are almost all used by every application we have. Whereas if I add these they would only be used for the new site and would leave tons of null cells on all the rest of the policies.
So my question is do I add those to the existing table or create a new table just for the policies sold on that site? Also I believe that if I created a new table I could leave out some of the columns (but not very many) from the main policies table because they are not needed for this application.
"[A]lmost all used" suggests that you could, upon considering it, split it more naturally.
Now, much of the efficiency concern here goes down to three things:
A single table can be scanned through more quickly than joins across several.
Large rows have a memory and disk-space cost in themselves.
If a single table represents something that is really a 1-to-many, then it requires more work on insert, delete or update.
Point 2 only really comes in, should there be a lot of cases where you need one particular subset of the data, and another batch where you need another subset, and maybe just a few where you need them all. If you're using most of the columns in most places, then it doesn't gain you anything. In that case, splitting tables is bad.
Point 1 and 3 argue for and against joining into one big table, respectively.
Before any of that though, let's get back to "almost all". If there are several rows with a batch of null fields, why? Often answering that "why?" reveals that really there's a natural split there, that should be broken off into another table as part of normal normalisation*. Repetition of fields, is an even greater suggestion that this is the case.
Do this first.
To denormalise - whether by splitting what is naturally one table, or joining what is naturally several - is a very particular type of optimisation - it makes some things more efficient at the cost of making other things less efficient, and it introduces possibilities of bugs that don't exist otherwise. I would never say you should never denormalise - I do it myself - but you need to be able to say "I am denormalising table X & Y in this manner, because it will help case C which happens enough and I can live with the extra cost to case D". Then you need to check it actually did help case C significantly and case D insignificantly, along with looking for hidden costs.
One of the reasons for normalising in the first place is it gives good average performance over a wide range of cases. It's the balance you want most of the time. Denormalising from the get-go rather than with a normalised database as a starting point is almost always premature.
*Fun trivia fact: The name "normalization" was in part a take on Richard Nixon's "Vietnamisation" policy meaning there was a running joke in some quarters of adding "-isation" onto just about anything. Were it not for the Whitehouse's reaction to the Tet Offensive, we could be using the gernund "normalising," or something completely different instead.

Database Design For Tournament Management Software

I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?

How many database table columns are too many?

I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value
Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.
It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.
This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.
Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.
If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.
This really depends on what you're trying to do.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.