SQL databases: normalization vs. performance? - mysql

For a project, I was asked to look at an existing SQL database and to see if it could be improved. It was basically a customer database with a bunch of different types of data per customer. This is (basically) how it was organized:
Each customer had a row in the customer table with a customer ID. Then for each type of data, each customer had its own table. So, for instance, there would not be one central table for "jobs", with a customer ID in each row, but for each customer there would be a jobs table called "jobs1234" (1234 being a customer ID.
Now, my first response was confusion as to why you would organize it like that. I've always just learned that it's always better to normalize without really thinking beyond that point. But when I discussed it with people, a few pointed out it may have been for performance reasons. They said that if there were too many rows for "jobs", it would be better to have them split up per customer than to have them all in one table.
Something about indexing and the customer ID being the identifier. I'm confused as to why this approach would improve performance and haven't really gotten a very clear answer so far. Can anyone explain to me why that's the case and if it's even true that this approach is better in some cases?

I find this statement rather shocking:
They said that if there were too many rows for "jobs", it would be
better to have them split up per customer than to have them all in one
table.
Databases are designed to have tables that have lots and lots of rows -- millions of rows should be no problem. You don't specify what the volume of data is, but with a name like jobs, I'd be surprised if the total volume exceeds a few million rows in total. For this volume of data, a single table with suitable indexes should be fine.
There are cases where splitting data by customer would make sense. The strongest case is when it is an explicit requirement, typically for security reasons. In other words, the clients are promised that "their data is never mixed with anyone else's data". And, in most databases (MySQL included), it is easier to deal with security at the table level than at the row level.
Another possible reason would be when the tables have different formats, reflecting different data for each customer. In this case, you would really be dealing separate applications, and each customer should have their own database.
Are there any the downsides to splitting the customer data into multiple tables per customer? Yes. Here are some:
You cannot write generic queries/views to access the data. Basically, all queries in the code need to by dynamic, so you can put in the right table name.
Maintaining the data becomes cumbersome. Instad of updating a single table, you have to update multiple tables.
Answering questions such as "How many jobs does each customer have?" or "What is the growth in the number of jobs over time?" become so difficult to answer that people probably won't even bother asking them.
Performance is a mixed bag. Although you might save the overhead of storing the customer id in each table, you incur another cost. Having lots of smaller tables means lots of tables with partially filled pages. Depending on the number of jobs per customer and number of overall customers, you might actually be multiplying the amount of space used. In the worst case of one job per customer where a page contains -- say -- 100 jobs, you would be multiplying the required space by about 100.
The last point also applies to the page cache in memory. So, data in one table that would fit into memory might not fit into memory when split among many tables.
Partitioning is one way to implement something similar. However, this would work best when the query load is focused on one customer at a time. If all customers are accessing the data at the same time, then partitioning is going to be less of a win, and indexing should be sufficient.
Unless there is a really good reason for splitting the data into separate tables (a requirement, cumbersome security for each client, or custom formats for each client), you simply would not take that approach. Even when there are reasons for doing it, there are often other solutions (such as partitioning) that solve the same problem.

Related

database, a table for each users or a big table?

I just start to learn database, in designing a database, I notice that a lot of recommendations, such as in this thread, suggests NOT to use one table per user, but keep all data in a big table and do a query when needed. But I still do NOT understand, because it seems that under a lot of situations, one table per user seems much efficient.
Suppose I have a database for 10,000 customers for them to track their orders. Each of customers will have very few orders, like around 10. In this way, every customer logs in, you will have to go through a big table to fetch data for this customer, however, if you keep each table per user, you can directly get what the customer need.
Another example, a restaurant information system tracks all restaurants' menu (say, in [foodname, price] pair), since each restaurant has different number of dishes, you can't really put each menu in one row, you can only make a huge table with [foodname,price,restaurant] rows. But there are a lot of restaurants, so when a user needs the menu of a certain restaurant, you'll need to go through the data of all restaurants, obviously inefficient.
For both these two examples, I can't think of a good way to design a database if I don't want to create each table per user. So my question is this:
If we want to avoid each table per user design, how should we design a database for these kinds of situations?
Sql databases are designed exactly for the types of scenarios you are suggesting. They can handle millions or billions of rows extremely efficiently. The complications of trying to partition every customer into a separate table are vast.
The only thing you need to worry about is that you have indexes on your table so that you do not have to scan through that billion records to find the ones applicable to your customer.
Once the indexes are in place then all of your example scenarios become simple and efficient queries.
Databases are designed to do exactly the kinds of lookups you're describing efficiently, even if all users are in a single table. As long as you create an index by user ID (or have the user ID as part of the primary key), then the database will keep the table sorted by user ID, so it can find any particular user efficiently using binary search.
"Tables" don't mean exactly what you think they mean either. Tables are meant to be used to logically group data in ways that are useful for the programmer. In theory, any database you use could just consist of one big table, but it's generally easier to reason about a database if you know that rows of the User table look like this, while rows of the Message table (or whatever) look like that. In fact, many databases only actually have one big underlying "table" in which all the data lives. So, whether two users are in the "same table" or "different tables" often doesn't matter at all from an efficiency standpoint.
Database management software is written based on the assumption that you'll have a relatively small number of tables (dozens, maybe hundreds in extreme cases). So go with whatever your database's documentation recommends.

MySQL - Partitioning vs multiple table suggestion for a use case

We are having around 30,000 customers and each customer is having multiple products. We are currently storing all the products in a single table partitioned by KEY(customerid). I would like to get your suggestions if separate tables for each customer would be more beneficial over the partitioning OR we continue to use partitioning with current (HASH) or different type.
Number of products per customers varies, a few customers having > 1M products while some customers having as small as a few hundred products. This may result in not so perfect partitions.
If a customer account is to be deleted, so will be all products of that customer. In case of separate tables, this would be quite useful.
All customers are disjointed. So there is no query to access cross-customer products.
Number of customers are quite large (around 30k), I am not sure if that's a good idea to have so many tables.
Is any other partitioning scheme is better than what we currently using.
Thank you for your inputs.
Generally I would go with the single table solution that you already have, it's the simple, straight-forward way to go.
You don't mention your motivation for wanting to change your setup.
How many entries do you have in your products table?
Are you experiencing performance issues with your current setup? If not I might be inclined to call this a case of "premature optimization".
If you ARE experiencing performance issues I would start by analyzing those first (profiling) to determine whether they are caused by your single products table design being a bottleneck.
Practical advice I can offer: Make sure you are using InnoDB storage engine and not MyISAM since that will allow for row level locks.
The downside to your proposal of having one table for each customer is maintenance and complexity. If you ever want to change your schema of the product tables it will be a lot more complicated and error prone task than before. You might have to make a script to batch the changes of all those tables, and what if the script crashes halfway? Then half of you customers have a changed table schema and the other half doesn't. As I mentioned if you do not currently have a performance problem you would be adding this complexity and maintenance without gaining anything.
You state that "All customers are disjointed. So there is no query to access cross-customer products." however it might not stay that way forever. Imagine in 2 months you need to extract a list of all customers who own specific product of type x, that would be a simple SQL query in your current setup, in the multi-table setup you would have to make a script or small program that could iterate over all customers and for each customer make a product query. So what was 1 query before is now 30.000 queries.
What you propose is a simple form of sharding. If you decide to go that way you may want to look into sharding since there are other ways to approach than the somewhat aggressive approach of giving every customer a dedicated table. E.g. use a hash of each customer id as sharding key, so every customer is either part of group A or group B. Products owned by A-customers are in ProductTableA, products owned by B-customers are in ProductTableB. (in a real implementation you may want to hash to a value between 0-255 and then keep a reference list saying that 0-127 are table-A, 128-255 are table-B, that way if you ever decide to scale up and add one more table, you don't have to recalculate all your hashes you just update your reference list).

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

What should be the best way to store entries that are rarely used?

I'm in the process of designing a database (MySQL) for a security company and wants to keep track of all security guards it hires. Due to the nature of the industry, a significant number of people are moved into a "terminated" list (mostly people who were fired on bad terms). The company wants to keep track of them since some of them have the tendency to try and re-apply to work after a year or two. Also, there are times that executives in the company think that putting a certain person in that list was unjust and they reinstate them (which is why, to my understanding, a MySQL Archive won't work)
The "center" of the database is guards table that has many relationships with other tables in the database, and I'm trying to decide what would be the most efficient way to design the "terminated" list. I thought of two options:
Have the guards table be in a one-to-one relationship with a terminatedGuards table. The problem I see in this solution is that any time I want to query the data I would always need to add a clause in my SELECT statement to exclude people that are in the terminatedGuards table.
Make a separate table with columns similar to the guards table, and any time a guard is moved to that table I completely erase their entry from guards table and just copy it to terminatedGuards table. The problem I see with this approach is that I would need to follow a lot of relationships that are associated with that entry (and sometime I would want to re-create them with the copied entry in the terminatedGuards list for reference. For example, I would need to re-link a table that holds work history of guards in different sites managed by the company with the terminatedGuards table, so I could preserve the work history of that guard, even if he or she was fired).
Which approach should be more efficient?
Thanks.
I really doubt you're going to have a million records in this table. Flag them by status, add an index on that status flag, and you should be fine.
Moving records between tables is always trouble, so it's usually done as a last resort. For example, if you had a billion records in the table you'd want to partition it or shard it in some capacity, but what you're talking about here is trivial amounts of data in comparison. It's unlikely you'll ever have more than a million records in this table, and if you do, obviously you're involved in a project that's of such a massive scale you can afford the hardware to host a database of that size.
Usually you'd architect this to have a guards table, and then some kind of associated records that define when they were hired, fired, or any other event that impacted their employment.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.