Query JOIN or not? (optimization) - mysql

i was wondering if to use 2 tables is better then using 1 single table .
Scenario:
i have a simple user table and a simple user_details table. i can JOIN tables and select both records.
But i was wondering if to merge 2 table into 1 single table.
what if i have 2milions users records in both tables?
in terms of speed and exec time is better to have a single table when selecting records?

You should easily be able to make either scenario perform well with proper indexing. Two million rows is not that many for any modern RDBMS.
However, one table is a better design if rows in the two tables represent the same logical entity. If the user table has a 1:1 relationship with the user_detail table, you should (probably) combine them.
Edit: A few other answers have mentioned de-normalizing--this assumes the relationship between the tables is 1:n (I read your question to mean the relationship was 1:1). If the relationship is indeed 1:n, you absolutely want to keep them as two tables.

Joins themselves are not inherently bad; RDBMS are designed to perform joins very efficiently—even with millions or hundreds of millions of records. Normalize first before you start to de-normalize, especially if you're new to DB design. You may ultimately end up incurring more overhead maintaining a de-normalized database than you would to use the appropriate joins.
As to your specific question, it's very difficult to advise because we don't really know what's in the tables. I'll throw out some scenarios, and if one matches yours, then great, otherwise, please give us more details.
If there is, and will always be a one-to-one relationship between user and user_details, then user details likely contains attributes of the same entity and you can consider combining them.
If the relationship is 1-to-1 and the user_details contains LOTS of data for each user that you don't routinely need when querying, it may be faster to keep that in a separate table. I've seen this often as an optimization to reduce the cost of table scans.
If the relationship is 1-to-many, I'd strongly advice against combining them, you'll soon wish you hadn't (as will those who come after you)
If the schema of user_details changes, I've seen this too where there is a core table and an additional attribute table with variable schema. If this is the case, proceed with caution.

To denormalize or not to denormalize, that is the question...
There is no simple, one-size-fits all response to this question. It is a case by case decision.
In this instance, it appears that there is exactly one user_detail record per record in the user table (or possibly either 1 or 0 detail record per user record), so shy of subtle caching concerns, there is really little no penalty for "denormalizing". (indeed in the 1:1 cardinality case, this would effectively be a normalization).
The difficulty in giving a "definitive" recommendation depends on many factors. In particular (format: I provide a list of questions/parameters to consider and general considerations relevant to these):
what is the frequency of UPDATEs/ DELETEs / INSERTs ?
what is the ratio of reads (SELECTs) vs. writes (UPDATEs, DELETEs, INSERTs) ?
Do the SELECT usually get all the rows from all the tables, or do we only get a few rows and [often or not] only select from one table at a given time ?
If there is a relative little amount of writes compared with reads, it would be possible to create many indexes, some covering the most common queries, and hence logically re-creating of sort, in a more flexible fashion the two (indeed multiple) table setting. The downside of too many covering indices could of course be to occupy too much disk space (not a big issue these days) but also to possibly impede (to some extent) the cache. Also too many indices may put undue burden on write operations...
what is the size of a user record? what is the size of a user_detail record?
what is the typical filtering done by a given query? Do the most common queries return only a few rows, or do they yield several thousand records (or more), most of the time?
If any one of the record average size is "unusually" long, say above 400 bytes, a multi-table may be appropriate. After all, an somewhat depending on the type of filtering done by the queries, the JOIN operation are typically very efficiently done by MySQL, and there is therefore little penalty in keeping separate table.
is the cardinality effectively 1:1 or 1:[0,1] ?
If it isn't the case i.e if we have user records with more than one user_details, given the relatively small number or records (2 millions) (Yes, 2M is small, not tiny, but small, in modern DBMS contexts), denormalization would probably be a bad idea. (possible exception with cases where we query several dozens of time per second the same 4 or 5 fields, some from the user table, some from the user_detail table.
Bottom lines:
2 Million records is relatively small ==> favor selecting a schema that is driven by the semantics of the records/sub-records rather than addressing, prematurely, performance concerns. If there are readily effective performance bottlenecks, the issue is probably not caused nor likely to be greatly helped by schema changes.
if 1:1 or 1:[0-1] cardinality, re-uniting the data in a single table is probably a neutral choice, performance wise.
if 1:many cardinality, denormalization ideas are probably premature (again given the "small" database size)
read about SQL optimization, pro-and-cons of indexes of various types, ways of limiting the size of the data, while allowing the same fields/semantics to be recorded.
establish baselines, monitor the performance frequently.

Denormalization will generally use-up more space while affording better query performance.
Be careful though - cache also matters and having more data effectively "shrinks" your cache! This may or may not wipe-out the theoretical performance benefit of merging two tables into one. As always, benchmark with representative data.
Of course, the more denormalized your data model is, the harder it will be to enforce data consistency. Performance does not matter if data is incorrect!
So, the answer to your question is: "it depends" ;)

The current trend is denormalize (i.e. put them in the same table). It usually give better performance, but easier to make inconsistent (programming mistake, that is).
Plan: determine your workload type.
Benchmark: See if the performance gain worth the risk.

Related

SQL databases: normalization vs. performance?

For a project, I was asked to look at an existing SQL database and to see if it could be improved. It was basically a customer database with a bunch of different types of data per customer. This is (basically) how it was organized:
Each customer had a row in the customer table with a customer ID. Then for each type of data, each customer had its own table. So, for instance, there would not be one central table for "jobs", with a customer ID in each row, but for each customer there would be a jobs table called "jobs1234" (1234 being a customer ID.
Now, my first response was confusion as to why you would organize it like that. I've always just learned that it's always better to normalize without really thinking beyond that point. But when I discussed it with people, a few pointed out it may have been for performance reasons. They said that if there were too many rows for "jobs", it would be better to have them split up per customer than to have them all in one table.
Something about indexing and the customer ID being the identifier. I'm confused as to why this approach would improve performance and haven't really gotten a very clear answer so far. Can anyone explain to me why that's the case and if it's even true that this approach is better in some cases?
I find this statement rather shocking:
They said that if there were too many rows for "jobs", it would be
better to have them split up per customer than to have them all in one
table.
Databases are designed to have tables that have lots and lots of rows -- millions of rows should be no problem. You don't specify what the volume of data is, but with a name like jobs, I'd be surprised if the total volume exceeds a few million rows in total. For this volume of data, a single table with suitable indexes should be fine.
There are cases where splitting data by customer would make sense. The strongest case is when it is an explicit requirement, typically for security reasons. In other words, the clients are promised that "their data is never mixed with anyone else's data". And, in most databases (MySQL included), it is easier to deal with security at the table level than at the row level.
Another possible reason would be when the tables have different formats, reflecting different data for each customer. In this case, you would really be dealing separate applications, and each customer should have their own database.
Are there any the downsides to splitting the customer data into multiple tables per customer? Yes. Here are some:
You cannot write generic queries/views to access the data. Basically, all queries in the code need to by dynamic, so you can put in the right table name.
Maintaining the data becomes cumbersome. Instad of updating a single table, you have to update multiple tables.
Answering questions such as "How many jobs does each customer have?" or "What is the growth in the number of jobs over time?" become so difficult to answer that people probably won't even bother asking them.
Performance is a mixed bag. Although you might save the overhead of storing the customer id in each table, you incur another cost. Having lots of smaller tables means lots of tables with partially filled pages. Depending on the number of jobs per customer and number of overall customers, you might actually be multiplying the amount of space used. In the worst case of one job per customer where a page contains -- say -- 100 jobs, you would be multiplying the required space by about 100.
The last point also applies to the page cache in memory. So, data in one table that would fit into memory might not fit into memory when split among many tables.
Partitioning is one way to implement something similar. However, this would work best when the query load is focused on one customer at a time. If all customers are accessing the data at the same time, then partitioning is going to be less of a win, and indexing should be sufficient.
Unless there is a really good reason for splitting the data into separate tables (a requirement, cumbersome security for each client, or custom formats for each client), you simply would not take that approach. Even when there are reasons for doing it, there are often other solutions (such as partitioning) that solve the same problem.

Using multiple Tables instead of one BIG table?

Is there an advantage or disadvantage when I split big tables into multiple smaller tables when using InnboDB & MySQL?
I'm not talking about splitting the actual innoDB file of course, I'm just wondering what happens when I use multiple tables.
Circumstances:
I have a REAL big table with millions of rows (items), they are categorized (column "category").
Now, I'm thinking about using a separate table for each category instead.
I will not need the data across multiple tables under any circumstances guaranteed.
Generally speaking if your tables have no relevance to each other they should be in separate tables rather than a catch all table.
However, if they are related they should really reside in one table. You can manage performance of large tables in a number of ways. I suggest you have a look at partitioning the tables if it grows too large that it starts to cause problems.
However millions of rows isn't a "REAL big table" as you say, we have many tables with tens of millions of rows, and even a few with hundreds of millions - and they perform just fine thanks to a mixture of cleaver indexing, partitioning and read replicas.
Edit 1 - In response to comments:
Creating dynamic tables on each of your keys in a key value pair is as you rightly say unusual, ugly and just very wrong – you’re defeating the relational part of a RDBMS.
It is impossible for me to be specific with the following as your schema and detailed information of what you want to achieve is still lacking from this question – however I feel I grasp enough to edit my original answer.
There is a huge different talking about partitioning a table within the same database and creating a new table in another database. You ask about performance, generally speaking they should perform the same (that is a new table in a 1000gb database and one in a 0gb database) providing you have enough resources such as memory for indexing and IO on the underlying data storage and there is no bottleneck.
I can’t really work out why you would be wanting to create dynamic tables
("table_{category}"), or storing the value / category in a text file. This really sounds like you need a 1-N relationship and a JOIN.

MySQL - 1 large table with 100 columns OR split into 5 tables and JOIN

I had a 'large' MySQL table that originally contained ~100 columns and I ended up splitting it up into 5 individual tables and then joining them back up with CodeIgniter Active Record...
From a performance point of view is it better to keep the original table with 100 columns or keep it split up.
Each table has around 200 rows.
200 rows? That's nothing.
I would split the table if the new ones combined columns in a way that was meaningful for your problem. I would do it with an eye towards normalization.
You sound like you're splitting them to meet some unstated criteria for "goodness" or because your current performance is unacceptable. Do you have some data that suggests a performance problem that is caused by your schema? If not, I'd recommend rethinking this approach.
No one can say what the impact on performance will be. More JOINs may be slower when you query, but you don't say what your use cases are.
So you've already made the change and now you're asking if we know which version of your schema goes faster?
(if the answer is the split tables, then you're doing something wrong).
Not only should the consolidated table be faster, it should also require less code and therefore less likely to have bugs.
You've not provided any information about the structure of your data.
And with 200 rows in your database, performance is the last thing you need to worry about.
The concept you're referring to is called vertical partitioning and it can have surprising effects on performance. On a Mysql.com Performance Post they discuss this in particular. An excerpt from the article:
Although you have to do vertical
partitioning manually, you can benefit
from the practice in certain
circumstances. For example, let's say
you didn't normally need to reference
or use the VARCHAR column defined in
our previously shown partitioned
table.
Important thing is - you can (and its good style!) move columns containing temporary data into separate table. You can move optional columns into separate table (this depends on logic).
When you are making a database the most important thing is: each table should incapsulate some essence. You should better create more tables, but separate different essences into different tables. The only exclusion is when you have to optimize your software, because 'straight' logical solution works slowly.
If you deal with some very complicated model, you should divide it into few simple blocks with simple relations - this works with database design as well.
As for perfomance - of course one table should give better perfomance since you would not need any kind of joins and keys to access all data. Less relations - less lags.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.

What's better? Having 100 Tables with each 1,000 rows, or 10 Tables with each 10,000 rows?

With 10 Tables, I would have no joins. With 100 Tables, I would have one join per query. Which would show better performance?
I wouldn't make a design decision this way without some measured performance data.
The proper way to model a problem is to create normalized tables with indexes that faithfully model the problem domain.
Once you have that, get some performance data for queries that you'll need to run.
If you find that performance isn't acceptable, denormalize as needed.
Your question is too generic and general to make a black and white decision.
I think this depends a lot on your DB schema, but 10k rows is not a lot for a table. If you can put an index on the data, do that. I think less tables should make your application much simpler.
Also, to state the obvious, joins are more expensive than not-joins because to compute a join you need to take the cross-product (or whatever its called) of two tables and then take rows from that. But again, I don't know what your data looks like.
Joins have performance implications. But also, having redundant data is a bad practice. Updating and inserting data would be very taxing in those cases.
Fewer joins means faster select queries. But if you're doing any inserts or updates, you'll most likely pay through data anomolies, or much more expensive inserts/updates.
If it's just static data you are just going to query, then denormalization could pay off, but otherwise you probably shoot yourself in the foot.
To start with the right schema, One table with 100,000 rows, if all you have is one logical entity....
Otherwise, anlayze your domain and design your schema first and foremost to mirror the logical domain entities it must represent. Then, only denormalize to address those performance issues that actually present themselves in load testing, (or that, from past experience, you know will present themselves) This approach, (starting from the right normalized schema), will make the tuning process itself easier, it will help guarantee that what you end up with will contain an optimum blend of normalization and optimizations, and it will ensure that you understand what compromises to normalization have been made for performance. This latter point is a good thing because it allows you to more intelligently add those necessary application validations to address those cases where normalization has been compromised, and where therefore, your database is vulnerable to data duplication or inconsistency.
If all you care about is read performance, though, then again, your best choice is just one table with 100,000 rows - and by the way, don't bother using a relational database, there's no point, just store the data in memory.