Is there an advantage or disadvantage when I split big tables into multiple smaller tables when using InnboDB & MySQL?
I'm not talking about splitting the actual innoDB file of course, I'm just wondering what happens when I use multiple tables.
Circumstances:
I have a REAL big table with millions of rows (items), they are categorized (column "category").
Now, I'm thinking about using a separate table for each category instead.
I will not need the data across multiple tables under any circumstances guaranteed.
Generally speaking if your tables have no relevance to each other they should be in separate tables rather than a catch all table.
However, if they are related they should really reside in one table. You can manage performance of large tables in a number of ways. I suggest you have a look at partitioning the tables if it grows too large that it starts to cause problems.
However millions of rows isn't a "REAL big table" as you say, we have many tables with tens of millions of rows, and even a few with hundreds of millions - and they perform just fine thanks to a mixture of cleaver indexing, partitioning and read replicas.
Edit 1 - In response to comments:
Creating dynamic tables on each of your keys in a key value pair is as you rightly say unusual, ugly and just very wrong – you’re defeating the relational part of a RDBMS.
It is impossible for me to be specific with the following as your schema and detailed information of what you want to achieve is still lacking from this question – however I feel I grasp enough to edit my original answer.
There is a huge different talking about partitioning a table within the same database and creating a new table in another database. You ask about performance, generally speaking they should perform the same (that is a new table in a 1000gb database and one in a 0gb database) providing you have enough resources such as memory for indexing and IO on the underlying data storage and there is no bottleneck.
I can’t really work out why you would be wanting to create dynamic tables
("table_{category}"), or storing the value / category in a text file. This really sounds like you need a 1-N relationship and a JOIN.
Related
Say I have lots of time to waste and decide to make a database where information is not stored as entities but in separate inter-related tables representing INT,VARCHAR,DATE,TEXT, etc types.
It would be such a revolution to never have to design a database structure ever again except that the fact no-one else has done it probably indicates it's not a good idea :p
So why is this a bad design ? What principles is this going against ? What issues could it cause from a practical point of view with a relational database ?
P.S: This is for the learning exercise.
Why shouldn't you separate out the fields from your tables based on their data types? Well, there are two reasons, one philosophical, and one practical.
Philosophically, you're breaking normalization
A properly normalized database will have different tables for different THINGS, with each table having all fields necessary and unique for that specific "thing." If the only way to find the make, model, color, mileage, manufacture date, and purchase date of a given car in my CarCollectionDatabase is to join meaningless keys on three tables demarked by data-type, then my database has almost zero discoverablity and no real cohesion.
If you designed a database like that, you'd find writing queries and debugging statements would be obnoxiously tiresome. Which is kind of the reason you'd use a relational database in the first place.
(And, really, that will make writing queries WAY harder.)
Practically, databases don't work that way.
Every database engine or data-storage mechanism i've ever seen is simply not meant to be used with that level of abstraction. Whatever engine you had, I don't know how you'd get around essentially doubling your data design with fields. And with a five-fold increase in row count, you'd have a massive increase in index size, to the point that once you get a few million rows your indexes wouldn't actually help.
If you tried to design a database like that, you'd find that even if you didn't mind the headache, you'd wind up with slower performance. Instead of 1,000,000 rows with 20 fields, you'd have that one table with just as many fields, and some 5-6 extra tables with 1,000,000+ entries each. And even if you optimized that away, your indexes would be larger, and larger indexes run slower.
Of course, those two ONLY apply if you're actually talking about databases. There's no reason, for example, that an application can't serialize to a text file of some sort (JSON, XML, etc.) and never write to a database.
And just because your application needs to store SQL data doesn't mean that you need to store everything, or can't use homogenous and generic tables. An Access-like application that lets user define their own "tables" might very well keep each field on a distinct row... although in that case your database's THINGS would be those tables and their fields. (And it wouldn't run as fast as a natively written database.)
I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table
Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.
Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark
i was wondering if to use 2 tables is better then using 1 single table .
Scenario:
i have a simple user table and a simple user_details table. i can JOIN tables and select both records.
But i was wondering if to merge 2 table into 1 single table.
what if i have 2milions users records in both tables?
in terms of speed and exec time is better to have a single table when selecting records?
You should easily be able to make either scenario perform well with proper indexing. Two million rows is not that many for any modern RDBMS.
However, one table is a better design if rows in the two tables represent the same logical entity. If the user table has a 1:1 relationship with the user_detail table, you should (probably) combine them.
Edit: A few other answers have mentioned de-normalizing--this assumes the relationship between the tables is 1:n (I read your question to mean the relationship was 1:1). If the relationship is indeed 1:n, you absolutely want to keep them as two tables.
Joins themselves are not inherently bad; RDBMS are designed to perform joins very efficiently—even with millions or hundreds of millions of records. Normalize first before you start to de-normalize, especially if you're new to DB design. You may ultimately end up incurring more overhead maintaining a de-normalized database than you would to use the appropriate joins.
As to your specific question, it's very difficult to advise because we don't really know what's in the tables. I'll throw out some scenarios, and if one matches yours, then great, otherwise, please give us more details.
If there is, and will always be a one-to-one relationship between user and user_details, then user details likely contains attributes of the same entity and you can consider combining them.
If the relationship is 1-to-1 and the user_details contains LOTS of data for each user that you don't routinely need when querying, it may be faster to keep that in a separate table. I've seen this often as an optimization to reduce the cost of table scans.
If the relationship is 1-to-many, I'd strongly advice against combining them, you'll soon wish you hadn't (as will those who come after you)
If the schema of user_details changes, I've seen this too where there is a core table and an additional attribute table with variable schema. If this is the case, proceed with caution.
To denormalize or not to denormalize, that is the question...
There is no simple, one-size-fits all response to this question. It is a case by case decision.
In this instance, it appears that there is exactly one user_detail record per record in the user table (or possibly either 1 or 0 detail record per user record), so shy of subtle caching concerns, there is really little no penalty for "denormalizing". (indeed in the 1:1 cardinality case, this would effectively be a normalization).
The difficulty in giving a "definitive" recommendation depends on many factors. In particular (format: I provide a list of questions/parameters to consider and general considerations relevant to these):
what is the frequency of UPDATEs/ DELETEs / INSERTs ?
what is the ratio of reads (SELECTs) vs. writes (UPDATEs, DELETEs, INSERTs) ?
Do the SELECT usually get all the rows from all the tables, or do we only get a few rows and [often or not] only select from one table at a given time ?
If there is a relative little amount of writes compared with reads, it would be possible to create many indexes, some covering the most common queries, and hence logically re-creating of sort, in a more flexible fashion the two (indeed multiple) table setting. The downside of too many covering indices could of course be to occupy too much disk space (not a big issue these days) but also to possibly impede (to some extent) the cache. Also too many indices may put undue burden on write operations...
what is the size of a user record? what is the size of a user_detail record?
what is the typical filtering done by a given query? Do the most common queries return only a few rows, or do they yield several thousand records (or more), most of the time?
If any one of the record average size is "unusually" long, say above 400 bytes, a multi-table may be appropriate. After all, an somewhat depending on the type of filtering done by the queries, the JOIN operation are typically very efficiently done by MySQL, and there is therefore little penalty in keeping separate table.
is the cardinality effectively 1:1 or 1:[0,1] ?
If it isn't the case i.e if we have user records with more than one user_details, given the relatively small number or records (2 millions) (Yes, 2M is small, not tiny, but small, in modern DBMS contexts), denormalization would probably be a bad idea. (possible exception with cases where we query several dozens of time per second the same 4 or 5 fields, some from the user table, some from the user_detail table.
Bottom lines:
2 Million records is relatively small ==> favor selecting a schema that is driven by the semantics of the records/sub-records rather than addressing, prematurely, performance concerns. If there are readily effective performance bottlenecks, the issue is probably not caused nor likely to be greatly helped by schema changes.
if 1:1 or 1:[0-1] cardinality, re-uniting the data in a single table is probably a neutral choice, performance wise.
if 1:many cardinality, denormalization ideas are probably premature (again given the "small" database size)
read about SQL optimization, pro-and-cons of indexes of various types, ways of limiting the size of the data, while allowing the same fields/semantics to be recorded.
establish baselines, monitor the performance frequently.
Denormalization will generally use-up more space while affording better query performance.
Be careful though - cache also matters and having more data effectively "shrinks" your cache! This may or may not wipe-out the theoretical performance benefit of merging two tables into one. As always, benchmark with representative data.
Of course, the more denormalized your data model is, the harder it will be to enforce data consistency. Performance does not matter if data is incorrect!
So, the answer to your question is: "it depends" ;)
The current trend is denormalize (i.e. put them in the same table). It usually give better performance, but easier to make inconsistent (programming mistake, that is).
Plan: determine your workload type.
Benchmark: See if the performance gain worth the risk.
I had a 'large' MySQL table that originally contained ~100 columns and I ended up splitting it up into 5 individual tables and then joining them back up with CodeIgniter Active Record...
From a performance point of view is it better to keep the original table with 100 columns or keep it split up.
Each table has around 200 rows.
200 rows? That's nothing.
I would split the table if the new ones combined columns in a way that was meaningful for your problem. I would do it with an eye towards normalization.
You sound like you're splitting them to meet some unstated criteria for "goodness" or because your current performance is unacceptable. Do you have some data that suggests a performance problem that is caused by your schema? If not, I'd recommend rethinking this approach.
No one can say what the impact on performance will be. More JOINs may be slower when you query, but you don't say what your use cases are.
So you've already made the change and now you're asking if we know which version of your schema goes faster?
(if the answer is the split tables, then you're doing something wrong).
Not only should the consolidated table be faster, it should also require less code and therefore less likely to have bugs.
You've not provided any information about the structure of your data.
And with 200 rows in your database, performance is the last thing you need to worry about.
The concept you're referring to is called vertical partitioning and it can have surprising effects on performance. On a Mysql.com Performance Post they discuss this in particular. An excerpt from the article:
Although you have to do vertical
partitioning manually, you can benefit
from the practice in certain
circumstances. For example, let's say
you didn't normally need to reference
or use the VARCHAR column defined in
our previously shown partitioned
table.
Important thing is - you can (and its good style!) move columns containing temporary data into separate table. You can move optional columns into separate table (this depends on logic).
When you are making a database the most important thing is: each table should incapsulate some essence. You should better create more tables, but separate different essences into different tables. The only exclusion is when you have to optimize your software, because 'straight' logical solution works slowly.
If you deal with some very complicated model, you should divide it into few simple blocks with simple relations - this works with database design as well.
As for perfomance - of course one table should give better perfomance since you would not need any kind of joins and keys to access all data. Less relations - less lags.
I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.