Is this MySql table a good candidate for partitioning? - mysql

I have a table with ~1.9 million rows and growing consistently. I run some fairly complicated queries against this data. The active data is generally clustered toward the end of the table -- that is, only the most recent n% of the records tend to be accessed on a regular basis, although the rest of the data needs to be available in the same table for the less usual cases that people look back at the older records.
For those with partitioning experience in MySQL, does this table seem like it would be a good candidate for partitioning? Or is it just too small to get much gain?
Thanks,
Jared
p.s. I looked for a question on stackoverflow to answer this question, but didn't find anything that quite fit.

Check out this article...He shows significant gains on a table with only 3 columns and 800K records. As long as your partitioning on a column that produces either an integer or NULL you should see some great performance improvements. I loved the speed gains from date based partitioning that I have seen with significantly fewer records but more columns.
Improving Database Performance with Partitioning

Logically, yes, if you typically run queries that need only the most recent 2% of the table, this would be a great candidate for partitioning.
The biggest barrier to using MySQL partitioning is that the column you use for the partitioning key must be part of the primary key and any other unique keys. This practically makes some tables not possible to partition.
If this blocks you from partitioning the table, the fallback plan is to partition "manually." That is, make two real tables with identical structure. Every week (or whatever schedule you want), run a batch job to migrate the older data to the second table. You can always make a VIEW which is a UNION of the two tables, in case you need to run occasional table-scans.

Table size should be greater than 5 GB.
You should go for RANGE PARTITIONING...(Monthly or yearly)

Related

How will partitioning affect my current queries in MySQL? When is it time to partition my tables?

I have a table that contains 1.5 million rows, has 39 columns, contains sales data of around 2 years, and grows every day.
I had no problems with it until we moved it to a new server, we probably have less memory now.
Queries are currently taking a very long time. Someone suggested partitioning the large table that is causing most of the performance issues but I have a few questions.
Is it wise to partition the table I described and is it
likely to improve its performance?
If I do partition it, will
I have to make changes to my current INSERT or SELECT statements or
will they continue working the same way?
Does the partition
take a long time to perform? I worry that with the slow performance,
something would happen midway through and I would lose the data.
Should I be partioning it to years or months? (we usually
look at the numbers within the month, but sometimes we take weeks or
years). And should I also partition the columns? (We have some
columns that we rarely or never use, but we might want to use them
later)
(I agree with Bill's answer; I will approach the Question in a different way.)
When is it time to partion my tables?
Probably never.
is it likely to improve its performance?
It is more likely to decrease performance a little.
I have a table that contains 1.5 million rows
Not big enough to bother with partitioning.
Queries are currently taking a very long time
Usually that is due to the lack of a good index, probably a 'composite' one. Secondly is the formulation of the query. Please show us a slow query, together with SHOW CREATE TABLE.
data of around 2 years, and grows every day
Will you eventually purge "old" data? If so, the PARTITION BY RANGE(TO_DAYS(..)) is an excellent idea. However, it only helps during the purge. This is because DROP PARTITION is a lot faster than DELETE....
we probably have less memory now.
If you are mostly looking at "recent" data, then the size of memory (cf innodb_buffer_pool_size) may not matter. This is due to caching. However, it sounds like you are doing table scans, perhaps unnecessarily.
will I have to make changes to my current INSERT or SELECT
No. But you probably need to change what column(s) are in the PRIMARY KEY and secondary key(s).
Does the partition take a long time to perform?
Slow - yes, because it will copy the entire table over. Note: that means extra disk space, and the partitioned table will take more disk.
something would happen midway through and I would lose the data.
Do not worry. The new table is created, then a very quick RENAME TABLE swaps it into place.
Should I be partioning it to years or months?
Rule of thumb: aim for about 50 partitions. With "2 years and growing", a likely choice is "monthly".
we usually look at the numbers within the month, but sometimes we take weeks or years
Smells like a typical "Data Warehouse" dataset? Build and incrementally augment a "Summary table" with daily stats. With that table, you can quickly get weekly/monthly/yearly stats -- possibly 10 times as fast. Ditto for any date range. This also significantly helps with "low memory".
And should I also partition the columns? (We have some columns that we rarely or never use, but we might want to use them later)
You should 'never' use SELECT *; instead, specify the columns you actually need. "Vertical partitioning" is the term for your suggestion. It is sometimes practical. But we need to see SHOW CREATE TABLE with realistic column names to discuss further.
More on partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
In most circumstances, you're better off using indexes instead of partitioning as your main method of query optimization.
The first thing you should learn about partitioning in MySQL is this rule:
All columns used in the partitioning expression for a partitioned table must be part of every unique key that the table may have.
Read more about this rule here: Partitioning Keys, Primary Keys, and Unique Keys.
This rule makes many tables ineligible for partitioning, because you might want to partition by a column that is not part of the primary or unique key in that table.
The second thing to know is that partitioning only helps queries using conditions that unambiguously let the optimizer infer which partitions hold the data you're interested in. This is called Partition Pruning. If you run a query that could find data in any or all partitions, MySQL must search all the partitions, and you gain no performance benefit compared to have a regular non-partitioned table.
For example, if you partition by date, but then you run a query for data related to a specific user account, it would have to search all your partitions.
In fact, it might even be a little bit slower to use partitioned tables in such a query, because MySQL has to search each partition serially.
You asked how long it would take to partition the table. Converting to a partitioned table requires an ALTER TABLE to restructure the data, so it takes about the same time as any other alteration that copies the data to a new tablespace. This is proportional to the size of the table, but varies a lot depending on your server's performance. You'll just have to test it out, there's no way we can estimate how long it will take on your server.

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

Optimizing Innodb table indexes with GUID/UUID keys

I have an InnoDB based schema with roughly 100 tables, most use GUID/UUID's as the primary key. I started this at a point in time where I didn't really understand the implications of a UUID PK with regard to Disk IO and fragmentation, but wanted the benefits of avoiding a single key dispenser when dealing with server clusters. We're not currently dealing with large numbers of rows, but we will be (in the hundreds of millions) and I would like to be prepared for that.
Now that I understand indexing in InnoDB better, specifically the clustered nature of the primary key, I can see that my UUID's are a poor choice for scalability from a DISK IO perspective, but I don't want to stop using them due to the server clustering requirement.
The accepted/recommended solution seems to be a mix of Autoincrement PK (INT|BIGINT), with UNIQUE Indexed UUID keys. My intention is to add a new first column ai_col to each table and assign it as the new PK, I'm taking queues from:
http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html
I would then update/recreate a new "UNIQUE" index on my UUID keys and continue to use them in our application layer.
My expectation is that once this is done that I can essentially ignore the ai_col and everything else runs business as usual. InnoDB will have a relatively small int based PK from which to cluster on and append to the other unique indexes.
Question 1: Am I correct in assuming that in this new scenario, I can have my cake and eat it too?
The follow up question is with regard to smaller 'associational' tables, i.e. Only two columns, both Foreign Keys to other tables joining them implicitly. In these cases I have typically two indexes, one being a UNIQUE two column index with the more heavily used column first, then a second single index on the other column. I know that this is essentially 2.5x as large as the actual row data, but it seems to really help our more complex queries during optimization, and is on smaller tables so relatively acceptable.
Most of these associational tables will only be a fraction the number of records in the primary tables because they're typically more specific, however, there are a few cases where these have many multiples the number of records as their foreign parents, i.e. potentially billions.
Question 2: Is it a good idea to add the numeric PK's to these tables as well? I'm guessing that the answer will be something along the lines of "Benchtest it" but I'm just looking for helpful nuggets of wisdom.
If I've obviously mis-interpreted anything or you can offer insights that I may not be considering, I'd really appreciate that too!
Many thanks!
EDIT: As promised in the answer, I just wanted to follow up for anyone interested... This solution has worked famously :) Read and write performance increased across the board, and so far it's been tested up to about 6 billion i/o's / month, without breaking a sweat.
Without any other suggestions, confirmations, or otherwise, I've begun testing on our dev server with a number of less used tables but ones that would be affected none the less if the new AI based id's were going to affect our application layer.
So far it's looking good, indexes are performing as expected and the new table fields haven't required any changes to our application layer, we've been basically able to ignore them.
I haven't run any thorough bench testing though to test the actual Disk IO under heavy load but from the sheer amount of information out there on the subject, I can surmise that we're in good shape for scaling up.
Once this has been in place for a while I'll drop in a follow up in case anyone's in the same boat we were.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.