Can someone tell me pros and cons of HASH PARITION vs RANGE PARTITION on a DATETIME column?
Let consider we have POS table with 20 milion records and would want to create partitions based on transaction date's year like
PARTITION BY HASH(YEAR(TRANSACTION_DATE)) PARTITIONS 4;
or
PARTITION BY RANGE(YEAR(TRANSACTION_DATE)) (
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p4 VALUES LESS THAN MAXVALUE
);
to improve performance of queries with TRANSACTION_DATE BETWEEN '2013-03-01' AND '2013-09-29'
Which one better over the other? and why?
There are some significant differences. If you have a where clause that refers to a range of years, such as:
where year(transaction_date) between 2009 and 2011
then I don't think the hash partitioning will recognize this as hitting just one, two, or three partitions. The range partitioning should recognize this, reducing the I/O for such a query.
The more important difference has to do with managing the data. With range partitioning, once a partition has been created -- and the year has past -- presumably the partition will not be touched again. That means that you only have to back up one partition, the current partition. And, next year, you'll only need to back up one partition.
A similar situation arises if you want to move data offline. Dropping a partition containing the oldest year of data is pretty easy, compared to deleting the rows one-by-one.
When the number of partitions is only four, these considerations may not make much of a difference. The key idea is that range partitioning assigns a each row to a known partition. Hash partitioning assigns each row to a partition, but you don't know exactly which one.
EDIT:
The particular optimization that reduces the reading of partitions is called "partition pruning". MySQL documents this pretty well here. In particular:
For tables that are partitioned by HASH or KEY, partition pruning is
also possible in cases in which the WHERE clause uses a simple =
relation against a column used in the partitioning expression.
It would appear that partition pruning for inequalities (and even in) requires range partitioning.
Related
I'm using MySQL 5.7 Percona.
My current design uses naive day-by-day partitioning, which adds new partition for next time period on regular basis.
CREATE TABLE `foo` (
...
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC
PARTITION BY RANGE (UNIX_TIMESTAMP(`created_at`)) (
PARTITION `foo_1640995200` VALUES LESS THAN (1640995200) ENGINE = InnoDB, # 2022-01-01 00:00:00
PARTITION `foo_1641081600` VALUES LESS THAN (1641081600) ENGINE = InnoDB, # 2022-01-02 00:00:00
PARTITION `foo_1641168000` VALUES LESS THAN (1641168000) ENGINE = InnoDB # 2022-01-03 00:00:00
);
The issue with that approach is that my data distribution is uneven. Some partitions have 1M rows, some have 50M.
Which leads to another issue - amount of opened tables during some long range selects like SELECT * FROM foo WHERE created_at > NOW() - INTERVAL 1 YEAR.
I want to optimize it to simply extend last partition if amount of rows is below some threshold instead of creating partition for next day. Like:
SELECT `table_rows`
FROM `information_schema`.`partitions`
WHERE table_schema = DATABASE()
AND partition_name = 'foo_1641168000';
-- only 1M rows, no need for new partition, extend existing one:
ALTER TABLE `foo` REORGANIZE PARTITION `foo_1641168000` INTO (
PARTITION `foo_1641254400` VALUES LESS THAN (1641254400) ENGINE = InnoDB # 2022-01-04 00:00:00
);
However this operation despite being simple range change completly rewrites partition foo_1641168000 data. Despite the fact that all data from existing partition fit into new definition.
Which is no-go due to table locks and excessive I/O usage.
Is there any way to achieve this without rewriting data?
BTW: My hacky idea was to add recent data to another table foo_recent and when it grows to certain size install it as partition in foo using EXCHANGE PARTITION .. WITHOUT VALIDATION. But this is dirty and worse both in terms of performance and syntax - queries must work on tables union or be ran on two tables independently with result merging.
REORGANIZE will read the 'from' partitions and write the 'to' partitions. Costly -- unless the 'froms' are empty.
Have a partition called 'future' that is LESS THAN MAXVALUE and is 'always' empty.
You are stuck with copying over lots of data.
Plan A:
Each night, before midnight, do this if the 'last' partition (before 'future') is getting "big":
REORGANIZE last, future
INTO last, soon, future;
Set the LESS THAN (for 'last') to end at midnight tonight. Set the LESS THAN for 'soon' to, say, a month from now. (This is the only big copy.)
Plan B:
The following may be a viable alternative. (I just thought of it; I have not tried it.) Each night, see if the "last" (before "future") is "big enough". When it is, do these steps (just before each midnight):
Use "transportable tablespaces" to remove the big partition from the table. (Note: a partition is essentially a table, so this action is only touching "meta" information. I'm pretty sure no data is copied.)
Turn right around and again use "transportable tablespaces" to put it back into the partitioned table, but with a different LESS THAN -- set to midnight tonight.
REORGANIZE future INTO soon, future; -- Both of those are empty, so this is quite fast. (The LESS THAN for 'soon' is some time in the future. I hesitate to make it "MAXVALUE", but that might work and be even simpler.)
If you try it and it works, let me know. I would like to add it to my Partition Maintenance blog
Say you have:
CREATE TABLE demo (
amount ,
year ,
cycle ,
otherStuff ,
PRIMARY KEY ( id , year , cycle )
) ENGINE = INNODB
PARTITION BY RANGE ( year )
SUBPARTITION BY KEY ( cycle )
SUBPARTITIONS 12 (
PARTITION p2020 VALUES LESS THAN (2021) ,
PARTITION p2021 VALUES LESS THAN (2022) ,
PARTITION p2022 VALUES LESS THAN (2023) ,
PARTITION pmax VALUES LESS THAN MAXVALUE
);
What's the best SELECT to run on that table?
A:
SELECT otherStuff FROM demo WHERE amount > 10 AND year = 2022 AND cycle = 1;
B:
SELECT otherStuff FROM demo (p2022, p1) WHERE amount > 10;
or
C:
SELECT otherStuff FROM demo (p2022, p1) WHERE amount > 10 AND year = 2022 AND cycle = 1;
I'm sure that there is some extra overhead in pruning—some preliminary step for the storage engine to take to figure out which partitions match the WHERE clause. But, where only one partition and subpartition match the WHERE clause and the pruning WHERE clause contains only simple equals comparisons, what I'm trying to figure out is whether the extra overhead is nominal for performance. The reason I want to figure that out is because I want to know if I can get away with pruning, which offers an advantage in design: if I ever wanted to, I could get rid of my partitions and have no queries to change. In other words, explicit partition selection introduces a dependency I'd rather avoid.
Thanks.
None of the above. That is, "A", but without any partitioning.
Get rid of partitioning unless you can show some use for it.
Only in certain applications does PARTITON help with performance. I have never found a performance use for SUBPARTITION.
WHERE amount > 10 AND year = 2022 AND cycle = 1
That is best handled by
INDEX(year, cycle, -- in either order
amount) -- put 'range' after '='
Partitioning would not help this query.
Time Series
A "time series" can be stored in a partitioned table where each partition is a week or month (or other time range). However, the only advantage comes when you get ready to Delete or Archive "old" rows.
DROP PARTITION is much faster and less invasive than the equivalent DELETE. However, it assumes that the oldest "week" can be jettisoned entirely.
Meanwhile, there is no performance benefit to SELECTs. Think of it this way. Partition pruning will pick (perhaps) one partition to look in, then the index takes over. But pruning is not "free". Nor is walking down a BTree. The BTree might be one level shallower because the partitioning serves for one level of "tree". But that just implies that the SELECT is trading off one search mechanism for another -- possibly without any performance change.
More on time series and how to Partition for such: http://mysql.rjweb.org/doc.php/partitionmaint That also covers how to efficiently create the 'next' partition as time goes on.
If you don't want to DROP the old partition, but want to "archive" it, then partitioning facilitates "transportable tablespaces", where the partition is removed from the main table and turned into a table by itself. Then that can be 'transported' to somewhere else. Again, that only applies to a complete partition, hence the rows being moved must align with the PARTITION BY ... being used.
Other uses for Partitioning
See the above link; I have found only 4 other cases; they are more obscure than Time Series.
Covering indexes
Indexing is too complex to make many general statements. If the covering index has two columns that are both being tested with a range (eg, BETWEEN), the query is destined to be inefficient. Essentially a BTree index can deal with only one range. This leads to a rarely seen use for Partitioning -- use partition pruning for one "range" and an Index for the other.
Finding "nearby" places on a globe can use that two-dimensional lookup with PARTITION BY RANGE(latitude) with longitude in the index.
I don't see this trick being viable beyond 2 ranges.
Back to "covering"... If the WHERE clause using a covering index has multiple ranges, there are still performance issues.
Another thing about "covering" indexes -- sometimes they are unwieldy because of having "too many" columns. I use the Rule of Thumb that says "don't put more than 5 columns in an INDEX". (This is a very soft rule; there is nothing magical about "5".)
Optimal index(es)
We can discuss one query at a time, but that is not sufficient. A table is usually hit by many different Selects. To find the optimal indexes, we need to see all the main queries at once.
If one Select begs for INDEX(a) and another begs for INDEX(a,b), having both indexes is counterproductive. It is better to get rid of the shorter one.
My recommendation above suggests either (year, cycle, amount) or (cycle, year, amount). Possibly another query would pick between them. Or, maybe there is enough variety in the queries to require both variations.
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?
CREATE TABLE part_table (
id bigint NOT NULL auto_increment,
moment datetime NOT NULL,
KEY (id),
KEY (moment)
)-- partitioning information (in years)
PARTITION BY RANGE( YEAR(moment) ) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pFuture VALUES LESS THAN (maxvalue) )
;
With e.g. lookup query:
SELECT * FROM part_table WHERE ID = <nr>
Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
It is inefficient to pre-create future partitions (other than one).
Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:
Dropping 'old' data is much faster than DELETEing it.
`WHERE something-else AND moment between ..
Some cases
For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).
WHERE id BETWEEN 111 and 222
Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.
WHERE id BETWEEN 111 and 222
AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`
This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.
Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:
PRIMARY KEY(id, moment)
The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.
INDEX(id)
is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.
DROP PARTITION p2020
id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.
More
It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.
Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.
In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):
If the WHERE references only one column, the PARTITIONing is probably never beneficial.
If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.
For my Table, I need to partition based on created timestamp field by Month.
I am evaluating the following two approaches:
RANGE
ALTER TABLE my_table
PARTITION BY RANGE ( MONTH(created) ) (
PARTITION p1 VALUES LESS THAN (2),
PARTITION p2 VALUES LESS THAN (3),
PARTITION p3 VALUES LESS THAN (4),
PARTITION p4 VALUES LESS THAN (5),
PARTITION p5 VALUES LESS THAN (6),
PARTITION p6 VALUES LESS THAN (7),
PARTITION p7 VALUES LESS THAN (8),
PARTITION p8 VALUES LESS THAN (9),
PARTITION p9 VALUES LESS THAN (10),
PARTITION p10 VALUES LESS THAN (11),
PARTITION p11 VALUES LESS THAN (12),
PARTITION p12 VALUES LESS THAN (13)
);
HASH
ALTER TABLE my_table
PARTITION BY HASH((YEAR(created) * 100) + MONTH(created))
PARTITIONS 13;
Use case:
My use case is that I want to archive by month, for the month which has crosses 1 year. For example, if the current month is july-2020, then the parition corresponsing to july-2019 would be archived, also the secondary use case is the partition pruning to improve the performance as most of the queries include this timestamp column.
Why 13 partitions in the HASH one?
As stated above, I will be archiving the 13th month from current month.
For this use case, which approach would suit better? As far as I understand, when I'm defining it by RANGE, I have the directly control on which data goes into which partition, and in case of HASH, it would be defined by MySQL HASH function (mod) and that will make things difficult to identify the "over the year" partition and archive it specifically.
Or is there any totally different approach for this use case?
PARTITION BY HASH is useless. Period.
PARTITION BY RANGE can be useful if you want to purge "old" data. Details: http://mysql.rjweb.org/doc.php/partitionmaint
What will you do next January?
Show me your SELECTs and SHOW CREATE TABLE. I'll help you optimize the INDEXes for a non-partitioned version. It will run as fast or faster than you think your schema.
More
BY HASH is useless when you have a "range". The Optimizer will always pick all partitions, thereby slowing down the query. (This flaw applies to most partitioning methods.)
If you always use WHERE month=constant, you may as well have the column month early in indexes. MONTH(date_col) = constant is a different matter. (I have not thought through all the implications. Let's see your queries.)
As a general rule, you can build an index on a non-partitioned table that will provide the equivalent functionality as partition pruning. (The link lists only 4 exceptions to the rule. I've spent a decade looking for more use cases.) Correlary: When switching to/from partitioning, all the indexes, including the PRIMARY KEY, should be redesigned.
One of my use cases is to use "transportable tablespaces" to archive one whole partition. You might be able to use that with BY HASH; it's rather clear how to do it with BY RANGE.
The main focus of my blog is to explain DROPping (or 'transporting') the oldest of a 13-month partitions and REORGANIZE to get a new "month" (or other time range).
I want to partition a table in MySQL while preserving the table's structure.
I have a column, 'Year', based on which I want to split up the table into different tables for each year respectively. The new tables will have names like 'table_2012', 'table_2013' and so on. The resultant tables need to have all the fields exactly as in the source table.
I have tried the following two pieces of SQL script with no success:
1.
CREATE TABLE all_data_table
( column1 int default NULL,
column2 varchar(30) default NULL,
column3 date default NULL
) ENGINE=InnoDB
PARTITION BY RANGE ((year))
(
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2011) , PARTITION p2 VALUES LESS THAN (2012) ,
PARTITION p3 VALUES LESS THAN (2013), PARTITION p4 VALUES LESS THAN MAXVALUE
);
2.
ALTER TABLE all_data_table PARTITION BY RANGE COLUMNS (`year`) (
PARTITION p0 VALUES LESS THAN (2011),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Any assistance would be appreciated!
This is old, but seeing as it comes up highly ranked in partitioning searches, I figured I'd give some additional details for people who might hit this page. What you are talking about in having a table_2012 and table_2013 is not "MySQL Partitioning" but "Manual Partitioning".
Partitioning means that you have one "logical table" with a single table name, which--behind the scenes--is divided among multiple files. When you have millions to billions of rows, over years, but typically you are only searching a single month, partitioning by Year/Month can have a great performance benefit because MySQL only has to search against the file that contains the Year/Month that you are searching for...so long as you include the partition key in your WHERE.
When you create multiple tables like table_2012 and table_2013, you are MANUALLY partitioning the tables, which you don't do with the MySQL PARTITION configuration. To manually partition the tables, during 2012, you put all data into the 2012 table. When you hit 2013, you start putting all the data into the 2013 table. You have to make sure to create the table before you hit 2013 or it won't have any place to go. Then, when you query across the years (e.g. from Nov 2012 - Jan 2013), you have to do a UNION between table_2012 and table_2013.
SELECT * FROM table_2012 WHERE #...
UNION
SELECT * FROM table_2013 WHERE #...
With partitioning, this manual work is not necessary. You do the initial setup of the partitions, then you treat is as a single table. No unions required, no checking the date before you insert, etc. This makes life much easier. MySQL handles figuring out what tables it needs to query. However, you MUST make sure to query against the Year column or it will have to scan ALL files. E.g. SELECT * FROM all_data_table WHERE Month=12 will scan all partitions for Month=12. To ensure you are only scanning the partition files that you need to scan, you want to make sure to include the partition column in every query that you can.
Possible negatives to partitioning...if you have billions of rows and you do an ALTER TABLE on the table to--say--add a column...it's going to have to update every row taking a VERY long time. At the company I currently work for, the boss doesn't think it's worth the time it takes to update the billion rows historically when we are adding a new column for going forward...so this is one of the reasons we do manual partitioning instead of letting MySQL do it.
DISCLAIMER: I am not an expert at partitioning...so if I'm wrong in any of this, please let me know and I'll fix the incorrect parts.
From what I see you want to create many tables from one big table.
I think you should try to create views instead.
Since from what I look around about partitioning, it actually partitions the physical storage of that table and then store them separately. But if you see from the top perspective you will see them as a single table.