I have a table that is constantly growing.
I want to delete rows that are older than 1 year (periodically - each 12 hours)
At first I thought using the ordinary delete statement, but it's not good as there are many entries and the database will get stuck.
Then I read that I can use another approach - moving the "undeleted" entries to a new table, renaiming it and using drop for the old table.
The approach that I wanted to try (and not so sure how to do) is using partitioning.
I want to take my field - created and devide it to months, then each month, delete the same month - a year ago.
example : once a month, 1.1.2016 - > delete all entries from jan 2015.
I removed the primary key and added it as index (as I got error 1503).
But still can't figure out how to do it..
can you please advise?
This is the table:
CREATE TABLE `myTable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`file_name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
adding - I tried this :
ALTER TABLE myTable
PARTITION BY RANGE( YEAR(created) )
SUBPARTITION BY HASH( MONTH(created) )
SUBPARTITIONS 12 (
PARTITION january VALUES LESS THAN (2),
PARTITION february VALUES LESS THAN (3),
PARTITION march VALUES LESS THAN (4),
PARTITION april VALUES LESS THAN (5),
PARTITION may VALUES LESS THAN (6),
PARTITION june VALUES LESS THAN (7),
PARTITION july VALUES LESS THAN (8),
PARTITION august VALUES LESS THAN (9),
PARTITION september VALUES LESS THAN (10),
PARTITION october VALUES LESS THAN (11),
PARTITION november VALUES LESS THAN (12),
PARTITION december VALUES LESS THAN (13)
);
but I always get an error : Table has no partition for value 2016 when trying to set created to 2016-01-26 15:37:22
HASH partitioning does not do anything useful.
RANGE partitioning needs specific ranges.
To keep a year's worth of data, but delete in 12-hour chunks, would require 730 partitions; this is impractical.
Instead, I suggest PARTITION BY RANGE with 14 monthly ranges (or 54 weekly) ranges and DROP a whole month (or week). For example, it is now mid-January, so monthly would have: Jan'15, Feb'15, ..., Jan'16, Future.
Near the end of Jan'16, REORGANIZE Future into Feb'16 and Future.
Early in Feb'16, DROP Jan'15.
Yes, you would have up to a month (or week) of data waiting to be deleted, but that probably is not a big deal. And it would be very efficient.
I would write a daily cron job to do "if it is time to drop, do so" and "if it is time to reorganize, do so".
More details.
Related
I'm trying to figure out how long it will take to partition a large table. I'm about 2 weeks into partitioning this table and don't have a good feeling for how much longer it will take. Is there any way to calculate how long this query might take?
The following is the query in question.
ALTER TABLE pIndexData REORGANIZE PARTITION pMAX INTO (
PARTITION p2022 VALUES LESS THAN (UNIX_TIMESTAMP('2023-01-01 00:00:00 UTC')),
PARTITION pMAX VALUES LESS THAN (MAXVALUE)
)
For context, the pIndexData table has about 6 billion records and the pMAX partition has roughly 2 billion records. This is an Amazon Aurora instance and the server is running MySQL 5.7.12. The DB Engine is InnoDB. The following is the table syntax.
CREATE TABLE `pIndexData` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`DateTime-UNIX` bigint(20) NOT NULL DEFAULT '0',
`pkl_PPLT_00-PIndex` int(11) NOT NULL DEFAULT '0',
`DataValue` decimal(14,4) NOT NULL DEFAULT '0.0000',
PRIMARY KEY (`pkl_PPLT_00-PIndex`,`DateTime-UNIX`),
KEY `id` (`id`),
KEY `DateTime` (`DateTime-UNIX`) USING BTREE,
KEY `pIndex` (`pkl_PPLT_00-PIndex`) USING BTREE,
KEY `DataIndex` (`DataValue`),
KEY `pIndex-Data` (`pkl_PPLT_00-PIndex`,`DataValue`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (`DateTime-UNIX`)
(PARTITION p2016 VALUES LESS THAN (1483246800) ENGINE = InnoDB,
PARTITION p2017 VALUES LESS THAN (1514782800) ENGINE = InnoDB,
PARTITION p2018 VALUES LESS THAN (1546318800) ENGINE = InnoDB,
PARTITION p2019 VALUES LESS THAN (1577854800) ENGINE = InnoDB,
PARTITION p2020 VALUES LESS THAN (1609477200) ENGINE = InnoDB,
PARTITION p2021 VALUES LESS THAN (1641013200) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
In researching this question, I found using Performance Schema could provide the answer to my question. However, Performance Schema in not enabled on this server and enabling it requires a reboot. Rebooting is not an option because doing so could corrupt the database while this query is processing.
As a means of gaining some sense for how long this will take I recreated the pIndexData table in a separate Aurora instance. I then imported a sample set of data (about 3 million records). The sample set had DateTime values spread out over 2021, 2022 and 2023, with the lions share of data in 2022. I then ran the same REORGANIZE PARTITION query and clocked the time it took to complete. The partition query took 2 minutes, 29 seconds. If the partition query to records was linear, I estimate the query on the original table should take roughly 18 hours. It seems there is no linear calculation. Even with a large margin of error, this is way off. Clearly, there are factors (perhaps many) I'm missing.
I'm not sure what else to try other than run the sample data test again but with an even larger data sample. Before I do, I'm hoping someone might have some insight how to best calculate how long this might take to finish.
Adding (or removing) partitioning will necessarily copy all the data over and rebuild all the tables. So, if your table is large enough to warrant partitioning (over 1M rows), it will take a noticeable amount of time.
In the case of REORGANIZE one (or a few) partitions (eg, PMAX) "INTO ...", the metric is how many rows in the PMAX.
What you should have done is to create the LESS THAN 2022 late in 2021 when PMAX was empty.
Recommend you reorganize PMAX into 2022 and 2023 and PMAX now. Again, the time is proportional to the size of PMAX. Then be sure to create 2024 in Dec 2023, when PMAX is still empty.
What is the advantage of partitioning by Year? Will you be purging old data eventually? (That may be the only advantage.)
As for your test -- was there nothing in the other partitions when you measured 2m29s? That test would be about correct. There may be a small burden in adding the 2021 index rows.
A side note: The following is unnecessary since there are 2 other indexes handling it:
KEY `pIndex` (`pkl_PPLT_00-PIndex`) USING BTREE,
However, I don't know if dropping it would be "instant".
I have created a table in MYSQL using following syntax:
CREATE TABLE `demo` (
`id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'ID',
`date` datetime NOT NULL COMMENT 'date',
`desc` enum('error','audit','info') NOT NULL,
PRIMARY KEY (`id`,`date`)
)
PARTITION BY RANGE (MONTH(`date`))
(
PARTITION JAN VALUES LESS THAN (2),
PARTITION FEB VALUES LESS THAN (3),
PARTITION MAR VALUES LESS THAN (4),
PARTITION APR VALUES LESS THAN (5),
PARTITION MAY VALUES LESS THAN (6),
PARTITION JUN VALUES LESS THAN (7),
PARTITION JUL VALUES LESS THAN (8),
PARTITION AUG VALUES LESS THAN (9),
PARTITION SEP VALUES LESS THAN (10),
PARTITION OCT VALUES LESS THAN (11),
PARTITION NOV VALUES LESS THAN (12),
PARTITION `DEC` VALUES LESS THAN (MAXVALUE)
);
Here id and date is the combined primary key and I have used date as the partitioning column. I am making the partitions based on month in the date.
The table is created successfully and the data is getting inserted properly into it as per the partitions.
What will be the effect on the performance if I fire a query which needs to fetch records across multiple partitions?
Consider following query:
SELECT * FROM `demo` WHERE `between` '2015-02-01 00:00:00' AND '2015-05-31 00:00:00';
The query will need to look at ALL the partitions. The optimizer is not smart enough to understand the basic principles of date ranges when they are "wrapped" by the MONTH() function.
You can see this by doing EXPLAIN PARTITIONS SELECT ...;.
Even if it were smart enough to touch only 4 partitions, you would gain no performance benefit for that SELECT. You may as well get rid of partitions and add an index on date.
Since this table is called demo, I suspect it is not the final version. If you would like to talk about whether PARTITIONing is useful for your application, let's see the real schema and the important queries.
I've a 30M rows table and I want to partition it by dates.
mysql > SHOW CREATE TABLE `parameters`
CREATE TABLE `parameters` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`add_time` datetime DEFAULT NULL,
...(etc)
) ENGINE=MyISAM AUTO_INCREMENT=28929477 DEFAULT CHARSET=utf8
Table stores data for last 5 years and rows count increases dramatically. I want partition it by years(2009, 2010, 2011, 2012, 2013).
ALTER TABLE parameters DROP PRIMARY KEY, ADD INDEX(id);
ALTER TABLE parameters PARTITION BY RANGE (TO_DAYS(id)) (
PARTITION y2009 VALUES LESS THAN (TO_DAYS('2010-01-01')),
PARTITION y2010 VALUES LESS THAN (TO_DAYS('2011-01-01')),
PARTITION y2011 VALUES LESS THAN (TO_DAYS('2012-03-01')),
PARTITION y2012 VALUES LESS THAN (TO_DAYS('2013-01-01')),
PARTITION y2013 VALUES LESS THAN MAXVALUE
);
Everyting works on dev-server, but there is a problem on production-server.
The problem: almost all of the rows moved to the first partition(y2009). But data is uniformly distributed by years. Physically there is large y2009.myd file in DATA folder and others partitions have much less size.
Also I tried to reorganize first partition in order to exclude Null dates:
alter table raw
reorganize partition y2012 into (
PARTITION y0 VALUES LESS THAN (0),
PARTITION y2012 VALUES LESS THAN (TO_DAYS('2013-01-01')),
);
P.S.: production and dev servers have same version of MySQL 5.1.37
You need to use date column in RANGE not id for partition.
I have changed TO_DAYS(id) to TO_DAYS(add_time)
Try below:
ALTER TABLE parameters PARTITION BY RANGE (TO_DAYS(add_time)) (
PARTITION y0 VALUES LESS THAN (TO_DAYS('2009-01-01')),
PARTITION y2009 VALUES LESS THAN (TO_DAYS('2010-01-01')),
PARTITION y2010 VALUES LESS THAN (TO_DAYS('2011-01-01')),
PARTITION y2011 VALUES LESS THAN (TO_DAYS('2012-03-01')),
PARTITION y2012 VALUES LESS THAN (TO_DAYS('2013-01-01')),
PARTITION y2013 VALUES LESS THAN MAXVALUE
);
I am having issue to partition a table using partition by range on a datetime column.
the test search result is still on full partition scan.
I saw some posts on the net in regards to this issue, but not sure if there is any way to fix it or bypass the issue.
mysql server: Percona 5.5.24-55.
table:
id bigint(20) unsigned NOT NULL,
time datatime unsigned NOT NULL,
....
....
KEY id_time (id,time)
engine=InnoDB
partition statement:
alter table summary_201204
partition by range (day(time))
subpartition by key(id)
subpartitions 5 (
partition p0 values less than (6),
partition p1 values less than (11),
partition p2 values less than (16),
partition p3 values less than (21),
partition p4 values less than (26),
partition p5 values less than (MAXVALUE) );
check:
explain partitions select * from summary_201204 where time < '2012-07-21';
result: p0_p0sp0,p0_p0sp1,p0_p0sp2,p0_p0sp3,p0_p0sp4,p1_p1sp0,p1_p1sp1,p1_p1sp2,p1_p1sp3,p1_p1sp4,p2_p2sp0,p2_p2sp1,p2_p2sp2,p2_p2sp3,p2_p2sp4,p3_p3sp0,p3_p3sp1,p3_p3sp2,p3_p3sp3,p3_p3sp4,p4_p4sp0,p4_p4sp1,p4_p4sp2,p4_p4sp3,p4_p4sp4,p5_p5sp0,p5_p5sp1,p5_p5sp2,p5_p5sp3,p5_p5sp4.
I think here is the answer: Visit enter link description here
So, the documentation within the mysql official site is not clear enough about the data types required for partition. In this case, if the table data type is datetime, then we should use to_seconds, whilst if the data type is DATE then we can use YEA
I have a log table that gets processed every night. Processing will be done on data that was logged yesterday. Once the processing is complete I want to delete the data for that day. At the same time, I have new data coming into the table for the current day. I partitioned the table based on day of week. My hope was that I could delete data and insert data at the same time without contention. There could be as many as 3 million rows of data a day being processed. I have searched for information but haven't found anything to confirm my assumption.
I don't want to have the hassles of writing a job that adds partitions and drop partitions as I have seen in other examples. I was hoping to implement a solution using seven partions. eg.
CREATE TABLE `professional_scoring_log` (
`professional_id` int(11) NOT NULL,
`score_date` date NOT NULL,
`scoring_category_attribute_id` int(11) NOT NULL,
`displayable_score` decimal(7,3) NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`professional_id`,`score_date`,`scoring_category_attribute_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (DAYOFWEEK(`score_date`))
(PARTITION Sun VALUES LESS THAN (2) ENGINE = InnoDB,
PARTITION Mon VALUES LESS THAN (3) ENGINE = InnoDB,
PARTITION Tue VALUES LESS THAN (4) ENGINE = InnoDB,
PARTITION Wed VALUES LESS THAN (5) ENGINE = InnoDB,
PARTITION Thu VALUES LESS THAN (6) ENGINE = InnoDB,
PARTITION Fri VALUES LESS THAN (7) ENGINE = InnoDB,
PARTITION Sat VALUES LESS THAN (8) ENGINE = InnoDB) */
When my job that processes yesterday's data is complete, it would delete all records where score_date = current_date-1. At any one time, I am likely only going to have data in one or two partitions, depending on time of day.
Are there any holes in my assumptions?
Charlie, I don't see any holes in your logic/assumptions.
I guess my one comment would be why not use the drop/add partition syntax? It has to be more efficient than DELETE FROM .. Where ..; and it's just two calls - no big deal -- store "prototype" statements and substitute for "Sun" and "2" as required for each day of the week -- I often use sprintf for doing just that
ALTER TABLE `professional_scoring_log` DROP PARTITION Sun;
ALTER TABLE `professional_scoring_log` ADD PARTITION (
PARTITION Sun VALUES LESS THAN (2)
);