I have a table which will grow large over time, moreover I need only small amount of data say last 7 days.
I want to configure it such that the data of 7 days goes in one partition, and then in next. This way I would keep only two partitions and archive others.
I read about MySQL partitions here but the way in article to create partitions is that we specify all partitions while creating table only.
I am not sure is this the best way to do it where we add partitioning logic for long time.
Any ideas?
Unfortunately, it'll be a fairly manual process. Your best bet is to create the partitions, week by week ahead of time, then have a job that runs periodically to archive the old data into the 'catchall' partition.
e.g. with:
PARTITION BY RANGE ( TO_DAYS(date) ) (
PARTITION pmin VALUES LESS THAN ( TO_DAYS('2016-10-02 00:00:00') ),
PARTITION p1 VALUES LESS THAN ( TO_DAYS('2016-10-09 00:00:00') ),
PARTITION p2 VALUES LESS THAN ( TO_DAYS('2016-10-16 00:00:00') ),
PARTITION p3 VALUES LESS THAN ( TO_DAYS('2016-10-23 00:00:00') ),
PARTITION pmax VALUES LESS THAN (MAXVALUE)
);
There's no real harm having a few empty partitions sitting there with higher dates then doing a 'shift' once a week. It'll be fast enough as long as when you change the partitioning definition, the data window shifts by the partition size.
Your job would do something like
ALTER TABLE x REORGANIZE PARTITION pmin,p1 INTO (
PARTITION pmin VALUES LESS THAN ('2016-10-09 00:00:00')
);
ALTER TABLE x
ADD PARTITION px VALUES LESS THAN ( TO_DAYS('2016-10-30 00:00:00') )
);
There is no "automatic" partition management in MySQL. We have to run some specific SQL statements to add and drop partitions from a partitioned table.
We automated the task with a cron job which runs a MySQL PROCEDURE we wrote to drop (swap out) old partitions, and another PROCEDURE to add new partitions. The procedures are specific to a particular table.
Our table is partitioned by RANGE on a TIMESTAMP column. The partition expression is like UNIX_TIMESTAMP(col).
To add a new partition, we reorganize the MAXVALUE partition, which is always (or should always be) empty, so the operation is very quick. We dynamically prepare and execute a statement of the form:
ALTER TABLE ourtable REORGANIZE PARTITION pmax
INTO ( PARTITION pn_name VALUES LESS THAN (UNIX_TIMESTAMP(pn_date))
, PARTITION pmax VALUES LESS THAN MAXVALUE)
To get a new date value for the new partition (pn_name), we take the partition_description value from the second to last partition (the last partition is the MAXVALUE partition), and add 7 days to it to get the pn_date string to use. We use that same value to generate the pn_name for the new partition. (We name the partitions following a pattern like this: p20161030 based on the date value in the partition_description e.g. UNIX_TIMESTAMP('2016-10-30').
(This information is obtained from a fairly involved query with a couple of references to information_schema.partitions view.
With the other procedure to drop old partitions, we actually "swap out" the old partition to an archive table. (The archive table is later backed up, and dropped by a different task.)
The procedure basically runs a series of statements like this:
DROP TABLE IF EXISTS `_et` ;
CREATE TABLE `_et` LIKE `rdg_point_value` ;
ALTER TABLE `_et` REMOVE PARTITIONING ;
ALTER TABLE `ourtable` EXCHANGE PARTITION `oldest_partition` WITH TABLE `_et` ;
ALTER TABLE `ourtable` DROP PARTITION `oldest_partition` ;
RENAME TABLE `et` TO `archive_oldest_partition` ;
(I wish there was a cleaner way to create a new un-partitioned table, in a single statement, such as a a CREATE TABLE ... LIKE ... WITHOUT PARTITIONING, but absent that, we settled on the two separate statements.)
Just dropping the oldest partition would be a simpler process.
To obtain information about the oldest partition, our query is probably overkill. But it's where most of the "magic" happens. Just to give you an idea of what that query looks like...
FROM information_schema.partitions p1
JOIN information_schema.partitions px
ON px.table_schema = 'ourdatabase'
AND px.table_name = 'ourtable'
AND px.partition_method = 'RANGE'
AND px.partition_expression = 'UNIX_TIMESTAMP(ourcol)'
AND px.partition_description = 'MAXVALUE'
WHERE p1.table_schema = 'ourdatabase'
AND p1.table_name = 'ourtable'
AND p1.partition_method = 'RANGE'
AND p1.partition_expression = 'UNIX_TIMESTAMP(ourcol)'
AND p1.partition_description <> 'MAXVALUE'
AND p1.partition_description + 0 <= UNIX_TIMESTAMP(DATE(NOW()) + INTERVAL -187 DAY)
AND p1.partition_ordinal_position = 1
You could probably get away with a simpler query. (Our query is designed to only return the "oldest" partition only if all of the timestamp values in it are at least six months old, and only if there is a MAXVALUE partition defined.
Each of the procedures use the current date to see if "its time" to add or drop a partition. (The amount of time forward and back is hardcoded into the queries in the procedure... the query returns 0 rows if its not time yet.
The procedures only need to be executed once per week, and we designed them so that any "extra" runs won't add or drop partitions outside of the specified time ranges.
We have the procedures scheduled to execute every day, and on most days, the procedure runs a query which returns zero rows, and exits. Only when the query returns a row is there any work to do.
Related
I'm using MySQL 5.7 Percona.
My current design uses naive day-by-day partitioning, which adds new partition for next time period on regular basis.
CREATE TABLE `foo` (
...
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC
PARTITION BY RANGE (UNIX_TIMESTAMP(`created_at`)) (
PARTITION `foo_1640995200` VALUES LESS THAN (1640995200) ENGINE = InnoDB, # 2022-01-01 00:00:00
PARTITION `foo_1641081600` VALUES LESS THAN (1641081600) ENGINE = InnoDB, # 2022-01-02 00:00:00
PARTITION `foo_1641168000` VALUES LESS THAN (1641168000) ENGINE = InnoDB # 2022-01-03 00:00:00
);
The issue with that approach is that my data distribution is uneven. Some partitions have 1M rows, some have 50M.
Which leads to another issue - amount of opened tables during some long range selects like SELECT * FROM foo WHERE created_at > NOW() - INTERVAL 1 YEAR.
I want to optimize it to simply extend last partition if amount of rows is below some threshold instead of creating partition for next day. Like:
SELECT `table_rows`
FROM `information_schema`.`partitions`
WHERE table_schema = DATABASE()
AND partition_name = 'foo_1641168000';
-- only 1M rows, no need for new partition, extend existing one:
ALTER TABLE `foo` REORGANIZE PARTITION `foo_1641168000` INTO (
PARTITION `foo_1641254400` VALUES LESS THAN (1641254400) ENGINE = InnoDB # 2022-01-04 00:00:00
);
However this operation despite being simple range change completly rewrites partition foo_1641168000 data. Despite the fact that all data from existing partition fit into new definition.
Which is no-go due to table locks and excessive I/O usage.
Is there any way to achieve this without rewriting data?
BTW: My hacky idea was to add recent data to another table foo_recent and when it grows to certain size install it as partition in foo using EXCHANGE PARTITION .. WITHOUT VALIDATION. But this is dirty and worse both in terms of performance and syntax - queries must work on tables union or be ran on two tables independently with result merging.
REORGANIZE will read the 'from' partitions and write the 'to' partitions. Costly -- unless the 'froms' are empty.
Have a partition called 'future' that is LESS THAN MAXVALUE and is 'always' empty.
You are stuck with copying over lots of data.
Plan A:
Each night, before midnight, do this if the 'last' partition (before 'future') is getting "big":
REORGANIZE last, future
INTO last, soon, future;
Set the LESS THAN (for 'last') to end at midnight tonight. Set the LESS THAN for 'soon' to, say, a month from now. (This is the only big copy.)
Plan B:
The following may be a viable alternative. (I just thought of it; I have not tried it.) Each night, see if the "last" (before "future") is "big enough". When it is, do these steps (just before each midnight):
Use "transportable tablespaces" to remove the big partition from the table. (Note: a partition is essentially a table, so this action is only touching "meta" information. I'm pretty sure no data is copied.)
Turn right around and again use "transportable tablespaces" to put it back into the partitioned table, but with a different LESS THAN -- set to midnight tonight.
REORGANIZE future INTO soon, future; -- Both of those are empty, so this is quite fast. (The LESS THAN for 'soon' is some time in the future. I hesitate to make it "MAXVALUE", but that might work and be even simpler.)
If you try it and it works, let me know. I would like to add it to my Partition Maintenance blog
I have a partitioned table, in which I'm inserting data from a stored procedure,
I have partitioning on the table by a column named year,
The stored procedure is able to insert data into the partitioned table properly.
But now I have a case where inserts might happen, for which partitions may not be present,
I need a solution to find if a particular partition name exists for the table.
Eg. My table name is backups
I have 3 Partitions for now -
2018, 2019 and 2020
But in the year 2021 which the stored procedure runs,
there may not be a partition for the year
So I wish my stored procedure handle the checking and creation of the partition at run time.
Following is my table structure -
Partition creation query -
ALTER TABLE backups
partition by list columns(year)
(partition backup_2018 values IN (2018),
partition backup_2019 values IN (2019),
partition backup_2020 values IN (2020));
Following is my stored procedure -
CREATE DEFINER=`root`#`localhost` PROCEDURE `daily_backup`()
BEGIN
DECLARE backuptime INT;
#Need Partition checking and creation here
SET backuptime = UNIX_TIMESTAMP(CONCAT(DATE_SUB(DATE_FORMAT(NOW(),'%Y-%m-%d'), INTERVAL 1 DAY),' 23:59:59'));
INSERT into backups
(user_id, latest_transaction_id, balance, last_transaction_timestamp, last_transaction_date, snapshot_date, year)
SELECT
T2.user_id,
T2.transaction_id AS latest_transaction_id,
T2.new_balance AS balance,
T2.created_date AS last_transaction_timestamp,
DATE_FORMAT(FROM_UNIXTIME(T2.created_date), '%Y-%m-%d %I:%i:%S') AS last_transaction_date,
DATE_FORMAT(NOW(), '%Y-%m-%d') AS snapshot_date,
DATE_FORMAT(NOW(), '%Y') AS year
FROM
(SELECT
user_id, MAX(transaction_id) maxTransID
FROM
transaction
WHERE
created_date < #backuptime
GROUP BY user_id) Tmp
JOIN
transaction T2 ON Tmp.MaxTransID = T2.Transaction_ID;
END
I suggest partitioning using RANGE COLUMNS instead of LIST COLUMNS. That way you have the option of adding a last column for any years beyond the partitions you have defined so far.
ALTER TABLE backups
partition by range columns(year)
(partition backup_2018 values LESS THAN (2019),
partition backup_2019 values LESS THAN (2020),
partition backup_2020 values LESS THAN (2021),
partition backup_other values LESS THAN MAXVALUE);
As you get closer to the end of 2020, you'd use ALTER TABLE backups REORGANIZE PARTITION backup_other INTO ( ...new partitions... ) to split the last partition and make new partitions for subsequent years.
See https://dev.mysql.com/doc/refman/5.7/en/partitioning-management-range-list.html for more details.
If you forget, no harm done, your data will just fill up backup_other for a while until you remember to reorganize. It's to your advantage though to do it proactively, because reorganizing an empty partition is quick, and reorganizing a partition with data in it will take more time.
I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)
I want to partition a table in MySQL while preserving the table's structure.
I have a column, 'Year', based on which I want to split up the table into different tables for each year respectively. The new tables will have names like 'table_2012', 'table_2013' and so on. The resultant tables need to have all the fields exactly as in the source table.
I have tried the following two pieces of SQL script with no success:
1.
CREATE TABLE all_data_table
( column1 int default NULL,
column2 varchar(30) default NULL,
column3 date default NULL
) ENGINE=InnoDB
PARTITION BY RANGE ((year))
(
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2011) , PARTITION p2 VALUES LESS THAN (2012) ,
PARTITION p3 VALUES LESS THAN (2013), PARTITION p4 VALUES LESS THAN MAXVALUE
);
2.
ALTER TABLE all_data_table PARTITION BY RANGE COLUMNS (`year`) (
PARTITION p0 VALUES LESS THAN (2011),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Any assistance would be appreciated!
This is old, but seeing as it comes up highly ranked in partitioning searches, I figured I'd give some additional details for people who might hit this page. What you are talking about in having a table_2012 and table_2013 is not "MySQL Partitioning" but "Manual Partitioning".
Partitioning means that you have one "logical table" with a single table name, which--behind the scenes--is divided among multiple files. When you have millions to billions of rows, over years, but typically you are only searching a single month, partitioning by Year/Month can have a great performance benefit because MySQL only has to search against the file that contains the Year/Month that you are searching for...so long as you include the partition key in your WHERE.
When you create multiple tables like table_2012 and table_2013, you are MANUALLY partitioning the tables, which you don't do with the MySQL PARTITION configuration. To manually partition the tables, during 2012, you put all data into the 2012 table. When you hit 2013, you start putting all the data into the 2013 table. You have to make sure to create the table before you hit 2013 or it won't have any place to go. Then, when you query across the years (e.g. from Nov 2012 - Jan 2013), you have to do a UNION between table_2012 and table_2013.
SELECT * FROM table_2012 WHERE #...
UNION
SELECT * FROM table_2013 WHERE #...
With partitioning, this manual work is not necessary. You do the initial setup of the partitions, then you treat is as a single table. No unions required, no checking the date before you insert, etc. This makes life much easier. MySQL handles figuring out what tables it needs to query. However, you MUST make sure to query against the Year column or it will have to scan ALL files. E.g. SELECT * FROM all_data_table WHERE Month=12 will scan all partitions for Month=12. To ensure you are only scanning the partition files that you need to scan, you want to make sure to include the partition column in every query that you can.
Possible negatives to partitioning...if you have billions of rows and you do an ALTER TABLE on the table to--say--add a column...it's going to have to update every row taking a VERY long time. At the company I currently work for, the boss doesn't think it's worth the time it takes to update the billion rows historically when we are adding a new column for going forward...so this is one of the reasons we do manual partitioning instead of letting MySQL do it.
DISCLAIMER: I am not an expert at partitioning...so if I'm wrong in any of this, please let me know and I'll fix the incorrect parts.
From what I see you want to create many tables from one big table.
I think you should try to create views instead.
Since from what I look around about partitioning, it actually partitions the physical storage of that table and then store them separately. But if you see from the top perspective you will see them as a single table.
I want to keep the last 45 days of log data in a MySQL table for statistical reporting purposes. Each day could be 20-30 million rows. I'm planning on creating a flat file and using load data infile to get the data in there each day. Ideally I'd like to have each day on it's own partition without having to write a script to create a partition every day.
Is there a way in MySQL to just say each day gets it's own partition automatically?
thanks
I would strongly suggest using Redis or Cassandra rather than MySQL to store high traffic data such as logs. Then you could stream it all day long rather than doing daily imports.
You can read more on those two (and more) in this comparison of "NoSQL" databases.
If you insist on MySQL, I think the easiest would just be to create a new table per day, like logs_2011_01_13 and then load it all in there. It makes dropping older dates very easy and you could also easily move different tables on different servers.
er.., number them in Mod 45 with a composite key and cycle through them...
Seriously 1 table per day was a valid suggestion, and since it is static data I would create packed MyISAM, depending upon my host's ability to sort.
Building queries to union some or all of them would be only moderately challenging.
1 table per day, and partition those to improve load performance.
Yes, you can partition MySQL tables by date:
CREATE TABLE ExampleTable (
id INT AUTO_INCREMENT,
d DATE,
PRIMARY KEY (id, d)
) PARTITION BY RANGE COLUMNS(d) (
PARTITION p1 VALUES LESS THAN ('2014-01-01'),
PARTITION p2 VALUES LESS THAN ('2014-01-02'),
PARTITION pN VALUES LESS THAN (MAXVALUE)
);
Later, when you get close to overflowing into partition pN, you can split it:
ALTER TABLE ExampleTable REORGANIZE PARTITION pN INTO (
PARTITION p3 VALUES LESS THAN ('2014-01-03'),
PARTITION pN VALUES LESS THAN (MAXVALUE)
);
This doesn't automatically partition by date, but you can reorganize when you need to. Best to reorganize before you fill the last partition, so the operation will be quick.
I have stumbled on this question while looking for something else and wanted to point out the MERGE storage engine (http://dev.mysql.com/doc/refman/5.7/en/merge-storage-engine.html).
The MERGE storage is more or less a simple pointer to multiple tables, and can be redone in seconds. For cycling logs, it can be very powerfull! Here's what I'd do:
Create one table per day, use LOAD DATA as OP mentionned to fill it up. Once it is done, drop the MERGE table and recreate it including that new table while ommiting the oldest one. Once done, I could delete/archive the old table. This would allow me to rapidly query a specific day, or all as both the orignal tables and the MERGE are valid.
CREATE TABLE logs_day_46 LIKE logs_day_45 ENGINE=MyISAM;
DROP TABLE IF EXISTS logs;
CREATE TABLE logs LIKE logs_day_46 ENGINE=MERGE UNION=(logs_day_2,[...],logs_day_46);
DROP TABLE logs_day_1;
Note that a MERGE table is not the same as a PARTIONNED one and offer some advantages and inconvenients. But do remember that if you are trying to aggregate from all tables it will be slower than if all data was in only one table (same is true for partitions, as they are basically different tables under the hood). If you are going to query mostly on specific days, you will need to choose the table yourself, but if partitions are done on the day values, MySQL will automatically grab the correct table(s) which might come out faster and easier to write.