I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)
Related
I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)
What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.
First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.
Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online
I am quite new in the subject of partitions and the necessity has arisen due to the great accumulation of data.
Well, basically it is an access control system, there are currently 20 departments and each department has approximately 100 users. The system records the date and time of the entries and exits (from_date / to_date) My intention is to divide by departments and then for a month throughout the year.
Plan:
Partition the table by [ dep_id and date (from_date and to_date) ]
Problem
I have the following table.
CREATE TABLE `employee` (
`employee_id` smallint(5) NOT NULL,
`dep_id` int(11) NOT NULL,
`from_date` int(11) NOT NULL,
`to_date` int(11) NOT NULL,
KEY `index1` (`employee_id`,`from_date`,`to_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have the dates (from_date and to_date) in UNIX_TIMESTAMP format (INT 11)
I am looking to divide it during all the months of the year.
it's possible?
Mysql - 5.7
It is possible to use range partitioning on an integer column.
Assuming my_int_col is unix-style integer seconds since 1970-01-01
we could achieve monthly partitions with something like this:
PARTITION BY RANGE (my_int_col)
( PARTITION p20180101 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-01-01 00:00') )
, PARTITION p20180201 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-02-01 00:00') )
, PARTITION p20180301 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-03-01 00:00') )
, PARTITION p20180401 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-04-01 00:00') )
, PARTITION p20180501 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-05-01 00:00') )
, PARTITION p20180601 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-06-01 00:00') )
Be careful of the time_zone setting of the session. Those date literals will be interpreted as values in the current time_zone... e.g. if you want those to be UTC datetime, time_zone should be +00:00.
Or, replace the UNIX_TIMESTAMP() expression with a literal integer value... that's what MySQL is going to do with the UNIX_TIMESTAMP() expressions.
Obviously, you can name the partitions whatever you want.
Note: applying partitioning to an existing table will require MySQL to create an entire copy of the table, holding an exclusive lock on the original table while the operation completes. So you will need sufficient storage (disk) space, and a window of time for the operation to complete.
It's possible to create a new table that is partitioned, and then copy the older data a chunk at a time. But make the chunks reasonably sized, to avoid ballooning the ibdata1 with large transactions. And then do some RENAME TABLE statements to move the old table out, and move the new table in.
Some caveats to note with partitioned tables: there's no foreign key support, and there's no guarantee that partitioned table will give better DML performance than a non-partitioned table.
Strategic indexes and carefully planned queries is the key to performance with "very large" tables. And this is true with partitioned tables as well.
Partitioning isn't a magic bullet for performance problems that some novices would like it to be.
As far as creating subpartitions within partitions, I wouldn't recommend it.
I couldn't find an example like mine, so here's the thing:
I have a big data set that I need to aggregate on top of.
We're talking about ~ %500M rows with a date field ranging from 2y ago until now.
My first instinct was to partition the table by this field (creating a partition on the date field), which leaves roughly 20M rows per partition.
Then I have indexes on the other fields I will aggregate/group by.
Here's my table definition (simplified for brevity sake):
create table t1(
date_field datetime not null,
additional_id int not null,
category_id int not null,
value_field1 double,
value_field2 double,
primary key(additional_id,date_field)
)
ENGINE=InnoDB
PARTITION BY RANGE(YEAR(date_field)*100 + MONTH(date_field)) (
PARTITION p_201411 VALUES LESS THAN (201411),
PARTITION p_201412 VALUES LESS THAN (201412),
#all the partitions until the current month...
PARTITION p_201610 VALUES LESS THAN (201610),
PARTITION p_201611 VALUES LESS THAN (201610),
PARTITION p_catchall VALUES LESS THAN MAXVALUE );
If I execute a query that gets a date directly, only the partition for the month is used, based on the output of explain partitions on top of a query such as the following one:
select value_field1 where additional_id=x and date_field='2014-11-05'
However, if I use a date range (even if inside the same partition), all partitions are scanned
select value_field1 where additional_id=x and date_field> '2014-11-05' and date_field <'2014-11-10'
(Same result if I use between).
What am I missing here? Is this really the right way to partition this table?
Thanks in advance
Short answer: Do not use complex expressions for PARTITION BY RANGE.
Long answer: (Aside from criticizing the implementation of BY RANGE with range queries.)
Instead, do this:
PARTITION BY RANGE (TO_DAYS(date_field)) (
PARTITION p_201411 VALUES LESS THAN (TO_DAYS('2014-11-01')),
...
PARTITION p_catchall VALUES LESS THAN MAXVALUE ); -- unchanged
Newer versions of MySQL have slightly more friendly expressions you can use.
If this is your typical query:
additional_id=x and date_field> '2014-11-05'
and date_field <'2014-11-10'
then partitioning is no faster than the equivalent non-partitioned table. You even have the perfect index for the non-partitioned version.
If, on the other hand, you are DROPping old partitions when they 'expire', the PARTITIONing is excellent.
25 partitions is good.
More discussion .
A side note: additional_id int is limited to 2 billion, so you are 1/4 of the way to overflowing. INT UNSIGNED would get you to 4 billion; you might consider an ALTER. (Of course, I don't know whether additional_id is unique in this table; so maybe it is not an issue.)
(MySQL version: 5.6.15)
I have a huge table (Table_A) with 10M rows, in entity-attribute-value model.
It has a compound unique key [Field_A + Element + DataTime].
CREATE TABLE TABLE_A
(
`Field_A` varchar(5) NOT NULL,
`Element` varchar(5) NOT NULL,
`DataTime` datetime NOT NULL,
`Value` decimal(10,2) DEFAULT NULL,
UNIQUE KEY `A_ELE_TIME` (`Field_A`,`Element`,`DataTime`),
KEY `DATATIME` (`DataTime`),
KEY `ELEID` (`ELEID`),
KEY `ELE_TIME` (`ELEID`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Rows are inserted/updated to the table every minutes, hence the row size of each [DataTime] (i.e. every minute) is regular, around 3K rows.
I have a "select" query from this table, after the above "inserted/updated".
The query selects one specified elements within most recent 25 hours (around 30K rows). This query usually processes within 3 sec.
SELECT
Field_A, Element, DataTime, `Value`
FROM
Table_A
WHERE
Element="XX"
AND DataTime between [time] and [time].
The original housekeeping would be remove any row after 3 days, every 5 minutes.
For better housekeeping, I try to partition the table base on [DataTime], every 6 hours. (00,06,12,18 local time)
PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime))
(PARTITION p2014103112 VALUES LESS THAN (73590212) ENGINE = InnoDB,
...
PARTITION p2014110506 VALUES LESS THAN (73590706) ENGINE = InnoDB,
PARTITION pFuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
My housekeeping script will drop the expired partition then create a new one
ALTER TABLE TABLE_A REORGANIZE PARTITION pFuture INTO (
PARTITION [new_partition_name] VALUES LESS THAN ([bound_value]),
PARTITION pFuture VALUES LESS THAN MAXVALUE
)
The new process seems running smoothly.
However, the SELECT query would slow down suddenly (> 100 sec).
The query is still slow even all process stopped. It won't be fixed until "analyzing partitions" (reads and stores the key distributions for partitions).
It usually happens ones a day.
It does not happen to a non-partitioned table.
Therefore, we think it is caused by corrupted indexing in a partitioned MySQL (huge) table.
Does anyone have any idea on how to solve it?
Many Thanks!!
If you PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime)), when you filter datetime with between [from] and [to] operation, mysql will scan all partitions unless [from] equals [to].
So it's reasonable that your query slow down suddenly.
My suggestion is partition using TO_DAYS(DataTime) without hour, if you query recent 25 hours data, it will scan up to 2 partitions only.
I'm not good at MySql, and I couldn't explain it, wish other smart guys can explain it further. But you could using EXPLAIN PARTITIONS to prove it. And here is the Sql Fiddle Demo.