I'm just experimenting a bit with partitions with some dummy data, and am not having any luck optimizing my queries so far.
I downloaded a dataset from the Internet, which consists of a single table of measurements:
CREATE TABLE `partitioned_measures` (
`measure_timestamp` datetime NOT NULL,
`station_name` varchar(255) DEFAULT NULL,
`wind_mtsperhour` int(11) NOT NULL,
`windgust_mtsperhour` int(11) NOT NULL,
`windangle` int(3) NOT NULL,
`rain_mm` decimal(5,2) DEFAULT NULL,
`temperature_dht11` int(5) DEFAULT NULL,
`humidity_dht11` int(5) DEFAULT NULL,
`barometric_pressure` decimal(10,2) NOT NULL,
`barometric_temperature` decimal(10,0) NOT NULL,
`lux` decimal(7,2) DEFAULT NULL,
`is_plugged` tinyint(1) DEFAULT NULL,
`battery_level` int(3) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(measure_timestamp))
(PARTITION `slow` VALUES LESS THAN (736634) ENGINE = InnoDB,
PARTITION `fast` VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Just as a learning exercise I wanted to try to partition the measurements by measure_timestamp (without help of indexing). Specifically, I thought it would be interesting to try and put the most recent month in a partition by itself. (I understand that it's best to have equally-sized partitions, but I just wanted to experiment)
I used the following command to add the partition (Note that the dataset ends in Dec of 2016, and the vast majority of the datapoints are in prior months):
ALTER TABLE partitioned_measures
PARTITION BY RANGE(TO_DAYS(measure_timestamp)) (
PARTITION slow VALUES LESS THAN(TO_DAYS('2016-12-01')),
PARTITION fast VALUES LESS THAN (MAXVALUE)
);
To query, I'm looking at all entries from the 2nd and onward (just to be sure that I'm only looking in the latest partition):
select SQL_NO_CACHE COUNT(*) FROM partitioned_measures
WHERE measure_timestamp >= '2016-12-02'
AND DAYOFWEEK(measure_timestamp) = 1;
When I add an EXPLAIN to the front of that, I get the following:
+----+-------------+----------------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | partitioned_measures | slow,fast | ALL | NULL | NULL | NULL | NULL | 1835458 | 33.33 | Using where |
+----+-------------+----------------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
But the query time is about the same as it was before the partition (~1.6 seconds). I've never used partitions before so I feel like there's something conceptual that I'm missing.
Tricky but i found a working solution or should i say a workaround, it seams to be a MySQL bug?
ALTER TABLE partitioned_measures
PARTITION BY RANGE COLUMNS(measure_timestamp) (
PARTITION slow VALUES LESS THAN('2016-12-01'),
PARTITION fast VALUES LESS THAN(MAXVALUE)
);
see demo which does use Partition Pruning correctly
i noticed that syntax here
I still find it wierd the partioning puning does not work correct, with
ALTER TABLE partitioned_measures
PARTITION BY RANGE(TO_DAYS(measure_timestamp)) (
PARTITION slow VALUES LESS THAN(TO_DAYS('2016-12-01')),
PARTITION fast VALUES LESS THAN (MAXVALUE)
);
MySQL 5.7 should be able to do the Partition Pruning which TO_DAYS() just fine
Pruning can also be applied for tables partitioned on a DATE or
DATETIME column when the partitioning expression uses the YEAR() or
TO_DAYS() function. In addition, in MySQL 5.7
source
see demo which does not use Partition Pruning correct, i've tryed alot to get it working all methods failed which i could think off.
The explanation:
It did do the pruning you requested, but it added the first partition. Why? Because there is where bad dates are put.
The workaround is to have a bogus first partition:
/*!50100 PARTITION BY RANGE (TO_DAYS(measure_timestamp))
({ARTITION bogus VALUES LESS THAN (0) ENGINE = InnoDB, -- any small value
PARTITION `slow` VALUES LESS THAN (736634) ENGINE = InnoDB,
PARTITION `fast` VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Reference is buried in https://dev.mysql.com/doc/refman/5.7/en/partitioning-handling-nulls.html
If you had more than a trivial number of partitions you might have been more obvious that it picked the desired partition, plus always the first.
With rare exceptions, partitioning does not provide better performance than you can get from a non-partitioned table with a suitable index. In this case, INDEX(measure_timestamp). (Or a virtual column with INDEX(dow, measure_timestamp).)
Related
I have a table that contains a month and a year column.
I have a query which usually looks something like WHERE month=1 AND year=2022
Given how large this table is i would like to make it more efficient using partitions and sub partitions.
table 1
Querying the data i need took around 2 minutes and 30 seconds.
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
)
Partitioning by "month"
Querying the data i need took around 21 seconds (big improvement).
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (`month`)
(PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB) */
I would like to see if i can improve the performance even further by partitioning by year and then subpartitioning by month. How can i do that?
I'm not sure the following question Partition by year and sub-partition by month mysql is relevant with no marked answers and that question looks to be particular to mysql 5* and php. Im asking about mysql 8, are there no changes since then regarding partioning/subpartioning/list columns/range columns etc? which could help me.
Broader query im making
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12 AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4') # only ever 4 types usually all 4 are present in the query
To answer your question directly, below is example syntax that accomplishes the subpartitioning. Notice the PRIMARY KEY must include all columns used for partitioning or subpartitioning. Read the manual on subpartitioning for more information: https://dev.mysql.com/doc/refman/8.0/en/partitioning-subpartitions.html
Schema (MySQL v8.0)
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`, `year`),
KEY `idx_month_year` (`month`,`year`, `score`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY LIST (`month`)
SUBPARTITION BY HASH(`year`)
SUBPARTITIONS 10 (
PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB
);
Using EXPLAIN on your query reveals that the query references only one subpartition.
Query #1
EXPLAIN
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12
AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4');
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
table_1
p12_p12sp2
range
idx_month_year
idx_month_year
11
1
100
Using index condition
The partitions field of the EXPLAIN shows that it accesses only partition p12_p12sp2. The year the query references, 2022, modulus the number of subpartitions, 10, will read from the subpartition 2.
In addition to the partitioning by month and year, it is also helpful to use an index. In this case, I added score to the index so it would filter out rows where score <= 0. The note in the EXPLAIN "Using index condition" shows that it is delegating further filtering on entity_type to the storage engine. Though in your example, you said there are only four values for entity type, and all four are selected, so that condition won't filter out any rows anyway.
View on DB Fiddle
Re your questions in comments below:
a little bit confused on SUBPARTITIONS 10 , why 10
It's just an example. You can choose a different number of subpartitions. Whatever you feel is required to reduce the search as much as you want.
To be honest, I've never encountered a situation that required subpartitioning at all, if the search is also optimized with indexes. So I have no guidance on what is an appropriate number of subpartitions.
It's your responsibility to test performance until you are satisfied.
also bit confusd on the partition name p12_p12sp2 how do i know it selected the partition with year 2022 from looking at that?
The query has a condition year = 2022.
There are 10 subpartitions in my example.
Hash partitioning just uses the integer value to be partitioned, modulus the number of partitions.
2022 modulus 10 is 2. Hence the partition ending in ...sp2 is the one used.
I also came across this anothermysqldba.blogspot.com/2014/12/… do you know how yours differs from what it shown here ( bare in mind that blog is from 2014)
They chose to name the subpartitions. There's no need to do that.
would there be any performance difference in having a single date e.g (2022-12-21) instead of sepreate columns month and year.
That depends on the query, and I'll leave it to you to test. Any predictions I make won't be accurate with your data on your server.
i can also see that you partition by month and subpartition by year, as oppose to partition by year and subpartition by month. can you explain the reasoning?
Subpartitioning works only if the outer partitions are LIST or RANGE partitions, and the subpartitions are HASH or KEY partitions. This is in the manual page I linked to.
There are a finite number of months (12). This makes it easy to partition by LIST as you did. You won't ever need more partitions. If you had partitioned by YEAR as the outer partition, you would have needed to specify year values in the list, and this is a growing set, so you would periodically have to alter the table to extend the list or range to account for new years.
Whereas when partitioning by HASH for the subpartitioning, the new year values are mapped into the finite set of subpartitions, so it's okay that it's not a finite list. You won't have to alter table to repartition (unless you want to change the number of subpartitions).
Splitting a date into columns is usually counterproductive. It is much easier to split during SELECT.
PARTITIONing is usually useless for performance of any SELECT.
When partitioning (or unpartitioning), the indexes usually need changing.
For that query, I recommend a combined date column,
WHERE date >= '2022-01-01'
AND date < '2022-01-01' + INTERVAL 1 MONTH
and some INDEX starting with date.
(You probably have other queries; let's see some of them; they may need a different index.)
Covering index -- This is an index that contains all the columns found anywhere in the SELECT. It is may be better (faster) than having only the columns needed for WHERE or WHERE + GROUP BY + ORDER BY. It depends on a lot of variables.
Order of columns in an index (or PK): The leftmost column(s) have priority. That is the order of the index rows on disk. PK(id, date) is useful if looking up by id (in the WHERE), but not if you are just searching by date.
Sargable -- sargable -- Hiding a column in a function disables the use of an index. That is MONTH(date) cannot use INDEX(date).
Blogs -- Index Cookbook and Partition
Test plan
I recommend you time all your queries against a variety of Create Tables.
For the WHERE clause:
The order of ANDs does not matter.
When using IN, a single value os equivalent to = and optimizes better. Multiple values may optimize more poorly. As Bill hints at, when the IN list contains all the options, you should eliminate the clause since the Optimizer is not smart enough. So, be sure to test with 1 and/or many items, so as to be realistic to your app.
For the table
Try Partition BY year + Subpartition by month.
Try Partition by a column that is the combination of year and month.
Try without partitioning.
For indexes
Order of the columns (in a composite index) does matter, so try different orderings.
When partitioning, be sure to tack onto the end of the PK the partition key(s).
A partitioned table needs different indexes than a non-partitioned table. That is, what works well for one may work poorly for the other.
Simply use something like this pattern to test various layouts:
CREATE TABLE (( a new layout with or without partitioning and with indexes ))
INSERT INTO test_table SELECT ... FROM real_table;
Change the "..." to adapt to any extra/missing columns in test_table
SELECT ...
Run various 'real' queries
Run each query twice (caching sometimes messes with the timing)
Report the results -- If you provide sufficient info (CREATE TABLE and SELECT), I may have suggestions on further speeding up the test (whether it is partitioned or not).
I create 8 key partitions, but each partitions row count is not flat.
The row counts of each partition has pattern: p0, p2, p4, p6 partition have 99.98% of rows, and p1, p3, p5, p7 partition have 0.02% of rows.
I want to fix it, so I wonder how MySQL determine the target partition when execute select statement.
Or, is there any better solution that can flatten this partition?
The mysql version is 5.7
Thanks.
Edit: I know the key partition works with md5() and mod. but I want to know how MySQL ACTUALLY calculate it.
Edit:
Schema
CREATE TABLE `WD` (
`dId` varchar(120) NOT NULL,
`wId` varchar(120) NOT NULL,
`createdAt` datetime NOT NULL,
`updatedAt` datetime NOT NULL,
PRIMARY KEY (`wId`,`dId`),
KEY `idx_WD_w_d` (`wId`,`dId`),
KEY `idx_WD_d_w` (`dId`,`wId`),
KEY `idx_WD_w_u` (`wId`,`updatedAt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (workspaceId)
PARTITIONS 11 */
CREATE TABLE `DA` (
`id` varchar(120) NOT NULL,
`wId` varchar(120) NOT NULL,
`subject` varchar(180) NOT NULL,
`dId` varchar(120) NOT NULL,
`createdAt` datetime NOT NULL,
`updatedAt` datetime NOT NULL,
PRIMARY KEY (`id`,`workspaceId`),
KEY `idx_DA_w_s_d` (`workspaceId`,`subject`,`documentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (wId)
PARTITIONS 11 */
Explain:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE WD p1 ALL PRIMARY,idx_WD_w_d,idx_WD_d_w,idx_WD_w_u NULL NULL NULL 1 100.00 Using where; Using filesort
1 SIMPLE DA p1 ref idx_documentAcl_w_s_d idx_documentAcl_w_s_d 1266 const,const,DocumentService.WD.documentId 1 100.00 Using index
The computation for picking the partition us quite lame, alas. It is a simple modulo the number of partitions.
Key partitioning, in my opinion, is useless. I know of no case where it helps performance, nor anything else.
Please provide the main queries that you will use with this table; I will explain how to make optimal indexes without using partitioning. Or, in the rare case that partitioning is useful, I will explain what to do instead of what you are trying.
A particular query
Reformulating that query this way may help wit performance.
SELECT WD.*
FROM WD
JOIN DA ON WD.did = DA.did
WHERE WD.wid = '...'
AND DA.wid = '...'
AND DA.subject = '...'
ORDER BY WD.updatedAt DESC -- (per Comment)
LIMIT 50;
And have these composite indexes, most of which you already have:
WD: INDEX(wid, did)
WD: INDEX(did, wid)
WD: INDEX(wid, upddatedAt)
DA: INDEX(wid, subject, did)
Be aware that UUIDs do not scale well.
Meanwhile, I see no performance benefit in Partitioning since the indexes should work quite well.
One more thing. A LIMIT without an ORDER BY gives you some random set of rows. Note that adding an ORDER BY is likely to alter my advice on indexing.
You mentioned UUIDs -- does that mean you expect them to be Unique? If so, do you really need DA.id? (There may be a benefit to changing the PK of DA.)
I want to partition a very large table. As the business is growing, partitioning by date isn't really that good because each year the partitions get bigger and bigger. What I'd really like is a partition for every 10 million records.
The Mysql manual show this simple example:
CREATE TABLE employees (
id INT NOT NULL,
fname VARCHAR(30),
lname VARCHAR(30),
hired DATE NOT NULL DEFAULT '1970-01-01',
separated DATE NOT NULL DEFAULT '9999-12-31',
job_code INT NOT NULL,
store_id INT NOT NULL
)
PARTITION BY RANGE (store_id) (
PARTITION p0 VALUES LESS THAN (6),
PARTITION p1 VALUES LESS THAN (11),
PARTITION p2 VALUES LESS THAN (16),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
But this means that everything larger than 16 and less than MAXVALUE gets thrown in the last partition. Is there a way to auto-generate a new partition every interval (in my case, 10 million records) so I won't have to keep modifying an active database? I am running Mysql 5.5
Thanks!
EDIT: Here is my actual table
CREATE TABLE `my_table` (
`row_id` int(11) NOT NULL AUTO_INCREMENT,
`filename` varchar(50) DEFAULT NULL,
`timestamp` datetime DEFAULT NULL,
`unit_num` int(3) DEFAULT NULL,
`string` int(3) DEFAULT NULL,
`voltage` float(6,4) DEFAULT NULL,
`impedance` float(6,4) DEFAULT NULL,
`amb` float(6,2) DEFAULT NULL,
`ripple_v` float(8,6) DEFAULT NULL,
PRIMARY KEY (`row_id`),
UNIQUE KEY `timestamp` (`timestamp`,`filename`,`string`,`unit_num`),
KEY `index1` (`filename`),
KEY `index2` (`timestamp`),
KEY `index3` (`timestamp`,`filename`,`string`),
KEY `index4` (`filename`,`unit_num`)
) ENGINE=MyISAM AUTO_INCREMENT=690892041 DEFAULT CHARSET=latin1
and an example query for the graph is...
SELECT DATE_FORMAT(timestamp,'%Y/%m/%d %H:%i:%s') as mytime,voltage,impedance,amb,ripple_v,unit_num
FROM my_table WHERE timestamp >= DATE_SUB('2015-07-31 00:05:59', INTERVAL 90 DAY)
AND filename = 'dlrphx10s320upsab3' and unit_num='5' and string='2'ORDER BY timestamp asc;
Here is the explain for the query...
mysql> explain SELECT DATE_FORMAT(timestamp,'%Y/%m/%d %H:%i:%s') as mytime,voltage,impedance,amb,ripple_v,unit_num FROM my_table WHERE timestamp >= DATE_SUB('2015-07-31 00:05:59', INTERVAL 90 DAY) AND filename = 'dlrphx10s320upsab3' and unit_num='5' and string='2'ORDER BY timestamp asc;
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
| 1 | SIMPLE | unit_tarma | ref | timestamp,index3,index4 | index4 | 58 | const,const | 13440 | Using index condition; Using where; Using filesort |
+----+-------------+------------+------+-------------------------+--------+---------+-------------+-------+----------------------------------------------------+
(This answer is directed at the schema and SELECT.)
Since you anticipate millions of rows, first I want to point out some improvements to the schema.
FLOAT(m,n) is usually the 'wrong' thing to do because it leads to two roundings. Either use plain FLOAT (which seems 'right' for metrics like voltage) or use DECIMAL(m,n). FLOAT is 4 bytes; in the cases given, DECIMAL would be 3 or 4 bytes.
When you have both INDEX(a) and INDEX(a,b), the former is unnecessary since the latter can cover for such. You have 3 unnecessary KEYs. This slows down INSERTs.
INT(3) -- Are you saying a "3-digit number"? If so consider TINYINT UNSIGNED (values 0..255) for 1 byte instead of INT for 4 bytes. This will save many MB of disk space, hence speed. (See also SMALLINT, etc, and SIGNED or UNSIGNED.)
If filename is repeated a lot, you may want to "normalize" it. This would save many MB.
Use NOT NULL unless you need NULL for something.
AUTO_INCREMENT=690892041 implies that you are about 1/3 of the way to disaster with id, which will top out at about 2 billion. Do you use id for anything? Getting rid of the column would avoid the issue; and change the UNIQUE KEY to PRIMARY KEY. (If you do need id, let's talk further.)
ENGINE=MyISAM -- Switching has some ramifications, both favorable and unfavorable. The table would become 2-3 times as big. The 'right' choice of PRIMARY KEY would further speed up this SELECT significantly. (And may or may not slow down other SELECTs.)
A note on the SELECT: Since string and unit_num are constants in the query, the last two fields of ORDER BY timestamp asc, string asc, unit_num asc are unnecessary. If they are relevant for reasons not apparent in the SELECT, then my advice may be incomplete.
This
WHERE filename = 'foobar'
AND unit_num='40'
AND string='2'
AND timestamp >= ...
is optimally handled by INDEX(filename, unit_name, string, timestamp). The order of the columns is not important except that timestamp needs to be last. Rearranging the current UNIQUE key, you give you the optimal index. (Meanwhile, none of the indexes is very good for this SELECT.) Making it the PRIMARY KEY and the table InnoDB would make it even faster.
Partitioning? No advantage. Not for performance; not for anything else you have mentioned. A common use for partitioning is for purging 'old'. If you intend to do such, let's talk further.
In huge tables it is best to look at all the important SELECTs simultaneously so that we don't speed up one while demolishing the speed of others. It may even turn out that partitioning helps in this kind of tradeoff.
First, I must ask what benefit Partitioning gives you? Is there some query that runs faster because of it?
There is no auto-partitioning.
Instead, you should have a job that runs every day and it counts the number of rows in the 'last active' partition to see if it is about 10M. If so, add another partition.
I recommend keeping the "last" partition (the one with MAXVALUE) empty. That way you can REORGANIZE PARTITION to split it into two empty partitions with essentially zero overhead. And I recommend that instead of ADD PARTITION because you might slip up and put something in the last partition.
It is unclear what will trigger 10M. Are there multiple rows for each store_id? And are there new rows coming in for each store? If so, then partitioning on store_id since all partitions will be growing all the time.
OK, so store_id was just a lame example from the reference manual. Please provide SHOW CREATE TABLE so we can talk concrete, not hand-waving. There are simply too many ways to take this task.
What is the activity?
If you mostly hit the "recent" partition(s), then an uneven distribution may be warrantied -- periodically add a new partition and combine an adjacent pair of old partitions. (I did this successfully in one system.)
If you will be purging "old" data, then clearly you need to use PARTITION BY RANGE(TO_DAYS(...)) and use DROP PARTITION plus REORGANIZE PARTITION.
And there are lots of other scenarios. But I know of only 4 scenarios where Partitioning provides any performance benefit. See my blog.
I want to create a partitioned table which is going to be filled with hundreds of millions of records. Using partitioning how can I have a particular day's records go into one partition, then the next day's in another, etc.. Then after ninety odd days I can delete old data from the oldest partition.
I tried this declaration (the hash function uses a modulo against the amount of partitions to calculate which partition gets the data). This ensures each day uses a different one of the 92 partitions; except it doesn't work.
CREATE TABLE records(
id INT NOT NULL AUTO_INCREMENT,
dt DATETIME,
PRIMARY KEY (id)
)
PARTITION BY HASH((MOD(DAYOFYEAR(dt), 92) + 92))
PARTITIONS 92;
The problem with the above snippet is that the column used in the hash expression has to be a unique key within the table.
How can I fix this so that I have ninety(ish) rotating partitions based on each day's records?
If I simply add the dt column to primary key, it seems to hit all the partitions if a select a date range, which is not what I want.
Any ideas?
The reason is that to partition on a date field and query by range you must either use YEAR() or TO_DAYS() in the partition expression.
Partitioning like this works as expected:
CREATE TABLE `alert` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`eventId` int(10) unsigned NOT NULL,
`occurred` datetime NOT NULL,
KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
/*!50100 PARTITION BY RANGE (TO_DAYS(occurred))
(PARTITION 28_06 VALUES LESS THAN (735413) ENGINE = InnoDB,
PARTITION 29_06 VALUES LESS THAN (735414) ENGINE = InnoDB,
PARTITION 30_06 VALUES LESS THAN (735415) ENGINE = InnoDB,
PARTITION 01_07 VALUES LESS THAN (735416) ENGINE = InnoDB,
PARTITION 02_07 VALUES LESS THAN (735417) ENGINE = InnoDB,
PARTITION 03_07 VALUES LESS THAN (735418) ENGINE = InnoDB,
PARTITION 04_07 VALUES LESS THAN (735419) ENGINE = InnoDB,
PARTITION 05_07 VALUES LESS THAN (735420) ENGINE = InnoDB,
PARTITION 06_07 VALUES LESS THAN (735421) ENGINE = InnoDB,
PARTITION 07_07 VALUES LESS THAN (735422) ENGINE = InnoDB) */
mysql> explain partitions SELECT * FROM alert WHERE occurred >= '2013-07-02' and occurred <= '2013-07-04';
+----+-------------+-------+-------------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | alert | 02_07,03_07,04_07 | ALL | NULL | NULL | NULL | NULL | 3 | Using where |
+----+-------------+-------+-------------------+------+---------------+------+---------+------+------+-------------+
Then you need to manage dropping and creating of the partition yourself.
Actually, the problem is that you can't define a PRIMARY or UNIQUE key on a partitioned table, if all the columns in the key are not included in the hash function.
One possible "fix" would be to remove the "PRIMARY" keyword from the KEY definition.
The problem is that MySQL has to enforce uniqueness when you declare a key to be UNIQUE or PRIMARY. And in order to enforce that, MySQL needs to be able to check whether the key value already exists. Instead of checking every partition, MySQL uses the partitioning function to determine the partition where a particular key would be found.
I need to optimize a MYSQL query doing an order by. No matter what I do, mysql ends up doing a filesort instead of using the index.
Here's my table ddl... (Yes, In this case the DAYSTAMP and TIMESTAMP columns are exactly the same).
CREATE TABLE DB_PROBE.TBL_PROBE_DAILY (
DAYSTAMP date NOT NULL,
TIMESTAMP date NOT NULL,
SOURCE_ADDR varchar(64) NOT NULL,
SOURCE_PORT int(10) NOT NULL,
DEST_ADDR varchar(64) NOT NULL,
DEST_PORT int(10) NOT NULL,
PACKET_COUNT int(20) NOT NULL,
BYTES int(20) NOT NULL,
UNIQUE KEY IDX_TBL_PROBE_DAILY_05 (DAYSTAMP,SOURCE_ADDR(16),SOURCE_PORT,
DEST_ADDR(16),DEST_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_01 (SOURCE_ADDR(16),TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_02 (DEST_ADDR(16),TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_03 (SOURCE_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_04 (DEST_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_06 (DAYSTAMP,TIMESTAMP,BYTES)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (to_days(DAYSTAMP))
(PARTITION TBL_PROBE_DAILY_P20100303 VALUES LESS THAN (734200) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100304 VALUES LESS THAN (734201) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100305 VALUES LESS THAN (734202) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100306 VALUES LESS THAN (734203) ENGINE = InnoDB) */;
The partitions are daily and I've added IDX_TBL_PROBE_DAILY_06 especially for the query I'm trying to get working, which is:
select SOURCE_ADDR as 'Source_IP',
SOURCE_PORT as 'Source_Port',
DEST_ADDR as 'Destination_IP',
DEST_PORT as 'Destination_Port',
BYTES
from TBL_PROBE_DAILY
where DAYSTAMP >= '2010-03-04' and DAYSTAMP <= '2010-03-04'
and TIMESTAMP >= FROM_UNIXTIME(1267653600) and TIMESTAMP <= FROM_UNIXTIME(1267687228)
order by bytes desc limit 20;
The explain plan as follows:
+----+-------------+-----------------+---------------------------+-------+-----------------------------------------------+------------------------+---------+------+--------+-----------------------------+ | id | select_type | table |
partitions | type | possible_keys |
key | key_len | ref | rows | Extra |
+----+-------------+-----------------+---------------------------+-------+-----------------------------------------------+------------------------+---------+------+--------+-----------------------------+ | 1 | SIMPLE | TBL_PROBE_DAILY |
TBL_PROBE_DAILY_P20100304 | range |
IDX_TBL_PROBE_DAILY_05,IDX_TBL_PROBE_DAILY_06 | IDX_TBL_PROBE_DAILY_05 | 3 | NULL |
216920 | Using where; Using filesort |
+----+-------------+-----------------+---------------------------+-------+-----------------------------------------------+------------------------+---------+------+--------+-----------------------------+
I've also tried to FORCE INDEX (IDX_TBL_PROBE_DAILY_06) , in which case it happily uses IDX_06 to satisfy the where constraints, but still does a filesort :(
I cant imagine index sorting impossible on partitioned tables? InnoDB behaves different to MyISAM in this regard? I would have thought InnoDBs index+data caching to be ideal for index sorting.
Any help will be much appreciated... I've been trying all week to optimize this query in different ways, without much success.
Ok. Looks like swapping the columns in the index did the trick.
I don't really know why... maybe someone else has an explanation?
Either way, if I add an index
create index IDX_TBL_PROBE_DAILY_07 on TBL_PROBE_DAILY(BYTES,DAYSTAMP)
then mysql favors IDX07 (even without the force index) and does an index sort instead of file sort.
I couldn't read the definition. Here it is formatted:
CREATE TABLE DB_PROBE.TBL_PROBE_DAILY (
DAYSTAMP date NOT NULL,
TIMESTAMP date NOT NULL,
SOURCE_ADDR varchar(64) NOT NULL,
SOURCE_PORT int(10) NOT NULL,
DEST_ADDR varchar(64) NOT NULL,
DEST_PORT int(10) NOT NULL,
PACKET_COUNT int(20) NOT NULL,
BYTES int(20) NOT NULL,
UNIQUE KEY IDX_TBL_PROBE_DAILY_05 (DAYSTAMP,SOURCE_ADDR(16),SOURCE_PORT,
DEST_ADDR(16),DEST_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_01 (SOURCE_ADDR(16),TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_02 (DEST_ADDR(16),TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_03 (SOURCE_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_04 (DEST_PORT,TIMESTAMP),
KEY IDX_TBL_PROBE_DAILY_06 (DAYSTAMP,TIMESTAMP,BYTES)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (to_days(DAYSTAMP))
(PARTITION TBL_PROBE_DAILY_P20100303 VALUES LESS THAN (734200) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100304 VALUES LESS THAN (734201) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100305 VALUES LESS THAN (734202) ENGINE = InnoDB,
PARTITION TBL_PROBE_DAILY_P20100306 VALUES LESS THAN (734203) ENGINE = InnoDB) */;
The Query:
select SOURCE_ADDR as 'Source_IP',
SOURCE_PORT as 'Source_Port',
DEST_ADDR as 'Destination_IP',
DEST_PORT as 'Destination_Port',
BYTES
from TBL_PROBE_DAILY
where DAYSTAMP >= '2010-03-04' and DAYSTAMP <= '2010-03-04'
and TIMESTAMP >= FROM_UNIXTIME(1267653600) and TIMESTAMP <= FROM_UNIXTIME(1267687228)
order by bytes desc limit 20;
I suspect the problem is that your query contains two range queries. I my experience, MySQL cannot optimise beyond the first range query it encounters, and so as far as it is concerned, any index beginning with DAYSTAMP is equivalent to any other.
The clue in the explain is key length: this shows how much of the index value actually gets used. It is probably the same value (3) even when you force it to use the index you want.
Using an open ended equality in where always forces a filesort. Simply put, an open ended < or > makes MySQL get the rows and order them to eliminate the ones not in matching your query. If logically this query can be changed into a range (between timestamp X and timestamp Y) THEN MySQL can use those bookend values to get results directly from the index and then either filesort if you still want the return sorted or not if you only want to match the values
Swapping did worked because
To sort or group a table if the sorting or grouping is done on a leftmost prefix of a usable key (for example, ORDER BY key_part1, key_part2). If all key parts are followed by DESC, the key is read in reverse order. See Section 8.3.1.11, “ORDER BY Optimization”, and Section 8.3.1.12, “GROUP BY Optimization”.