Mysql range partition with range select - mysql

I couldn't find an example like mine, so here's the thing:
I have a big data set that I need to aggregate on top of.
We're talking about ~ %500M rows with a date field ranging from 2y ago until now.
My first instinct was to partition the table by this field (creating a partition on the date field), which leaves roughly 20M rows per partition.
Then I have indexes on the other fields I will aggregate/group by.
Here's my table definition (simplified for brevity sake):
create table t1(
date_field datetime not null,
additional_id int not null,
category_id int not null,
value_field1 double,
value_field2 double,
primary key(additional_id,date_field)
)
ENGINE=InnoDB
PARTITION BY RANGE(YEAR(date_field)*100 + MONTH(date_field)) (
PARTITION p_201411 VALUES LESS THAN (201411),
PARTITION p_201412 VALUES LESS THAN (201412),
#all the partitions until the current month...
PARTITION p_201610 VALUES LESS THAN (201610),
PARTITION p_201611 VALUES LESS THAN (201610),
PARTITION p_catchall VALUES LESS THAN MAXVALUE );
If I execute a query that gets a date directly, only the partition for the month is used, based on the output of explain partitions on top of a query such as the following one:
select value_field1 where additional_id=x and date_field='2014-11-05'
However, if I use a date range (even if inside the same partition), all partitions are scanned
select value_field1 where additional_id=x and date_field> '2014-11-05' and date_field <'2014-11-10'
(Same result if I use between).
What am I missing here? Is this really the right way to partition this table?
Thanks in advance

Short answer: Do not use complex expressions for PARTITION BY RANGE.
Long answer: (Aside from criticizing the implementation of BY RANGE with range queries.)
Instead, do this:
PARTITION BY RANGE (TO_DAYS(date_field)) (
PARTITION p_201411 VALUES LESS THAN (TO_DAYS('2014-11-01')),
...
PARTITION p_catchall VALUES LESS THAN MAXVALUE ); -- unchanged
Newer versions of MySQL have slightly more friendly expressions you can use.
If this is your typical query:
additional_id=x and date_field> '2014-11-05'
and date_field <'2014-11-10'
then partitioning is no faster than the equivalent non-partitioned table. You even have the perfect index for the non-partitioned version.
If, on the other hand, you are DROPping old partitions when they 'expire', the PARTITIONing is excellent.
25 partitions is good.
More discussion .
A side note: additional_id int is limited to 2 billion, so you are 1/4 of the way to overflowing. INT UNSIGNED would get you to 4 billion; you might consider an ALTER. (Of course, I don't know whether additional_id is unique in this table; so maybe it is not an issue.)

Related

MySQL : optimize partitioning to speed up requests [duplicate]

I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)

mysql select query optimization of partitioned table in non-cluster environment

I have select query on a partitioned table with 123 million records which is taking more then 10 minutes to fetch data. My query looks like 'select * from tableName where column1='1.1.1.1' order by timestamp desc';
Table is already indexed on column1.
Any help appreciated.
(From comments)
CREATE TABLE mytable (
column1 varchar(256) NOT NULL,
column2 varchar(100) NOT NULL,
column3 smallint(5) unsigned NOT NULL,
column4 smallint(5) unsigned NOT NULL,
timestamp bigint(20) unsigned NOT NULL,
KEY mytable_idx (column2,timestamp,column3,column4),
KEY ip_addr_index (column1),
KEY ts_idx (timestamp)
) /*!50100 PARTITION BY RANGE ((TIMESTAMP))
(PARTITION p1498800000 VALUES LESS THAN (1498800000) ENGINE = InnoDB,
PARTITION p1500000000 VALUES LESS THAN (1500000000) ENGINE = InnoDB,
PARTITION p1501200000 VALUES LESS THAN (1501200000) ENGINE = InnoDB,
PARTITION p1502400000 VALUES LESS THAN (1502400000) ENGINE = InnoDB,
PARTITION p1503600000 VALUES LESS THAN (1503600000) ENGINE = InnoDB,
PARTITION p1504800000 VALUES LESS THAN (1504800000) ENGINE = InnoDB,
PARTITION p1506000000 VALUES LESS THAN (1506000000) ENGINE = InnoDB
) */
For this query:
select *
from tableName
where column1 = '1.1.1.1'
order by timestamp desc;
You want an index on (column1, timestamp desc). Note: The desc may be ignored in earlier versions of MySQL.
PARTITIONing does not intrinsically provide speed. Please provide SHOW CREATE TABLE so we can discuss whether partitioning actually hurts performance in your case.
INDEX(column1, timestamp) -- In this order
is optimal whether the table is partitioned or not. In particular, that index will work just as good for non-partitioned. (Gordon's comment about DESC has no impact on performance, whether old or new version.)
With 123 million rows, you should keep an eye on datatypes. If you have
column1 VARCHAR(15) CHARACTER SET utf8
then that ipv4_address can be improved from up-to-17 bytes to exactly 4:
BINARY(4)
with suitable conversions on INSERT and SELECT. Making that change would also allow for CDR and other range tests, which are not possible with VARCHAR. Will you need to handle IPv6? I discuss that here.
How many rows match 1.1.1.1? Are there any TEXT columns? What is the PRIMARY KEY? Which Engine? Each of those questions may have an impact on the "10 minutes".
It is important to understand when a "composite" index is better than a single-column index. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
after CREATE
Replace this
KEY ip_addr_index (column1)
with
KEY ip_addr_index (column1, timestamp)
Don't create more than one future partition before it is needed. Always have a LESS THAN (MAXVALUE) partition just in case.
IPv4 can live with VARCHAR(15); IPv6 fits in (39) or `BINARY(16) after packing.
For that one query, 7 queries must be done (one per partition); the results put together, then sorted. Without partitioning, it becomes one query, no sort (since the index is already sorted). So, (I believe) that partitioning slows that query down.
When discussing performance in 123M rows, I need to see all the main queries in one sitting in order to advise. Optimizing for one query is all to likely to de-optimize for some other.
There seems to be no reason to use BIGINT for TIMESTAMP. INT UNSIGNED would save 4 bytes per row of data, plus more for the indexes. Perhaps a total savings of 2GB of disk space. That translates into some speedup for some queries.
If timestamp is always used in a "range", then this index (column2,timestamp,column3,column4) is probably in an inefficient order. Please provide the query that benefits from this index so I can further elaborate.

how to partitioning mysql by column value "user_id" and "gps_time"?

my table scheme:
CREATE TABLE `test_table` (
`his_id` int(11) NOT NULL,
`user_id` varchar(45) NOT NULL,
`gps_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`his_id`,`user_id`)
)
I want partitioning this table by user_id and gps_time,
which column user_id is partitioned by first character A~Z、a~z、0~9,
column gps_time is partitioned by the tast 3 month(ie:3 partitions).
how to do that?
thanks alot~
With MySQL 5.5, you can use multiple columns with RANGE partitioning.
From your question, it's not entirely clear how many partitions you want; it sounds as if you want a whole boatload of partitions, but I don't believe that's what you really want.
The syntax for RANGE partitioning is in the MySQL Reference Manual, available online.
here: http://dev.mysql.com/doc/refman/5.5/en/partitioning.html
(Be sure you check the manual for the version of MySQL you are actually running; there's been some significant changes to partitioning in 5.0, 5.1, 5.5, etc.
With MySQL 5.5.x, if you want a separate partitions for the first character of user_id, and a range of gps_time values, you could do something like this:
PARTITION BY RANGE COLUMNS(userid, gps_time)
( PARTITION pA0 VALUES LESS THAN ('B','2014-07-01')
, PARTITION pA1 VALUES LESS THAN ('B','2014-08-01')
, PARTITION pA2 VALUES LESS THAN ('B','2014-09-01')
, PARTITION pA3 VALUES LESS THAN ('B',MAXVALUE)
, PARTITION pB0 VALUES LESS THAN ('C','2014-07-01')
, PARTITION pB1 VALUES LESS THAN ('C','2014-08-01')
, PARTITION pB2 VALUES LESS THAN ('C','2014-09-01')
, PARTITION pB3 VALUES LESS THAN ('C',MAXVALUE)
, ...
, PARTITION pMX VALUES LESS THAN (MAXVALUE,MAXVALUE),
But that'd be over 100 partitions. I can't imagine a scenario where that's that's you really want. (I'm not sure what the upper limit on partitions for a table is.)
With MySQL 5.1, I don't believe it's possible to partition on multiple columns. You could, howerver, partition on just the user_id column, and then create subpartitions (within each partition) on the gps_time column... but I've never done that before.

MySQL table partition by month

I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)

How to decide varchar partition RANGE in MySQL 5.5?

Backgroud
I have a very big table, the table is just like this
CREATE TABLE tb_doc (
did mediumint(8) unsigned NOT NULL auto_increment,
title varchar(80) NOT NULL default '',
...,
PRIMARY KEY (did),
KEY title (title)
)
TYPE=MyISAM;
The type of title is varchar(80), most of the time title will be pure number strings like '111111','2222222','44444444', some times it will be utf-8 strings, like '3a','a4' or "中国" (Chinese characters).
I've already used HASH (did) to do partition, but my SELECT statements are alway like
SELECT did, title,... FROM tb_doc WHERE title= '1111111';
SELECT did, title,... FROM tb_doc WHERE title= '2222222';
So I want to use title to do partition, hope this would be faster. Now it comes the question.
Experiment
I used the following statement:
PARTITION BY RANGE COLUMNS (title)(
PARTITION p00 VALUES LESS THAN (1), # not pure number strings
PARTITION p01 VALUES LESS THAN (500000), # pure number strings from 1 to 500k
PARTITION p02 VALUES LESS THAN (1000000), # pure number strings from 500k to 1000k
PARTITION p03 VALUES LESS THAN (1500000), # pure number strings from 1000k to 1500k
.......... # ......
PARTITION pn VALUES LESS THAN (25000000), # the biggest number now
)
;
Similar Questions
I read the following two Q&As:
Partitioning a database table in MySQL
How to Partitioning a table using a LIKE criteria in Mysql
but they are for English world, not work in my situation.
Questions
Use title to do partition is better, right?
Can you give me a "utf-8" RANGE example?
I tryied '500000','1000000',...,but they do not work.
If I use SELECT xxx from tb_doc WHERE title='12345', dose MySQL fetch data from partation 1 only?
This table is ~50GB, how many partitions are optimum?
Thank you in advance.
May I note that VARCHAR will have problems with storing characters from multiple languages properly, better use NVARCHAR.
HASH partitioning is used to distribute load over partitions evenly. I would say, that first you should partition by something meaningful to a human (columns appearing in Where clause often) and then do HASH sub-partitioning to utilise as many cores as possible at the same time. So number of HASH sub-partitiong in this case will <= No cores.
I would suggest you creating a clustered index on title column. This will speed up your queries.
And in relation to your questions:
Not neceserally. It will speed up queries, because of clustered index, not partitioning.
Use partitioning to manage the table: eg. delete many rows quickly.
If a good proportion of your queries looks for many rows (not just 1) or title is not a UNIQUE column, then you may consider Partition
As an example of UTF-8 partition boundary I would say: less then ('c')
Depending on how you define partitioning it may hit 1, several or all partitions.
There is no penalty for having many partitions, but a table in MySQL 5.5+ can have upto 1024 partitions and sub-partitions.
When you whan to do partitioning by string value, use KEY PARTITIONING as described here: 18.2.5. KEY Partitioning.
Example:
CREATE TABLE tm1 (
s1 CHAR(32) PRIMARY KEY
)
PARTITION BY KEY(s1)
PARTITIONS 10;
Set number of partitions same as there are letters in your alphabet (or all alphabets you anticipate to see in the table) to begin with.
Partitioning by title, even if you could do it, will not speed up
SELECT did, title,... FROM tb_doc WHERE title= '1111111';
For a further discussion of the limitations of PARTITIONing, plus what few use cases it will help, see my blog;