I have a very large 500 million rows table with the following columns:
id - Bigint - Autoincrementing primary index.
date - Datetime - Approximately 1.5 million rows per date, data older 1 year is deleted.
uid - VARCHAR(60) - A user ID
sessionNumber - INT
start - INT - epoch of start time.
end - INT - epoch of end time.
More columns not relevant for this query.
The combination of uid and sessionNumber forms a uinque index. I also have an index on date.
Due to the sheer size, I'd like to partition the table.
Most of my accesses would be by date, so partitioning by date ranges seems intuitive, but as the date is not part of the unique index, this is not an option.
Option 1: RANGE PARTITION on Date and BEFORE INSERT TRIGGER
I don't really have a regular issue with the uid and sessionNumber uniqueness being violated. The source data is consistent, but sessions that span two days may be inserted on two consecutive days with midnight being the end time of the first and start time of the second.
I'm trying to understand if I could remove the unique key and instead use a trigger that would
Check if there is a session with the same identifiers the previous day and if so,
Updates the end date.
cancels the actual insert.
However, I am not sure if I can 1) trigger an update on the same table. or 2) prevent the actual insert.
Option 2: LINEAR HASH PARTITION on UID
My second option is to use a linear hash partition on the UID. However I cannot see any example that utilizes a VARCHAR and converts it to an INTEGER which is used for the HASH partitioning.
However I cannot finde a permitted way to convert from VARCHAR to INTEGER. For example
ALTER TABLE mytable
PARTITION BY HASH (CAST(md5(uid) AS UNSIGNED integer))
PARTITIONS 20
returns that the partition function is not allowed.
HASH partitioning must work with a 32-bit integer. But you can't convert an MD5 string to an integer simply with CAST().
Instead of MD5, CRC32() can take an arbitrary string and converts to a 32-bit integer. But this is also not a valid function for partitioning.
mysql> alter table v partition by hash(crc32(uid));
ERROR 1564 (HY000): This partition function is not allowed
You could partition by the string using KEY Partitioning instead of HASH partitioning. KEY Partitioning accepts strings. It passes whatever input string through MySQL's built-in PASSWORD() function, which is basically related to SHA1.
However, this leads to another problem with your partitioning strategy:
mysql> alter table v partition by key(uid);
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
Your table's primary key id does not include the column uid that you want to partition by. This is a restriction of MySQL's partitioning:
every unique key on the table must use every column in the table's partitioning expression.
Here's the table I'm testing with (it would have been a good idea for you to include this in your question):
CREATE TABLE `v` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`uid` varchar(60) NOT NULL,
`sessionNumber` int(11) NOT NULL,
`start` int(11) NOT NULL,
`end` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`,`sessionNumber`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Before going any further, I have to wonder why you want to use partitioning anyway? "Sheer size" is not a reason to partition a table.
Partitioning, like any optimization, is done for the sake of specific queries you want to optimize for. Any optimization improves one query at the expense of other queries. Optimization has nothing to do with the table. The table is happy to sit there with 5 billion rows, and it doesn't care. Optimization is for the queries.
So you need to know which queries you want to optimize for. Then decide on a strategy. Partitioning might not be the best strategy for the set of queries you need to optimize!
I'll assume your 'uid' is a 128-bit UUID kind of value, which can be stored as a BINARY(16), because that is generally worth the trouble.
Next, stay away from the 'datetime' type, as it is stored like a packed string, and doesn't hold any timezone information. Store date-time-values either as pure numerical values (the number of seconds since the UNIX-epoch), or let MySQL do that for you and use the timestamp(N) type.
Also don't call a column 'date', not just because that is a reserved word, but also because the value contains time details too.
Next, stay away from using anything else than latin1 as the CHARSET of (all) your tables. Only ever do UTF-8-ness at the column level. This to prevent unnecessarily byte-wide columns and indexes creeping in over time. Adopt this habit and you'll happily look back on it after some years, promised.
This makes the table look like:
CREATE TABLE `v` (
`uuid` binary(16) NOT NULL,
`mysql_created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`visitor_uuid` BINARY(16) NOT NULL,
`sessionNumber` int NOT NULL,
`start` int NOT NULL,
`end` int NOT NULL,
PRIMARY KEY (`uuid`),
UNIQUE KEY (`visitor_uuid`,`sessionNumber`),
KEY (`mysql_created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITIONED BY RANGE COLUMNS (`uuid`)
( PARTITION `p_0` VALUES LESS THAN (X'10')
, PARTITION `p_1` VALUES LESS THAN (X'20')
...
, PARTITION `p_9` VALUES LESS THAN (X'A0')
, PARTITION `p_A` VALUES LESS THAN (X'B0')
...
, PARTITION `p_F` VALUES LESS THAN (MAXVALUE)
);
To make the KEY (mysql_created_at) be only on the date-part, needs a calculated column, which can be added in-place, and then an index on it is also light to add, so I'll leave that as homework.
Related
We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.
I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.
I've been thinking about keeping a history in the following table structure:
`id` bigint unsigned not null auto_increment,
`userid` bigint unsigned not null,
`date` date not null,
`points_earned` int unsigned not null,
primary key (`id`),
key `userid` (`userid`),
key `date` (`date`)
This will allow me to do something like SO does with its Reputation Graph (where I can see my rep gain since I joined the site).
Here's the problem, though: I just ran a simple calculation:
SELECT SUN(DATEDIFF(`lastclick`,`registered`)) FROM `users`
The result was as near as makes no difference 25,000,000 man-days. If I intend to keep one row per user per day, that's a [expletive]ing large table, and I'm expecting further growth. Even if I exclude days where a user doesn't come online, that's still huge.
Can anyone offer any advice on maintaining such a large amount of data? The only queries that will be run on this table are:
SELECT * FROM `history` WHERE `userid`=?
SELECT SUM(`points_earned`) FROM `history` WHERE `userid`=? AND `date`>?
INSERT INTO `history` VALUES (null,?,?,?)
Would the ARCHIVE engine be of any use here, for instance? Or do I just not need to worry because of the indexes?
Assuming its mysql:
for history tables you should consider partitioning you can set the best partition rule for you and looking at what queries you have there are 2 choices :
a. partition by date (1 partition = 1 month for example)
b. partition by user (lets say you have 300 partitions and 1 partition = 100000 users)
this will help you allot if you will use partition pruning (here)
you could use a composite index for user,date (it will be used for the first 2 queries)
avoid INSERT statement, when you have huge data use LOAD DATA (this will not work is the table is partitioned )
And most important ... the best engine for huge volumes of data is MyISAM
We are running MySQL/ISAM database with a following table:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`tm_stamp`,`fk_channel`)
);
The tm_stamp-fk_channel combination is required unique, hence the compound primary key. Now, for certain irrelevant reason, the database will be migrated to InnoDB engine. Upon googling something about it, i found out that the key will dictate the physical ordering of the data on the disk. 90% of the queries currently go as follows:
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Inserts are 99% in order of tm_stamp, it's a storage for dataloggers network. The table has low millions of rows but growing steadily. The questions are
Should the sole change of storage engine result in any significant performance change, better or worse?
Does the order of columns in the index matter with regards to the most popular SELECT? This blog suggest something along that line.
Thanks to the nature of clustered index, may we perhaps leave out the ORDER BY clause and gain some performance?
Edit 1:
It appears that changing the primary key from
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
always makes sense, for both MyISAM and InnoDB. See http://sqlfiddle.com/#!2/0aa08/1 for proof this is so.
Original answer:
To determine if changing
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
would improve your query's performance, you need to determine which field's values cardinality is higher (which field's values are more varied). Running
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
will give you the cardinality of the columns.
So, to answer your question properly we first need to know: What are the common range of values between B and C? 60? 3,600? 86,400? more?
For example, let's say that
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
returns 32,768 and 256. 32,768 divided by 256 is 128. This tells us that tm_stamp has 128 unique values for every value of fk_channel.
So if the difference between B and C is usually less than 128, then leave tm_stamp as the first field in the primary key. If 128 or greater, then make fk_channel the first field.
Another question: Does fk_channel need to be an INT (4 billion unique values, half of which are negative)? If not, then changing fk_channel to TINYINT UNSIGNED (if you have 256 unique values), or SMALLINT UNSIGNED (65536 unique values) would save a lot of time and space.
For example, let's say you have 256 maximum possible fk_channel values, and 65,536 possible values, then you could change your schema via:
create table measurements_new (
tm_stamp INT UNSIGNED NOT NULL DEFAULT '0',
fk_channel TINYINT UNSIGNED NOT NULL DEFAULT '0', -- remove UNSIGNED if values can be negative
value SMALLINT UNSIGNED DEFAULT NULL, -- remove UNSIGNED if values can be negative
PRIMARY KEY (tm_stamp,fk_channel)
) ENGINE=InnoDB
SELECT
tm_stamp,
fk_channel,
value
FROM
measurements
ORDER BY
tm_stamp,
fk_channel;
RENAME TABLE measurements TO measurements_old, measurements_new TO measurements;
This will store the existing data in the new table in PRIMARY KEY order, which will improve performance somewhat.
Staring at the Query
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Your static value is fk_channel and the moving ordered values is tm_stamp. This addresses your second question which seems to be at the heart of the Query's needs.
You would be way better off with PRIMARY KEY columns reversed
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`)
);
As for the first question, the storage engine dictates what gets cached.
MyISAM caches index pages only in the Key Cache (sized by key_buffer_size)
InnoDB caches data and indexes in the Buffer Pool (sized by innodb_buffer_pool_size)
I wrote about this in the DBA StackExchange
If you remain with MyISAM, you could change the primary key to include the value column:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`,`value`)
) ENGINE=MyISAM;
That way, your Query's data retrieval is strictly from one file at most, the .MYI of the MyISAM table. The table need not be read at all.
If your switch to InnoDB, fk_channel,tm_stamp gets loaded twice into RAM
Once from InnoDB data page
Once from InnoDB index page
The order of your arguments in the WHERE clause is irrellavent here, the optimizer will pick the best key option (usually a direct comparison on a indexed field over a > or < comparison). With your initial example, the best option was the tm_stamp <> comparison which was not a direct equality check and therefore sub-par.
However, the order of the clustered key does matters.... If the exact comparison is always on the fk_channel column, I'd change the PK to be:
PRIMARY KEY (`fk_channel`,`tm_stamp`)
Now you've got an index that will benefit from the fk_channel=A in your where clause.
Also, while the storage engine plays a role somewhat, but I don't think the issue here is between innodb & myisam.
Finally, I don't think the ORDER BY clause has much bearing on your issue, that's done post query. A group by could affect your performance....
I want to partition a mysql table by datetime column. One day a partition.The create table scripts is like this:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY hash (day(ftime)) partitions 31;
But when I select data of some day.It could not locate the partition.The select statement is like this:
explain partitions select * from raw_log_2011_4 where day(ftime) = 30;
when i use another statement,it could locate the partition,but I coluld not select data of some day.
explain partitions select * from raw_log_2011_4 where ftime = '2011-03-30';
Is there anyone tell me How I could select data of some day and make use of partition.Thanks!
Partitions by HASH is a very bad idea with datetime columns, because it cannot use partition pruning. From the MySQL docs:
Pruning can be used only on integer columns of tables partitioned by
HASH or KEY. For example, this query on table t4 cannot use pruning
because dob is a DATE column:
SELECT * FROM t4 WHERE dob >= '2001-04-14' AND dob <= '2005-10-15';
However, if the table stores year values in an INT column, then a
query having WHERE year_col >= 2001 AND year_col <= 2005 can be
pruned.
So you can store the value of TO_DAYS(DATE()) in an extra INTEGER column to use pruning.
Another option is to use RANGE partitioning:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY RANGE( TO_DAYS(ftime) ) (
PARTITION p20110401 VALUES LESS THAN (TO_DAYS('2011-04-02')),
PARTITION p20110402 VALUES LESS THAN (TO_DAYS('2011-04-03')),
PARTITION p20110403 VALUES LESS THAN (TO_DAYS('2011-04-04')),
PARTITION p20110404 VALUES LESS THAN (TO_DAYS('2011-04-05')),
...
PARTITION p20110426 VALUES LESS THAN (TO_DAYS('2011-04-27')),
PARTITION p20110427 VALUES LESS THAN (TO_DAYS('2011-04-28')),
PARTITION p20110428 VALUES LESS THAN (TO_DAYS('2011-04-29')),
PARTITION p20110429 VALUES LESS THAN (TO_DAYS('2011-04-30')),
PARTITION future VALUES LESS THAN MAXVALUE
);
Now the following query will only use partition p20110403:
SELECT * FROM raw_log_2011_4 WHERE ftime = '2011-04-03';
Hi You are doing the wrong partition in definition of the table the table definition would like this:
CREATE TABLE raw_log_2011_4 (
id bigint(20) NOT NULL AUTO_INCREMENT,
logid char(16) NOT NULL,
tid char(16) NOT NULL,
reporterip char(46) DEFAULT NULL,
ftime datetime DEFAULT NULL,
KEY id (id)
) ENGINE=InnoDB AUTO_INCREMENT=286802795 DEFAULT CHARSET=utf8
PARTITION BY hash (TO_DAYS(ftime)) partitions 31;
And your select command would be:
explain partitions
select * from raw_log_2011_4 where TO_DAYS(ftime) = '2011-03-30';
The above command would select all the date required, as if you use the TO_DAYS command as
mysql> SELECT TO_DAYS(950501);
-> 728779
mysql> SELECT TO_DAYS('2007-10-07');
-> 733321
Why to use the TO_DAYS AS The MySQL optimizer will recognize two date-based functions for partition pruning purposes:
1.TO_DAYS()
2.YEAR()
and this would solve your problem..
I just recently read a MySQL blog post relating to this, at http://dev.mysql.com/tech-resources/articles/mysql_55_partitioning.html.
Versions earlier than 5.1 required special gymnastics in order to do partitioning based on dates. The link above discusses it and shows examples.
Versions 5.5 and later allowed you to do direct partitioning using non-numeric values such as dates and strings.
Don't use CHAR, use VARCHAR. That will save a lot of space, hence decrease I/O, hence speed up queries. (Exception: If the column is really fixed length, then use CHAR. And it will probably be CHARACTER SET ascii.)
reporterip: (46) is unnecessarily big for an IP address, even IPv6. See My blog for further discussion, including how to shrink it to 16 bytes.
PARTITION BY RANGE(TO_DAYS(...)) as #Steyx suggested, but don't have more than about 50 partitions. The more partitions you have, the slower queries get, in spite of the "pruning". HASH partitioning is essentially useless.
More discussion of partitioning, especially the type you are looking at. That includes code for a sliding set of partitions over time.