I was trying to check whether implementing MySQL database partitioning is beneficial for our application or not. I have heard a lot about the benefits of using partitioning for large number of records.
But surprisingly, the response time of the application got reduced by 3 times when doing the load testing after partitioning was implemented. Could someone please help with the reason why this may happen?
Let me explain in detail:
Below is the DDL of the table when partitioning was ‘not’ in place.
CREATE TABLE `myTable` (
`column1` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`column2` char(3) NOT NULL,
`column3` char(3) NOT NULL,
`column4` char(2) NOT NULL,
`column5` smallint(4) unsigned NOT NULL,
`column6` date NOT NULL,
`column7` varchar(2) NOT NULL,
`column8` tinyint(3) unsigned NOT NULL COMMENT 'Seat Count Ranges from 0-9.',
`column9` varchar(2) NOT NULL,
`column10` varchar(4) NOT NULL,
`column11` char(2) NOT NULL,
`column12` datetime NOT NULL,
`column13` datetime DEFAULT NULL,
PRIMARY KEY (`column1`),
KEY `index1` (`column2`,`column3`,`column4`,`column5`,`column7`,`column6`),
KEY `index2` (`column2`,`column3`,`column6`,`column4`)
) ENGINE=InnoDB AUTO_INCREMENT=342024674 DEFAULT CHARSET=latin1;
And below is the DDL of the same table after implementing ‘Range’ partitioning based on a date field.
CREATE TABLE `myTable` (
`column1` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`column2` char(3) NOT NULL,
`column3` char(3) NOT NULL,
`column4` char(2) NOT NULL,
`column5` smallint(4) unsigned NOT NULL,
`column6` date NOT NULL,
`column7` varchar(2) NOT NULL,
`column8` tinyint(3) unsigned NOT NULL COMMENT 'Seat Count Ranges from 0-9.',
`column9` varchar(2) NOT NULL,
`column10` varchar(4) NOT NULL,
`column11` char(2) NOT NULL,
`column12` datetime NOT NULL,
`column13` datetime DEFAULT NULL,
PRIMARY KEY (`column1`,`column6`),
KEY `index1` (`column2`,`column3`,`column4`,`column5`,`column7`,`column6`),
KEY `index2` (`column2`,`column3`,`column6`,`column4`)
) ENGINE=InnoDB AUTO_INCREMENT=342024674 DEFAULT CHARSET=latin1
PARTITION BY RANGE COLUMNS(`column6`)
(PARTITION date_jul_11 VALUES LESS THAN ('2011-08-01') ENGINE = InnoDB,
PARTITION date_aug_11 VALUES LESS THAN ('2011-09-01') ENGINE = InnoDB,
PARTITION date_sep_11 VALUES LESS THAN ('2011-10-01') ENGINE = InnoDB,
PARTITION date_oct_11 VALUES LESS THAN ('2011-11-01') ENGINE = InnoDB,
PARTITION date_nov_11 VALUES LESS THAN ('2011-12-01') ENGINE = InnoDB,
PARTITION date_dec_11 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
PARTITION date_jan_12 VALUES LESS THAN ('2012-02-01') ENGINE = InnoDB,
PARTITION date_feb_12 VALUES LESS THAN ('2012-03-01') ENGINE = InnoDB,
PARTITION date_mar_12 VALUES LESS THAN ('2012-04-01') ENGINE = InnoDB,
PARTITION date_apr_12 VALUES LESS THAN ('2012-05-01') ENGINE = InnoDB,
PARTITION date_may_12 VALUES LESS THAN ('2012-06-01') ENGINE = InnoDB,
PARTITION date_jun_12 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,
PARTITION date_jul_12 VALUES LESS THAN ('2012-08-01') ENGINE = InnoDB,
PARTITION date_aug_12 VALUES LESS THAN ('2012-09-01') ENGINE = InnoDB,
PARTITION date_sep_12 VALUES LESS THAN ('2012-10-01') ENGINE = InnoDB,
PARTITION date_oct_12 VALUES LESS THAN ('2012-11-01') ENGINE = InnoDB,
PARTITION date_nov_12 VALUES LESS THAN ('2012-12-01') ENGINE = InnoDB,
PARTITION date_dec_12 VALUES LESS THAN ('2013-01-01') ENGINE = InnoDB,
PARTITION date_jan_13 VALUES LESS THAN ('2013-02-01') ENGINE = InnoDB,
PARTITION date_feb_13 VALUES LESS THAN ('2013-03-01') ENGINE = InnoDB,
PARTITION date_mar_13 VALUES LESS THAN ('2013-04-01') ENGINE = InnoDB,
PARTITION date_apr_13 VALUES LESS THAN ('2013-05-01') ENGINE = InnoDB,
PARTITION date_may_13 VALUES LESS THAN ('2013-06-01') ENGINE = InnoDB,
PARTITION date_jun_13 VALUES LESS THAN ('2013-07-01') ENGINE = InnoDB,
PARTITION date_oth VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB);
Below is a sample query which was used for doing the load testing to test the performance.
SELECT column8, column9
FROM myTable
WHERE column2 = ? AND column3 = ? AND column4 =? AND column5 = ? AND column7 = ? AND column6 = ?
LIMIT 1
The ? above were replaced with real values present in the database for testing.
Please note that the number of records in ‘myTable’ table is around 342 million, and the number of test data used for doing the performance testing is about 2 million.
However, as I said, the performance after implementing partitioning was reduced by a shocking 3 times. Any idea what may have caused this?
Also, please let me know if doing any further change in the table structure or indexing may help resolve this issue.
Remember, the goal of partitioning is to speed up queries where your query limits the number of partitions the result could be found in. I think the issues is the column6 = ? in your test query. I'm guessing that requiring an exact value, rather than a range, for column6 reduces your result set to very few values. Therefore, in the process of narrowing down the partitions, you've already essentially found the result. And since the indexes are split across the multiple partitions, there is a cost to that narrowing process.
The kind of query you would expect to benefit from partitioning on column6 is one that returns a range of values, limited to a small number of partitions. For example, try something like this as a test query:
SELECT column8, column9
FROM myTable
WHERE column6 < ? AND column6 > ? AND column2 = ? AND column3 = ? AND column4 =? AND column5 = ?
where that column6 range spans around 2 partitions, and the total result count is expected to be reasonably large.
This might help: http://dev.mysql.com/tech-resources/articles/partitioning.html
Looking at this, there's several things I would consider.
The first, and most glaring issue is that the big benefit from partitioning comes when you spread your data across different devices (disks) - and there's no evidence of that from the code posted.
Next, your partitioning is hard coded to specific date ranges - hence you're going to have to come up with a better plan when date_oth starts to fill up.
AND column6 = ?
So you only tested the performance of data from single partition? At best this will be no faster than with all the data in one table.
As Nathan points out, you are partitioning by column 6 - but you don't have this at the front of any of your indexes, hence the DBMS must search the index in each partition to find the data - this is ilkely the reason why the performance is so poor. (I disagree that partitioning only helps range queries).
Related
I have a table that contains a month and a year column.
I have a query which usually looks something like WHERE month=1 AND year=2022
Given how large this table is i would like to make it more efficient using partitions and sub partitions.
table 1
Querying the data i need took around 2 minutes and 30 seconds.
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
)
Partitioning by "month"
Querying the data i need took around 21 seconds (big improvement).
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (`month`)
(PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB) */
I would like to see if i can improve the performance even further by partitioning by year and then subpartitioning by month. How can i do that?
I'm not sure the following question Partition by year and sub-partition by month mysql is relevant with no marked answers and that question looks to be particular to mysql 5* and php. Im asking about mysql 8, are there no changes since then regarding partioning/subpartioning/list columns/range columns etc? which could help me.
Broader query im making
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12 AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4') # only ever 4 types usually all 4 are present in the query
To answer your question directly, below is example syntax that accomplishes the subpartitioning. Notice the PRIMARY KEY must include all columns used for partitioning or subpartitioning. Read the manual on subpartitioning for more information: https://dev.mysql.com/doc/refman/8.0/en/partitioning-subpartitions.html
Schema (MySQL v8.0)
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`, `year`),
KEY `idx_month_year` (`month`,`year`, `score`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY LIST (`month`)
SUBPARTITION BY HASH(`year`)
SUBPARTITIONS 10 (
PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB
);
Using EXPLAIN on your query reveals that the query references only one subpartition.
Query #1
EXPLAIN
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12
AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4');
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
table_1
p12_p12sp2
range
idx_month_year
idx_month_year
11
1
100
Using index condition
The partitions field of the EXPLAIN shows that it accesses only partition p12_p12sp2. The year the query references, 2022, modulus the number of subpartitions, 10, will read from the subpartition 2.
In addition to the partitioning by month and year, it is also helpful to use an index. In this case, I added score to the index so it would filter out rows where score <= 0. The note in the EXPLAIN "Using index condition" shows that it is delegating further filtering on entity_type to the storage engine. Though in your example, you said there are only four values for entity type, and all four are selected, so that condition won't filter out any rows anyway.
View on DB Fiddle
Re your questions in comments below:
a little bit confused on SUBPARTITIONS 10 , why 10
It's just an example. You can choose a different number of subpartitions. Whatever you feel is required to reduce the search as much as you want.
To be honest, I've never encountered a situation that required subpartitioning at all, if the search is also optimized with indexes. So I have no guidance on what is an appropriate number of subpartitions.
It's your responsibility to test performance until you are satisfied.
also bit confusd on the partition name p12_p12sp2 how do i know it selected the partition with year 2022 from looking at that?
The query has a condition year = 2022.
There are 10 subpartitions in my example.
Hash partitioning just uses the integer value to be partitioned, modulus the number of partitions.
2022 modulus 10 is 2. Hence the partition ending in ...sp2 is the one used.
I also came across this anothermysqldba.blogspot.com/2014/12/… do you know how yours differs from what it shown here ( bare in mind that blog is from 2014)
They chose to name the subpartitions. There's no need to do that.
would there be any performance difference in having a single date e.g (2022-12-21) instead of sepreate columns month and year.
That depends on the query, and I'll leave it to you to test. Any predictions I make won't be accurate with your data on your server.
i can also see that you partition by month and subpartition by year, as oppose to partition by year and subpartition by month. can you explain the reasoning?
Subpartitioning works only if the outer partitions are LIST or RANGE partitions, and the subpartitions are HASH or KEY partitions. This is in the manual page I linked to.
There are a finite number of months (12). This makes it easy to partition by LIST as you did. You won't ever need more partitions. If you had partitioned by YEAR as the outer partition, you would have needed to specify year values in the list, and this is a growing set, so you would periodically have to alter the table to extend the list or range to account for new years.
Whereas when partitioning by HASH for the subpartitioning, the new year values are mapped into the finite set of subpartitions, so it's okay that it's not a finite list. You won't have to alter table to repartition (unless you want to change the number of subpartitions).
Splitting a date into columns is usually counterproductive. It is much easier to split during SELECT.
PARTITIONing is usually useless for performance of any SELECT.
When partitioning (or unpartitioning), the indexes usually need changing.
For that query, I recommend a combined date column,
WHERE date >= '2022-01-01'
AND date < '2022-01-01' + INTERVAL 1 MONTH
and some INDEX starting with date.
(You probably have other queries; let's see some of them; they may need a different index.)
Covering index -- This is an index that contains all the columns found anywhere in the SELECT. It is may be better (faster) than having only the columns needed for WHERE or WHERE + GROUP BY + ORDER BY. It depends on a lot of variables.
Order of columns in an index (or PK): The leftmost column(s) have priority. That is the order of the index rows on disk. PK(id, date) is useful if looking up by id (in the WHERE), but not if you are just searching by date.
Sargable -- sargable -- Hiding a column in a function disables the use of an index. That is MONTH(date) cannot use INDEX(date).
Blogs -- Index Cookbook and Partition
Test plan
I recommend you time all your queries against a variety of Create Tables.
For the WHERE clause:
The order of ANDs does not matter.
When using IN, a single value os equivalent to = and optimizes better. Multiple values may optimize more poorly. As Bill hints at, when the IN list contains all the options, you should eliminate the clause since the Optimizer is not smart enough. So, be sure to test with 1 and/or many items, so as to be realistic to your app.
For the table
Try Partition BY year + Subpartition by month.
Try Partition by a column that is the combination of year and month.
Try without partitioning.
For indexes
Order of the columns (in a composite index) does matter, so try different orderings.
When partitioning, be sure to tack onto the end of the PK the partition key(s).
A partitioned table needs different indexes than a non-partitioned table. That is, what works well for one may work poorly for the other.
Simply use something like this pattern to test various layouts:
CREATE TABLE (( a new layout with or without partitioning and with indexes ))
INSERT INTO test_table SELECT ... FROM real_table;
Change the "..." to adapt to any extra/missing columns in test_table
SELECT ...
Run various 'real' queries
Run each query twice (caching sometimes messes with the timing)
Report the results -- If you provide sufficient info (CREATE TABLE and SELECT), I may have suggestions on further speeding up the test (whether it is partitioned or not).
The I have a table which will receive 45-60 million rows of IOT type data a year. The initial desire is to never delete data as we might use it for different types of "big data analysis". Today this table needs to support our online application. The app needs fast query times for data that is usually within the last 30 or 90 days. So I was thinking that partitioning might be a good idea.
Our current thinking is to use an 'aging' column, called partition_id in this case. Records within the last 30 days are partition_id = 0. Records 31 days to 90 days are partition_id = 1 and everything else is in partition_id = 2.
All queries will 'know' which partition_id they want to use. Within that, queries are always by sensor_id, badge_id, etc (see indexes) all the sensor_ids or badge_id within a group i.e. sensor_id in ( 3, 15, 35, 100, 1024) etc.
Here's the table definition
CREATE TABLE 'device_messages' (
'id' int(10) unsigned NOT NULL AUTO_INCREMENT,
'partition_id' tinyint(3) unsigned NOT NULL DEFAULT '0',
'customer_id' int(10) unsigned NOT NULL,
'unix_timestamp' double(12, 2) NOT NULL,
'timestamp' timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
'timezone_id' smallint(5) unsigned NOT NULL,
'event_date' date NOT NULL,
'is_day_shift' tinyint(1) unsigned NOT NULL,
'msg_id' tinyint(3) unsigned NOT NULL,
'sensor_id' int(10) unsigned NOT NULL,
'sensor_role_id' int(10) unsigned NOT NULL,
'sensor_box_build_id' int(10) unsigned NOT NULL,
'gateway_id' int(10) unsigned NOT NULL,
'location_hierarchy_id' int(10) unsigned NOT NULL,
'group_hierarchy_id' int(10) unsigned DEFAULT NULL,
'badge_id' int(10) unsigned NOT NULL,
'is_badge_deleted' tinyint(1) DEFAULT NULL,
'user_id' int(10) unsigned DEFAULT NULL,
'is_user_deleted' tinyint(1) DEFAULT NULL,
'badge_battery' double unsigned DEFAULT NULL,
'scan_duration' int(10) unsigned DEFAULT NULL,
'reading_count' tinyint(3) unsigned DEFAULT NULL,
'median_rssi_reading' tinyint(4) DEFAULT NULL,
'powerup_counter' int(10) unsigned DEFAULT NULL,
'tx_counter' int(10) unsigned DEFAULT NULL,
'activity_counter' int(10) unsigned DEFAULT NULL,
'still_counter' int(10) unsigned DEFAULT NULL,
'created_at' timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY ('id', 'partition_id', 'sensor_id', 'event_date'),
KEY 'sensor_id_query_index' ('partition_id', 'sensor_id', 'event_date'),
KEY 'badge_id_query_index' ('partition_id', 'badge_id', 'event_date'),
KEY 'location_hierarchy_id_query_index' ('partition_id', 'location_hierarchy_id', 'event_date'),
KEY 'group_hierarchy_id_query_index' ('partition_id', 'group_hierarchy_id', 'event_date')
) ENGINE = InnoDB AUTO_INCREMENT = 1 DEFAULT CHARSET = utf8 COLLATE = utf8_unicode_ci
PARTITION BY RANGE (partition_id)
SUBPARTITION BY HASH (sensor_id)
(PARTITION fresh VALUES LESS THAN (1)
(SUBPARTITION f0 ENGINE = InnoDB,
SUBPARTITION f1 ENGINE = InnoDB,
SUBPARTITION f2 ENGINE = InnoDB,
SUBPARTITION f3 ENGINE = InnoDB,
SUBPARTITION f4 ENGINE = InnoDB,
SUBPARTITION f5 ENGINE = InnoDB,
SUBPARTITION f6 ENGINE = InnoDB,
SUBPARTITION f7 ENGINE = InnoDB,
SUBPARTITION f8 ENGINE = InnoDB,
SUBPARTITION f9 ENGINE = InnoDB),
PARTITION archive VALUES LESS THAN (2)
(SUBPARTITION a0 ENGINE = InnoDB,
SUBPARTITION a1 ENGINE = InnoDB,
SUBPARTITION a2 ENGINE = InnoDB,
SUBPARTITION a3 ENGINE = InnoDB,
SUBPARTITION a4 ENGINE = InnoDB,
SUBPARTITION a5 ENGINE = InnoDB,
SUBPARTITION a6 ENGINE = InnoDB,
SUBPARTITION a7 ENGINE = InnoDB,
SUBPARTITION a8 ENGINE = InnoDB,
SUBPARTITION a9 ENGINE = InnoDB),
PARTITION deep_archive VALUES LESS THAN MAXVALUE
(SUBPARTITION C0 ENGINE = InnoDB,
SUBPARTITION C1 ENGINE = InnoDB,
SUBPARTITION C2 ENGINE = InnoDB,
SUBPARTITION C3 ENGINE = InnoDB,
SUBPARTITION C4 ENGINE = InnoDB,
SUBPARTITION C5 ENGINE = InnoDB,
SUBPARTITION C6 ENGINE = InnoDB,
SUBPARTITION C7 ENGINE = InnoDB,
SUBPARTITION C8 ENGINE = InnoDB,
SUBPARTITION C9 ENGINE = InnoDB)) ;
This table definition is currently working with 16 million rows of data and queries seem to be fast. However, I'm concerned about the long term sustainability of this implementation. Plus I now see that we are doing a lot of churn on the partitions as we 'age' the records by updating the partition_id of 10s of thousands of records per week.
The queries will almost always be a variant of this:
SELECT * FROM device_messages
WHERE partition_id = 0
AND 'event_date' BETWEEN '2019-08-07' AND '2019-08-13'
AND 'sensor_id' in ( 3317, 3322, 3323, 3327, 3328, 3329, 3331, 3332, 3333, 3334, 3335, 3336, 3337, 3338, 3339, 3340, 3341, 3342 )
ORDER BY 'unix_timestamp' asc
There could be as few as one sensor_id in the list but often will be several.
I've spent hours of time researching partitioning but haven't found an example or discussion of partitioning for exactly this use case. Since, we're using the artificial aging column of partition_id in this way I also realize that I can't do any true manipulation of the partitions, so I think I'm losing at least some of the value of partitioning.
Advice on partitioning schemes or even alternative approaches would be greatly appreciated.
PARTITIONing is not a performance panacea.
Not deleting? OK, the main use (DROP PARTITION is faster than DELETE) is not available.
Summary Tables is the answer to Data Warehouse performance problems. See http://mysql.rjweb.org/doc.php/summarytables
(Now I will read the Question in detail and any answers; maybe I will come back in have something to change.)
Schema critique
Since you anticipate millions of rows, shrinking datatypes is rather important.
customer_id is a 4-byte integer. If you don't anticipate more than a few thousand, use a 2-byte SMALLINT UNSIGNED. See also MEDIUMINT UNSIGNED. Ditto for all the other INTs.
'unix_timestamp' double(12, 2) is quite strange. What's wrong with TIMESTAMP(2), which would be smaller?
'badge_battery' double -- Excessive resolution? DOUBLE is 8 bytes; FLOAT is 4 and has ~7 signficant digits.
Most columns are NULLable. Are they really optional? (NULL has a tiny overhead; use NOT NULL where practical.)
When rows age out of being "fresh", will you do a massive UPDATE to change that column? Please consider the large impact that statement will have. It is better to create new partitions and change the queries. This works especially well if you have AND some_date > some_column and that column is PARTITION BY RANGE(TO_DAYS(..)).
I have yet to see a justification for SUBPARTITIONing.
Non-partition
Given that this is typical:
SELECT * FROM device_messages
WHERE partition_id = 0
AND 'event_date' BETWEEN '2019-08-07' AND '2019-08-13'
AND 'sensor_id' in ( 3317, 3322, 3323, 3327, 3328, 3329, 3331, 3332,
3333, 3334, 3335, 3336, 3337, 3338, 3339, 3340, 3341, 3342 )
ORDER BY 'unix_timestamp' asc
I would suggest the following:
No partitioning (and no partition_key)
Toss event_date; use unix_timestamp instead
Change the select as follows:
...
SELECT * FROM device_messages
WHERE `unix_timestamp` >= '2019-08-07'
AND `unix_timestamp` < '2019-08-07' + INTERVAL 1 WEEK
AND sensor_id in ( 3317, 3322, 3323, 3327, 3328, 3329, 3331, 3332,
3333, 3334, 3335, 3336, 3337, 3338, 3339, 3340, 3341, 3342 )
ORDER BY `unix_timestamp` asc
And add
INDEX(sensor_id, `unix_timestamp`)
The, I think the following will be the processing. (Note: It may be worse than this in some older versions of MySQL/MariaDB.)
Drill down the BTree for the new index to [3317, '2019-08-07']
Scan forward (collecting rows into a temp) for the week
Repeate 1,2 for each other sensor_id.
Sort the temp table (to satisfy the ORDER BY).
Deliver result rows.
The key point here is that it reads only exactly the rows that need to be delivered (plus one extra row per sensor to realize the week is over). Since this is a huge table, this is as good as it gets
The extra sort (cf Explain's "filesort") is necessary because there is no way to fetch the rows in ORDER BY order.
There is still another optimization...
In the above, the index was in order, but the data was not. We can fix that as follows:
PRIMARY KEY(sensor_id, `unix_timestamp`, id), -- (`id` adds uniqueness)
INDEX(id), -- to keep AUTO_INCREMENT happy
(and skip my previous index suggestion)
This modification will become especially beneficial if the table becomes bigger than the buffer_pool. This is because of the "clustering" provided by the revised PK.
More Normalization
I suspect that many of those ~30 columns are identical from row to row, especially for the same sensor (aka 'device'?). If I am correct, then you 'should' remove those columns from this huge table and put them into another table, de-dupped.
This would save even more space than tweaking INTs, etc.
Summary Table
Again, using your query, let's discuss what summary table would be useful. But first, I don't see what would be useful to summarize. I would expect to see a device_value FLOAT or something like that. I'll use that as a hypothetical example:
CREATE TABLE Summary (
event_date DATE NOT NULL, -- reconstructed from `unix_timestamp`
sensor_id ...,
ct SMALLINT UNSIGNED, -- number of readings for the day
sum_value FLOAT NOT NULL, -- SUM(device_value)
sum2 -- if you need standard deviation
min_value, etc -- if you want those
PRIMARY KEY(sensor_id, event_date)
) ENGINE=InnoDB;
Once a day:
INSERT INTO Summary (sensor_id, event_date, ct, sum_value, ...)
SELECT sensor_id, DATE(`unix_timestamp`),
COUNT(*), SUM(device_value), ...
FROM device_messages
WHERE `unix_timestamp` >= CURDATE() - INTERVAL 1 DAY
AND `unix_timestamp` < CURDATE()
GROUP BY sensor_id;
(There are more robust ways; there are more timely ways; etc.) Or you may want to summarize by hour instead of day. In any case, you can get arbitrary date range by summing the sums from daily summaries.
Average: SUM(sum_value) / SUM(ct)
Reduncancy?
unix_timestamp, timestamp, event_date, created_at -- all have the "same" value and meaning??
A note on DATE -- it is almost always easier to pick apart a DATETIME or TIMESTAMP than to have an extra column, and especially than having both DATE and TIME.
Without a date column, checking for all readings for one day needs to look something like:
WHERE `dt` >= '2019-08-07'
AND `dt` < '2019-08-07' + INTERVAL 1 DAY
I am optimizing a database with almost no knowledge for my bachelor thesis. In no way i want to let you do the work for me, but i have some questions which no one could answer so far.
Table Structure:
data_inc, CREATE TABLE 'data_inc' (
'id' bigint(20) NOT NULL AUTO_INCREMENT,
'id_para' int(10) unsigned NOT NULL DEFAULT '0',
't_s' int(11) unsigned NOT NULL DEFAULT '0',
't_ms' smallint(6) unsigned NOT NULL DEFAULT '0',
't_ns' bigint(20) unsigned NOT NULL DEFAULT '0',
'id_inst' smallint(6) NOT NULL DEFAULT '1',
'value' varchar(255) NOT NULL DEFAULT '',
'isanchor' tinyint(4) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY ('id','t_ns'),
KEY 't_s' ('t_s'),
KEY 't_ns' ('t_ns')
) ENGINE=MyISAM AUTO_INCREMENT=2128295174 DEFAULT CHARSET=latin1
/*
!50100 PARTITION BY RANGE (t_ns)
(PARTITION 19_02_2015_23_59 VALUES LESS THAN (1424386799000000000) ENGINE = MyISAM,
PARTITION 20_02_2015_23_59 VALUES LESS THAN (1424473199000000000) ENGINE = MyISAM,
PARTITION 21_02_2015_23_59 VALUES LESS THAN (1424559599000000000) ENGINE = MyISAM,
PARTITION 22_02_2015_23_59 VALUES LESS THAN (1424645999000000000) ENGINE = MyISAM,
PARTITION 23_02_2015_23_59 VALUES LESS THAN (1424732399000000000) ENGINE = MyISAM,
PARTITION 24_02_2015_23_59 VALUES LESS THAN (1424818799000000000) ENGINE = MyISAM,
PARTITION 25_02_2015_23_59 VALUES LESS THAN (1424905199000000000) ENGINE = MyISAM,
PARTITION 05_03_2015_23_59 VALUES LESS THAN (1425596399000000000) ENGINE = MyISAM,
PARTITION 13_03_2015_23_59 VALUES LESS THAN (1426287599000000000) ENGINE = MyISAM,
PARTITION 14_03_2015_23_59 VALUES LESS THAN (1426373999000000000) ENGINE = MyISAM,
PARTITION 15_03_2015_23_59 VALUES LESS THAN (1426460399000000000) ENGINE = MyISAM,
PARTITION 16_03_2015_23_59 VALUES LESS THAN (1426546799000000000) ENGINE = MyISAM,
PARTITION 17_03_2015_23_59 VALUES LESS THAN (1426633199000000000) ENGINE = MyISAM,
PARTITION 18_03_2015_23_59 VALUES LESS THAN (1426719599000000000) ENGINE = MyISAM)
*/
The system is currently logging up to 4000 Parameters per second into a database (differnet tables, which one is decided in stored procedures). Every 5 minutes, 1 hour and daily different scripts are called to analyse the logging data, during that time data is written to the tables. This results in some heavy loads right now. Is there a chance that switching from MyISAM to InnoDB (or others) that the performance improves?
Thanks for your help!
For logging quickly followed by analysis...
Gather the data into a MyISAM table with no indexes. After 5 min (1.2M rows!):
Analyze it into InnoDB "Summary Table(s)".
DROP TABLE or TRUNCATE TABLE.
The analysis would be put into other table(s). These would have summary information and be much smaller than 1.2M rows.
To get hourly data, summarize the summary table(s). But don't create "hourly" tables; simply fetch and recalculate as needed.
Here are some related discussions: High speed ingestion and Summary Tables.
I have a table storing weekly viewing statistic for around 40K businesses, the tables passed 2.2M records and is starting to slow things down, I'm looking at partitioning it to speed things up but I'm not sure how best to do it.
My ORM requires an id field as a primary key, but that field has no relevance to the data, I've been using a unique index on fields for year, week number and business ID.
As I need the primary key to be involved in the partition map, I'm not sure how best to organise this (I've never used partitioning before).
Currently I have...
CREATE TABLE `weekly_views` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`business_id` int(11) NOT NULL,
`year` smallint(4) UNSIGNED NOT NULL,
`week` tinyint(2) UNSIGNED NOT NULL,
`hits` int(5) NOT NULL,
`created` timestamp NOT NULL ON UPDATE CURRENT_TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
`updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
UNIQUE `search` USING BTREE (business_id, `year`, `week`),
UNIQUE `id` USING BTREE (id, `week`)
) ENGINE=`InnoDB` AUTO_INCREMENT=2287009 DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci ROW_FORMAT=COMPACT CHECKSUM=0 DELAY_KEY_WRITE=0 PARTITION BY LIST(week) PARTITIONS 52 (PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
(5 ... 51)
PARTITION p52 VALUES IN (52) ENGINE = InnoDB);
One partition per week seemed the only logical way to break them up. Am I right that when I search for a record for the current week/business using 'business_id = xx and week = xx and year = xx' it's going to know which partition to use without searching them all? But, when I get the result and save it via the ORM, it's going to use the id field and not know which partition to use?
I guess I could use a custom query to insert or update (I haven't originally done this as the ORM doesn't support it).
Am I going the right way about this, or is there a better way to partition a table like this?
Thanks for your help!
As long as the query has week column in WHERE clause, MySQL will look in correct partition. However, weeks repeat each year and you'll end up with data from different years in the same partition.
Also you need 53 not 52 partitions, as you'll need to deal with leap years.
I have the following table structure with live data in it:
CREATE TABLE IF NOT EXISTS `userstatistics` (
`user_id` int(10) unsigned NOT NULL,
`number_logons` int(7) unsigned NOT NULL DEFAULT '0',
`number_profileminiviews` int(7) unsigned NOT NULL DEFAULT '0',
`number_profilefullviews` int(7) unsigned NOT NULL DEFAULT '0',
`number_mailsreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_interestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_favouratesreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_friendshiprequestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_imchatrequestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`yearweek` int(6) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`user_id`,`yearweek`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I want to convert this to a partitioned table with the following structure:
CREATE TABLE IF NOT EXISTS `userstatistics` (
`user_id` int(10) unsigned NOT NULL,
`number_logons` int(7) unsigned NOT NULL DEFAULT '0',
`number_profileminiviews` int(7) unsigned NOT NULL DEFAULT '0',
`number_profilefullviews` int(7) unsigned NOT NULL DEFAULT '0',
`number_mailsreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_interestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_favouratesreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_friendshiprequestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`number_imchatrequestreceived` int(7) unsigned NOT NULL DEFAULT '0',
`yearweek` int(6) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`user_id`,`yearweek`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (yearweek)
(PARTITION userstats_201108 VALUES LESS THAN (201108) ENGINE = InnoDB,
PARTITION userstats_201109 VALUES LESS THAN (201109) ENGINE = InnoDB,
PARTITION userstats_201110 VALUES LESS THAN (201110) ENGINE = InnoDB,
PARTITION userstats_201111 VALUES LESS THAN (201111) ENGINE = InnoDB,
PARTITION userstats_201112 VALUES LESS THAN (201112) ENGINE = InnoDB,
PARTITION userstats_201113 VALUES LESS THAN (201113) ENGINE = InnoDB,
PARTITION userstats_201114 VALUES LESS THAN (201114) ENGINE = InnoDB,
PARTITION userstats_201115 VALUES LESS THAN (201115) ENGINE = InnoDB,
PARTITION userstats_201116 VALUES LESS THAN (201116) ENGINE = InnoDB,
PARTITION userstats_201117 VALUES LESS THAN (201117) ENGINE = InnoDB,
PARTITION userstats_201118 VALUES LESS THAN (201118) ENGINE = InnoDB,
PARTITION userstats_201119 VALUES LESS THAN (201119) ENGINE = InnoDB,
PARTITION userstats_201120 VALUES LESS THAN (201120) ENGINE = InnoDB,
PARTITION userstats_201121 VALUES LESS THAN (201121) ENGINE = InnoDB,
PARTITION userstats_max VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;
How can I do this conversion?
Simply changing the first line of the second SQL statement to
ALTER TABLE 'userstatistics' (
Would this do it?
Going from MySQL 5.0 to 5.1.
First, you need to be running MySQL 5.1 or later. MySQL 5.0 does not support partitioning.
Second, please be aware of the difference between single-quotes (which delimit strings and dates) and back-ticks (which delimit table and column identifiers in MySQL). Use the correct type where appropriate. I mention this, because your example uses the wrong type of quotes:
ALTER TABLE 'userstatistics' (
That should be:
ALTER TABLE `userstatistics` (
Finally, yes, you can restructure a table into partitions with ALTER TABLE. Here's an exact copy & paste from a statement I tested on MySQL 5.1.57:
ALTER TABLE userstatistics PARTITION BY RANGE (yearweek) (
PARTITION userstats_201108 VALUES LESS THAN (201108) ENGINE = InnoDB,
PARTITION userstats_201109 VALUES LESS THAN (201109) ENGINE = InnoDB,
PARTITION userstats_201110 VALUES LESS THAN (201110) ENGINE = InnoDB,
PARTITION userstats_201111 VALUES LESS THAN (201111) ENGINE = InnoDB,
PARTITION userstats_201112 VALUES LESS THAN (201112) ENGINE = InnoDB,
PARTITION userstats_201113 VALUES LESS THAN (201113) ENGINE = InnoDB,
PARTITION userstats_201114 VALUES LESS THAN (201114) ENGINE = InnoDB,
PARTITION userstats_201115 VALUES LESS THAN (201115) ENGINE = InnoDB,
PARTITION userstats_201116 VALUES LESS THAN (201116) ENGINE = InnoDB,
PARTITION userstats_201117 VALUES LESS THAN (201117) ENGINE = InnoDB,
PARTITION userstats_201118 VALUES LESS THAN (201118) ENGINE = InnoDB,
PARTITION userstats_201119 VALUES LESS THAN (201119) ENGINE = InnoDB,
PARTITION userstats_201120 VALUES LESS THAN (201120) ENGINE = InnoDB,
PARTITION userstats_201121 VALUES LESS THAN (201121) ENGINE = InnoDB,
PARTITION userstats_max VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
Note that this causes a table restructure, so if you already have a lot of data in this table, it will take a while to run. Exactly how long depends on how much data you have, and your hardware speed, and other factors. Be aware that while the table is being restructured, it is locked and unavailable for reading and writing by other queries.
Look this
http://dev.mysql.com/doc/refman/5.1/en/alter-table.html about the alter table.
Then in particular the alter table.
ADD/DROP/COALESCE/REORGANIZE partition sql provides almost all the functions to manage your partitions.
note that hash can be only used to integer.
ALTER TABLE ... ADD PARTITION creates no temporary table except when used with NDB tables. ADD or DROP operations for RANGE or LIST partitions are immediate operations or nearly so. ADD or COALESCE operations for HASH or KEY partitions copy data between changed partitions; unless LINEAR HASH or LINEAR KEY was used, this is much the same as creating a new table (although the operation is done partition by partition). REORGANIZE operations copy only changed partitions and do not touch unchanged ones.