I have a very large table on a mysql 5.6.10 instance (roughly 480 million rows).
The storage engine is InnoDB. (Table and DB Default).
The table was partitioned by hash of merchantId (bigint: a kind of client identifier) which helped when queries related to a single merchant. Due to significant performance degradation when queries spanned multiple merchants, I decided to repartition the table by Range on ACTION_DATE (the DATE that an activity occurred). Thinking I was being clever, I decided to add a few (5) new fields for future use (unused_varchar1 varchar(200), etc.), since the table is so large, adding new fields essentially requires a rebuild anyway, so why not...
I created the new table structure as _new, dumped the existing file to a secondary server using mysql dump. I then used an awk script to finesse the name and a few other details to fit the new table (change tableName to tableName_new), and started the load.
The existing table was approximately 430 GB. The text file similarly was about 403 GB. I was surprised therefore that the new table ended up taking about 840 GB!! (Based on the linux fize size of the .ibd files)
So, I have 2 basic questions, which really amount to why and what now...
I imagine that the new table is larger because the dump file was in the order of the previous partition (merchantId) while the load was inserting into the new partitioning (Activity date) creating a semi-random insertion order. The randomness led mysql to leave plenty of space (roughly 50%) in the pages for future insertions. (I'm a little fuzzy on the terminology here, having spent much more time in my career with Sql Server DBs than MySql Dbs...) I'm not able to find any internal statistics in mysql for space free per page. The INFORMATION_SCHEMA.TABLES DATA_FREE stat is an unconvincing 68MB.
If it helps these are the relevant stats from I_S.TABLES:
TABLE_TYPE: BASE TABLE
Engine: InnoDB
VERSION: 10
ROW_FORMAT: Compact
TABLE_ROWS: 488,094,271
AVG_ROW_LENGTH: 1,564
DATA_LENGTH: 763,509,358,592 (711 GB)
INDEX_LENGTH: 100,065,574,912 (93.19 GB)
DATA_FREE: 68,157,440 (0.06 GB)
I realize that that doesn't add up to 840 GB, but as I said, that was the size of the .ibd files which seems to be slightly different than the I_S.TABLES stats. Either way, it is significantly more than the text dump file.
I digress...
My question is whether my theory about whether the repartioning explains the roughly doubled size. Or is there another explanation? I think the extra columns (2 Bigint, 2 Varchar(200), 1 Date) are not the culprit since they are all null. My napkin calculation was that the additional columns would add < 9 GB. Likewise, one additional index on UID should be a relatively small addition.
The follow up question is what can I do now if I want to try to compact the table. (Server now only has about 385 GB free...)
If I repeated the procedure, dump to file, reload, this time in the current partition order, would I end up with a table more like the size of my original table ~430 GB?
Following are relevant parts of DDL.
OLD TABLE:
CREATE TABLE table_name (
`AUTO_SEQ` bigint(20) NOT NULL,
`MERCHANT_ID` bigint(20) NOT NULL,
`AFFILIATE_ID` bigint(20) DEFAULT NULL,
`PROGRAM_ID` bigint(20) NOT NULL,
`ACTION_DATE` date DEFAULT NULL,
`UID` varchar(128) DEFAULT NULL,
... additional columns ...
PRIMARY KEY (`AUTO_SEQ`,`MERCHANT_ID`,`PROGRAM_ID`),
KEY `oc_rpt_mpad_idx` (`MERCHANT_ID`,`PROGRAM_ID`,`ACTION_DATE`,`AFFILIATE_ID`),
KEY `oc_rpt_mapd` (`MERCHANT_ID`,`ACTION_DATE`),
KEY `oc_rpt_apda_idx` (`AFFILIATE_ID`,`PROGRAM_ID`,`ACTION_DATE`,`MERCHANT_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (merchant_id)
PARTITIONS 16 */
NEW TABLE:
CREATE TABLE `tableName_new` (
`AUTO_SEQ` bigint(20) NOT NULL,
`MERCHANT_ID` bigint(20) NOT NULL,
`AFFILIATE_ID` bigint(20) DEFAULT NULL,
`PROGRAM_ID` bigint(20) NOT NULL,
`ACTION_DATE` date NOT NULL DEFAULT '0000-00-00',
`UID` varchar(128) DEFAULT NULL,
... additional columns...
# NEW COLUMNS (ALL NULL)
`UNUSED_BIGINT1` bigint(20) DEFAULT NULL,
`UNUSED_BIGINT2` bigint(20) DEFAULT NULL,
`UNUSED_VARCHAR1` varchar(200) DEFAULT NULL,
`UNUSED_VARCHAR2` varchar(200) DEFAULT NULL,
`UNUSED_DATE1` date DEFAULT NULL,
PRIMARY KEY (`AUTO_SEQ`,`ACTION_DATE`),
KEY `oc_rpt_mpad_idx` (`MERCHANT_ID`,`PROGRAM_ID`,`ACTION_DATE`,`AFFILIATE_ID`),
KEY `oc_rpt_mapd` (`ACTION_DATE`),
KEY `oc_rpt_apda_idx` (`AFFILIATE_ID`,`PROGRAM_ID`,`ACTION_DATE`,`MERCHANT_ID`),
KEY `oc_uid` (`UID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50500 PARTITION BY RANGE COLUMNS(ACTION_DATE)
(PARTITION p01 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
PARTITION p02 VALUES LESS THAN ('2012-04-01') ENGINE = InnoDB,
PARTITION p03 VALUES LESS THAN ('2012-07-01') ENGINE = InnoDB,
PARTITION p04 VALUES LESS THAN ('2012-10-01') ENGINE = InnoDB,
PARTITION p05 VALUES LESS THAN ('2013-01-01') ENGINE = InnoDB,
PARTITION p06 VALUES LESS THAN ('2013-04-01') ENGINE = InnoDB,
PARTITION p07 VALUES LESS THAN ('2013-07-01') ENGINE = InnoDB,
PARTITION p08 VALUES LESS THAN ('2013-10-01') ENGINE = InnoDB,
PARTITION p09 VALUES LESS THAN ('2014-01-01') ENGINE = InnoDB,
PARTITION p10 VALUES LESS THAN ('2014-04-01') ENGINE = InnoDB,
PARTITION p11 VALUES LESS THAN ('2014-07-01') ENGINE = InnoDB,
PARTITION p12 VALUES LESS THAN ('2014-10-01') ENGINE = InnoDB,
PARTITION p13 VALUES LESS THAN ('2015-01-01') ENGINE = InnoDB,
PARTITION p14 VALUES LESS THAN ('2015-04-01') ENGINE = InnoDB,
PARTITION p15 VALUES LESS THAN ('2015-07-01') ENGINE = InnoDB,
PARTITION p16 VALUES LESS THAN ('2015-10-01') ENGINE = InnoDB,
PARTITION p17 VALUES LESS THAN ('2016-01-01') ENGINE = InnoDB,
PARTITION p18 VALUES LESS THAN ('2016-04-01') ENGINE = InnoDB,
PARTITION p19 VALUES LESS THAN ('2016-07-01') ENGINE = InnoDB,
PARTITION p20 VALUES LESS THAN ('2016-10-01') ENGINE = InnoDB,
PARTITION p21 VALUES LESS THAN ('2017-01-01') ENGINE = InnoDB,
PARTITION p22 VALUES LESS THAN ('2017-04-01') ENGINE = InnoDB,
PARTITION p23 VALUES LESS THAN ('2017-07-01') ENGINE = InnoDB,
PARTITION p24 VALUES LESS THAN ('2017-10-01') ENGINE = InnoDB,
PARTITION p25 VALUES LESS THAN ('2018-01-01') ENGINE = InnoDB,
PARTITION p26 VALUES LESS THAN ('2018-04-01') ENGINE = InnoDB,
PARTITION p27 VALUES LESS THAN ('2018-07-01') ENGINE = InnoDB,
PARTITION p28 VALUES LESS THAN ('2018-10-01') ENGINE = InnoDB,
PARTITION p29 VALUES LESS THAN ('2019-01-01') ENGINE = InnoDB,
PARTITION p30 VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */
adding new fields essentially requires a rebuild anyway, so why not
I predict you will regret it.
The existing table was approximately 430 GB.
According to size of .ibd? Or SHOW TABLE STATUS? Or the dump size, which would be bogus (see below).
it is significantly more than the text dump file
The lengths in TABLE STATUS include several flavors of overhead (BTree, free space, extra extents, etc), plus the indexes (which are not in the dump file).
Also, think about a BIGINT that contains 1234. The .ibd will 8 bytes plus some overhead; the dump will have 5 ('1234', plus a comma). That leads to my next point...
Are there really more than 4 billion merchants? merchant_id is BIGINT (8 bytes); INT UNSIGNED is only 4 bytes and allows 0..4 billion.
What's in uid? If it is some sort of UUID, it seems awfully long.
Do you happen to have the "stats from I_S.TABLES" from the old table?
So far, I have not addressed "whether the repartioning explains the roughly doubled size".
extra columns (2 Bigint, 2 Varchar(200), 1 Date)
That's about 29 bytes per row (15GB of Data_length), perhaps less since they are NULL.
You seem to be using the default ROW_FORMAT. I suspect this did not change in the conversion.
It is usually unwise to start an index with the "partition key" (merchant_id or action_date). This is because you are already "pruning" on that key; you are better off starting the index with something else. (Caveat: There are exceptions.)
Check the CHARACTER SET and datatype of the "additional columns". If something changed, that could be significant.
would I end up with a table more like the size of my original table ~430 GB?
Alas, until we figure out why it grew, I can't answer that question.
I'm more interested in whether random insertion vs. the partition (ACTION_DATE) would lead to wasted space / half empty pages.
I recommend you try the following experiment. Do not use optimize partition; see http://bugs.mysql.com/bug.php?id=42822 . Instead do this to defragment one partition (such as p02):
ALTER TABLE table_name REBUILD PARTITION p02;
You could do this SELECT before and after in order to see the change(s) to the PARTITIONs:
SELECT *
FROM information_schema.PARTITIONS
WHERE TABLE_SCHEMA = 'dbname' -- change as needed
AND TABLE_NAME = 'table_name' -- change as needed
ORDER BY PARTITION_ORDINAL_POSITION,
SUBPARTITION_ORDINAL_POSITION;
It's a generic query to get the table-status-like info for the partitions of one table.
If the REBUILD cuts the partition by about 50%, then we have the answer.
Generally, randomly inserting into a BTree should leave you with about 69% (not 50%) of the "full" size. Hence, I'm not 'expecting' this to be the solution/answer.
Related
I have a table that contains a month and a year column.
I have a query which usually looks something like WHERE month=1 AND year=2022
Given how large this table is i would like to make it more efficient using partitions and sub partitions.
table 1
Querying the data i need took around 2 minutes and 30 seconds.
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
)
Partitioning by "month"
Querying the data i need took around 21 seconds (big improvement).
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`),
KEY `idx_month_year` (`month`,`year`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (`month`)
(PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB) */
I would like to see if i can improve the performance even further by partitioning by year and then subpartitioning by month. How can i do that?
I'm not sure the following question Partition by year and sub-partition by month mysql is relevant with no marked answers and that question looks to be particular to mysql 5* and php. Im asking about mysql 8, are there no changes since then regarding partioning/subpartioning/list columns/range columns etc? which could help me.
Broader query im making
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12 AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4') # only ever 4 types usually all 4 are present in the query
To answer your question directly, below is example syntax that accomplishes the subpartitioning. Notice the PRIMARY KEY must include all columns used for partitioning or subpartitioning. Read the manual on subpartitioning for more information: https://dev.mysql.com/doc/refman/8.0/en/partitioning-subpartitions.html
Schema (MySQL v8.0)
CREATE TABLE `table_1` (
`id` int NOT NULL AUTO_INCREMENT,
`entity_id` varchar(36) NOT NULL,
`entity_type` varchar(36) NOT NULL,
`score` decimal(4,3) NOT NULL,
`month` int NOT NULL DEFAULT '0',
`year` int NOT NULL DEFAULT '0',
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`,`month`, `year`),
KEY `idx_month_year` (`month`,`year`, `score`, `entity_type`)
) ENGINE=InnoDB AUTO_INCREMENT=21000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY LIST (`month`)
SUBPARTITION BY HASH(`year`)
SUBPARTITIONS 10 (
PARTITION p0 VALUES IN (0) ENGINE = InnoDB,
PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
PARTITION p5 VALUES IN (5) ENGINE = InnoDB,
PARTITION p6 VALUES IN (6) ENGINE = InnoDB,
PARTITION p7 VALUES IN (7) ENGINE = InnoDB,
PARTITION p8 VALUES IN (8) ENGINE = InnoDB,
PARTITION p9 VALUES IN (9) ENGINE = InnoDB,
PARTITION p10 VALUES IN (10) ENGINE = InnoDB,
PARTITION p11 VALUES IN (11) ENGINE = InnoDB,
PARTITION p12 VALUES IN (12) ENGINE = InnoDB
);
Using EXPLAIN on your query reveals that the query references only one subpartition.
Query #1
EXPLAIN
SELECT
table_1.entity_id AS entity_id,
table_1.entity_type,
table_1.score
FROM table_1
WHERE table_1.month = 12
AND table_1.year = 2022
AND table_1.score > 0
AND table_1.entity_type IN ('type1', 'type2', 'type3', 'type4');
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
table_1
p12_p12sp2
range
idx_month_year
idx_month_year
11
1
100
Using index condition
The partitions field of the EXPLAIN shows that it accesses only partition p12_p12sp2. The year the query references, 2022, modulus the number of subpartitions, 10, will read from the subpartition 2.
In addition to the partitioning by month and year, it is also helpful to use an index. In this case, I added score to the index so it would filter out rows where score <= 0. The note in the EXPLAIN "Using index condition" shows that it is delegating further filtering on entity_type to the storage engine. Though in your example, you said there are only four values for entity type, and all four are selected, so that condition won't filter out any rows anyway.
View on DB Fiddle
Re your questions in comments below:
a little bit confused on SUBPARTITIONS 10 , why 10
It's just an example. You can choose a different number of subpartitions. Whatever you feel is required to reduce the search as much as you want.
To be honest, I've never encountered a situation that required subpartitioning at all, if the search is also optimized with indexes. So I have no guidance on what is an appropriate number of subpartitions.
It's your responsibility to test performance until you are satisfied.
also bit confusd on the partition name p12_p12sp2 how do i know it selected the partition with year 2022 from looking at that?
The query has a condition year = 2022.
There are 10 subpartitions in my example.
Hash partitioning just uses the integer value to be partitioned, modulus the number of partitions.
2022 modulus 10 is 2. Hence the partition ending in ...sp2 is the one used.
I also came across this anothermysqldba.blogspot.com/2014/12/… do you know how yours differs from what it shown here ( bare in mind that blog is from 2014)
They chose to name the subpartitions. There's no need to do that.
would there be any performance difference in having a single date e.g (2022-12-21) instead of sepreate columns month and year.
That depends on the query, and I'll leave it to you to test. Any predictions I make won't be accurate with your data on your server.
i can also see that you partition by month and subpartition by year, as oppose to partition by year and subpartition by month. can you explain the reasoning?
Subpartitioning works only if the outer partitions are LIST or RANGE partitions, and the subpartitions are HASH or KEY partitions. This is in the manual page I linked to.
There are a finite number of months (12). This makes it easy to partition by LIST as you did. You won't ever need more partitions. If you had partitioned by YEAR as the outer partition, you would have needed to specify year values in the list, and this is a growing set, so you would periodically have to alter the table to extend the list or range to account for new years.
Whereas when partitioning by HASH for the subpartitioning, the new year values are mapped into the finite set of subpartitions, so it's okay that it's not a finite list. You won't have to alter table to repartition (unless you want to change the number of subpartitions).
Splitting a date into columns is usually counterproductive. It is much easier to split during SELECT.
PARTITIONing is usually useless for performance of any SELECT.
When partitioning (or unpartitioning), the indexes usually need changing.
For that query, I recommend a combined date column,
WHERE date >= '2022-01-01'
AND date < '2022-01-01' + INTERVAL 1 MONTH
and some INDEX starting with date.
(You probably have other queries; let's see some of them; they may need a different index.)
Covering index -- This is an index that contains all the columns found anywhere in the SELECT. It is may be better (faster) than having only the columns needed for WHERE or WHERE + GROUP BY + ORDER BY. It depends on a lot of variables.
Order of columns in an index (or PK): The leftmost column(s) have priority. That is the order of the index rows on disk. PK(id, date) is useful if looking up by id (in the WHERE), but not if you are just searching by date.
Sargable -- sargable -- Hiding a column in a function disables the use of an index. That is MONTH(date) cannot use INDEX(date).
Blogs -- Index Cookbook and Partition
Test plan
I recommend you time all your queries against a variety of Create Tables.
For the WHERE clause:
The order of ANDs does not matter.
When using IN, a single value os equivalent to = and optimizes better. Multiple values may optimize more poorly. As Bill hints at, when the IN list contains all the options, you should eliminate the clause since the Optimizer is not smart enough. So, be sure to test with 1 and/or many items, so as to be realistic to your app.
For the table
Try Partition BY year + Subpartition by month.
Try Partition by a column that is the combination of year and month.
Try without partitioning.
For indexes
Order of the columns (in a composite index) does matter, so try different orderings.
When partitioning, be sure to tack onto the end of the PK the partition key(s).
A partitioned table needs different indexes than a non-partitioned table. That is, what works well for one may work poorly for the other.
Simply use something like this pattern to test various layouts:
CREATE TABLE (( a new layout with or without partitioning and with indexes ))
INSERT INTO test_table SELECT ... FROM real_table;
Change the "..." to adapt to any extra/missing columns in test_table
SELECT ...
Run various 'real' queries
Run each query twice (caching sometimes messes with the timing)
Report the results -- If you provide sufficient info (CREATE TABLE and SELECT), I may have suggestions on further speeding up the test (whether it is partitioned or not).
I'm trying to figure out how long it will take to partition a large table. I'm about 2 weeks into partitioning this table and don't have a good feeling for how much longer it will take. Is there any way to calculate how long this query might take?
The following is the query in question.
ALTER TABLE pIndexData REORGANIZE PARTITION pMAX INTO (
PARTITION p2022 VALUES LESS THAN (UNIX_TIMESTAMP('2023-01-01 00:00:00 UTC')),
PARTITION pMAX VALUES LESS THAN (MAXVALUE)
)
For context, the pIndexData table has about 6 billion records and the pMAX partition has roughly 2 billion records. This is an Amazon Aurora instance and the server is running MySQL 5.7.12. The DB Engine is InnoDB. The following is the table syntax.
CREATE TABLE `pIndexData` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`DateTime-UNIX` bigint(20) NOT NULL DEFAULT '0',
`pkl_PPLT_00-PIndex` int(11) NOT NULL DEFAULT '0',
`DataValue` decimal(14,4) NOT NULL DEFAULT '0.0000',
PRIMARY KEY (`pkl_PPLT_00-PIndex`,`DateTime-UNIX`),
KEY `id` (`id`),
KEY `DateTime` (`DateTime-UNIX`) USING BTREE,
KEY `pIndex` (`pkl_PPLT_00-PIndex`) USING BTREE,
KEY `DataIndex` (`DataValue`),
KEY `pIndex-Data` (`pkl_PPLT_00-PIndex`,`DataValue`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (`DateTime-UNIX`)
(PARTITION p2016 VALUES LESS THAN (1483246800) ENGINE = InnoDB,
PARTITION p2017 VALUES LESS THAN (1514782800) ENGINE = InnoDB,
PARTITION p2018 VALUES LESS THAN (1546318800) ENGINE = InnoDB,
PARTITION p2019 VALUES LESS THAN (1577854800) ENGINE = InnoDB,
PARTITION p2020 VALUES LESS THAN (1609477200) ENGINE = InnoDB,
PARTITION p2021 VALUES LESS THAN (1641013200) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
In researching this question, I found using Performance Schema could provide the answer to my question. However, Performance Schema in not enabled on this server and enabling it requires a reboot. Rebooting is not an option because doing so could corrupt the database while this query is processing.
As a means of gaining some sense for how long this will take I recreated the pIndexData table in a separate Aurora instance. I then imported a sample set of data (about 3 million records). The sample set had DateTime values spread out over 2021, 2022 and 2023, with the lions share of data in 2022. I then ran the same REORGANIZE PARTITION query and clocked the time it took to complete. The partition query took 2 minutes, 29 seconds. If the partition query to records was linear, I estimate the query on the original table should take roughly 18 hours. It seems there is no linear calculation. Even with a large margin of error, this is way off. Clearly, there are factors (perhaps many) I'm missing.
I'm not sure what else to try other than run the sample data test again but with an even larger data sample. Before I do, I'm hoping someone might have some insight how to best calculate how long this might take to finish.
Adding (or removing) partitioning will necessarily copy all the data over and rebuild all the tables. So, if your table is large enough to warrant partitioning (over 1M rows), it will take a noticeable amount of time.
In the case of REORGANIZE one (or a few) partitions (eg, PMAX) "INTO ...", the metric is how many rows in the PMAX.
What you should have done is to create the LESS THAN 2022 late in 2021 when PMAX was empty.
Recommend you reorganize PMAX into 2022 and 2023 and PMAX now. Again, the time is proportional to the size of PMAX. Then be sure to create 2024 in Dec 2023, when PMAX is still empty.
What is the advantage of partitioning by Year? Will you be purging old data eventually? (That may be the only advantage.)
As for your test -- was there nothing in the other partitions when you measured 2m29s? That test would be about correct. There may be a small burden in adding the 2021 index rows.
A side note: The following is unnecessary since there are 2 other indexes handling it:
KEY `pIndex` (`pkl_PPLT_00-PIndex`) USING BTREE,
However, I don't know if dropping it would be "instant".
I am optimizing a database with almost no knowledge for my bachelor thesis. In no way i want to let you do the work for me, but i have some questions which no one could answer so far.
Table Structure:
data_inc, CREATE TABLE 'data_inc' (
'id' bigint(20) NOT NULL AUTO_INCREMENT,
'id_para' int(10) unsigned NOT NULL DEFAULT '0',
't_s' int(11) unsigned NOT NULL DEFAULT '0',
't_ms' smallint(6) unsigned NOT NULL DEFAULT '0',
't_ns' bigint(20) unsigned NOT NULL DEFAULT '0',
'id_inst' smallint(6) NOT NULL DEFAULT '1',
'value' varchar(255) NOT NULL DEFAULT '',
'isanchor' tinyint(4) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY ('id','t_ns'),
KEY 't_s' ('t_s'),
KEY 't_ns' ('t_ns')
) ENGINE=MyISAM AUTO_INCREMENT=2128295174 DEFAULT CHARSET=latin1
/*
!50100 PARTITION BY RANGE (t_ns)
(PARTITION 19_02_2015_23_59 VALUES LESS THAN (1424386799000000000) ENGINE = MyISAM,
PARTITION 20_02_2015_23_59 VALUES LESS THAN (1424473199000000000) ENGINE = MyISAM,
PARTITION 21_02_2015_23_59 VALUES LESS THAN (1424559599000000000) ENGINE = MyISAM,
PARTITION 22_02_2015_23_59 VALUES LESS THAN (1424645999000000000) ENGINE = MyISAM,
PARTITION 23_02_2015_23_59 VALUES LESS THAN (1424732399000000000) ENGINE = MyISAM,
PARTITION 24_02_2015_23_59 VALUES LESS THAN (1424818799000000000) ENGINE = MyISAM,
PARTITION 25_02_2015_23_59 VALUES LESS THAN (1424905199000000000) ENGINE = MyISAM,
PARTITION 05_03_2015_23_59 VALUES LESS THAN (1425596399000000000) ENGINE = MyISAM,
PARTITION 13_03_2015_23_59 VALUES LESS THAN (1426287599000000000) ENGINE = MyISAM,
PARTITION 14_03_2015_23_59 VALUES LESS THAN (1426373999000000000) ENGINE = MyISAM,
PARTITION 15_03_2015_23_59 VALUES LESS THAN (1426460399000000000) ENGINE = MyISAM,
PARTITION 16_03_2015_23_59 VALUES LESS THAN (1426546799000000000) ENGINE = MyISAM,
PARTITION 17_03_2015_23_59 VALUES LESS THAN (1426633199000000000) ENGINE = MyISAM,
PARTITION 18_03_2015_23_59 VALUES LESS THAN (1426719599000000000) ENGINE = MyISAM)
*/
The system is currently logging up to 4000 Parameters per second into a database (differnet tables, which one is decided in stored procedures). Every 5 minutes, 1 hour and daily different scripts are called to analyse the logging data, during that time data is written to the tables. This results in some heavy loads right now. Is there a chance that switching from MyISAM to InnoDB (or others) that the performance improves?
Thanks for your help!
For logging quickly followed by analysis...
Gather the data into a MyISAM table with no indexes. After 5 min (1.2M rows!):
Analyze it into InnoDB "Summary Table(s)".
DROP TABLE or TRUNCATE TABLE.
The analysis would be put into other table(s). These would have summary information and be much smaller than 1.2M rows.
To get hourly data, summarize the summary table(s). But don't create "hourly" tables; simply fetch and recalculate as needed.
Here are some related discussions: High speed ingestion and Summary Tables.
I'm exporting a largeish table (1.5 billion rows) between servers. This is the table format.
CREATE TABLE IF NOT EXISTS `partitionedtable` (
`domainid` int(10) unsigned NOT NULL,
`instanceid` int(10) unsigned NOT NULL,
`urlid` int(10) unsigned NOT NULL,
`adjrankid` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`domainid`,`instanceid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (MOD(domainid,8192))
(PARTITION p0 VALUES LESS THAN (1) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (2) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (3) ENGINE = InnoDB
...
PARTITION p8191 VALUES LESS THAN (8192) ENGINE = InnoDB)
The data was exported to the new server in PK order and resulted in 8192 text files... which equated to around 200K records per file.
I'm simply iterating from 0 to 8191 importing the files into the new table.
LOAD DATA INFILE '/home/backup/rc/$i.tsv INTO TABLE partitionedtable PARTITION (p$i)
I'm thinking that each of these should only take a second to import, however they take around 6 seconds.
The spec of the server can be seen here.
http://www.ovh.co.uk/dedicated_servers/sp_32g.xml
There isn't much else going on in the server that'd bottleneck the process.
Could it be that partitioning by MOD() causes fragmentation? I was under the impression that there wouldnt be any fragmentation as each partition would be considered a separate table, and since data is inserted in PK order there'd be no fragmentation.
Added - probably useful... these settings were applied at the start of the batch.
SET autocommit=0;
SET foreign_key_checks=0;
SET sql_log_bin=0;
SET unique_checks=0;
A COMMIT is applied after every file.
The thread seems to spend the majority of its time in a System lock state, during LOAD DATA INFILE.
When I set up the server I mistakenly thought the open files limit was higher, though in reality it's sitting at 1024.
I've upped it to 16000 and rebooted the server, and it's running slightly quicker # 3 seconds (I was assuming the file opening/closing was causing the system lock status).
I also purged the bin logs.
Still seems a bit slow though.
I have a log table that gets processed every night. Processing will be done on data that was logged yesterday. Once the processing is complete I want to delete the data for that day. At the same time, I have new data coming into the table for the current day. I partitioned the table based on day of week. My hope was that I could delete data and insert data at the same time without contention. There could be as many as 3 million rows of data a day being processed. I have searched for information but haven't found anything to confirm my assumption.
I don't want to have the hassles of writing a job that adds partitions and drop partitions as I have seen in other examples. I was hoping to implement a solution using seven partions. eg.
CREATE TABLE `professional_scoring_log` (
`professional_id` int(11) NOT NULL,
`score_date` date NOT NULL,
`scoring_category_attribute_id` int(11) NOT NULL,
`displayable_score` decimal(7,3) NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`professional_id`,`score_date`,`scoring_category_attribute_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (DAYOFWEEK(`score_date`))
(PARTITION Sun VALUES LESS THAN (2) ENGINE = InnoDB,
PARTITION Mon VALUES LESS THAN (3) ENGINE = InnoDB,
PARTITION Tue VALUES LESS THAN (4) ENGINE = InnoDB,
PARTITION Wed VALUES LESS THAN (5) ENGINE = InnoDB,
PARTITION Thu VALUES LESS THAN (6) ENGINE = InnoDB,
PARTITION Fri VALUES LESS THAN (7) ENGINE = InnoDB,
PARTITION Sat VALUES LESS THAN (8) ENGINE = InnoDB) */
When my job that processes yesterday's data is complete, it would delete all records where score_date = current_date-1. At any one time, I am likely only going to have data in one or two partitions, depending on time of day.
Are there any holes in my assumptions?
Charlie, I don't see any holes in your logic/assumptions.
I guess my one comment would be why not use the drop/add partition syntax? It has to be more efficient than DELETE FROM .. Where ..; and it's just two calls - no big deal -- store "prototype" statements and substitute for "Sun" and "2" as required for each day of the week -- I often use sprintf for doing just that
ALTER TABLE `professional_scoring_log` DROP PARTITION Sun;
ALTER TABLE `professional_scoring_log` ADD PARTITION (
PARTITION Sun VALUES LESS THAN (2)
);