Greenplum creates a partition table is very slow

Greenplum creates a partition table is very slow - create-table

I run psql to create a partition table. 2 hours are gone, but it doesn't complete. What's the problem? How to speed up? Thanks very much.
My command is
DROP SCHEMA IF EXISTS people_hotel CASCADE; CREATE SCHEMA people_hotel; SET SEARCH_PATH TO people_hotel, public, pg_catalog, pg_toolkit; ALTER ROLE user SET search_path TO people_hotel, public, pg_catalog, pg_toolkit;
drop table if exists people_hotel.event;
create table people_hotel.event ( event_id bigint, time bigint, object bigint, loc_id integer )
distributed by (event_id)
PARTITION BY RANGE(time)
SUBPARTITION by range(object)
SUBPARTITION TEMPLATE (
START (0) INCLUSIVE END (100000000) EXCLUSIVE EVERY (10000),
DEFAULT SUBPARTITION other_object ) (START (bigint '1527811200000') INCLUSIVE END (bigint '1544975684280') INCLUSIVE EVERY (bigint '2592000000'), DEFAULT PARTITION other_time );
greenplum runs with 1 master and 2 segment hosts, each host have 10 instances( 10 primary directory)
Greenplum version: 5.9.0
OS version: ubuntu 16.04

Do not go beyond 3K partitions. I recommend no more 1K.
Also do two level partitioning. Partition - subpartition.

Related

Can I kill a process in the "query end" state in Aurora MySql

I have a large table hosted on Aurora at Amazon using MySql 5.7
2 days ago, I ran this command:
insert IGNORE into archiveDataNEW
(`DateTime-UNIX`,`pkl_PPLT_00-PIndex`,`DataValue`)
SELECT `DateTime-UNIX`,`pkl_PPLT_00-PIndex`,`DataValue`
FROM offlineData
order by id
limit 600000000, 200000000
Yesterday afternoon, my computer crashed so the connection to mysql was severed.
sometime last night the status of the query was "query end"
today the status of the query is still "query end"
Questions:
Can I stop this process - or with that only make things worse?
Does MySQL innodb unwind a query when the connection to the server drops? Is there any way to tell it to proceed instead?
will I need to re-run the command when it finally completes the query end process ?
Here is the table I am loading data into, any thoughts or suggestions will be appreciated.
CREATE TABLE `archiveDataNEW` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`DateTime-UNIX` bigint(20) NOT NULL DEFAULT '0',
`pkl_PPLT_00-PIndex` int(11) NOT NULL DEFAULT '0',
`DataValue` decimal(14,4) NOT NULL DEFAULT '0.0000',
PRIMARY KEY (`id`,`DateTime-UNIX`),
UNIQUE KEY `Unique2` (`pkl_PPLT_00-PIndex`,`DateTime-UNIX`) USING BTREE,
KEY `DateTime` (`DateTime-UNIX`) USING BTREE,
KEY `pIndex` (`pkl_PPLT_00-PIndex`) USING BTREE,
KEY `DataIndex` (`DataValue`),
KEY `pIndex-Data` (`pkl_PPLT_00-PIndex`,`DataValue`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=736142506 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (`DateTime-UNIX`)
(PARTITION p2016 VALUES LESS THAN (1483246800) ENGINE = InnoDB,
PARTITION p2017 VALUES LESS THAN (1514782800) ENGINE = InnoDB,
PARTITION p2018 VALUES LESS THAN (1546318800) ENGINE = InnoDB,
PARTITION p2019 VALUES LESS THAN (1577854800) ENGINE = InnoDB,
PARTITION p2020 VALUES LESS THAN (1609477200) ENGINE = InnoDB,
PARTITION p2021 VALUES LESS THAN (1641013200) ENGINE = InnoDB,
PARTITION p2022 VALUES LESS THAN (1672549200) ENGINE = InnoDB,
PARTITION p2023 VALUES LESS THAN (1704085200) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;```

There's no way to complete that statement and commit the row it inserted.
This is apparently a bug in MySQL 5.7 code, discussed here: https://bugs.mysql.com/bug.php?id=91078
The symptom is that a query is stuck in "query end" state and there's no way to kill it or finish it except by restarting the MySQL Server. But this is not possible on AWS Aurora, right?
There's some back and forth in that bug log about whether it's caused by the query cache. The query cache is deprecated, but in Aurora they have reenabled it and changed its implementation. They are convinced that their query cache code solves the disadvantages of MySQL's query cache implementation, so they leave it on in Aurora (this is one of many reasons you should think of Aurora as a fork of MySQL, not necessarily compatible with MySQL itself).

Kill it, if you can. It is either busy committing (which will take a long time) or busy undoing (which will take even longer). If it won't kill, you are stuck with waiting it out.
A better way.
Using OFFSET in limit 600000000, 200000000 will only get slower and slower as you do the chunks. This is because it must step over the 600M rows.
Also, INSERTing 200M rows at a time is quite inefficient. The system must prepare to UNDO the action in case of crash.
So, it is better to "remember where you left off". Or do it in explicit chunks like WHERE id BETWEEN 12345000 AND 12345999. Also, do only 1K rows at a time.
But, what are your trying to do?
If you are adding Partitioning, let's discuss whether there will be any benefit. It looks like you are adding yearly partitioning. Possibly the only advantage is when you need to DROP PARTITION to get rid of "old" data. It is unlikely that any queries will run any faster.
Likely optimizations:
Shrink:
`DateTime-UNIX` bigint(20)
This seems to be a unix timestamp, that fits nicely in a 4-byte INT or a 5-byte TIMESTAMP; why use an 8-byte BIGINT? TIMESTAMP has the advantage of allowing lots of datetime functions. A 5-byte DATETIME or a 3-byte DATE will last until the end of the year 9999. We are 17 years from overflow of TIMESTAMP; what computer systems do you know of that have been around since 2004 (today - 17 years)? Caveat: There will be timezone issues to address (or ignore) if you switch from TIMESTAMP. (If you need the time part, do not split a DATETIME into two columns; it is likely to add complexity.)
Drop KEY pIndex (pkl_PPLT_00-PIndex) USING BTREE, it is redundant with two other indexes.
Do not pre-build future partitions; it hurts performance (a small amount). At the end of the current year, build the next year's partition with REORGANIZE. Details here: http://mysql.rjweb.org/doc.php/partitionmaint
This will improve performance in multiple ways:
PRIMARY KEY (`id`,`DateTime-UNIX`),
UNIQUE KEY `Unique2` (`pkl_PPLT_00-PIndex`,`DateTime-UNIX`) USING BTREE,
-->
PRIMARY KEY(`pkl_PPLT_00-PIndex`,`DateTime-UNIX`),
INDEX(id) -- sufficient for AUTO_INCREMENT
It may run faster if you leave off the non-UNIQUE indexes until the table is loaded. Then do ALTER(s) to add them.

UNIX_TIMESTAMP field partition for a whole year

I am quite new in the subject of partitions and the necessity has arisen due to the great accumulation of data.
Well, basically it is an access control system, there are currently 20 departments and each department has approximately 100 users. The system records the date and time of the entries and exits (from_date / to_date) My intention is to divide by departments and then for a month throughout the year.
Plan:
Partition the table by [ dep_id and date (from_date and to_date) ]
Problem
I have the following table.
CREATE TABLE `employee` (
`employee_id` smallint(5) NOT NULL,
`dep_id` int(11) NOT NULL,
`from_date` int(11) NOT NULL,
`to_date` int(11) NOT NULL,
KEY `index1` (`employee_id`,`from_date`,`to_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have the dates (from_date and to_date) in UNIX_TIMESTAMP format (INT 11)
I am looking to divide it during all the months of the year.
it's possible?
Mysql - 5.7

It is possible to use range partitioning on an integer column.
Assuming my_int_col is unix-style integer seconds since 1970-01-01
we could achieve monthly partitions with something like this:
PARTITION BY RANGE (my_int_col)
( PARTITION p20180101 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-01-01 00:00') )
, PARTITION p20180201 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-02-01 00:00') )
, PARTITION p20180301 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-03-01 00:00') )
, PARTITION p20180401 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-04-01 00:00') )
, PARTITION p20180501 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-05-01 00:00') )
, PARTITION p20180601 VALUES LESS THAN ( UNIX_TIMESTAMP('2018-06-01 00:00') )
Be careful of the time_zone setting of the session. Those date literals will be interpreted as values in the current time_zone... e.g. if you want those to be UTC datetime, time_zone should be +00:00.
Or, replace the UNIX_TIMESTAMP() expression with a literal integer value... that's what MySQL is going to do with the UNIX_TIMESTAMP() expressions.
Obviously, you can name the partitions whatever you want.
Note: applying partitioning to an existing table will require MySQL to create an entire copy of the table, holding an exclusive lock on the original table while the operation completes. So you will need sufficient storage (disk) space, and a window of time for the operation to complete.
It's possible to create a new table that is partitioned, and then copy the older data a chunk at a time. But make the chunks reasonably sized, to avoid ballooning the ibdata1 with large transactions. And then do some RENAME TABLE statements to move the old table out, and move the new table in.
Some caveats to note with partitioned tables: there's no foreign key support, and there's no guarantee that partitioned table will give better DML performance than a non-partitioned table.
Strategic indexes and carefully planned queries is the key to performance with "very large" tables. And this is true with partitioned tables as well.
Partitioning isn't a magic bullet for performance problems that some novices would like it to be.
As far as creating subpartitions within partitions, I wouldn't recommend it.

MySQL - Data Loading by Partitions, and Indexes

This is for MySQL 5.7 with InnoDB.
I have a partitioned table, and I'll be doing batch data loading (of a large amount of data) by partitions. i.e. I know that each batch of data I load will fall exclusively into one partition.
Now, the common way to handle indexes with data loading (as far as I know), would be to drop all indexes first, do the data loading, then re-create the indexes.
But I'm wondering, since I'm loading by partitions, is this still the most optimal way (dropping and then re-creating indexes) since it seems like I'm unnecessarily "touching" the non-updated partitions this way.
e.g.
Loading data into partition 1.
Drop all indexes - nothing happens, since no data yet.
Load data - all goes into partition 1.
Create indexes - only parition 1 is modified.
Loading data into partition 2.
Drop all indexes - all indexes in partition 1 dropped (unnecessary!)
Load data - all goes into partition 2.
Create indexes - partition 1 indexes re-created (unnecessary!) and partition 2 indexes created.
And hence, loading this second batch of data takes significantly longer than the first batch. And it will get worse for each batch!
In that case, should I just pre-create the indexes and leave them there when loading data?
(BTW, don't worry about queries. The database is "offline" when data loading takes place. The objective here is only to shorten the time for each batch of data loading.)
The table schema is as follows:
CREATE TABLE MYTABLE (
ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
YEAR SMALLINT UNSIGNED NOT NULL,
MONTH TINYINT UNSIGNED NOT NULL,
A CHAR(4),
B VARCHAR(127),
C VARCHAR(15),
D VARCHAR(511),
E TEXT,
F TEXT,
G VARCHAR(127),
H VARCHAR(127),
I VARCHAR(127),
J VARCHAR(511),
K VARCHAR(511),
L BIT(1),
CONSTRAINT PKEY PRIMARY KEY (ID, YEAR, MONTH)
)
PARTITION BY LIST COLUMNS(YEAR, MONTH) (
PARTITION PART1 VALUES IN ((2007, 1)),
PARTITION PART2 VALUES IN ((2007, 2)),
PARTITION PART3 VALUES IN ((2007, 3)),
...
);
And, of course, there are a bunch of indexes (14 in all), mostly involving 2 to 4 columns. None of the 2 TEXT columns are in any of the index.

If you are using InnoDB, do not drop the PRIMARY KEY.
All PARTITIONs always have the same indexes. So you cannot turn on/off indexes separately.
Please provide SHOW CREATE TABLE for further critique and advice. I may say that PARTITIONing is of no use; there are very few use cases were it is worth using PARTITION. More info, and use cases

MySQL table partition strange behavior (slow query suddenly)

(MySQL version: 5.6.15)
I have a huge table (Table_A) with 10M rows, in entity-attribute-value model.
It has a compound unique key [Field_A + Element + DataTime].
CREATE TABLE TABLE_A
(
`Field_A` varchar(5) NOT NULL,
`Element` varchar(5) NOT NULL,
`DataTime` datetime NOT NULL,
`Value` decimal(10,2) DEFAULT NULL,
UNIQUE KEY `A_ELE_TIME` (`Field_A`,`Element`,`DataTime`),
KEY `DATATIME` (`DataTime`),
KEY `ELEID` (`ELEID`),
KEY `ELE_TIME` (`ELEID`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Rows are inserted/updated to the table every minutes, hence the row size of each [DataTime] (i.e. every minute) is regular, around 3K rows.
I have a "select" query from this table, after the above "inserted/updated".
The query selects one specified elements within most recent 25 hours (around 30K rows). This query usually processes within 3 sec.
SELECT
Field_A, Element, DataTime, `Value`
FROM
Table_A
WHERE
Element="XX"
AND DataTime between [time] and [time].
The original housekeeping would be remove any row after 3 days, every 5 minutes.
For better housekeeping, I try to partition the table base on [DataTime], every 6 hours. (00,06,12,18 local time)
PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime))
(PARTITION p2014103112 VALUES LESS THAN (73590212) ENGINE = InnoDB,
...
PARTITION p2014110506 VALUES LESS THAN (73590706) ENGINE = InnoDB,
PARTITION pFuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
My housekeeping script will drop the expired partition then create a new one
ALTER TABLE TABLE_A REORGANIZE PARTITION pFuture INTO (
PARTITION [new_partition_name] VALUES LESS THAN ([bound_value]),
PARTITION pFuture VALUES LESS THAN MAXVALUE
)
The new process seems running smoothly.
However, the SELECT query would slow down suddenly (> 100 sec).
The query is still slow even all process stopped. It won't be fixed until "analyzing partitions" (reads and stores the key distributions for partitions).
It usually happens ones a day.
It does not happen to a non-partitioned table.
Therefore, we think it is caused by corrupted indexing in a partitioned MySQL (huge) table.
Does anyone have any idea on how to solve it?
Many Thanks!!

If you PARTITION BY RANGE (TO_DAYS(DataTime)*100+hour(DataTime)), when you filter datetime with between [from] and [to] operation, mysql will scan all partitions unless [from] equals [to].
So it's reasonable that your query slow down suddenly.
My suggestion is partition using TO_DAYS(DataTime) without hour, if you query recent 25 hours data, it will scan up to 2 partitions only.
I'm not good at MySql, and I couldn't explain it, wish other smart guys can explain it further. But you could using EXPLAIN PARTITIONS to prove it. And here is the Sql Fiddle Demo.

MySQL table partition by month

I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?

As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));

CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)

Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).

HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008