Vertica Hierarchical Partitioned Table creation throws error - partitioning

I am using Vertica Analytic Database v8.1.1-8.
I have created a table with simple partitioning clause as:
CREATE TABLE public.test
(
id timestamp NOT NULL,
cid numeric(37,15) NOT NULL DEFAULT 0
)
UNSEGMENTED ALL NODES PARTITION BY id::DATE;
Table got successfully created and I inserted few rows into it.
But when I execute following SQL,
SELECT DUMP_PARTITION_KEYS();
I see following:
Partition keys on node v_public_node0001
Projection 'test_super'
No of partition keys: 0
Partition keys on node v_public_node0003
Projection 'test_super'
No of partition keys: 0
I was expecting there must be some valid "partition keys".
Thus, wondering have I missed any step here ?
How do I verify that my table really got "partitioned" ?
2) Next I tried "Hierarchical Partitioning" with CALENDAR_HIERARCHY_DAY meta-function to leverages partition grouping.
But this time table creation itself failed.
CREATE TABLE public.test
(
id timestamp NOT NULL,
cid numeric(37,15) NOT NULL DEFAULT 0
)
UNSEGMENTED ALL NODES PARTITION BY id::DATE
GROUP BY CALENDAR_HIERARCHY_DAY(id::DATE, 2, 2);
with following error:
16:45:14 [CREATE - 0 rows, 0.130 secs] [Code: 4856, SQL State: 42601] [Vertica][VJDBC](4856) ERROR: Syntax error at or near "GROUP"
... 1 statement(s) executed, 0 rows affected, exec/fetch time: 0.130/0.000 sec [0 successful, 1 errors]
Can anyone pls. suggest what wrong I did?
My goal is to create a table with Hierarchical Partitioning.
Many Thanks in advance,
- Kuntal

1) The reason why you are not seeing partition keys right after an insert, is because partitioning only happens on disk (per node, per projection). When you insert rows into a table, those rows are written to Write-optimized-store (WOS) or memory in other words. After a given interval, the data in memory (WOS) is written to disk or Read-optimized-store (ROS). At that point you will see the partition keys.
The process of data being copied from WOS to ROS is performed by the tuple mover (https://www.vertica.com/docs/latest/HTML/Content/Authoring/Glossary/TupleMover.htm).
In short to see the partition keys, either wait 5 min or so for the tuple mover to initiate an automatic moveout, or you can force data from WOS to be written to ROS by executing a manual moveout.
SELECT DO_TM_TASK('moveout', 'public.test');
Then you should see the keys.
2) Hierarchical partitioning is a Vertica 9 feature. You will need to upgrade to at least Vertica 9.0 in order to use that feature.
https://www.vertica.com/blog/whats-new-vertica-9-0-hierarchical-partitioning/

Related

Determining partitioning key in range based partitioning of a MySQL Table

I've been researching for a while regarding database partitioning in MySQL. Since I have one ever-growing table in my DB, I thought of using partitioning as an effective tool to optimize it. I'm only interested in retaining recent data (say last 6 months) and the table has a column name 'CREATED_AT' (TIMESTAMP, NON-PRIMARY), the approach which popped up in my mind is as follows
Create a time-based range partition on the table by using 'CREATED_AT' as the partition key.
Run a DB level Event periodically and drop partitions which are obsolete. ( older than 6 months).
However, the partition can only be realized if I make 'CREATED_AT' field as primary. But doesn't it violate the primary key principle? since the same field is non-unique and can have tons of rows with the same value, doesn't marking it as primary turn out to be an anti-pattern? Is there any workaround to acheive time based ranged partitioning in this scenario?
This is a problem that prevents many MySQL users from using partitioning.
The column you use for your partitioning key must be in every PRIMARY KEY or UNIQUE KEY of the table. It doesn't have to be the only column in those keys (because keys can be multi-column), but it has to be part of every unique key.
Still, in many tables it would violate the logical design of the table. So partitioning is not practical.
You could grit your teeth and design a table with partitions that has a compromised design:
create table mytable (
id bigint auto_increment not null,
created_at datetime not null,
primary key (id, created_at)
) partition by range columns (created_at) (
partition p20190101 values less than ('2019-01-01'),
partition p20190201 values less than ('2019-02-01'),
partition p20190301 values less than ('2019-03-01'),
partition p20190401 values less than ('2019-04-01'),
-- etc...
partition pMAX values less than (MAXVALUE)
);
I tested this table and there's no error when I define it. Even though this table technically allows multiple rows with the same id value if they have different timestamps, in practice you can code your application to just let id values be auto-incremented, and never change the id. As long as your code is the only application that inserts data, you can more or less have some assurance that the data doesn't contain multiple rows with the same id.
You might think you can add a secondary unique key constraint to enforce that id must be unique by itself. But this violates the partitioning rules:
mysql> alter table mytable add unique key (id);
ERROR 1503 (HY000): A UNIQUE INDEX must include all columns in the table's partitioning function
You just have to trust that your application won't insert invalid data.
Or else forget about using partitioning, and instead just add an index to the created_at column, and use incremental DELETE instead of using DROP PARTITION to prune old data.
The latter strategy is what I see used in almost every case. Usually, it's important to have the RDBMS enforce strict uniqueness on the id column. It's not safe to allow this uniqueness to be unenforced.
Re your comment:
Isn't dropping of an entire partition a much cheaper operartion than performing incremental deletes?
Yes and no.
DELETE can be rolled back, so it results in some overhead, like temporarily storing data in the rollback segment. On the other hand, it locks only the rows that match the index search.
Dropping a partition doesn't do rollback, so there are some steps it can skip. But it does an ALTER TABLE, so it needs to first acquire a metadata lock on the whole table. Any concurrent query, either read or write, will block that and be blocked by it.
Demo:
Open two MySQL client windows. In the first session do this:
mysql> START TRANSACTION;
mysql> SELECT * FROM mytable;
This holds a metadata lock on the table, which blocks things like ALTER TABLE.
In the second window:
mysql> ALTER TABLE mytable DROP PARTITION p20190101;
<pauses, waiting for the metadata lock held by the first session!>
You can even open a third session and do this:
mysql> SELECT * FROM mytable;
<also pauses>
The second SELECT is waiting behind the ALTER TABLE. They are both queued for the metadata lock.
If I commit the first SELECT, then the ALTER TABLE finally finishes:
mysql> ALTER TABLE mytable DROP PARTITION p20190101;
Query OK, 0 rows affected (6 min 25.25 sec)
That 6 min 25 sec isn't because it takes a long time to do the DROP PARTITION. It's because I had left my transaction uncommitted that long while writing this post.
Metadata lock waits don't time out like an InnoDB row lock, which times out after 50 seconds. The default metadata lock timeout is 1 year! See https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_lock_wait_timeout
Statements like ALTER TABLE, DROP TABLE, RENAME TABLE, and even things like CREATE TRIGGER need to acquire a metadata lock.
So in some cases, depending on if you have long-running transactions holding onto metadata locks, it could be better for your concurrent throughput to use DELETE to remove data incrementally, even if it takes longer.

Mysql explain query scan more rows then what actual returns

I am using mysql 5.6.22-log
I am executing query on table aggr, with all the condition in where clause.
Following are the data
Table
CREATE TABLE aggr (
a_date DATE,
product_id INT(11),
data_point VARCHAR(16),
los INT(11),
hour_0 DOUBLE(4,2),
UNIQUE KEY `unique_row` (a_date,product_id,data_point,los),
INDEX product_id(product_id)
);
Insert queries
INSERT INTO aggr(a_date,product_id,data_point,los,hour_0)
VALUES
('2018-07-29',1,'arrivals',1,10),('2018-07-29',1,'departure',1,9),
('2018-07-29',1,'solds',1,12),('2018-07-29',1,'revenue',1,45.20),
('2018-07-30',1,'arrivals',2,10),('2018-07-30',1,'departure',2,9),
('2018-07-30',1,'solds',2,12),('2018-07-30',1,'revenue',2,45.20),
('2018-07-29',2,'arrivals',1,10),('2018-07-29',2,'departure',1,9),
('2018-07-29',2,'solds',1,12),('2018-07-29',2,'revenue',1,45.20),
('2018-07-30',2,'arrivals',2,10),('2018-07-30',2,'departure',2,9),
('2018-07-30',2,'solds',2,12),('2018-07-30',2,'revenue',2,45.20);
Query
EXPLAIN
SELECT * FROM aggr
WHERE a_date BETWEEN '2018-07-29' AND '2018-07-29'
AND product_id = 1
AND data_point IN('arrivals','departure' ,'solds','revenue')
AND los = 1 ;
Question
Above query scan 8 rows (while as per the where condition it should scan only 4 rows )
Expected Result :
It should scan only 4 rows instead of 8 rows.
Can some one explain why mysql scan 8 rows instead of 4 rows?
Thanks
EXPLAIN statement is used to obtain information about how the query is executed. The rows number is an approximation only, used by the query optimizer to make decisions when it builds an execution plan. It is a tool for getting diagnostic information by database administration or developer.
What the result of EXPLAIN is actually showing you is that you have no usable index for your query (key is (NULL)). This is quite bad and can cause significant slowdowns for this query. By looking at your table definition, I would say that you need a separate index for data_point, or at least try to make make it the last column for your primary key.
However, none of this is enough to explain the deadlock. I'm not even sure why you are showing us EXPLAIN here - it has nothing to do with it. To be able to diagnose a deadlock, you need to privide more information. Start with the type of your table (MyISAM, InnoDB etc.) and SHOW FULL PROCESSLIST. Then, for each process, see what locks it's holding for each table.

Using MySQL partitioning by an AUTO INCREMENT field, how can I guarantee that INSERT/LOAD DATA statements are only accessing specified partitions?

General context
I want to be able to tell, when inserting into non-balanced RANGE-partitioned MySQL tables with AUTO INCREMENT primary keys, whether my inserts are causing MySQL to communicate in any way with partitions other than the ones I specify. This is useful for budgeting future capacity for large-scale data loading; with that assurance, I could much more accurately predict that performance and hardware resource cost of loading data into the database.
I am using MySQL 5.6.
Specific context
Say I have the following table in MySQL (5.6):
CREATE TABLE foo (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(6) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=9001 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
/*!12345 PARTITION BY RANGE (id)
(PARTITION cold VALUES LESS THAN (8000) ENGINE = InnoDB,
PARTITION hot VALUES LESS THAN (9000) ENGINE = InnoDB,
PARTITION overflow VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Assume the table is not sparse: no rows have been deleted, so count(*) = max(id) = 9001.
Questions
If I do INSERT INTO foo (data) PARTITION (hot) VALUES ('abc') or an equivalent LOAD DATA statement with the PARTITION clause included, are any partitions other than the selected hot partition being accessed?
How would I tell what partitions are being accessed by those DML statements?
What I've tried
The MySQL documentation on partition selection says:
REPLACE and INSERT now lock only those partitions having rows to be
inserted or replaced. However, if an AUTO_INCREMENT value is generated
for any partitioning column then all partitions are locked.
Additionally, it says:
Locks imposed by LOAD DATA statements on partitioned tables cannot be
pruned.
Those statements don't help clarify which partitions are being accessed by DML queries which explicitly specify the partition.
I've tried doing EXPLAIN PARTITIONS INSERT INTO foo ..., but the partitions column of the output is always NULL.
According to the documentation,
For statements that insert rows, the behavior differs in that failure to find a suitable partition causes the statement to fail. This is true for both INSERT and REPLACE statements
So when you try to insert a row that does not match your specified partition, you'll receive
Error Code: 1748. Found a row not matching the given partition set
This including statements where some rows match and some don't,
so you cannot use this to fill "hot" and throw away rows that would go into "overflow" (as the whole query will fail).
The explain-otuput for MySQL 5.6 does not include a seperate row for insert; the value for partition relates to the source of the data you insert (in cases where you e.g. use insert ... select ... partition ...), even if you use values() (then you use "no table", and the relevant partition is just null). For MySQL 5.7+, there is an "insert"-type, and it would indeed list only your specified partition.

MySQL explain shows 1 row in "rows" when there is no data in table at all

I have a table:
CREATE TABLE `test` (
`pk` INT(11) NOT NULL AUTO_INCREMENT,
`index_col` INT(11) NULL DEFAULT '0',
PRIMARY KEY (`pk`)
)
ENGINE=InnoDB
AUTO_INCREMENT=5
;
no data in table at all.
And I have a query:
explain select * from test;
It shows in explain:
rows: 1
filtered: 100
From where does it get that 1 row, if there is no data in table at all?
In MySQL, EXPLAIN returns execution plan of the query and not the actual result. This is what the documentation says:
When EXPLAIN is used with an explainable statement, MySQL displays
information from the optimizer about the statement execution plan.
That is, MySQL explains how it would process the statement, including
information about how tables are joined and in which order.
Execution plan is represented in a tabular format, and 1 row in the output is actually the row of the execution plan and not query output. This output shows 1 row per table and hence, if you have more than 1 table in your query, you will see 2 rows in the output regardless of the result of actual SELECT query.
Here's the detailed explanation of EXPLAIN output format.
Ok, i've debugged mysqld source code and that what i've found out:
n_rows = ib_table->stat_n_rows; <- this is zero (if table is truncated what in my case is true or equals to rows count in table (but may be rows count in statistics of table)).
next below in the code
/*
The MySQL optimizer seems to assume in a left join that n_rows
is an accurate estimate if it is zero. Of course, it is not,
since we do not have any locks on the rows yet at this phase.
Since SHOW TABLE STATUS seems to call this function with the
HA_STATUS_TIME flag set, while the left join optimizer does not
set that flag, we add one to a zero value if the flag is not
set. That way SHOW TABLE STATUS will show the best estimate,
while the optimizer never sees the table empty.
However, if it is internal temporary table used by optimizer,
the count should be accurate */
if (n_rows == 0 && !(flag & HA_STATUS_TIME) <-- this expression is true
&& table_share->table_category != TABLE_CATEGORY_TEMPORARY) {
n_rows++; <-- this code executes
}
then below
this value is used everywhere --> stats.records = (ha_rows) n_rows;

Partition strategy for MySQL 5.5 (InnoDB)

Trying to implement a partition strategy for a MySQL 5.5 (InnoDB) table and I am not sure my understanding is right or if I need to change the syntax in creating the partition.
Table "Apple" has 10 mill rows...Columns "A" to "H"
PK is columns "A", "B" and "C"
Column "A" is a char column and can identify groups of 2 million rows.
I thought column "A" would be a nice candidate to try and implement a partition around since
I select and delete by this column and could really just truncate the partition when the data is no longer needed.
I issued this command:
ALTER TABLE Apple
PARTITION BY KEY (A);
After looking at the partition info using this command:
SELECT PARTITION_NAME, TABLE_ROWS FROM
INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'Apple';
I see all the data is on partition p0
I am wrong in thinking that MySQL was going to break out the partitions in groups of 2 million automagically?
Did I need to specify the number of partitions in the Alter command?
I was hoping this would create groups of 2 million rows in a partition and then create a new partition as new data comes in with a unique value for column "A".
Sorry if this was too wordy.
Thanks - JeffSpicoli
Yes, you need to specify the number of partitions (I assume the default was to create 1 partition). Partition by KEY uses internal hashing function http://dev.mysql.com/doc/refman/5.1/en/partitioning-key.html , so the partition is not selected based on the value of column, but on hash computed from it. Hashing functions return the same result for same input, so yes, all rows having the same value will be in the same partition.
But maybe you want to partition by RANGE if you want to be able to DROP PARTITION (because if partitioned by KEY, you only know that the rows are spaced evenly in the partitions, but you many different values end up in the same partition).