In SAP HANA 1.0 SPS 12 we want to partition a table by ValidationAreaID and by VersionValidTo.
This is no problem so far.
But since a comparison with null is supposed to be faster than a timestamp I want to partition by
} technical configuration {
partition by
range (ValidationAreaID) (
partition value = 1,
partition value = 2,
partition value = 3,
partition others
),
range (VersionValidTo) (
partition value = null,
partition others
)
;
instead of
} technical configuration {
partition by
range (ValidationAreaID) (
partition value = 1,
partition value = 2,
partition value = 3,
partition others
),
range (VersionValidTo) (
partition value = '9999-12-31',
partition others
)
;
However trying to partition by a null value results in the error message : Syntax error: unexpected token "null"
To provide a closable answer:
The partition definition clauses don't allow an IS NULL check.
The partition either needs to be specified by one distinct and uniquely identifiable string or (unsigned) numeral or by a closed range of values (see here).
This answers the part whether it's possible to create a partition for records where the condition IS NULL evaluated to true: it's not.
The second part of the answer addresses the claim that a check for IS NULL is faster than a check for a specific value.
This is not generally true. While you may find data distributions for which checking for NULL entries in a specific column can be done quicker than scanning the entire main segment of that column, this is not something special of the NULL entry.
Depending on the overall distribution of distinct values in any column (and across all columns in the table), SAP HANA will sort and compress the value ID pointers in the main segment of the column.
If e.g. the majority of all entries in a column is currently NULL it may well result in a compression that puts all NULL entries at the very top and compresses those with RLE.
A general search for IS NULL would be very fast in this case.
Likewise, the compression could change for other very prominent values of that column.
The only technical difference in the column store for NULLs, that I'm aware of, is that they have a hard-coded and fixed value ID so the lookup into the dictionary can be avoided and all columns share the same value ID for NULL entries. As the dictionary lookup is usually not the bottleneck in statement execution, it's fair to say that the "NULL is faster" idea is not true.
Related
I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)
General context
I want to be able to tell, when inserting into non-balanced RANGE-partitioned MySQL tables with AUTO INCREMENT primary keys, whether my inserts are causing MySQL to communicate in any way with partitions other than the ones I specify. This is useful for budgeting future capacity for large-scale data loading; with that assurance, I could much more accurately predict that performance and hardware resource cost of loading data into the database.
I am using MySQL 5.6.
Specific context
Say I have the following table in MySQL (5.6):
CREATE TABLE foo (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(6) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=9001 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
/*!12345 PARTITION BY RANGE (id)
(PARTITION cold VALUES LESS THAN (8000) ENGINE = InnoDB,
PARTITION hot VALUES LESS THAN (9000) ENGINE = InnoDB,
PARTITION overflow VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Assume the table is not sparse: no rows have been deleted, so count(*) = max(id) = 9001.
Questions
If I do INSERT INTO foo (data) PARTITION (hot) VALUES ('abc') or an equivalent LOAD DATA statement with the PARTITION clause included, are any partitions other than the selected hot partition being accessed?
How would I tell what partitions are being accessed by those DML statements?
What I've tried
The MySQL documentation on partition selection says:
REPLACE and INSERT now lock only those partitions having rows to be
inserted or replaced. However, if an AUTO_INCREMENT value is generated
for any partitioning column then all partitions are locked.
Additionally, it says:
Locks imposed by LOAD DATA statements on partitioned tables cannot be
pruned.
Those statements don't help clarify which partitions are being accessed by DML queries which explicitly specify the partition.
I've tried doing EXPLAIN PARTITIONS INSERT INTO foo ..., but the partitions column of the output is always NULL.
According to the documentation,
For statements that insert rows, the behavior differs in that failure to find a suitable partition causes the statement to fail. This is true for both INSERT and REPLACE statements
So when you try to insert a row that does not match your specified partition, you'll receive
Error Code: 1748. Found a row not matching the given partition set
This including statements where some rows match and some don't,
so you cannot use this to fill "hot" and throw away rows that would go into "overflow" (as the whole query will fail).
The explain-otuput for MySQL 5.6 does not include a seperate row for insert; the value for partition relates to the source of the data you insert (in cases where you e.g. use insert ... select ... partition ...), even if you use values() (then you use "no table", and the relevant partition is just null). For MySQL 5.7+, there is an "insert"-type, and it would indeed list only your specified partition.
I have a simple history table that I am developing a new lookup for. I am wondering what is the best index (if any) to add to this table so that the lookups are as fast as possible.
The history table is a simple set of records of actions taken. Each action has a type and an action date (and some other attributes). Every day a new set of action records is generated by the system.
The relevant pseudo-schema is:
TABLE history
id int,
type int,
action_date date
...
INDEX
id
...
Note: the table is not indexed on type or action_date.
The new lookup function is intended to retrieve all the records of a specific type that occurred on a specific action date.
My initial inclination is to define a compound key consisting of both the type and the action_date.
However in my case there will be many actions with the same type and date. Further, the actions will be roughly evenly distributed in number each day.
Given all of the above: (a) is an index worthwhile; and (b) if so, what is the preferred index(es)?
I am using MySQL, but I think my question is not specific to this RDBMS.
The first field on index should be the one giving you the smallest dataset for the majority of queries after the condition is applied.
Depending on your business requirements, you may request a specific date or specific date range (most likely the date range). So the date should one the last field on the index. Most likely you will always have the date condition.
A common answer is to have the (type,date) index, but you should consider just the date index if you ever query more than one type value in the query or if you have just a few types (like less than 5) and they are not evenly distributed.
For example, you have type 1 70% of the table, type 2,3,4,... is less than few percent of the table, and you often query type 1, you better have just separate date index, and type index (for cases when you query type 2,3,4,), not compound (type, date) index.
INDEX(type, action_date), regardless of cardinality or distribution of either column. Doing so will minimize the number of 'rows' of the index's BTree` that need to be looked at. (Yes, I am disagreeing with Sergiy's Answer.)
Even for WHERE type IN (2,3) AND action_date ... can use that index.
For checking against a date range of, say 2 weeks, I recommend this pattern:
AND action_date >= '2016-10-16`
AND action_date < '2016-10-16` + INTERVAL 2 WEEK
A way to see how much "work" is needed for a query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers presented will give you a feel for how many index (or data) rows need to be touched. This makes it easy to see which of two possible queries/indexes works better, even when the table is too small to get reliable timings.
Yes, an index is worthwhile. Especially if you search for a small subset of the table.
If your search would match 20% or more of the table (approximately), the MySQL optimizer decides that the index is more trouble than it's worth, and it'll do a table-scan even if the index is available.
If you search for one specific type value and one specific date value, an index on (type, date) or an index on (date, type) is a good choice. It doesn't matter much which column you list first.
If you search for multiple values of type or multiple dates, then the order of columns matters. Follow this guide:
The leftmost columns of the index should be the ones on which you do equality comparisons. An equality comparison is one that matches exactly one value (even if that value is found on many rows).
WHERE type = 2 AND date = '2016-10-19' -- both equality
The next column of the index can be part of a range comparison. A range comparison matches multiple values. For example, > or IN( ) or BETWEEN or !=.
WHERE type = 2 AND date > '2016-10-19' -- one equality, one range
Only one such column benefits from an index. If you have range comparisons on multiple columns, only the first column of the index will use the index to support lookups. The subsequent column(s) will have to search through those matching rows "the hard way".
WHERE type IN (2, 3, 4) AND date > '2016-10-19' -- multiple range
If you sometimes search using a range condition on type and equality on date, you'll need to create a second index.
WHERE type IN (2, 3, 4) AND date = '2016-10-19' -- make index on (date, type)
The order of terms in your WHERE clause doesn't matter. The SQL query optimizer will figure that out and reorder them to match the right columns defined in an index.
I'm new to this with partitions. Didn't knew it existed but came aware when I tried to make our new 'url_hash' column unique in a table in our database. And got the error message:
A UNIQUE INDEX must include all columns in the table's partitioning function
This is a database created by another person that I don't know and who are not involved in the project anymore.
I have tried to read mysql documentation and read on forums about Partition. What it is and how it works. Understand the purpose, to "divide" a table in to several "parts" so it becomes faster to retrieve relevant data. A common example is to partition in to years intervals. But most examples shows an manual method. Where you decide for example less than three specific years. For example:
PARTITION BY RANGE ( YEAR(separated) ) (
PARTITION p0 VALUES LESS THAN (1991),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (2001),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
But in our table, the partitions are created this way:
PARTITION BY HASH ( `feeditemsID` + YEAR(`feeddate`))
PARTITIONS 3;
What does that mean? How does our partition work?
feeditemsID is the unique ID for every row in our table.
When you use hash partitioning, the partition that contains each record is determined by calculating a hash code from the expression feaditemsID + YEAR(feeddate), and then finding the modulus of this code by the number of partitions. So if the hash code for a row is 123, it calculates 123 % 3, which is 0, so the record goes into partition 0.
This is explained inthe MySQL documentation.
As stated there,
Note
If a table to be partitioned has a UNIQUE key, then any columns supplied as arguments to the HASH user function or to the KEY's column_list must be part of that key.
In your case, the table's primary key needs to be:
PRIMARY KEY (feeditemsID, feeddate)
Assuming feeditemsID is already unique (presumably it's an auto-increment column), adding feeddate to the primary is redundant as far as keeping the data unique is concerned, but it's needed to satisfy the partitioning requirement. Putting feeditemsID first in the composite key will allow it to be used by itself to optimize table lookup.
This requirement is probably because each partition has its own index. When inserting/updating a row and checking for uniqueness, it only checks the index of the partition where that row will be stored. So when it finds the partition using the hash function, it needs to be sure that this partition will uniquely contain the indexed columns.
For more information see
Partitioning Keys, Primary Keys, and Unique Keys
I have a huge table that stores many tracked events, such as a user click.
The table is already in the 10s of millions, and it's growing larger every day.
The queries are starting to get slower when I try to fetch events from a large timeframe, and after reading quite a bit on the subject I understand that partitioning the table may boost the performance.
What I want to do is partition the table on a per month basis.
I have only found guides that show how to partition manually each month, is there a way to just tell MySQL to partition by month and it will do that automatically?
If not, what is the command to do it manually considering my partitioned by column is a datetime?
As explained by the manual: http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
This is easily possible by hash partitioning of the month output.
CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)
ENGINE=INNODB
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Do note that this only partitions by month and not by year, also there are only 6 partitions (so 6 months) in this example.
And for partitioning an existing table (manual: https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html):
ALTER TABLE ti
PARTITION BY HASH( MONTH(tr_date) )
PARTITIONS 6;
Querying can be done both from the entire table:
SELECT * from ti;
Or from specific partitions:
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
CREATE TABLE `mytable` (
`post_id` int DEFAULT NULL,
`viewid` int DEFAULT NULL,
`user_id` int DEFAULT NULL,
`post_Date` datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
PARTITION BY RANGE (extract(year_month from `post_Date`))
(PARTITION P0 VALUES LESS THAN (202012) ENGINE = InnoDB,
PARTITION P1 VALUES LESS THAN (202104) ENGINE = InnoDB,
PARTITION P2 VALUES LESS THAN (202108) ENGINE = InnoDB,
PARTITION P3 VALUES LESS THAN (202112) ENGINE = InnoDB,
PARTITION P4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Be aware of the "lazy" effect doing it partitioning by hash:
As docs says:
You should also keep in mind that this expression is evaluated each time a row is inserted or updated (or possibly deleted); this means that very complex expressions may give rise to performance issues, particularly when performing operations (such as batch inserts) that affect a great many rows at one time.
The most efficient hashing function is one which operates upon a single table column and whose value increases or decreases consistently with the column value, as this allows for “pruning” on ranges of partitions. That is, the more closely that the expression varies with the value of the column on which it is based, the more efficiently MySQL can use the expression for hash partitioning.
For example, where date_col is a column of type DATE, then the expression TO_DAYS(date_col) is said to vary directly with the value of date_col, because for every change in the value of date_col, the value of the expression changes in a consistent manner. The variance of the expression YEAR(date_col) with respect to date_col is not quite as direct as that of TO_DAYS(date_col), because not every possible change in date_col produces an equivalent change in YEAR(date_col).
HASHing by month with 6 partitions means that two months a year will land in the same partition. What good is that?
Don't bother partitioning, index the table.
Assuming these are the only two queries you use:
SELECT * from ti;
SELECT * from ti PARTITION (HASH(MONTH(some_date)));
then start the PRIMARY KEY with the_date.
The first query simply reads the entire table; no change between partitioned and not.
The second query, assuming you want a single month, not all the months that map into the same partition, would need to be
SELECT * FROM ti WHERE the_date >= '2019-03-01'
AND the_date < '2019-03-01' + INTERVAL 1 MONTH;
If you have other queries, let's see them.
(I have not found any performance justification for ever using PARTITION BY HASH.)