How does hash(row id + year) partition work? - mysql

I'm new to this with partitions. Didn't knew it existed but came aware when I tried to make our new 'url_hash' column unique in a table in our database. And got the error message:
A UNIQUE INDEX must include all columns in the table's partitioning function
This is a database created by another person that I don't know and who are not involved in the project anymore.
I have tried to read mysql documentation and read on forums about Partition. What it is and how it works. Understand the purpose, to "divide" a table in to several "parts" so it becomes faster to retrieve relevant data. A common example is to partition in to years intervals. But most examples shows an manual method. Where you decide for example less than three specific years. For example:
PARTITION BY RANGE ( YEAR(separated) ) (
PARTITION p0 VALUES LESS THAN (1991),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (2001),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
But in our table, the partitions are created this way:
PARTITION BY HASH ( `feeditemsID` + YEAR(`feeddate`))
PARTITIONS 3;
What does that mean? How does our partition work?
feeditemsID is the unique ID for every row in our table.

When you use hash partitioning, the partition that contains each record is determined by calculating a hash code from the expression feaditemsID + YEAR(feeddate), and then finding the modulus of this code by the number of partitions. So if the hash code for a row is 123, it calculates 123 % 3, which is 0, so the record goes into partition 0.
This is explained inthe MySQL documentation.
As stated there,
Note
If a table to be partitioned has a UNIQUE key, then any columns supplied as arguments to the HASH user function or to the KEY's column_list must be part of that key.
In your case, the table's primary key needs to be:
PRIMARY KEY (feeditemsID, feeddate)
Assuming feeditemsID is already unique (presumably it's an auto-increment column), adding feeddate to the primary is redundant as far as keeping the data unique is concerned, but it's needed to satisfy the partitioning requirement. Putting feeditemsID first in the composite key will allow it to be used by itself to optimize table lookup.
This requirement is probably because each partition has its own index. When inserting/updating a row and checking for uniqueness, it only checks the index of the partition where that row will be stored. So when it finds the partition using the hash function, it needs to be sure that this partition will uniquely contain the indexed columns.
For more information see
Partitioning Keys, Primary Keys, and Unique Keys

Related

MySQL: How avoid all partitions scan (year-based) when doing ID lookup?

In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?
CREATE TABLE part_table (
id bigint NOT NULL auto_increment,
moment datetime NOT NULL,
KEY (id),
KEY (moment)
)-- partitioning information (in years)
PARTITION BY RANGE( YEAR(moment) ) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pFuture VALUES LESS THAN (maxvalue) )
;
With e.g. lookup query:
SELECT * FROM part_table WHERE ID = <nr>
Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
It is inefficient to pre-create future partitions (other than one).
Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:
Dropping 'old' data is much faster than DELETEing it.
`WHERE something-else AND moment between ..
Some cases
For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).
WHERE id BETWEEN 111 and 222
Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.
WHERE id BETWEEN 111 and 222
AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`
This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.
Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:
PRIMARY KEY(id, moment)
The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.
INDEX(id)
is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.
DROP PARTITION p2020
id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.
More
It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.
Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.
In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):
If the WHERE references only one column, the PARTITIONing is probably never beneficial.
If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.

Mysql select by auto increment primary key while partitioned by date

I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?
Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)

Is the partition key required when retrieving by the document ID

Is it possible to retrieve a document by its ID without specifying the partition key?
My understanding from reading the documentation is that the query will fan out across all partitions when the partition key is not specified:
The following query does not have a filter on the partition key
(DeviceId) and is fanned out to all partitions where it is executed
against the partition's index. Note that you have to specify the
EnableCrossPartitionQuery (x-ms-documentdb-query-enablecrosspartition
in the REST API) to have the SDK to execute a query across partitions.
This makes sense with non-key properties, but given the ID is treated specially, I'm hoping I won't need to enable cross partition queries for it.
If I do need to enable cross partition queries, would this be an expensive operation?
Query by just ID will be a cross partition operation. You should include the partition key in these queries in FeedOptions.PartitionKey, or as part of the filter.
In DocumentDB, ID is not unique across all documents within a collection. Instead, the combination of "partition key" and "id" is the primary key and uniquely identifies documents within a collection.
Some applications encode partition key as part of the ID, e.g. partition key would be customer ID, and ID = "customer_id.order_id", so you can extract the partition key from the ID value.

Error #1526 when partitioning table on mysql

Sorry, I don't know English, but I need help :(
I'm using partitioning by LIST COLUMNS by ALTER TABLE statement
My table :
table member_list:
id int,
name varchar(255),
company varchar(255),
cell_phone varchar(20)
It's haven't key
I have more than 900.000 records in the current. After inserting, I tried partitioning table by LIST COLUMNS :
alter table member_list
partition by list columns(company)(
partition p1 values in ('Lavasoft','Cakewalk','Lycos'),
partition p2 values in ('Adobe','Vivoo','Apple Systems','Sibelius'),
partition p3 values in ('Finale','Borland','Macromedia','FPT'),
partition p4 values in ('Chami','Yahoo','Google','Altavista')
)
After runned :
#1526 - Table has no partition for value from column_list
MySQL returned me this error, I can not find support from Oracle page. I hope you will help me. Thanks
#1526 - Table has no partition for value from column_list
The error message is telling you that there is a value in your data in one of the columns you have chosen for partitioning that is not accounted for in your defined partitions.
In this case, there is something in the "company" field that cannot be placed into any of the partitions. For instance, on some record, company="Blackberry." MySQL cannot put this row into any of your partitions.
LIST partitioning allow only Integer values. If you want to use columns with varchar partitioning use HASH or KEY PARTITIONS. Besides partition can only be used on columns that have primary or unique attribute.

Partition strategy for MySQL 5.5 (InnoDB)

Trying to implement a partition strategy for a MySQL 5.5 (InnoDB) table and I am not sure my understanding is right or if I need to change the syntax in creating the partition.
Table "Apple" has 10 mill rows...Columns "A" to "H"
PK is columns "A", "B" and "C"
Column "A" is a char column and can identify groups of 2 million rows.
I thought column "A" would be a nice candidate to try and implement a partition around since
I select and delete by this column and could really just truncate the partition when the data is no longer needed.
I issued this command:
ALTER TABLE Apple
PARTITION BY KEY (A);
After looking at the partition info using this command:
SELECT PARTITION_NAME, TABLE_ROWS FROM
INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'Apple';
I see all the data is on partition p0
I am wrong in thinking that MySQL was going to break out the partitions in groups of 2 million automagically?
Did I need to specify the number of partitions in the Alter command?
I was hoping this would create groups of 2 million rows in a partition and then create a new partition as new data comes in with a unique value for column "A".
Sorry if this was too wordy.
Thanks - JeffSpicoli
Yes, you need to specify the number of partitions (I assume the default was to create 1 partition). Partition by KEY uses internal hashing function http://dev.mysql.com/doc/refman/5.1/en/partitioning-key.html , so the partition is not selected based on the value of column, but on hash computed from it. Hashing functions return the same result for same input, so yes, all rows having the same value will be in the same partition.
But maybe you want to partition by RANGE if you want to be able to DROP PARTITION (because if partitioned by KEY, you only know that the rows are spaced evenly in the partitions, but you many different values end up in the same partition).