Is the partition key required when retrieving by the document ID - partitioning

Is it possible to retrieve a document by its ID without specifying the partition key?
My understanding from reading the documentation is that the query will fan out across all partitions when the partition key is not specified:
The following query does not have a filter on the partition key
(DeviceId) and is fanned out to all partitions where it is executed
against the partition's index. Note that you have to specify the
EnableCrossPartitionQuery (x-ms-documentdb-query-enablecrosspartition
in the REST API) to have the SDK to execute a query across partitions.
This makes sense with non-key properties, but given the ID is treated specially, I'm hoping I won't need to enable cross partition queries for it.
If I do need to enable cross partition queries, would this be an expensive operation?

Query by just ID will be a cross partition operation. You should include the partition key in these queries in FeedOptions.PartitionKey, or as part of the filter.
In DocumentDB, ID is not unique across all documents within a collection. Instead, the combination of "partition key" and "id" is the primary key and uniquely identifies documents within a collection.
Some applications encode partition key as part of the ID, e.g. partition key would be customer ID, and ID = "customer_id.order_id", so you can extract the partition key from the ID value.

Related

MySQL: How avoid all partitions scan (year-based) when doing ID lookup?

In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?
CREATE TABLE part_table (
id bigint NOT NULL auto_increment,
moment datetime NOT NULL,
KEY (id),
KEY (moment)
)-- partitioning information (in years)
PARTITION BY RANGE( YEAR(moment) ) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pFuture VALUES LESS THAN (maxvalue) )
;
With e.g. lookup query:
SELECT * FROM part_table WHERE ID = <nr>
Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
It is inefficient to pre-create future partitions (other than one).
Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:
Dropping 'old' data is much faster than DELETEing it.
`WHERE something-else AND moment between ..
Some cases
For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).
WHERE id BETWEEN 111 and 222
Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.
WHERE id BETWEEN 111 and 222
AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`
This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.
Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:
PRIMARY KEY(id, moment)
The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.
INDEX(id)
is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.
DROP PARTITION p2020
id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.
More
It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.
Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.
In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):
If the WHERE references only one column, the PARTITIONing is probably never beneficial.
If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.

Mysql select by auto increment primary key while partitioned by date

I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?
Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)

One composite index or many indexes for foreign keys?

Whats is the difference between creating a covering index for all the foreign keys of a relation table and creating one index for each column (foreign key) of the relation table ?
For instance, I have the table sales(p_id, e_id, c_id, ammount) where p_id is a foreign key (products table), e_id is a foreign key (employee table) and c_id a foreign key (customer_table). The primary key of the table is {p_id, e_id, c_id}.
Which on is better ?
CREATE INDEX cmpindex ON sales(p_id, e_id, c_id)
OR
CREATE INDEX pindex on sales(p_id)
CREATE INDEX eindex on sales(e_id)
CREATE INDEX cindex on sales(c_id)
I mostly run queries with joins on the relation table and the parent tables.
Which one is better depends on your actual queries.
One thing to understand is that when you join the table sales once in your query, it will only use one index for it (at the most). So you need to make sure an index is available that is most appropriate for the query.
If you join the sales table always to all three other tables (customer, product and employee) then a composite index is to be preferred, assuming that the engine will actually use it and not perform a table scan.
The order of the fields in the composite index is important when it comes to the order of the results. For instance, if your query is going to group the results by product (first), and then order the details per customer, you could benefit from an index that has the product id first, and the customer id as second.
But it may also be that the engine decides that it is better to start scanning the table sales first and then join in the other three tables using their respective primary key indexes. In that case no index is used that exists on the sales table.
The only way to find out is to get the execution plan of your query and see which indexes will be used when they are defined.
If you only have one query on the sales table, there is no need to have several indexes. But more likely you have several queries which output completely different results, with different field selections, filters, groupings, ...Etc.
In that case you may need several indexes, some of which will serve for one type of query, and others for others. Note that what you propose is not mutually exclusive. You could maybe benefit from several composite indexes, which just have a different order of fields.
Obviously, a multitude of indexes will slow down data changes in those tables, so you need to consider that trade-off as well.
Note that an index on a compound key will only be used if you query on the first portion, the first and second portion, the first, second and third portion, etc., so querying on p_id or p_id and e_id, etc. or even e_id and p_id will utilize the index. Indeed, any query containing p_id will use this index.
However, if you query your Sales table on e_id or c-id or any combination of these two, cmpindex will not be used and a full table scan will be performed.
One benefit of having an index on each foreign key (a non-unique index, as there could be multiple sales of the same product, or by the same employee, or to the same customer, leading to duplicate entries in the index) is that the query optimizer has the option of using the index to reduce the number of rows returned, and then doing a sequential search through the result set.
E.g. if the query is a search on sales of a particular product to a particular customer (regardless of employee) and you have a million sales, the foreign key index cindex could be used to return 20 sales items to that particular customer, and that result set could be very efficiently searched sequentially to find which of these sales were for a particular product.
If the search was performed on Product and pindex was used, the result set may be 10,000 rows (all sales of that product), which would have to be sequentially searched to find the sales of that product to a particular customer, leading to a very inefficient query.
I believe that the statistics kept for a table (used by the optimizer) keep track of the average number of rows that will be returned for a query using each index, so the optimizer will be able to work out that cindex should be used rather than pindex in the examples above. Alternatively, you can give hints on your queries to specify that a particular index be used.
It is, obviously, important to run UPDATE STATISTICS on a regular basis, as the execution plan would use pindex in the example above if there were, on average, only 10 sales of each product.
If your queries(search) propagates through sales for each of tables independently then you must create a separate index for each.
If that's not necessary then you can go for composite.
As HoneyBadger commented, you already have a composite index, since your primary key is itself an index.
In general, you should use a single index for each column whenever you think you will have queries involving each field by itself.
As stated here, when you have a composite index, it can work with queries involving all fields, or with queries involving the first field (in order), the first and the second, or the first,second and third together. It won't be used in queries involving only the second and third field.
The other answers are missing an important point. When you declare a foreign key in MySQL, it creates an index on the column. This is not (necessarily) true in other databases, but it is true in MySQL.
So, the declaration automatically creates these indexes:
CREATE INDEX pindex on sales(p_id);
CREATE INDEX eindex on sales(e_id);
CREATE INDEX cindex on sales(c_id);
(These indexes are very handy for dealing with cascading constraints and maintaining the data integrity based on the foreign key.)
If you happen to have also declared an index on sales(p_id, e_id, c_id, amount), then the first of the indexes is not needed -- it is a subset of this index. However, the other two are needed.
Is this index needed? As mentioned in other questions, that depends on the queries you want to use the index for. I recommend starting with the documentation on this subject to understand how the indexes get used.

How does hash(row id + year) partition work?

I'm new to this with partitions. Didn't knew it existed but came aware when I tried to make our new 'url_hash' column unique in a table in our database. And got the error message:
A UNIQUE INDEX must include all columns in the table's partitioning function
This is a database created by another person that I don't know and who are not involved in the project anymore.
I have tried to read mysql documentation and read on forums about Partition. What it is and how it works. Understand the purpose, to "divide" a table in to several "parts" so it becomes faster to retrieve relevant data. A common example is to partition in to years intervals. But most examples shows an manual method. Where you decide for example less than three specific years. For example:
PARTITION BY RANGE ( YEAR(separated) ) (
PARTITION p0 VALUES LESS THAN (1991),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (2001),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
But in our table, the partitions are created this way:
PARTITION BY HASH ( `feeditemsID` + YEAR(`feeddate`))
PARTITIONS 3;
What does that mean? How does our partition work?
feeditemsID is the unique ID for every row in our table.
When you use hash partitioning, the partition that contains each record is determined by calculating a hash code from the expression feaditemsID + YEAR(feeddate), and then finding the modulus of this code by the number of partitions. So if the hash code for a row is 123, it calculates 123 % 3, which is 0, so the record goes into partition 0.
This is explained inthe MySQL documentation.
As stated there,
Note
If a table to be partitioned has a UNIQUE key, then any columns supplied as arguments to the HASH user function or to the KEY's column_list must be part of that key.
In your case, the table's primary key needs to be:
PRIMARY KEY (feeditemsID, feeddate)
Assuming feeditemsID is already unique (presumably it's an auto-increment column), adding feeddate to the primary is redundant as far as keeping the data unique is concerned, but it's needed to satisfy the partitioning requirement. Putting feeditemsID first in the composite key will allow it to be used by itself to optimize table lookup.
This requirement is probably because each partition has its own index. When inserting/updating a row and checking for uniqueness, it only checks the index of the partition where that row will be stored. So when it finds the partition using the hash function, it needs to be sure that this partition will uniquely contain the indexed columns.
For more information see
Partitioning Keys, Primary Keys, and Unique Keys

Best way to index a table with a unique multi-column?

I am creating a table which will store around 100million rows in MySQL 5.6 using InnoDB storage engine. This table will have a foreign key that will link to another table with around 5 million rows.
Current Table Structure:
`pid`: [Foreign key from another table]
`price`: [decimal(9,2)]
`date`: [date field]
and every pid should have only one record for a date
What is the best way to create indexes on this table?
Option #1: Create Primary index on two fields pid and date
Option #2: Add another column id with AUTO_INCREMENT and primary index and create a unique index on column pid and date
Or any other option?
Only select query i will be using on this table is:
SELECT pid,price,date FROM table WHERE pid = 123
Based on what you said (100M; the only query is...; InnoDB; etc):
PRIMARY KEY(pid, date);
and no other indexes
Some notes:
Since it is InnoDB, all the rest of the fields are "clustered" with the PK, so a lookup by pid is acts as if price were part of the PK. Also WHERE pid=123 ORDER BY date would be very efficient.
No need for INDEX(pid, date, price)
Adding an AUTO_INCREMENT gains nothing (except a hint of ordering). If you needed ordering, then an index starting with date might be best.
Extra indexes slow down inserts. Especially UNIQUE ones.
Either method is fine. I prefer having synthetic primary keys (that is, the auto-incremented version with the additional unique index). I find that this is useful for several reasons:
You can have a foreign key relationship to the table.
You have an indicator of the order of insertion.
You can change requirements, so if some pids allows two values per day or only one per week, then the table can support them.
That said, there is additional overhead for such a column. This overhead adds space and a small amount of time when you are accessing the data. You have a pretty large table, so you might want to avoid this additional effort.
I would try with an index that attempts to cover the query, in the hope that MySQL has to access to the index only in order to get the result set.
ALTER TABLE `table` ADD INDEX `pid_date_price` (`pid` , `date`, `price`);
or
ALTER TABLE `table` ADD INDEX `pid_price_date` (`pid` , `price`, `date`);
Choose the first one if you think you may need to select applying conditions over pid and date in the future, or the second one if you think the conditions will be most probable over pid and price.
This way, the index has all the data the query needs (pid, price and date) and its indexing on the right column (pid)
By the way, always use EXPLAIN to see if the query planner will really use the whole index (take a look at the key and keylen outputs)