Mysql select by auto increment primary key while partitioned by date - mysql

I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?

Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)

Related

Clustered index on integer column surprisingly slow

I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.

MySQL: How avoid all partitions scan (year-based) when doing ID lookup?

In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?
CREATE TABLE part_table (
id bigint NOT NULL auto_increment,
moment datetime NOT NULL,
KEY (id),
KEY (moment)
)-- partitioning information (in years)
PARTITION BY RANGE( YEAR(moment) ) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026),
PARTITION pFuture VALUES LESS THAN (maxvalue) )
;
With e.g. lookup query:
SELECT * FROM part_table WHERE ID = <nr>
Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
It is inefficient to pre-create future partitions (other than one).
Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:
Dropping 'old' data is much faster than DELETEing it.
`WHERE something-else AND moment between ..
Some cases
For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).
WHERE id BETWEEN 111 and 222
Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.
WHERE id BETWEEN 111 and 222
AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`
This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.
Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:
PRIMARY KEY(id, moment)
The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.
INDEX(id)
is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.
DROP PARTITION p2020
id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.
More
It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.
Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.
In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):
If the WHERE references only one column, the PARTITIONing is probably never beneficial.
If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.

Combined Index performance with optional where clause

I have a table with the following columns:
id-> PK
customer_id-> index
store_id-> index
order_date-> index
last_modified-> index
other_columns...
other_columns...
I have three single column index. I also have a customer_id_store_id index which is a foreign key constraint referencing other tables.
id, customer_id, store_id are char(36) which is UUID. order_date is datetime and last_modifed is UNIX timestamp.
I want to gain some performance by removing all index and adding one with (customer_id, store_id, order_date). Most queries will have these fields in the where clause. But sometimes the store_id will not be needed.
What is the best approach? to add "store_id IS NOT NULL" in the where clause or creating the index this way (customer_id, order_date, store_id).
I also frequently need to query the table by last_modified field (where clause includes customer_id=, store_id=, last_modified>).
As I only have a single column index on it and there are hundreds of customers who is insert/updating the tables, more often the index scans rows more than necessary. Is it better to create another index (customer_id, store_id, last_modified) or leave it as it is? Or add this column to the previous index making it four columns composite index. But then again the order_date is irrelevant here and omitting it might result the index not being used as intended.
The query works fast on customers that don't have many rows possibly using the customer_id index there. But for customers with large amount of data, this isn't optimal. More often I need only few days of data.
Can anyone please advise what's the best index in this scenario.
It is true that lots of single column indexes on a MySQL table are generally considered harmful.
A query with
WHERE customer_id=constant AND store_id=constant AND last_modified>=constant
will be accelerated by an index on (customer_id, store_id, last_modified). Why? The MySQL query planner can random-access the index to the first item it needs to retrieve, then scan the index sequentially. That same index works for
WHERE customer_id=constant AND store_id=constant
AND last_modified>=constant
AND last_modified< constant + INTERVAL 1 DAY
BUT, that index will not be useful for a query with just
WHERE store_id=constant AND last_modified>constant
or
WHERE customer_id=constant AND store_id IS NOT NULL AND last_modified>=constant
For the first of those query patterns you need (store_id, last_modified) to achieve the ability to sequentially scan the index.
The second of those query patterns requires two different range searches. One is something IS NOT NULL. That's a range search because it has to romp through all the non-null values in the column. The second range search is last_modified>=constant. That's a range search, because it starts with the first value of last_modified that meets the given criterion, and scans to the end of the index.
MySQL indexes are B-trees. That means, essentially, that they're sorted into a particular single order. So, an index is best for accelerating queries that require just one range search. So, the second query pattern is inherently hard to satisfy with an index.
A table can have multiple compound indexes designed to satisfy multiple different query patterns. That's usually the strategy to large tables work well in practical applications. Each index imposes a little bit of performance penalty on updates and inserts. Indexes also take storage space. But storage is very cheap these days.
If you want to use a compound index to search on multiple criteria, these things must be true:
all but one of the criteria must be equality criteria like store_id = constant.
one criterion can be a range-scan criterion like last_modified >= constant or something IS NOT NULL.
the columns in the index must be ordered so that the columns involved in equality criteria all appear, then the the column involved in the range-scan criterion.
you may mention other columns after the range scan criterion. But they make up part of a covering index strategy (beyond the scope of this post).
http://use-the-index-luke.com/ is a good basic intro to the black art of indexing.

Index on mysql partitioned tables

I have a table with two partitions. Partitions are pactive = 1 and pinactive = 0. I understand that two partitions does not make so much of a gain, but I have used it to truncate and load in one partition and plain inserts in another partition.
The problem comes when I create indexes.
Query goes this way
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
Created index -
create index idx_try on customformattributes(partitionflag,companyid,activityname,completiondate,attributename,isclosed)
there are around 200000 records that will be retreived from the above query. But the query along with the mentioned index takes 30+ seconds. What is the reason for such a long time? Also, if remove the partitionflag from the mentioned index, the index is not even used.
And is the understanding that,
Even with the partitions available, the optimizer needs to have the required partition mentioned in the index definition, so that it only hits the required partition ---- Correct?
Any ideas on understanding this would be very helpful
You can optimize your index by reordering the columns in it. Usually the columns in the index are ordered by its cardinality (starting from the highest and go down to the lowest). Cardinality is the uniqueness of data in the given column. So in your case I suppose there are many variations of companyid in customformattributes table while partitionflag will have cardinality of 2 (if all the options for this column are 1 and 0).
Your query will first filter all the rows with partitionflag=0, then it will filter by company id and so on.
When you remove partitionflag from the index the query did not used the index because may be the optimizer decides that it will be faster to make full table scan instead of using the index (in most of the cases the optimizer is right)
For the given query:
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
the following index may be would be better (but of course :
create index idx_try on customformattributes(companyid,activityname, completiondate,attributename, partitionflag, isclosed)
For the query to use index the following rule must be met - the left most column in the index should be present in the where clause ... and depending on the mysql version you are using additional query requirements may be needed. For example if you are using old version of mysql - you may need to order the columns in the where clause in the same order they are listed in the index. In the last versions of mysql the query optimizer is responsible for ordering the columns in the where clause in the correct order.
Your SELECT query took 30+ seconds because it returns 200k rows and because the index might not be the optimal for the given query.
For the second question about the partitioning: the common rule is that the column you are partitioning by must be part of all the UNIQUE keys in a table (Primary key is also unique key by definition so the column should be added to the PK also). If table structure and logic allows you to add the partitioning column to all the UNIQUE indexes in the table then you add it and partition the table.
When the partitioning is made correctly you can take the advantage of partitioning pruning - this is when SELECT query searches the data only in the partitions where given data is stored (otherwise it looks in all partitions)
You can read more about partitioning here:
https://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
The query is slow simply because disks are slow.
Cardinality is not important when designing an index.
The optimal index for that query is
INDEX(companyid, activityname, partitionflag) -- in any order
It is "covering" since it includes all the columns mentioned anywhere in the SELECT. This is indicated by "Using index" in the EXPLAIN.
Leaving off the other 3 columns makes the query faster because it will have to read less off the disk.
If you make any changes to the query (add columns, change from '=' to '>', add ORDER BY, etc), then the index may no longer be optimal.
"Also, if remove the partitionflag from the mentioned index, the index is not even used." -- That is because it was no longer "covering".
Keep in mind that there are two ways an index may be used -- "covering" versus being a way to look up the data. When you don't have a "covering" index, the optimizer chooses between using the index and bouncing between the index and the data versus simply ignoring the index and scanning the table.

Creating an index on a timestamp to optimize query

I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.