Optimize index for multi field ordering with mixed direction - mysql

I'm trying to optimize a MySQL table for faster reads. The ratio of read to writes is about 100:1 so I'm disposed to sacrifice write performances with multi indexes.
Relevant fields for my table are the following and it contains about 200000 records
CREATE TABLE `publications` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
-- omitted fields
`publicaton_date` date NOT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
`position` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
-- these are just attempts, they are not production index
KEY `publication_date` (`publication_date`),
KEY `publication_date_2` (`publication_date`,`position`,`active`)
) ENGINE=MyISAM;`enter code here`
Since I'm using Ruby on Rails to access data in this table I've defined a default scope for this table which is
default_scope where(:active => true).order('publication_date DESC, position ASC')
i.e. every query in this table by default will be completed automatically with the following SQL fragment, so you can assume that almost all queries will have these conditions
WHERE `publications`.`active` = 1 ORDER BY publication_date DESC, position
So I'm mainly interested in optimize this kind of query, plus queries with publication_date in the WHERE condition.
I tried with the following indexes in various combinations (also with multiple of them at the same time)
`publication_date`
`publication_date`,`position`
`publication_date`,`position`,`active`
However a simple query as this one still doesn't use the index properly and uses filesort
SELECT `publications`.* FROM `publications`
WHERE `publications`.`active` = 1
AND (id NOT IN (35217,35216,35215,35218))
ORDER BY publication_date DESC, position
LIMIT 8 OFFSET 0
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: publications
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 34903
Extra: Using where; Using filesort
1 row in set (0.00 sec)
Some considerations on my issue:
According to MySQL documentation a composite index can't be used for ordering when you mix ASC and DESC in ORDER BY clause
active is a boolean flag, so put it in a standalone index make no sense (it has just 2 possible values) but it's always used in WHERE clause so it should appear somewhere in an index to avoid Using where in Extra
position is an integer with few possible values and it's always used scoped to publication_date so I think it's useless to have it in a standalone index
Lot of queries uses publication_date in the where part so it can be useful to have it also in a standalone index, even if redundant and it's the first column of the composite index.

One problem is that your are mixing sort orders in the order by clause. You could invert your position (inverted_position = max_position - position) so that you may also invert the sort order on that column.
You can then create a compound index on [publication_date, inverted_position] and change your order by clause to publication_date DESC, inverted_position DESC.
The active column should most likely not be part of the index as it has a very low selectivity.

Related

MySQL keys for GROUP BY

Table structure, indexes, and query are below. On a table with more than million records, this takes well over a minute to run. I guess mainly because of GROUP BY and / or DAY().
I tried creating a composite index with the draft column first, because that would allow faster querying of WHERE draft = 0. Unfortunately it doesn't seem to make a difference and I haven't been able to find much information at all on how to use indexes to optimise this kind of query with a GROUP BY.
How can I speed this up?
CREATE TABLE `table` (
`id` bigint(20) UNSIGNED NOT NULL,
`user_id` int(11) NOT NULL,
`coords` point NOT NULL,
`date` datetime NOT NULL,
`draft` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `table`
ADD PRIMARY KEY (`id`),
ADD KEY `user_id` (`user_id`),
ADD KEY `draft` (`draft`),
ADD KEY `date` (`date`),
ADD KEY `user_id_2` (`draft`,`user_id`,`date`) USING BTREE;
SELECT id, user_id, date, coords
FROM table
WHERE draft = 0
GROUP BY (user_id, DAY(date))
ORDER BY date ASC
EXPLAIN
select type: simple
table: table
type: ref
possible_keys: draft, user_id_2
key: draft
key_len: 1,
ref: const
rows: 3427592
extra: Using where; Using temporary; Using filesort
First of all, I don't see any reason why you are even using GROUP BY, given that your select clause does not include any aggregates. Perhaps this is what you are intending to run:
SELECT id, user_id, date, coords
FROM yourTable
WHERE draft = 0
ORDER BY date;
This query might benefit from the following index:
CREATE INDEX idx ON yourTable (draft, date);
If used, this index would let MySQL discard some records via the WHERE clause, and also would enable efficient sorting in the ORDER BY date clause. You could also go all out, and cover the select clause:
CREATE INDEX idx ON yourTable (draft, date, user_id);
This would cover the entire select clause, meaning that the index by itself would contain all information needed to complete the query (this assumes you are running InnoDB, in which case MySQL would automatically include id as the fourth column of the index).

Very slow when order by id, but fast when order by timestamp, id

I encountered a very puzzling optimization case. I'm no SQL expert but still this case seems to defy my understanding of clustered key principles.
I have the below table schema:
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`chargeQuote` tinyint(1) NOT NULL,
`features` int(11) NOT NULL,
`sequenceIndex` int(11) NOT NULL,
`createdAt` bigint(20) NOT NULL,
`previousSeqId` bigint(20) NOT NULL,
`refOrderId` bigint(20) NOT NULL,
`refSeqId` bigint(20) NOT NULL,
`seqId` bigint(20) NOT NULL,
`updatedAt` bigint(20) NOT NULL,
`userId` bigint(20) NOT NULL,
`version` bigint(20) NOT NULL,
`amount` decimal(36,18) NOT NULL,
`fee` decimal(36,18) NOT NULL,
`filledAmount` decimal(36,18) NOT NULL,
`makerFeeRate` decimal(36,18) NOT NULL,
`price` decimal(36,18) NOT NULL,
`takerFeeRate` decimal(36,18) NOT NULL,
`triggerOn` decimal(36,18) NOT NULL,
`source` varchar(32) NOT NULL,
`status` varchar(50) NOT NULL,
`symbol` varchar(32) NOT NULL,
`type` varchar(50) NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_STATUS` (`status`) USING BTREE,
KEY `IDX_USERID_SYMBOL_STATUS_TYPE` (`userId`,`symbol`,`status`,`type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=7937243 DEFAULT CHARSET=utf8mb4;
This is a big table. 100 million rows. It's already sharded by createdAt, so 100 million = 1 month worth of orders.
I have a below slow query. The query is pretty straight-forward:
select id,chargeQuote,features,sequenceIndex,createdAt,previousSeqId,refOrderId,refSeqId,seqId,updatedAt,userId,version,amount,fee,filledAmount,makerFeeRate,price,takerFeeRate,triggerOn,source,`status`,symbol,type
from orders where 1=1
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and symbol in ( 'BTC_USDT' )
and status in ( 'FULLY_FILLED' , 'PARTIAL_CANCELLED' , 'FULLY_CANCELLED' )
and type in ( 'BUY_LIMIT' , 'BUY_MARKET' , 'SELL_LIMIT' , 'SELL_MARKET' )
order by id desc limit 0,20;
This query takes 24 seconds. The number of rows that satisfy userId=100000 is very little, around 100. And the number of rows that satisfy this entire where clause is 0.
But when I did a small tweak, that is, I changed the order by clause:
order by id desc limit 0,20; -- before
order by createdAt desc, id desc limit 0,20; -- after
It became very fast, 0.03 seconds.
I can see it made a big difference in MySQL engine because explain gives that, before the change it was using key: PRIMARY and after it finally uses key: IDX_USERID_SYMBOL_STATUS_TYPE, as expected, and I guess therefore very fast. Here's the explain plan:
select_type table partitions type possible_keys key key_len ref rows filtered Extra
SIMPLE orders index IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE PRIMARY 8 20360 0.02 Using where
SIMPLE orders range IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE IDX_USERID_SYMBOL_STATUS_TYPE 542 26220 11.11 Using index condition; Using where; Using filesort
So what gives? Actually I was very surprised by the fact that it was not naturally sorted by id (which is the PRIMARY KEY). Isn't this the clustered key in MySQL? And why it chose to not to use index when it's sorted by id?
I'm very puzzled because a more demanding query (sort by 2 conditions) is super fast but a more lenient query is slow.
And no, I tried ANALYZE TABLE orders; and nothing happened.
MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:
Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.
In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, the MySQL optimizer evidently thinks that the second strategy is the best. Inn other words, it does not see that the WHERE clause will not be satisfied by any rows, but thinks that 2% of the rows will satisfy the WHERE clause, and that it will be able to find 20 rows by only scanning part of the table backwards in PRIMARY key order.
How the filtering effect of a WHERE clause is estimated varies quite a bit between 5.6, 5.7, and 8.0. If you are using MySQL 8.0, you can try to create histograms for the columns involved to see if that can improve the estimation. If not, I think your only option is to use a FORCE INDEX hint to make the optimizer choose the desired index.
For your fast query, the second strategy is not an option since there is no index on createdAt that can be used to avoid sorting.
Update:
Reading Rick's answer, I realized that an index on only userId should speed up your ORDER BY id query. In such an index, the entries for a given userId will be sorted on primary key. Hence, using this index will both make it possible to only access the rows of the requested userId, and access the rows in the requested sort order (by id).
The main filters works well with cardinality estimator. When order by uses limit, this is automatically another filter, as data needs to be filter further. This may redirect cardinality estimator to prone to inaccurate estimation which eventually result a poor plan to be selected. In order to prove this, run the 24sec query without the limit clause. It should also respond at 0.3 as your trick.
In order to solve this, if you have a standard very good performance just with the main filters, select this first, and filter at later 2nd time where the result set will be significantly smaller than the whole table. Use something like:
select * from (select ...main select statement)
order by x limit by y
...or...
insert into temp select ...main select statement
select from temp order by x limit by y
Given
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and ... -- I am not making use of the other items
order by createdAt DESC, id desc -- I am assuming this change
limit 0,20;
I would try
INDEX(userId, createdAt, id) -- in this order
userId is tested by = is first, thereby narrows down the part of the index to look at.
Leave out the columns tested by IN. If there are multiple values in a IN, we can't make use of step 4.
createdAt filters further by range.
createdAt and id are compared in the same direction (DESC). (Yes, I know 8.0 has an improvement, but I don't think you wanted (ASC, DESC)).

How to Optimize MYSQL in Extra :-Using where; Using temporary; Using filesort

What is the proper indexing for this query.
I tried given different combinations of indexes for this query but it is still using from using tempory , using filesort etc.
Total table data - 7,60,346
product= 'Dresses' - Total rows = 122 554
CREATE TABLE IF NOT EXISTS `product_data` (
`table_id` int(11) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`price` int(11) NOT NULL,
`store` varchar(255) NOT NULL,
`brand` varchar(255) DEFAULT NULL,
`product` varchar(255) NOT NULL,
`model` varchar(255) NOT NULL,
`size` varchar(50) NOT NULL,
`discount` varchar(255) NOT NULL,
`gender_id` int(11) NOT NULL,
`availability` int(11) NOT NULL,
PRIMARY KEY (`table_id`),
UNIQUE KEY `table_id` (`table_id`),
KEY `id` (`id`),
KEY `discount` (`discount`),
KEY `step_one` (`product`,`availability`),
KEY `step_two` (`product`,`availability`,`brand`,`store`),
KEY `step_three` (`product`,`availability`,`brand`,`store`,`id`),
KEY `step_four` (`brand`,`store`),
KEY `step_five` (`brand`,`store`,`id`)
) ENGINE=InnoDB ;
Query :
SELECT id ,store,brand FROM `product_data` WHERE product='dresses' and
availability='1' group by brand,store order by store limit 10;
excu..time :- (10 total, Query took 1.0941 sec)
EXPLAIN PLAN :
possible_keys :- step_one, step_two, step_three, step_four, step_five
key :- step_two
ref :- const,const
rows :- 229438
Extra :-Using where; Using temporary; Using filesort
I tried these indexes
Key step_one (product,availability)
Key step_two (product,availability,brand,store)
Key step_three (product,availability,brand,store,id)
Key step_four (brand,store)
Key step_five (brand,store,id)
The real problem is not the index, but the mismatch between GROUP BY and ORDER BY preventing taking advantage of LIMIT.
This
INDEX(product, availability, store, brand, id)
will be "covering" and in the right order. But note that I have swapped store and brand...
Change the query to
SELECT id ,store,brand
FROM `product_data`
WHERE product='dresses'
and availability='1'
GROUP BY store, brand -- change
ORDER BY store, brand -- change
limit 10;
That changes the GROUP BY to start with store, to reflect the ORDER BY ordering -- this avoid an extra sort. And it changes the ORDER BY to be identical to the GROUP BY so that the two can be combined.
Given those changes, the INDEX can now go all the way through to the LIMIT, thereby allowing the processing to look at only 10 rows, not a much larger set.
Anything less than all these changes will not be as efficient.
Further discussion:
INDEX(product, availability, -- these two can be in either order
store, brand, -- must match both `GROUP BY` and `ORDER BY`
id) -- tacked on (on the end) to make it "covering"
"Covering" means that all the columns for the SELECT are found in the INDEX, so no need to reach over into the data.
But... The whole query does not make sense because of the inclusion of id in the SELECT. If you want to find what stores have available dresses, then get rid of id. If you want to list all the available dresses, then change id to GROUP_CONCAT(id).
For the indexes, the best index is the step_two. The product field is used in where and has more variation than the availability-field.
Couple of notes about the query:
availability='1' should be availability=1 so that needless int->varchar conversion would be avoided.
"group by brand" should not be used as GROUP BY should only be used when you use aggregate functions as selected columns. What as it that you were trying to achieve with the group by?
Your group by clause doesn't really make sense without an aggregate function.
If you can re-write the query to
SELECT id ,store
FROM `product_data`
WHERE product='dresses'
and availability='1'
order by store limit 10;
Then an index on (product,availability,store) will remove all filesorts.
See SQLFiddle: http://sqlfiddle.com/#!9/60f33d/2
UPDATE:
The SQLFiddle makes your intention clear - you're using GROUP BY to simulate DISTINCT. I don't think you can get rid of the filesort and temporary table steps in your query if this is the case - but I also don't think those steps should be hugely expensive.

Optimize Indexes for Particular Query in mySQL

I have a fairly simple query that is taking about 14 seconds to complete and I would like to speed it up. I think I have the correct indexes in place, but I'm not sure...
Here is the query
SELECT *
FROM opportunities
WHERE cid = 7785
AND STATUS != 4
AND otype != 200
AND links > 0
AND ontopic != 'F'
ORDER BY links DESC
LIMIT 0, 100;
Here is the table schema
CREATE TABLE `opportunities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`cid` int(11) NOT NULL,
`url` varchar(900) CHARACTER SET utf8 NOT NULL,
`status` tinyint(4) NOT NULL,
`links` int(11) NOT NULL,
`otype` int(11) NOT NULL,
`reserved` tinyint(4) NOT NULL,
`ontopic` varchar(3) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `cid` (`cid`,`url`),
KEY `cid1` (`cid`),
KEY `url` (`url`),
KEY `otype` (`otype`),
KEY `reserved` (`reserved`),
KEY `ontopic` (`ontopic`),
KEY `status` (`status`),
KEY `links` (`links`),
KEY `ontopic_links` (`ontopic`,`links`),
KEY `cid_status_otype_links_ontopic` (`cid`,`status`,`otype`,`links`,`ontopic`)
) ENGINE=InnoDB AUTO_INCREMENT=13022832 DEFAULT CHARSET=latin1
Here is the result of the EXPLAIN command
id: 1
select_type: Simple
table: opportunities
partitions: null
type: range
possible_keys: cid,cid1,otype,ontopic,status,links,ontopic_links,cid_status_otype_links_ontopic
key: links
keylen: 4
ref: null
rows: 1531552
filtered: 0.33
Extra: Using index condition; Using where
Thoughts / Questions
Am I reading it correctly that it is using the "links" key to do the query? Why wouldn't it use a more complete index, like the cid_status_otype_links_ontopic which covers all the conditions of my query?
Thanks in advance!
As requested
There are 30,961 results that match the query when you remove the LIMIT 0,100. Interestingly, the "count()" command returns almost instantaneously.
It's a funny thing about using inequality comparisons, that they count as range conditions.
That is, equality matches one value, but anything other than equality (!=, >, <, IN, BETWEEN).
By matching multiple values, it means that only the first column in an index used in a range condition is going to be optimized. You'd think that your index cid_status_otype_links_ontopic has all the columns mentioned in conditions of your query, but only the first two will be used. The first because you have an equality comparison for cid. The second because the next column is used in an inequality comparison, and then that's where it stops using columns from the index.*
Evidence: if you can force that index to be used, you should see the keylen field of the EXPLAIN result show only 5, which is the size of cid (4 bytes) + status (1 byte).
The MySQL optimizer apparently has predicted that it would be more beneficial to use your links index, because that allows it to access the rows in index order, which is the same as the sort order you requested with your ORDER BY.
Evidence: you don't see "Using filesort" in your EXPLAIN notes.
Is that really better than using one of the other indexes? Maybe, maybe not. The optimizer's predictions aren't always perfect.
You can use an index hint to override the optimizer's choice:
SELECT * FROM opportunities USE INDEX (cid_status_otype_links_ontopic) WHERE ...
Try that out, do the EXPLAIN of that query and compare it to your other EXPLAIN. Then execute both queries and see which is reliably faster.
(* Actually, I have to add a footnote about the index column usage. MySQL 5.6 and later can do a little bit better than just the two columns, when you see the note "Using Index Condition" in the EXPLAIN. But it's not quite the same. You can read more about that here: https://dev.mysql.com/doc/refman/5.6/en/index-condition-pushdown-optimization.html)
What you have must plow through all of the rows, using your 5-column index, then sort the results and deliver 100 rows.
The only index likely to be useful is INDEX(cid, links). This is because cid is the only column being tested with =, then having links might be useful for the ORDER BY and LIMIT. There is still the risk that the != tests will require filtering a lot of rows.
Are status and otype multi-valued? If either has only 2 values, then turning the != into = and adding it to the index would be beneficial.
Do you really need all the columns (SELECT *)? If not, and if you don't need any big columns (url), then you could go with a 'covering' index.
More on writing indexes .

MySQL seems to ignore index on datetime column

I am having a pretty severe performance problem when comparing or manipulating two datetime columns in a MySQL table. I have a table tbl that's got around 10 million rows. It looks something like this:
`id` int(11) NOT NULL AUTO_INCREMENT,
`last_interaction` datetime NOT NULL DEFAULT '1970-01-01 00:00:00',
`last_maintenence` datetime NOT NULL DEFAULT '1970-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `last_maintenence_idx` (`last_maintenence`),
KEY `last_interaction_idx` (`last_interaction`),
ENGINE=InnoDB AUTO_INCREMENT=12389814 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
When I run this query, it takes about a minute to execute:
SELECT id FROM tbl
WHERE last_maintenence < last_interaction
ORDER BY last_maintenence ASC LIMIT 200;
Describing the query renders this:
id: 1
select_type: SIMPLE
table: tbl
type: index
possible_keys: NULL
key: last_maintenence_idx
key_len: 5
ref: NULL
rows: 200
Extra: Using where
It looks like MySQL isn't finding/using the index on last_interaction. Does anybody know why that might be and how to trigger it?
That is because mysql normally uses only one index per table. Unfortunately your query though it looks a simple one. The solution would be to create a composite index involving both columns.
The query you provide needs a full table scan.
Each and every record has to be checked to match your criterion last_maintenence < last_interaction.
So, no index is used when searching.
The index last_maintenence_idx is used only because you want the results ordered by it.
Thats easy: Mistake No 1.
MySQL can only (mostly) use one index per query and the optimizer looks which one is better for using.
So, you must create a composite index with both fields.