MySQL keys for GROUP BY - mysql

Table structure, indexes, and query are below. On a table with more than million records, this takes well over a minute to run. I guess mainly because of GROUP BY and / or DAY().
I tried creating a composite index with the draft column first, because that would allow faster querying of WHERE draft = 0. Unfortunately it doesn't seem to make a difference and I haven't been able to find much information at all on how to use indexes to optimise this kind of query with a GROUP BY.
How can I speed this up?
CREATE TABLE `table` (
`id` bigint(20) UNSIGNED NOT NULL,
`user_id` int(11) NOT NULL,
`coords` point NOT NULL,
`date` datetime NOT NULL,
`draft` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `table`
ADD PRIMARY KEY (`id`),
ADD KEY `user_id` (`user_id`),
ADD KEY `draft` (`draft`),
ADD KEY `date` (`date`),
ADD KEY `user_id_2` (`draft`,`user_id`,`date`) USING BTREE;
SELECT id, user_id, date, coords
FROM table
WHERE draft = 0
GROUP BY (user_id, DAY(date))
ORDER BY date ASC
EXPLAIN
select type: simple
table: table
type: ref
possible_keys: draft, user_id_2
key: draft
key_len: 1,
ref: const
rows: 3427592
extra: Using where; Using temporary; Using filesort

First of all, I don't see any reason why you are even using GROUP BY, given that your select clause does not include any aggregates. Perhaps this is what you are intending to run:
SELECT id, user_id, date, coords
FROM yourTable
WHERE draft = 0
ORDER BY date;
This query might benefit from the following index:
CREATE INDEX idx ON yourTable (draft, date);
If used, this index would let MySQL discard some records via the WHERE clause, and also would enable efficient sorting in the ORDER BY date clause. You could also go all out, and cover the select clause:
CREATE INDEX idx ON yourTable (draft, date, user_id);
This would cover the entire select clause, meaning that the index by itself would contain all information needed to complete the query (this assumes you are running InnoDB, in which case MySQL would automatically include id as the fourth column of the index).

Related

Optimize MySQL query using range and group by on large table

My table structure:
CREATE TABLE `jobs_view_stats` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`job_id` int(11) NOT NULL,
`created_at` datetime NOT NULL,
`account_id` int(11) DEFAULT NULL,
`country` varchar(2) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDX_D05BC6799FDS15210` (`job_id`),
KEY `FK_YTGBC67994591257` (`account_id`),
KEY `jobs_view_stats_created_at_id_index` (`created_at`,`id`),
CONSTRAINT `FK_YTGBC67994591257` FOREIGN KEY (`account_id`) REFERENCES `accounts` (`id`) ON DELETE SET NULL,
CONSTRAINT `job_views_jobs_id_fk` FOREIGN KEY (`job_id`) REFERENCES `jobs` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=79976587 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='New jobs views system'
This is the query:
SELECT COUNT(id) as view, job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
GROUP BY jobs_view_stats.job_id
Execution plan:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
jobs_view_stats
null
range
IDX_D05BC6799FDS15210,jobs_view_stats_created_at_id_index
jobs_view_stats_created_at_id_index
5
null
1584610
100
Using index condition; Using MRR; Using temporary; Using filesort
This query takes 4m to complete I want to reduce it to take minimum time.
In your execution plan, you are returning 1584610 rows that are then grouped by, which in turn uses a temp table to sort through and group (slow).
jobs_view_stats_created_at_id_index contains also 'id' which will make the key cardinality excessively large, could try instead adding job_id to the key as that is what you are grouping by.
I think the main issue is your where clause returns over 1.5 million rows that have to all load into a temp table to then be re-read in full to be grouped by.
You need to bite off smaller chunks i think.
I'm going to assume you are using a programming lang to envoke the DB calls (like PHP), if so you could try
SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
then when you have the job_id list
do smaller queries looping over the results from the first
SELECT count(*) FROM jobs_view_stats where job_id = *theid*
or if there are many different job_id's batch them
SELECT count(*) FROM jobs_view_stats where job_id IN('id1', id2, id3....)
for a pure MySQL resolution, I would create a temp memory table of all the job_ids as a memory table using
CREATE TEMPORARY TABLE `temptable`
(
job_id INT PRIMARY KEY
) ENGINE=MEMORY
INSERT INTO 'temptable' SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
Then
SELECT count(*) as view, job_id FROM jobs_view_stats where job_id = (SELECT job_id FROM `temptable`)
this is all totally untested so might be typos.
Change COUNT(id) to COUNT(*) since it does not need to check id for being NOT NULL
Add
INDEX(job_id, created_at)
INDEX(created_at, job_id)
Either of those will be "covering". (I don't know which is better; the Optimizer will know.)
Drop the index (job_id), as being redundant.
Unless you specifically need (created_id, id), drop it.
For even more performance, build and maintain a Summary Table . Then perform the query against it. Such a summary table might have PRIMARY KEY(job_id, created_date) and daily counts for each day for each job.

Very slow when order by id, but fast when order by timestamp, id

I encountered a very puzzling optimization case. I'm no SQL expert but still this case seems to defy my understanding of clustered key principles.
I have the below table schema:
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`chargeQuote` tinyint(1) NOT NULL,
`features` int(11) NOT NULL,
`sequenceIndex` int(11) NOT NULL,
`createdAt` bigint(20) NOT NULL,
`previousSeqId` bigint(20) NOT NULL,
`refOrderId` bigint(20) NOT NULL,
`refSeqId` bigint(20) NOT NULL,
`seqId` bigint(20) NOT NULL,
`updatedAt` bigint(20) NOT NULL,
`userId` bigint(20) NOT NULL,
`version` bigint(20) NOT NULL,
`amount` decimal(36,18) NOT NULL,
`fee` decimal(36,18) NOT NULL,
`filledAmount` decimal(36,18) NOT NULL,
`makerFeeRate` decimal(36,18) NOT NULL,
`price` decimal(36,18) NOT NULL,
`takerFeeRate` decimal(36,18) NOT NULL,
`triggerOn` decimal(36,18) NOT NULL,
`source` varchar(32) NOT NULL,
`status` varchar(50) NOT NULL,
`symbol` varchar(32) NOT NULL,
`type` varchar(50) NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_STATUS` (`status`) USING BTREE,
KEY `IDX_USERID_SYMBOL_STATUS_TYPE` (`userId`,`symbol`,`status`,`type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=7937243 DEFAULT CHARSET=utf8mb4;
This is a big table. 100 million rows. It's already sharded by createdAt, so 100 million = 1 month worth of orders.
I have a below slow query. The query is pretty straight-forward:
select id,chargeQuote,features,sequenceIndex,createdAt,previousSeqId,refOrderId,refSeqId,seqId,updatedAt,userId,version,amount,fee,filledAmount,makerFeeRate,price,takerFeeRate,triggerOn,source,`status`,symbol,type
from orders where 1=1
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and symbol in ( 'BTC_USDT' )
and status in ( 'FULLY_FILLED' , 'PARTIAL_CANCELLED' , 'FULLY_CANCELLED' )
and type in ( 'BUY_LIMIT' , 'BUY_MARKET' , 'SELL_LIMIT' , 'SELL_MARKET' )
order by id desc limit 0,20;
This query takes 24 seconds. The number of rows that satisfy userId=100000 is very little, around 100. And the number of rows that satisfy this entire where clause is 0.
But when I did a small tweak, that is, I changed the order by clause:
order by id desc limit 0,20; -- before
order by createdAt desc, id desc limit 0,20; -- after
It became very fast, 0.03 seconds.
I can see it made a big difference in MySQL engine because explain gives that, before the change it was using key: PRIMARY and after it finally uses key: IDX_USERID_SYMBOL_STATUS_TYPE, as expected, and I guess therefore very fast. Here's the explain plan:
select_type table partitions type possible_keys key key_len ref rows filtered Extra
SIMPLE orders index IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE PRIMARY 8 20360 0.02 Using where
SIMPLE orders range IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE IDX_USERID_SYMBOL_STATUS_TYPE 542 26220 11.11 Using index condition; Using where; Using filesort
So what gives? Actually I was very surprised by the fact that it was not naturally sorted by id (which is the PRIMARY KEY). Isn't this the clustered key in MySQL? And why it chose to not to use index when it's sorted by id?
I'm very puzzled because a more demanding query (sort by 2 conditions) is super fast but a more lenient query is slow.
And no, I tried ANALYZE TABLE orders; and nothing happened.
MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:
Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.
In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, the MySQL optimizer evidently thinks that the second strategy is the best. Inn other words, it does not see that the WHERE clause will not be satisfied by any rows, but thinks that 2% of the rows will satisfy the WHERE clause, and that it will be able to find 20 rows by only scanning part of the table backwards in PRIMARY key order.
How the filtering effect of a WHERE clause is estimated varies quite a bit between 5.6, 5.7, and 8.0. If you are using MySQL 8.0, you can try to create histograms for the columns involved to see if that can improve the estimation. If not, I think your only option is to use a FORCE INDEX hint to make the optimizer choose the desired index.
For your fast query, the second strategy is not an option since there is no index on createdAt that can be used to avoid sorting.
Update:
Reading Rick's answer, I realized that an index on only userId should speed up your ORDER BY id query. In such an index, the entries for a given userId will be sorted on primary key. Hence, using this index will both make it possible to only access the rows of the requested userId, and access the rows in the requested sort order (by id).
The main filters works well with cardinality estimator. When order by uses limit, this is automatically another filter, as data needs to be filter further. This may redirect cardinality estimator to prone to inaccurate estimation which eventually result a poor plan to be selected. In order to prove this, run the 24sec query without the limit clause. It should also respond at 0.3 as your trick.
In order to solve this, if you have a standard very good performance just with the main filters, select this first, and filter at later 2nd time where the result set will be significantly smaller than the whole table. Use something like:
select * from (select ...main select statement)
order by x limit by y
...or...
insert into temp select ...main select statement
select from temp order by x limit by y
Given
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and ... -- I am not making use of the other items
order by createdAt DESC, id desc -- I am assuming this change
limit 0,20;
I would try
INDEX(userId, createdAt, id) -- in this order
userId is tested by = is first, thereby narrows down the part of the index to look at.
Leave out the columns tested by IN. If there are multiple values in a IN, we can't make use of step 4.
createdAt filters further by range.
createdAt and id are compared in the same direction (DESC). (Yes, I know 8.0 has an improvement, but I don't think you wanted (ASC, DESC)).

How to Optimize MYSQL in Extra :-Using where; Using temporary; Using filesort

What is the proper indexing for this query.
I tried given different combinations of indexes for this query but it is still using from using tempory , using filesort etc.
Total table data - 7,60,346
product= 'Dresses' - Total rows = 122 554
CREATE TABLE IF NOT EXISTS `product_data` (
`table_id` int(11) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`price` int(11) NOT NULL,
`store` varchar(255) NOT NULL,
`brand` varchar(255) DEFAULT NULL,
`product` varchar(255) NOT NULL,
`model` varchar(255) NOT NULL,
`size` varchar(50) NOT NULL,
`discount` varchar(255) NOT NULL,
`gender_id` int(11) NOT NULL,
`availability` int(11) NOT NULL,
PRIMARY KEY (`table_id`),
UNIQUE KEY `table_id` (`table_id`),
KEY `id` (`id`),
KEY `discount` (`discount`),
KEY `step_one` (`product`,`availability`),
KEY `step_two` (`product`,`availability`,`brand`,`store`),
KEY `step_three` (`product`,`availability`,`brand`,`store`,`id`),
KEY `step_four` (`brand`,`store`),
KEY `step_five` (`brand`,`store`,`id`)
) ENGINE=InnoDB ;
Query :
SELECT id ,store,brand FROM `product_data` WHERE product='dresses' and
availability='1' group by brand,store order by store limit 10;
excu..time :- (10 total, Query took 1.0941 sec)
EXPLAIN PLAN :
possible_keys :- step_one, step_two, step_three, step_four, step_five
key :- step_two
ref :- const,const
rows :- 229438
Extra :-Using where; Using temporary; Using filesort
I tried these indexes
Key step_one (product,availability)
Key step_two (product,availability,brand,store)
Key step_three (product,availability,brand,store,id)
Key step_four (brand,store)
Key step_five (brand,store,id)
The real problem is not the index, but the mismatch between GROUP BY and ORDER BY preventing taking advantage of LIMIT.
This
INDEX(product, availability, store, brand, id)
will be "covering" and in the right order. But note that I have swapped store and brand...
Change the query to
SELECT id ,store,brand
FROM `product_data`
WHERE product='dresses'
and availability='1'
GROUP BY store, brand -- change
ORDER BY store, brand -- change
limit 10;
That changes the GROUP BY to start with store, to reflect the ORDER BY ordering -- this avoid an extra sort. And it changes the ORDER BY to be identical to the GROUP BY so that the two can be combined.
Given those changes, the INDEX can now go all the way through to the LIMIT, thereby allowing the processing to look at only 10 rows, not a much larger set.
Anything less than all these changes will not be as efficient.
Further discussion:
INDEX(product, availability, -- these two can be in either order
store, brand, -- must match both `GROUP BY` and `ORDER BY`
id) -- tacked on (on the end) to make it "covering"
"Covering" means that all the columns for the SELECT are found in the INDEX, so no need to reach over into the data.
But... The whole query does not make sense because of the inclusion of id in the SELECT. If you want to find what stores have available dresses, then get rid of id. If you want to list all the available dresses, then change id to GROUP_CONCAT(id).
For the indexes, the best index is the step_two. The product field is used in where and has more variation than the availability-field.
Couple of notes about the query:
availability='1' should be availability=1 so that needless int->varchar conversion would be avoided.
"group by brand" should not be used as GROUP BY should only be used when you use aggregate functions as selected columns. What as it that you were trying to achieve with the group by?
Your group by clause doesn't really make sense without an aggregate function.
If you can re-write the query to
SELECT id ,store
FROM `product_data`
WHERE product='dresses'
and availability='1'
order by store limit 10;
Then an index on (product,availability,store) will remove all filesorts.
See SQLFiddle: http://sqlfiddle.com/#!9/60f33d/2
UPDATE:
The SQLFiddle makes your intention clear - you're using GROUP BY to simulate DISTINCT. I don't think you can get rid of the filesort and temporary table steps in your query if this is the case - but I also don't think those steps should be hugely expensive.

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.

Optimize index for multi field ordering with mixed direction

I'm trying to optimize a MySQL table for faster reads. The ratio of read to writes is about 100:1 so I'm disposed to sacrifice write performances with multi indexes.
Relevant fields for my table are the following and it contains about 200000 records
CREATE TABLE `publications` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
-- omitted fields
`publicaton_date` date NOT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
`position` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
-- these are just attempts, they are not production index
KEY `publication_date` (`publication_date`),
KEY `publication_date_2` (`publication_date`,`position`,`active`)
) ENGINE=MyISAM;`enter code here`
Since I'm using Ruby on Rails to access data in this table I've defined a default scope for this table which is
default_scope where(:active => true).order('publication_date DESC, position ASC')
i.e. every query in this table by default will be completed automatically with the following SQL fragment, so you can assume that almost all queries will have these conditions
WHERE `publications`.`active` = 1 ORDER BY publication_date DESC, position
So I'm mainly interested in optimize this kind of query, plus queries with publication_date in the WHERE condition.
I tried with the following indexes in various combinations (also with multiple of them at the same time)
`publication_date`
`publication_date`,`position`
`publication_date`,`position`,`active`
However a simple query as this one still doesn't use the index properly and uses filesort
SELECT `publications`.* FROM `publications`
WHERE `publications`.`active` = 1
AND (id NOT IN (35217,35216,35215,35218))
ORDER BY publication_date DESC, position
LIMIT 8 OFFSET 0
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: publications
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 34903
Extra: Using where; Using filesort
1 row in set (0.00 sec)
Some considerations on my issue:
According to MySQL documentation a composite index can't be used for ordering when you mix ASC and DESC in ORDER BY clause
active is a boolean flag, so put it in a standalone index make no sense (it has just 2 possible values) but it's always used in WHERE clause so it should appear somewhere in an index to avoid Using where in Extra
position is an integer with few possible values and it's always used scoped to publication_date so I think it's useless to have it in a standalone index
Lot of queries uses publication_date in the where part so it can be useful to have it also in a standalone index, even if redundant and it's the first column of the composite index.
One problem is that your are mixing sort orders in the order by clause. You could invert your position (inverted_position = max_position - position) so that you may also invert the sort order on that column.
You can then create a compound index on [publication_date, inverted_position] and change your order by clause to publication_date DESC, inverted_position DESC.
The active column should most likely not be part of the index as it has a very low selectivity.