Optimize MySQL query using range and group by on large table - mysql

My table structure:
CREATE TABLE `jobs_view_stats` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`job_id` int(11) NOT NULL,
`created_at` datetime NOT NULL,
`account_id` int(11) DEFAULT NULL,
`country` varchar(2) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDX_D05BC6799FDS15210` (`job_id`),
KEY `FK_YTGBC67994591257` (`account_id`),
KEY `jobs_view_stats_created_at_id_index` (`created_at`,`id`),
CONSTRAINT `FK_YTGBC67994591257` FOREIGN KEY (`account_id`) REFERENCES `accounts` (`id`) ON DELETE SET NULL,
CONSTRAINT `job_views_jobs_id_fk` FOREIGN KEY (`job_id`) REFERENCES `jobs` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=79976587 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='New jobs views system'
This is the query:
SELECT COUNT(id) as view, job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
GROUP BY jobs_view_stats.job_id
Execution plan:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
jobs_view_stats
null
range
IDX_D05BC6799FDS15210,jobs_view_stats_created_at_id_index
jobs_view_stats_created_at_id_index
5
null
1584610
100
Using index condition; Using MRR; Using temporary; Using filesort
This query takes 4m to complete I want to reduce it to take minimum time.

In your execution plan, you are returning 1584610 rows that are then grouped by, which in turn uses a temp table to sort through and group (slow).
jobs_view_stats_created_at_id_index contains also 'id' which will make the key cardinality excessively large, could try instead adding job_id to the key as that is what you are grouping by.
I think the main issue is your where clause returns over 1.5 million rows that have to all load into a temp table to then be re-read in full to be grouped by.
You need to bite off smaller chunks i think.
I'm going to assume you are using a programming lang to envoke the DB calls (like PHP), if so you could try
SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
then when you have the job_id list
do smaller queries looping over the results from the first
SELECT count(*) FROM jobs_view_stats where job_id = *theid*
or if there are many different job_id's batch them
SELECT count(*) FROM jobs_view_stats where job_id IN('id1', id2, id3....)
for a pure MySQL resolution, I would create a temp memory table of all the job_ids as a memory table using
CREATE TEMPORARY TABLE `temptable`
(
job_id INT PRIMARY KEY
) ENGINE=MEMORY
INSERT INTO 'temptable' SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
Then
SELECT count(*) as view, job_id FROM jobs_view_stats where job_id = (SELECT job_id FROM `temptable`)
this is all totally untested so might be typos.

Change COUNT(id) to COUNT(*) since it does not need to check id for being NOT NULL
Add
INDEX(job_id, created_at)
INDEX(created_at, job_id)
Either of those will be "covering". (I don't know which is better; the Optimizer will know.)
Drop the index (job_id), as being redundant.
Unless you specifically need (created_id, id), drop it.
For even more performance, build and maintain a Summary Table . Then perform the query against it. Such a summary table might have PRIMARY KEY(job_id, created_date) and daily counts for each day for each job.

Related

MySQL keys for GROUP BY

Table structure, indexes, and query are below. On a table with more than million records, this takes well over a minute to run. I guess mainly because of GROUP BY and / or DAY().
I tried creating a composite index with the draft column first, because that would allow faster querying of WHERE draft = 0. Unfortunately it doesn't seem to make a difference and I haven't been able to find much information at all on how to use indexes to optimise this kind of query with a GROUP BY.
How can I speed this up?
CREATE TABLE `table` (
`id` bigint(20) UNSIGNED NOT NULL,
`user_id` int(11) NOT NULL,
`coords` point NOT NULL,
`date` datetime NOT NULL,
`draft` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `table`
ADD PRIMARY KEY (`id`),
ADD KEY `user_id` (`user_id`),
ADD KEY `draft` (`draft`),
ADD KEY `date` (`date`),
ADD KEY `user_id_2` (`draft`,`user_id`,`date`) USING BTREE;
SELECT id, user_id, date, coords
FROM table
WHERE draft = 0
GROUP BY (user_id, DAY(date))
ORDER BY date ASC
EXPLAIN
select type: simple
table: table
type: ref
possible_keys: draft, user_id_2
key: draft
key_len: 1,
ref: const
rows: 3427592
extra: Using where; Using temporary; Using filesort
First of all, I don't see any reason why you are even using GROUP BY, given that your select clause does not include any aggregates. Perhaps this is what you are intending to run:
SELECT id, user_id, date, coords
FROM yourTable
WHERE draft = 0
ORDER BY date;
This query might benefit from the following index:
CREATE INDEX idx ON yourTable (draft, date);
If used, this index would let MySQL discard some records via the WHERE clause, and also would enable efficient sorting in the ORDER BY date clause. You could also go all out, and cover the select clause:
CREATE INDEX idx ON yourTable (draft, date, user_id);
This would cover the entire select clause, meaning that the index by itself would contain all information needed to complete the query (this assumes you are running InnoDB, in which case MySQL would automatically include id as the fourth column of the index).

Mysql GROUP BY really slow on a view

I have these tables.
CREATE TABLE `movements` (
`movementId` mediumint(8) UNSIGNED NOT NULL,
`movementType` tinyint(3) UNSIGNED NOT NULL,
`deleted` tinyint(1) UNSIGNED NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `movements`
ADD PRIMARY KEY (`movementId`),
ADD KEY `movementType` (`movementType`) USING BTREE,
ADD KEY `deleted` (`deleted`),
ADD KEY `movementId` (`movementId`,`deleted`);
CREATE TABLE `movements_items` (
`movementId` mediumint(8) UNSIGNED NOT NULL,
`itemId` mediumint(8) UNSIGNED NOT NULL,
`qty` decimal(10,3) UNSIGNED NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `movements_items`
ADD KEY `movementId` (`movementId`),
ADD KEY `itemId` (`itemId`),
ADD KEY `movementId_2` (`movementId`,`itemId`);
and this view called "movements_items_view".
SELECT
movements_items.itemId, movements_items.qty,
movements.movementId, movements.movementType
FROM movements_items
JOIN movements ON (movements.movementId=movements_items.movementId
AND movements.deleted=0)
The first table has 5913 rows, the second one has 144992.
The view is very fast, it loads 20 result in PhpMyAdmin in 0.0011s but as soon as I ask for a GROUP BY on it (I need it to do statistics with SUM()) es:
SELECT * FROM movements_items_view GROUP BY itemId LIMIT 0,20
time jumps to 0.2s or more and it causes "Using where; Using temporary; Using filesort" on movements join.
Any help appreciated, thanks.
EDIT:
I also run via phpMyAdmin this query to try to not use the view:
SELECT movements.movementId, movements.movementType, movements_items.qty
FROM movements_items
JOIN movements ON movements.movementId=movements_items.movementId
GROUP BY itemId LIMIT 0,20
And the performance is the same.
Edit. Here is the EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE movements index PRIMARY,movementid movement_type 1 NULL 5913 Using index; Using temporary; Using filesort
1 SIMPLE movements_items ref movementId,itemId,movementId_2 movementId_2 3 movements.movementId 12 Using index
Turn it inside out. See if this works:
SELECT movementId, m.movementType, mi.qty
FROM
( SELECT movementId, qty
FROM movements_items
GROUP BY itemId
ORDER BY itemId
LIMIT 20
) AS mi
JOIN movements AS m USING(movementId)
The trick is to do the LIMIT sooner. The original way had all the data being hauled around, not just 20 rows.
In movements_items, is no column or combination of columns 'unique'? If so, make it the PRIMARY KEY.
In movement, KEY movementId (movementId,deleted) is redundant and should be dropped.
In movement_items, KEY movementId (movementId) is redundant and should be dropped.

How to Optimize MYSQL in Extra :-Using where; Using temporary; Using filesort

What is the proper indexing for this query.
I tried given different combinations of indexes for this query but it is still using from using tempory , using filesort etc.
Total table data - 7,60,346
product= 'Dresses' - Total rows = 122 554
CREATE TABLE IF NOT EXISTS `product_data` (
`table_id` int(11) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`price` int(11) NOT NULL,
`store` varchar(255) NOT NULL,
`brand` varchar(255) DEFAULT NULL,
`product` varchar(255) NOT NULL,
`model` varchar(255) NOT NULL,
`size` varchar(50) NOT NULL,
`discount` varchar(255) NOT NULL,
`gender_id` int(11) NOT NULL,
`availability` int(11) NOT NULL,
PRIMARY KEY (`table_id`),
UNIQUE KEY `table_id` (`table_id`),
KEY `id` (`id`),
KEY `discount` (`discount`),
KEY `step_one` (`product`,`availability`),
KEY `step_two` (`product`,`availability`,`brand`,`store`),
KEY `step_three` (`product`,`availability`,`brand`,`store`,`id`),
KEY `step_four` (`brand`,`store`),
KEY `step_five` (`brand`,`store`,`id`)
) ENGINE=InnoDB ;
Query :
SELECT id ,store,brand FROM `product_data` WHERE product='dresses' and
availability='1' group by brand,store order by store limit 10;
excu..time :- (10 total, Query took 1.0941 sec)
EXPLAIN PLAN :
possible_keys :- step_one, step_two, step_three, step_four, step_five
key :- step_two
ref :- const,const
rows :- 229438
Extra :-Using where; Using temporary; Using filesort
I tried these indexes
Key step_one (product,availability)
Key step_two (product,availability,brand,store)
Key step_three (product,availability,brand,store,id)
Key step_four (brand,store)
Key step_five (brand,store,id)
The real problem is not the index, but the mismatch between GROUP BY and ORDER BY preventing taking advantage of LIMIT.
This
INDEX(product, availability, store, brand, id)
will be "covering" and in the right order. But note that I have swapped store and brand...
Change the query to
SELECT id ,store,brand
FROM `product_data`
WHERE product='dresses'
and availability='1'
GROUP BY store, brand -- change
ORDER BY store, brand -- change
limit 10;
That changes the GROUP BY to start with store, to reflect the ORDER BY ordering -- this avoid an extra sort. And it changes the ORDER BY to be identical to the GROUP BY so that the two can be combined.
Given those changes, the INDEX can now go all the way through to the LIMIT, thereby allowing the processing to look at only 10 rows, not a much larger set.
Anything less than all these changes will not be as efficient.
Further discussion:
INDEX(product, availability, -- these two can be in either order
store, brand, -- must match both `GROUP BY` and `ORDER BY`
id) -- tacked on (on the end) to make it "covering"
"Covering" means that all the columns for the SELECT are found in the INDEX, so no need to reach over into the data.
But... The whole query does not make sense because of the inclusion of id in the SELECT. If you want to find what stores have available dresses, then get rid of id. If you want to list all the available dresses, then change id to GROUP_CONCAT(id).
For the indexes, the best index is the step_two. The product field is used in where and has more variation than the availability-field.
Couple of notes about the query:
availability='1' should be availability=1 so that needless int->varchar conversion would be avoided.
"group by brand" should not be used as GROUP BY should only be used when you use aggregate functions as selected columns. What as it that you were trying to achieve with the group by?
Your group by clause doesn't really make sense without an aggregate function.
If you can re-write the query to
SELECT id ,store
FROM `product_data`
WHERE product='dresses'
and availability='1'
order by store limit 10;
Then an index on (product,availability,store) will remove all filesorts.
See SQLFiddle: http://sqlfiddle.com/#!9/60f33d/2
UPDATE:
The SQLFiddle makes your intention clear - you're using GROUP BY to simulate DISTINCT. I don't think you can get rid of the filesort and temporary table steps in your query if this is the case - but I also don't think those steps should be hugely expensive.

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.

MYSQL SUM() with GROUP BY and LIMIT

I got this table
CREATE TABLE `votes` (
`item_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`vote` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`item_id`,`user_id`),
KEY `FK_vote_user` (`user_id`),
KEY `vote` (`vote`),
KEY `item` (`item_id`),
CONSTRAINT `FK_vote_item` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON UPDATE CASCADE,
CONSTRAINT `FK_vote_user` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
And I got this simple select
SELECT
`a`.`item_id`, `a`.`sum`
FROM
(SELECT
`item_id`, SUM(vote) AS `sum`
FROM
`votes`
GROUP BY `item_id`) AS a
ORDER BY `a`.`sum` DESC
LIMIT 10
Right now, with only 250 rows, there isn't a problem, but it's using filesort. The vote column has either -1, 0 or 1. But will this be performant when this table has millions or rows?
If I make it a simpler query without a subquery, then the using temporary table appears.
Explain gives (the query completes in 0.00170s):
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 33 Using filesort
2 DERIVED votes index NULL PRIMARY 8 NULL 250
No, this won't be efficient with millions of rows.
You'll have to create a supporting aggregate table which would store votes per item:
CREATE TABLE item_votes
(
item_id INT NOT NULL PRIMARY KEY,
votes UNSIGNED INT NOT NULL,
upvotes UNSIGNED INT NOT NULL,
downvotes UNSIGNED INT NOT NULL,
KEY (votes),
KEY (upvotes),
KEY (downvotes)
)
and update it each time a vote is cast:
INSERT
INTO item_votes (item_id, votes, upvotes, downvotes)
VALUES (
$item_id,
CASE WHEN $upvote THEN 1 ELSE -1 END,
CASE WHEN $upvote THEN 1 ELSE 0 END,
CASE WHEN $upvote THEN 0 ELSE 1 END
)
ON DUPLICATE KEY
UPDATE
SET votes = votes + VALUES(upvotes) - VALUES(downvotes),
upvotes = upvotes + VALUES(upvotes),
    downvotes = downvotes + VALUES(downvotes)
then select top 10 votes:
SELECT *
FROM item_votes
ORDER BY
votes DESC, item_id DESC
LIMIT 10
efficiently using an index.
But will this be performant when this table has millions or rows?
No, it won't.
If I make it a simpler query without a subquery, then the using temporary table appears.
Probably because the planner would turn it into the query you posted: it needs to calculate the sum to return the results in the correct order.
To quickly grab the top voted questions, you need to cache the result. Add a score field in your items table, and maintain it (e.g. using triggers). And index it. You'll then be able to grab the top 10 scores using an index scan.
First, you don't need the subquery, so you can rewrite your query as:
SELECT `item_id`, SUM(vote) AS `sum`
FROM `votes`
GROUP BY `item_id`
ORDER BY `a`.`sum` DESC
LIMIT 10
Second, you can build an index on votes(item_id, vote). The group by will then be an index scan. This will take time as the table gets bigger, but it should be manageable for reasonable data sizes.
Finally, with this structure of a query, you need to do a file sort for the final order by. Whether this is efficient or not depends on the number of items you have. If each item has, on average, one or two votes, then this may take some time. If you have a fixed set of items and there are only a few hundred or thousand, then then should not be a performance bottleneck, even as the data size expands.
If this summary is really something you need quickly, then a trigger with a summary table (as explained in another answer) provides a faster retrieval method.