Improve performance of a last status retrieval from history table - mysql

I want to retrieve the latest status for an item from a history table. History table will have a record of all status changes for an item. The query must be quick to run.
Below is the query that I use to get the latest status per item
SELECT item_history.*
FROM item_history
INNER JOIN (
SELECT MAX(created_at) as created_at, item_id
FROM item_history
GROUP BY item_id
) as latest_status
on latest_status.item_id = item_history.item_id
and latest_status.created_at = item_history.created_at
WHERE item_history.status_id = 1
and item_history.created_at BETWEEN "2020-12-16" AND "2020-12-23"
I've tried putting query above into another inner join to link data with an item:
SELECT *
FROM `items`
INNER JOIN ( [query from above] )
WHERE items.category_id = 3
Notes about item_history table, I have index on the following columns: status_id, creatd_at and listing_id. I have also turned 3 of those into a compound primary key.
My issue is that MySQL keeps scanning the full table to grab MAX(created_at) which is a very slow operation, even tho I only have 3 million records within the history table.
Query plan as follows:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
items
NULL
ref
"PRIMARY,district"
district
18
const
694
100.00
NULL
1
PRIMARY
item_history
NULL
ref
"PRIMARY,status_id,created_at,item_history_item_id_index"
PRIMARY
9
"main.items.id,const"
1
100.00
"Using where"
1
PRIMARY
NULL
ref
<auto_key0>
<auto_key0>
14
"func,main.items.id"
10
100.00
"Using where; Using index"
2
DERIVED
item_history
NULL
range
"PRIMARY,status_id,created_at,item_history_item_id_index"
item_history_item_id_index
8
NULL
2751323
100.00
"Using index"

I want to retrieve the latest status for an item from a history table.
If you want the results for just one item, then use order by and limit:
select *
from item_history
where item_id = ? and created_at between '2020-12-16' and '2020-12-23'
order by created_at desc limit 1
This query would benefit an index on (item_id, created_at).
If you want the latest status per item, I would recommend a correlated subquery:
select *
from item_history h
where created_at = (
select max(h1.created_at)
from item_history h1
where h1.item_id = h.item_id
and h1.created_at between '2020-12-16' and '2020-12-23'
)
The same index should be beneficial.

Using window function MySQL 8.0.14+:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY item_id ORDER BY created_at DESC) r
FROM item_history
WHERE item_history.status_id = 1
and item_history.created_at BETWEEN '2020-12-16' AND '2020-12-23'
)
SELECT *
FROM cte WHERE r = 1;
Index on (item_id,created_at) will also help

Related

Error when deleting rows in chunks from mysql table with 1M rows [duplicate]

This question already has answers here:
MySql 8 delete subquery with limit
(3 answers)
Closed 3 months ago.
I've two tables , person and person_history to keep versioned records of a person entity.
person table will always have the latest version of the person entity, while person_history table keeps all the versions of the person.
The table person_history is growing exponentially in size as with every update of person, a new record is added to the history table.
The primary key of person table is referenced as person_id from the person_history table. Column version_num keeps track of versioning in the history table. With each update, version_num is bumped up by 1.
I wish to keep only 5 records per person_id, and purge the older ones.
For this I've prepared the below statement
DELETE
FROM person_history
WHERE id in (SELECT p0.id
FROM person_history p0
WHERE (
SELECT COUNT(*)
FROM person_history pi
WHERE p0.person_id = p1.person_id AND p0.version_num < p1.version_num
) >= 5);
This statement works, but is very slow and write operations are impacted at that time.
I tried adding order and limit to above condition to delete it in chunks and formed below query
DELETE
FROM person_history
WHERE id in (SELECT p0.id
FROM person_history p0
WHERE (
SELECT COUNT(*)
FROM person_history pi
WHERE p0.person_id = p1.person_id AND p0.version_num < p1.version_num
) >= 5)
ORDER BY p0.id
LIMIT 1000);
This query fails with error This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
I've also tried creating a procedure and but that threw error too
DROP PROCEDURE IF EXISTS purge_history;
DELIMITER $$
CREATE PROCEDURE purge_history()
BEGIN
REPEAT
DO SLEEP(1);
SET #z:= (SELECT p0.id
FROM person_history p0
WHERE (
SELECT COUNT(*)
FROM person_history p1
WHERE p0.person_id = p1.person_id AND p0.version_num < p1.version_num
) >= 5 ORDER BY p0.id LIMIT 1000);
DELETE
FROM person_history
WHERE id in z;
UNTIL ROW_COUNT() = 0 END REPEAT;
END$$
DELIMITER ;
This failed with ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'z;
UNTIL ROW_COUNT() = 0 END REPEAT;
I've tried it on MySQL 8 and Mariadb 10.9
Please suggest any alternative to above chunk delete query so that the writes are not impacted while delete is in progress.
You could do it using ROW_NUMBER() -
SELECT person_id, version_num, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
FROM person_history
and for the delete -
DELETE ph
FROM person_history ph
JOIN (
SELECT person_id, version_num, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
FROM person_history
) t ON ph.person_id = t.person_id
AND ph.version_num = t.version_num
AND t.revs > 5;
UPDATE
I have set up a test table with 1M rows. Running select version of OP's query (which only retains latest 3 versions, not 5 suggested) I get 267,432 rows, but the distribution is likely to be very different.
Query 1: Original correlated sub-query
SELECT id
FROM person_history
WHERE id in (
SELECT p0.id
FROM person_history p0
WHERE (
SELECT COUNT(*)
FROM person_history p1
WHERE p0.person_id = p1.person_id
AND p0.version_num < p1.version_num
) >= 3
);
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
person_history
index
PRIMARY
uq_person_ver
4
998896
100.00
Using index
1
PRIMARY
p0
eq_ref
PRIMARY
PRIMARY
3
test.person_history.id
1
100.00
Using where
3
DEPENDENT SUBQUERY
p1
ref
uq_person_ver
uq_person_ver
3
test.p0.person_id
3
33.33
Using where; Using index
Query 2: Rewritten correlated sub-query
SELECT ph.id
FROM person_history ph
JOIN (
SELECT p0.id
FROM person_history p0
JOIN person_history p1
ON p0.person_id = p1.person_id
AND p0.version_num < p1.version_num
GROUP BY p0.id
HAVING COUNT(p1.id) >= 3
) t ON ph.id = t.id;
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
ALL
1148980
100.00
1
PRIMARY
ph
eq_ref
PRIMARY
PRIMARY
3
t.id
1
100.00
Using index
2
DERIVED
p0
index
PRIMARY,uq_person_ver
PRIMARY
3
998896
100.00
2
DERIVED
p1
ref
uq_person_ver
uq_person_ver
3
test.p0.person_id
3
33.33
Using where; Using index
Query 3: ROW_NUMBER() sub-query
SELECT ph.id
FROM person_history ph
JOIN (
SELECT id, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
FROM person_history
) t ON ph.id = t.id
AND t.revs > 3;
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
PRIMARY
ALL
998896
33.33
Using where
1
PRIMARY
ph
eq_ref
PRIMARY
PRIMARY
3
t.id
1
100.00
Using index
2
DERIVED
person_history
index
uq_person_ver
4
998896
100.00
Using index; Using filesort
Observations
#
Query 1
Query 2
Query 3
Rows examined
6,378,752
534,864
1,267,432
Rows returned
267,432
267,432
267,432
Execution time
5.95
3.75
1.10
Combining this with batching on the person_id should significantly reduce the overhead -
DELETE ph
FROM person_history ph
JOIN (
SELECT id, ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY version_num DESC) revs
FROM person_history
WHERE person_id BETWEEN 1 AND 50000
) t ON ph.id = t.id
AND t.revs > 3;
I also tried these queries against the table with the surrogate PK replaced by PK on (person_id, version_num) but the improvement was negligible.

select last record in each group for large database

I want to fetch last record in each group. I have used following query with very small database and it works perfectly -
SELECT * FROM logs
WHERE id IN (
SELECT max(id) FROM logs
WHERE id_search_option = 31
GROUP BY items_id
)
ORDER BY id DESC
But when it comes to actual database having millions of rows (80,00000+ rows), the system gets hanged.
I also tried another query, which gives result in 6.6sec on an average --
SELECT p1.id, p1.itemtype, p1.items_id, p1.date_mod
FROM logs p1
INNER JOIN (
SELECT max(id) as max_id, itemtype, items_id, date_mod
FROM logs
WHERE id_search_option = 31
GROUP BY items_id) p2
ON (p1.id = p2.max_id)
ORDER BY p1.items_id DESC;
Please help !
EDIT:: Explain 2nd query
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1177 Using temporary; Using filesort
1 PRIMARY p1 eq_ref PRIMARY PRIMARY 4 p2.max_id 1
2 DERIVED logs ALL NULL NULL NULL NULL 7930527 Using where; Using temporary; Using filesort
select *from tablename orderby unique_column desc limit 0,1;
try it will work
here 0->oth record,1->one record

Optimize joined order by query

I Have the following query:
SELECT `p_products`.`id`, `p_products`.`name`, `p_products`.`date`,
`p_products`.`img`, `p_products`.`safe_name`, `p_products`.`sku`,
`p_products`.`productstatusid`, `op`.`quantity`
FROM `p_products`
INNER JOIN `p_product_p_category`
ON `p_products`.`id` = `p_product_p_category`.`p_product_id`
LEFT JOIN (SELECT `p_product_id`,`order_date`,SUM(`product_quantity`) as quantity
FROM `p_orderedproducts`
WHERE `order_date`>='2013-03-01 16:51:17'
GROUP BY `p_product_id`) AS op
ON `p_products`.`id` = `op`.`p_product_id`
WHERE `p_product_p_category`.`p_category_id` IN ('15','23','32')
AND `p_products`.`active` = '1'
GROUP BY `p_products`.`id`
ORDER BY `date` DESC
Explain says:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY p_product_p_category ref p_product_id,p_category_id,p_product_id_2 p_category_id 4 const 8239 Using temporary; Using filesort
1 PRIMARY p_products eq_ref PRIMARY PRIMARY 4 pdev.p_product_p_category.p_product_id 1 Using where
1 PRIMARY ALL NULL NULL NULL NULL 78
2 DERIVED p_orderedproducts index order_date p_product_id 4 NULL 201 Using where
And I have indexes on a number of columns including p_products.date.
Problem is the speed when there are more then 5000 products in a number of categories. 60000 products take >1 second. Is there any way to speed things up?
This also holds true if I remove the left join in which case the result is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p_product_p_category index p_product_id,p_category_id,p_product_id_2 p_product_id_2 8 NULL 91167 Using where; Using index; Using temporary; Using filesort
1 SIMPLE p_products eq_ref PRIMARY PRIMARY 4 pdev.p_product_p_category.p_product_id 1 Using where
The intermediatate table p_product_p_category has indexes on both p_product_id and p_category_id aswell as a combined index with both.
Tries Ochi's suggestion and ended up with:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY ALL NULL NULL NULL NULL 62087 Using temporary; Using filesort
1 PRIMARY nr1media_products eq_ref PRIMARY PRIMARY 4 cats.nr1media_product_id 1 Using where
2 DERIVED nr1media_product_nr1media_category range nr1media_category_id nr1media_category_id 4 NULL 62066 Using where
I think I can simplify the question to how can I join my products on the category intermediate table to fetch all unique products for the selected categories, sorted by date.
EDIT:
This gives me all unique products in the categories without using a temp table for ordering or grouping:
SELECT
`p_products`.`id`,
`p_products`.`name`,
`p_products`.`img`,
`p_products`.`safe_name`,
`p_products`.`sku`,
`p_products`.`productstatusid`
FROM
p_products
WHERE
EXISTS (
SELECT
1
FROM
p_product_p_category
WHERE
p_product_p_category.p_product_id = p_products.id
AND p_category_id IN ('15', '23', '32')
)
AND p_products.active = 1
ORDER BY
`date` DESC
Above query is very fast, much faster then the join using group by order by (0.04 VS 0.7 sec), although I don't understand why it can do this query without temp tables.
I think I need to find another solution for the orderedproducts join, it still slows the query down to >1 sec. Might make a cron to update the ranking of the products sold once every night and save that info to the p_products table.
Unless someone has a definitive solution...
You are joining every type of category to products - only then it gets filtered by category id
try to limit your query as soon as possible for e.g. instead of
INNER JOIN `p_product_p_category`
do
INNER JOIN ( SELECT * FROM `p_product_p_category` WHERE `p_category_id` IN ('15','23','32') )
so that you will be working on smaller subset of products right from begining
One possible solution would be to remove the derived table and just do a single Group By:
Select P.id, P.name, P.date
, P.img, P.safe_name, P.sku
, P.productstatusid
, Sum( OP.product_quantity ) As quantity
From p_products As P
Join p_product_p_category As CAT
On p_products.id = CAT.p_product_id
Left Join p_orderedproducts As OP
On OP.p_product_id = P.id
And OP.order_date >= '2013-03-01 16:51:17'
Where CAT.p_category_id In ('15','23','32')
And P.active = '1'
Group By P.id, P.name, P.date
, P.img, P.safe_name, P.sku
, P.productstatusid
Order By P.date Desc

MySQL: how to increase speed of a select query with 2 joins and 1 subquery

In a table 'ttraces' I have many records for different tasks (whose value is held in 'taskid' column and is a foreign key of a column 'id' in a table 'ttasks'). Each task inserts a record to 'ttraces' every 8-10 seconds, so caching data to increase performance is not a good idea. What I need is to select only the newest records for each task from 'ttraces', that means the records with the maximum value of the column 'time'. At the moment, I have over 500000 records in the table. The very simplified structure of these two tables looks as follows:
-----------------------
| ttasks |
-----------------------
| id | name | blocked |
-----------------------
---------------------
| ttraces |
---------------------
| id | taskid | time |
---------------------
And my query is shown below:
SELECT t.name,tr.time
FROM
ttraces tr
JOIN
ttasks t ON tr.itask = t.id
JOIN (
SELECT taskid, MAX(time) AS max_time
FROM ttraces
GROUP BY itask
) x ON tr.taskid = x.taskid AND tr.time = x.max_time
WHERE t.blocked
All columns used in WHERE and JOIN clauses are indexed. As for now the query runs for ~1,5 seconds. It's extremely crucial to increase its speed. Thanks for all suggestions. BTW: the database is running on a hosted, shared server and I can't move it anywhere else for the moment.
[EDIT]
EXPLAIN SELECT... results are:
--------------------------------------------------------------------------------------------------------------
id select_type table type possible_keys key key_len ref rows Extra
--------------------------------------------------------------------------------------------------------------
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 74
1 PRIMARY t eq_ref PRIMARY PRIMARY 4 x.taskid 1 Using where
1 PRIMARY tr ref taskid,time time 9 x.max_time 1 Using where
2 DERIVED ttraces index NULL itask 5 NULL 570853
--------------------------------------------------------------------------------------------------------------
The engine is InnoDB.
I may be having a bit of a moment, but is this query not logically the same, and (almost certainly) faster?
SELECT t.id, t.name,max(tr.time)
FROM
ttraces tr
JOIN
ttasks t ON tr.itask = t.id
where BLOCKED
group by t.id, t.name
Here's my idea... You need one composite index on ttraces having taskid and time columns (in that order). Than, use this query:
SELECT t.name,
trm.mtime
FROM ttasks AS t
JOIN (SELECT taskid,
Max(time) AS mtime
FROM ttraces
GROUP BY taskid) AS trm
ON t.id = trm.taskid
WHERE t.blocked
Does this code return correct result? If so how is its speed time?
SELECT t.name, max_time
FROM ttasks t JOIN (
SELECT taskid, MAX(time) AS max_time
FROM ttraces
GROUP BY taskid
) x ON t.id = x.taskid
If there are many traces for each task then you can keep a table with only the newest traces. Whenever you insert into ttraces you also upsert into ttraces_newest:
insert into ttraces_newest (id, taskid, time) values
(3, 1, '2012-01-01 08:02:01')
on duplicate key update
`time` = current_timestamp
The primary key to ttraces_newest would be (id, taskid). Querying ttraces_newest would be cheaper. How much cheaper depends on how many traces there are to each task. Now the query is:
SELECT t.name,tr.time
FROM
ttraces_newest tr
JOIN
ttasks t ON tr.itask = t.id
WHERE t.blocked

Mysql Optimize Query: Trying to Get Average of Subquery

I have the following query:
SELECT AVG(time) FROM
(SELECT UNIX_TIMESTAMP(max(datelast)) - UNIX_TIMESTAMP(min(datestart)) AS time
FROM table
WHERE id IN
(SELECT DISTINCT id
FROM table
WHERE product_id = 12394 AND datelast > '2011-04-13 00:26:59'
)
GROUP BY id
)
as T
The query gets the greatest datelast value and subtracts it from the greatest datestart value for every ID (which is the length of a user session), and then averages it.
The outer most query is there only to average the resulting times. Is there any way to optimize this query?
Output from EXPLAIN:
id select_type table type possible_keys key key_len ref rows extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 7
2 DERIVED table index NULL id 16 NULL 26 Using where
3 DEPENDENT SUBQUERY table index_subquery id,product_id,datelast id 12 func 2 Using index; Using where
Is the first SELECT really necessary ?
SELECT
AVG(time)
FROM
(
SELECT
UNIX_TIMESTAMP(max(datelast)) - UNIX_TIMESTAMP(min(datestart)) AS time
FROM
table
WHERE
product_id = 12394 AND datelast > '2011-04-13 00:26:59'
GROUP BY
id
)
I can't test now and I think it would work too. Otherwise, your query looks good.
You can optimize the query by adding a (datelast, product_id) key (always put the most restrictive field first, to increase selectivity).