Most efficient way to join "most recent row" - mysql

I know this question has been asked 100 times, and this isn't a "how do I do it", but an efficiency question - a topic I don't know much about.
From my internet reading I have settled on one way of solving the most recent problem that sounds like it's pretty efficient - LEFT JOIN a "max" table (grouped by the matching conditions) and then LEFT JOIN the row that matches the grouped conditions. Something like this:
Select employee.*, evaluation.* form employee
LEFT JOIN (select max(report_date) report_date, employee_id
from evaluation group by employee_id) most_recent_eval
on most_recent_eval.employee_id = employee.id
LEFT JOIN evaluation
on evaluation.employee_id = employee.id and evaluation.report_date = most_recent_eval.report_date
Are there problems with this that I don't know about? Is this doing 2 table scans (one to find the max, and one to find the row)? Does it have to do 2 full scans for every employee?
The reason I'm asking is that I am now looking at joining on 3 tables where I need the most recent row (evaluations, security clearance, and project) and it seems like any inefficiencies are going to be massively multiplied.
Can anyone give me some advice on this?

You should be in pretty good shape with the query pattern you propose.
One possible suggestion, that will help if your evaluation table has its own autoincrementing id column. You may be able to find the latest evaluation for each employee with this subquery:
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
Then your join can look like this:
FROM employee
LEFT JOIN (
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
) most_recent_eval ON most_recent_eval.employee_id=employee.id
LEFT JOIN evaluation ON most_recent_eval.id = evaluation.id
This will work if your id values and your report_date values in your evaluation table are in the same order. Only you know if that's the case in your application. But if it is, this is a very helpful optimization.
Other than that, you may need to add some compound indexes to some tables to speed up your queries. Get them working correctly first. Read http://use-the-index-luke.com/ . Remember that lots of single-column indexes are generally harmful to MySQL query performance unless they're chosen to accelerate particular queries.
If you create a compound index on (employee_id, report_date), this subquery
select max(report_date) report_date, employee_id
from evaluation
group by employee_id
can be satisfied with an astonishingly efficient loose index scan. Similarly, if you're using InnoDB, the query
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
can be satisfied by a loose index scan on a single-column index on employee_id. (If you're using MyISAM, you need a compound index on (employee_id, id) because InnoDB puts the primary key column implicitly into every index.)

Related

Where to set IN criteria (Parent table or Child table) in mysql

We have two tables (Customer and Order) table. Is there any performance difference between these two queries given below.
Customer table has customer details.(customerId, customerdetails)
Order table has Order details of the customer(orderId, customerId, orderdetails) i.e customer id will be duplicated here and not nullable with ON_DELETE_CASCADE foreign key.
eg query:
select * from Order where customerId in (1,2,3,4,....)
or
select Order.* from Order inner join customer on
Order.customerId = customer.customerId where customer.customerId
in (1,2,3,4,....)
The first one does less work: it only references one table. Therefore it's faster.
The second one has an INNER JOIN. That means it must check each Order row to ensure it has a matching Customer row. (If a Customer is DELETEd, her orders won't appear in the result set.) Rows that don't match must not appear in the result set. Doing that check takes at least some work. But if your tables are small you'll probably be unable to measure any significant difference.
You can investigate this yourself by prefixing EXPLAIN to each query. It tells you how the query planner module will satisfy the query. Your second query will have two EXPLAIN rows.
Indexes will help. Your first query will benefit if you create this one:
ALTER TABLE Order CREATE INDEX customerId (customerId);
Welcome to Stack Overflow. When you have other query-optimization questions, you probably should read this.
A General Rule: One query involving two tables and a JOIN will be faster than two queries where you are handing ids from the first to the second. And it is less code (once you are familiar with Joins).
Why? A roundtrip from the client to the server takes some effort. Two roundtrips is slower than one.
Does it matter? Not a lot. Usually, the difference is millisecond(s), maybe less.
Always? No. That is why I said "General Rule".

Is there a way to avoid sorting for a query with WHERE and ORDER BY?

I am looking to understand how a query with both WHERE and ORDER BY can be indexed properly. Say I have a query like:
SELECT *
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3
With an index on date_created, it seems like the execution plan will prefer to use the PRIMARY key and then sort the results itself. This seems to be very slow when it needs to sort a large amount of results.
I was reading through this guide on indexing for ordered queries which mentions an almost identical example and it mentions:
If the database uses a sort operation even though you expected a pipelined execution, it can have two reasons: (1) the execution plan with the explicit sort operation has a better cost value; (2) the index order in the scanned index range does not correspond to the order by clause.
This makes sense to me but I am unsure of a solution. Is there a way to index my particular query and avoid an explicit sort or should I rethink how I am approaching my query?
The Optimizer is caught between a rock and a hard place.
Plan A: Use an index starting with id; collect however many rows that is; sort them; then deliver only 3. The downside: If the list is large and the ids are scattered, it could take a long time to find all the candidates.
Plan B: Use an index starting with date_created filtering on id until it gets 3 items. The downside: What if it has to scan all the rows before it finds 3.
If you know that the query will always work better with one query plan than the other, you can use an "index hint". But, when you get it wrong, it will be a slow query.
A partial answer... If * contains bulky columns, both approaches may be hauling around stuff that will eventually be tossed. So, let's minimize that:
SELECT u.*
FROM ( SELECT id
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3 -- not repeated
) AS x
JOIN users AS u USING(id)
ORDER BY date_created; -- repeated
Together with
INDEX(date_created, id),
INDEX(id, date_created)
Hopefully, the Optimizer will pick one of those "covering" indexes to perform the "derived table" (subquery). If so that will be somewhat efficiently performed. Then the JOIN will look up the rest of the columns for the 3 desired rows.
If you want to discuss further, please provide
SHOW CREATE TABLE.
How many ids you are likely to have.
Why you are not already JOINing to another table to get the ids.
Approximately how many rows in the table.
You best bet might to to write this in a more complicated way:
SELECT u.*
FROM ((SELECT u.*
FROM users u
WHERE id = 1
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 2
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 3
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 4
ORDER BY date_created
LIMIT 3
)
) u
ORDER BY date_created
LIMIT 3;
Each of the subqueries will now use an index on users(id, date_created). The outer query is then sorting at most 12 rows, which should be trivial from a performance perspective.
You could create a composite index on (id, date_created) - that will give the engine the option of using an index for both steps - but the optimiser may still choose not to.
If there aren't many rows in your table or it thinks the resultset will be small it's quicker to sort after the fact than it is to traverse the index tree.
If you really think you know better than the optimiser (which you don't), you can use index hints to tell it what to do, but this is almost always a bad idea.

Choosing the optimal index for SQL Query

I'm doing some problem sets in my database management course and I can't figure this specific problem out.
We have the following relation:
Emp (id, name, age, sal, ...)
And the following query:
SELECT id
FROM Emp
WHERE age > (select max(sal) from Emp);
We are then supposed to choose an index that we would be a good query optimizer. My answer would be to just use Emp(age) but the solution to the question is
Emp(age)
&
Emp(sal)
How come there are 2 indices? I can't seem to wrap my head around why you would need more than the age attribute..
Of course, you realize that the query is non-sensical, comparing age to sal (which is presumably a salary). That said, two indexes are appropriate for:
SELECT e.id
FROM Emp e
WHERE e.age > (select max(e2.sal) from Emp e2);
I added table aliases to emphasize that the query is referring to the Emp table twice.
To get the maximum sal from the table, you want an index on emp(sal). The maximum is a simple index lookup operation.
Then you want to compare this to age. Well, for a comparison to age, you want an index on emp(age). This an entirely separate reference to emp that has no reference to sal, so you cannot put the two columns in a single index.
The index on age may not be necessary. The query may be returning lots of rows -- and tables that returns lots of rows don't generally benefit from a secondary index. The one case where it can benefit from the index is if age is a clustered index (that is, typically the first column in the primary key). However, I wouldn't recommend such an indexing structure.
you need both indexes to get optimal performance
1) the subquery (select max(sal) from Emp) will benefit from indexing Emp(sal) because on a tree-index, retrieving the max would be much quicker
2) the outer query needs to run a filtering on Emp(age), so that also benefits from a tree-index

Optimizate My SQL Index Multiple Table JOIN

I have a 5 tables in mysql. And when I want execute query it executed too long.
There are structure of my tables:
Reciept(count rows: 23799640)reciept table structure
reciept_goods(count rows: 39398989)reciept_goods table structure
good(count rows: 17514)good table structure
good_categories(count rows: 121)good_categories table structure
retail_category(count rows: 10)retail_category table structure
My Indexes:
Date -->reciept.date #1
reciept_goods_index --> reciept_goods.recieptId #1,
reciept_goods.shopId #2,
reciept_goods.goodId #3
category_id -->good.category_id #1
I have a next sql request:
SELECT
R.shopId,
sales,
sum(Amount) as sum_amount,
count(distinct R.id) as count_reciept,
RC.id,
RC.name
FROM
reciept R
JOIN reciept_goods RG
ON R.id = RG.RecieptId
AND R.ShopID = RG.ShopId
JOIN good G
ON RG.GoodId = G.id
JOIN good_categories GC
ON G.category_id = GC.id
JOIN retail_category RC
ON GC.retail_category_id = RC.id
WHERE
R.date >= '2018-01-01 10:00:00'
GROUP BY
R.shopId,
R.sales,
RC.id
Explain this query gives next result:
Explain query
and execution time = 236sec
if use straight_join good ON (good.id = reciept_goods.GoodId ) explain query
Explain query
and execution time = 31sec
SELECT STRAIGHT_JOIN ... rest of query
I think, that problem in the indexes of my tables, but I don't uderstand how to fix them, can someone help me?
With about 2% of your rows in reciepts having the correct date, the 2nd execution plan chosen (with straight_join) seems to be the right execution order. You should be able to optimize it by adding the following covering indexes:
reciept(date, sales)
reciept_goods(recieptId, shopId, goodId, amount)
I assume that the column order in your primary key for reciept_goods currently is (goodId, recieptId, shopId) (or (goodId, shopId, receiptId)). You could change that to recieptId, shopId, goodId (and if you look at e.g. the table name, you may wanted to do this anyway); in that case, you do not need the 2nd index (at least for this query). I would assume that this primary key made MySQL take the slower execution plan (of course assuming that it would be faster) - although sometimes it's just bad statistics, especially on a test server.
With those covering indexes, MySQL should take the faster explain plan even without straight_join, if it doesn't, just add it again (although I would like a look at both executions plans then). Also check that those two new indexes are used in the explain plan, otherwise I may have missed a column.
It looks like you are depending on walking through a couple of many:many tables? Many people design them inefficiently.
Here I have compiled a list of 7 tips on making mapping tables more efficient. The most important is use of composite indexes.

Join vs subquery to count nested objects

Let's say my model contains 2 tables: persons and addresses. One person can have O, 1 or more addresses. I'm trying to execute a query that lists all persons and includes the number of addresses they have respectively. Here is the 2 queries that I have to achieve that:
SELECT
persons.*,
count(addresses.id) AS number_of_addresses
FROM `persons`
LEFT JOIN addresses ON persons.id = addresses.person_id
GROUP BY persons.id
and
SELECT
persons.*,
(SELECT COUNT(*)
FROM addresses
WHERE addresses.person_id = persons.id) AS number_of_addresses
FROM `persons`
And I was wondering if one is better than the other in term of performance.
The way to determine performance characteristics is to actually run the queries and see which is better.
If you have no indexes, then the first is probably better. If you have an index on addresses(person_id), then the second is probably better.
The reason is a little complicated. The basic reason is that group by (in MySQL) uses a sort. And, sorts are O(n * log(n)) in complexity. So, the time to do a sort grows faster than a data (not much faster, but a bit fast). The consequence is that a bunch of aggregations for each person is faster than one aggregation by person over all the data.
That is conceptual. In fact, MySQL will use the index for the correlated subquery, so it is often faster than the overall group by, which does not make use of an index.
I think the first query is optimum and more optimization can provided by changing table structure. For example define both person_id and address_id fields (order is important) as primary key in addresses table to join faster.
The mysql table storage structure is indexed organized table(clustered index) so the primary key index is very faster than normal index specially in join operation.