MySql: order by along with group by - performance - mysql

I have the performance problem with query that have order by and group by. I have checked similar problems on SO but I did not find the solution to this:(
I have something like this in my db schema:
pattern has many pattern_file belongs to project_template which belongs to project
Now I want to get projects filtered by some data(additional tables that I join) and want to get the result ordered for example by projects.priority and grouped by patterns.id. I have tried many things and to get the desired result I've figured out this query:
SELECT DISTINCT `projects`.* FROM `projects`
INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
INNER JOIN (SELECT DISTINCT projects.id FROM `projects` INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
WHERE [here my conditions] ORDER BY [here my order]) P
ON P.id = projects.id
WHERE [here my conditions]
GROUP BY patterns.id
ORDER BY [here my order]
From my research I have to INNER JOIN with subquery to conquer the problem "ORDER BY before GROUPing BY" => then I have put the same conditions on the outer query for performance purpose. The order by I had to use again in the outer query too, otherwise the result will be sorted by default.
Now there is real performance problem as I have about 6k projects and when I run this query without any conditions it takes about 15s :/ When I narrow the result by specify the conditions the time drastically dropped down. I've found somewhere that the subquery is run for every outer query row result which could be true when you watch at the execution time :/
Could you please give some advice how I can optimize the query? I do not work much with sql so maybe I do it from the wrong side from the very beginning?
P.S. I have tried WHERE projects.id IN (Select project.id FROM projects ....) and that discarded the performance issue but also discarded the ORDER BY before GROUPing BY
EDIT.
I want to retrieve list of projects, but I want also to filter it and order, and finally I want to get patterns.id unique(that is why I use the group by).
order by in your inner query (p) doesn't make sense (any inner sort will only
have an arbitrary effect).
#Solarflare Unfortunately it does. group by will take first row from grouped result. It preserve the order for join. Well, I believe that it is specific to MySql. Furthermore to keep the order from subquery I could use ORDER BY NULL in outer query :-)
Also, select projects.* ... group by pattern.id is fishy (although MySQL, in contrast to every other dbms, allows you to do this)
so we can assume I retrieve only projects.id, but from docs:
MySQL extends the use of GROUP BY to permit selecting fields that are not mentioned in the GROUP BY clause

Related

MySQL Left-Join vs Join with subquery return different results

I wrote 2 queries expecting them to yield same results, yet they turned out to be different.
I would like to ask why they return different results?
I am more confident that the 1st query returns what I want, so how should I amend the 2nd query? Thx!
1st SQL query:
SELECT
Product.*,
Status.*,
Price.*
FROM Product
LEFT JOIN Status
ON Product.MarketplaceId = Status.ListingId
LEFT JOIN Price
ON Product.ProductId = Price.Id
LIMIT 15;
2nd SQL query:
SELECT
Product.*,
Status.*,
Price.*
FROM Product
LEFT JOIN Status
ON Product.MarketplaceId IN
(
SELECT ListingId FROM Status
)
LEFT JOIN Price
ON Product.ProductId IN
(
SELECT Id FROM Price
)
LIMIT 15;
Without seeing the data, I don't understand why a different result. However, if you intend to have only one per product, I would change the second query to use DISTINCT. If the subquery returns multiple rows for whatever the condition, it will return that many rows even if a single product.
Don't use IN ( SELECT ... ) if there is an alternative; it is often slower.
Don't use LEFT JOIN if the matching row in the 'right' table will always be there. It confuses readers.
The reason for different results is the lack of ORDER BY. (As Akina mentioned.) Removing the LIMIT would probably cause the two queries to deliver all the same rows, though probably in a different order.

mysql joining efficiency - join with where then join with something else

I have a query that looks like this:
select `adverts`.*
from `adverts`
inner join `advert_category` on `advert_category`.`advert_id` = `adverts`.`id`
inner join `advert_location` on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
and `advert_category`.`category_id` = ?
order by `updated_at` desc
The problem here is I have a huge database and this response is absolutely ravaging my database.
What I really need is to do the first join, and then do there where clause. This will whittle down my response from like 100k queries to less than 10k, then I want to do the other join, in order to whittle down the responses again so I can get the advert_location on the category items.
Doing it as is just isn't viable.
So, how do I go about using a join and a where condition, and then after getting that response doing a further join with a where condition?
Thanks
This is your query, written a bit simpler so I can read it:
select a.*
from adverts a inner join
advert_category ac
on ac.advert_id = a.id inner join
advert_location al
on al.advert_id = a.id
where al.location_id = ? and
ac.category_id = ?
order by a.updated_at desc;
I am speculating that advert_category and advert_locations have multiple rows per advert. In that case, you are getting a Cartesian product for each advert.
A better way to write the query uses exists:
select a.*
from adverts a
where exists (select 1
from advert_location al
where al.advert_id = a.id and al.location_id = ?
) and
exists (select 1
from advert_category ac
where ac.advert_id = a.id and ac.category_id = ?
)
order by a.updated_at desc;
For this version, you want indexes on advert_location(advert_id, location_id), advert_category(advert_id, category_id), and probably advert(updated_at, id).
You can write the 1st join in a Derived Table including a WHERE-condition and then do the 2nd join (but a decent optimizer might resolve the Derived Table again and do what he thinks is best based on statistics):
select adverts.*
from
(
select `adverts`.*
from `adverts`
inner join `advert_category`
on `advert_category`.`advert_id` =`adverts`.`id`
where `advert_category`.`category_id` = ?
) as adverts
inner join `advert_location`
on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
order by `updated_at` desc
MySQL will reorder inner joins for you during optimization, regardless of how you wrote them in your query. Inner join is the same in either direction (in algebra this is called commutative), so this is safe to do.
You can see the result of join reordering if you use EXPLAIN on your query.
If you don't like the order MySQL chose for your joins, you can override it with this kind of syntax:
from `adverts`
straight_join `advert_category` ...
https://dev.mysql.com/doc/refman/5.7/en/join.html says:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order.
Once the optimizer has decided on the join order, it always does one join at a time, in that order. This is called the nested join method.
There isn't really any way to "do the join then do the where clause". Conditions are combined together when looking up rows for joined tables. But this is a good thing, because you can then create a compound index that helps match rows based on both join conditions and where conditions.
PS: When asking query optimization question, you should include the EXPLAIN output, and also run SHOW CREATE TABLE <tablename> for each table, and include the result. Then we don't have to guess at the columns and indexes in your table.

Best way to write this query?

I am doing a sub-query join to another table as I wanted to be able to sort the results I got back with it, I only need the first row but I need them ordered in a certain way so I would get the lowest id.
I tried adding LIMIT 1 to this but then the full query returned 0 results; so now it has no limit and in the EXPLAIN I have two rows showing they are using the full 10k+ rows of the auction_media table.
I wrote it this way to avoid having to query the auction_media table for each row separately, but now I'm thinking that this way isn't that great if it has to use the whole auction_media table?
Which way is better? The way I have it or querying the auction_media table separately? ...or is there a better way!?
Here is the code:
SELECT
a.auction_id,
a.name,
media.media_url
FROM
auctions AS a
LEFT JOIN users AS u ON u.user_id=a.owner_id
INNER JOIN ( SELECT media_id,media_url,auction_id
FROM auction_media
WHERE media_type=1
AND upload_in_progress=0
ORDER BY media_id ASC
) AS media
ON a.auction_id=media.auction_id
WHERE a.hpfeat=1
AND a.active=1
AND a.approved=1
AND a.closed=0
AND a.creation_in_progress=0
AND a.deleted=0
AND (a.list_in='auction' OR u.shop_active='1')
GROUP BY a.auction_id;
Edit: Through my testing, using the above query seems like it would be the much faster method overall; however I worry if that will still be the case when the auction_media table grows to like 1M rows or something.
edit: As stated in the comments - DISTINCT is not required because the auctions table can only be associated with (at most) one user table row and one row in the inner query.
You may want to try this. The outer query's GROUP BY is replaced with DISTINCT since you don't have any aggregate function. The inner query, was replaced by a query to find the smallest media_id per auction_id, then JOINed back to get the media_url. (Since I didn't know if the media_id and auction_id were a composite unique key, I used the same WHERE clause to help eliminate potential duplicates.)
SELECT
a.auction_id,
a.name,
auction_media.media_url
FROM auctions AS a
LEFT JOIN users AS u
ON u.user_id=a.owner_id
INNER JOIN (SELECT auction_id, MIN(media_id) AS media_id
FROM auction_media
WHERE media_type=1
AND upload_in_progress=0
GROUP BY auction_id) AS media
ON a.auction_id=media.auction_id
INNER JOIN auction_media
ON auction_media.media_id = media.media_id
AND auction_media.auction_id = media.auction_id
AND auction_media.media_type=1
AND auction_media.upload_in_progress=0
WHERE a.hpfeat=1
AND a.active=1
AND a.approved=1
AND a.closed=0
AND a.creation_in_progress=0
AND a.deleted=0
AND (a.list_in='auction' OR u.shop_active='1');

Mysql subquery order when using IN

Does the order of the results inside a mysql subquery affect the order of the actual query?
I tried it but did not came to a real result cause sometimes it seemed so and sometimes it doesn't.
eg:
SELECT name FROM people WHERE pid IN (SELECT mid FROM member ORDER BY mdate)
Is the "order by"-clause going to affect the order of the results in this case?
Thanks.
No it cant and if you want to change the order as per your need then better use a JOIN
Something like this:-
select name
from people p inner join member m on p.pid = m.mid
order by p.name
Your outer query doesn't have ORDER BY; thus, order is not guaranteed.
I guess the only part which may be affected in this particular case is optimizer which might generate a different execution plan depends on how results of subquery are sorted...
Irrespective of whether or not the outer query results depend on the order by clause in the sub-query, one should never depend on the order. If you need any particular order of outer query results, you should explicitly use order by clause on the outer query. AFAK, it makes sense to use order by clause in a sub-query only if you have to use TOP clause in SELECT clause of the sub-query.
It really can't. The data is coming from the from clause. Your subquery is in the where clause. It is just used to filter the rows. If you want the ordering:
select p.name
from people p join
(select member, min(mdate) as minmdate
from member
group by member
) m
on p.pid = m.mid
order by minmdate;
That is, join in the results between the two tables. I am assuming that member could have duplicates, and you want the earliest date associated with each member.

SQL Nested query interpereted as correlated incorrectly

I've got a serious problem with a nested query, which I suspect MySQL is interpreting as a correlated subquery when in fact it should be uncorrelated. The query spans two tables, one being a list of products and the other being their price at various points in time. My aim is to return each price record for products that have a price range above a certain value for the whole time. My query looks like this:
SELECT oP.id, oP.title, oCR.price, oC.timestamp
FROM Crawl_Results AS oCR
JOIN Products AS oP
ON oCR.product = oP.id
JOIN Crawls AS oC
ON oCR.crawl = oC.id
WHERE oP.id
IN (
SELECT iP.id
FROM Products AS iP
JOIN Crawl_Results AS iCR
ON iP.id = iCR.product
WHERE iP.category =2
GROUP BY iP.id
HAVING (
MAX( iCR.price ) - MIN( iCR.price )
) >1
)
ORDER BY oP.id ASC
Taken alone, the inner query executes fine and returns a list of the id's of the products with a price range above the criterion. The outer query also works fine if I provide a simple list of ids in the IN clause. When I run them together however, the query takes ~3min to return ~1500 rows, so I think it's executing the inner query for every row of the outer, which is not ideal. I did have the columns aliased the same in the inner and outer queries, so I thought that aliasing them differently in the inner and outer as above would fix it, but it didn't.
Any ideas as to what's going on here?
MySQL might think it could use indexes to execute the query faster by running it once for every OP.id. The first thing to check is if your statistics are up to date.
You could rewrite the where ... in as a filtering inner join. This is less likely to be "optimized" for seeks:
SELECT *
FROM Crawl_Results AS oCR
JOIN Products AS oP
ON oCR.product = oP.id
JOIN Crawls AS oC
ON oCR.crawl = oC.id
JOIN (
SELECT iP.id
FROM Products AS iP
JOIN Crawl_Results AS iCR
ON iP.id = iCR.product
WHERE iP.category =2
GROUP BY
iP.id
HAVING (MAX(iCR.price) - MIN(iCR.price)) > 1
) filter
ON OP.id = filter.id
Another option is to use a temporary table. You store the result of the subquery in a temporary table and join on that. That really forces MySQL not to execute the subquery as a correlated query.