Is it true that SUBSELECTs are less performant than JOINs?
I got this query
SELECT categories_id,
products_id
FROM products_to_categories a
WHERE date_added = (
SELECT MIN(date_added)
FROM products_to_categories b
WHERE a.products_id = b.products_id
)
AND categories_id != 0
GROUP BY products_id
and would like to change it into a query with JOIN.
Is it true that SUBSELECTs are less performant than JOINs?
Possibly. This depends entirely on the query in question. Many constructs that are frequently implemented with a subquery, which can just as easily be achieved with a join, are actually executed as a join internally by the query optimizer... in database systems with an enterprise grade query optimizer, like SQL Server and Oracle. MySQL's query optimizer is notably worse at these kinds of optimizations, you'd have to look into the explain output to see whether it is smart enough for your specific case or not. It could even decide not to apply this optimization even if it sees it, just because system load is sufficiently low that optimization would be slower than just executing the slower version.
Even if it is executed as a subquery, it depends on the query itself and the system load. A subquery might cause a quicker lock escalation, potentially causing table locks and thus slower execution in the case of more simultaneous queries on the same table. Without concurrency, extra locks don't cause noticeable extra slowdowns.
In general, try to use joins whenever possible instead of subqueries, but don't overdo it - subqueries usually perform perfectly fine and the query optimizer will do a good job of keeping the server alive. But also keep in mind that MySQL isn't exactly an 'enterprise grade RDBMS' and as such can be rather dumb in its optimizations.
SELECT DISTINCT a.products_id,
b.MinDate
FROM products_to_categories a
JOIN (SELECT b.products_id,
MIN(b.date_added) AS MinDate
FROM products_to_categories b
GROUP BY b.products_id ) AS B
ON a.products_id = b.products_id
AND a.date_added = b.MinDate
WHERE a.categories_id != 0
Switching this to a join without a subquery or aggregation is not obvious.
The idea is to do a left outer join with a condition on the date_added condition. When this condition does not match, then you have the minimum:
SELECT categories_id, products_id
FROM products_to_categories a left outer join
products_to_categories b
on a.products_id = b.products_id and
b.date_added < a.date_added
WHERE b.date_added is null and a.categories_id != 0;
Select products_to_catergoriesa.categories_id,
products_to_catergoriesa.products_id, min(products_to_categories b.date_added)
from products_to_categories a
join products_to_categories b
on products_to_categories b.products_id = products_to_categories a.product_id
where [table_name_here].catergory_id !=0
Yes, subqueries are more process-intensive because every query around the subquery needs to wait until that subquery is finished processing. This is not necessarily the case with Joins.
Do you need help with the syntax of Joins? Or was my answer all you needed?
Here's what you're looking for:
SELECT a.categories_id,
a.products_id
FROM products_to_categories a
LEFT JOIN products_to_categories b
ON a.products_id = b.products_id
WHERE a.date_added = MIN(b.date_added)
AND a.categories_id != 0
GROUP BY a.products_id, a.categories_id
Related
I've a query like below,
SELECT
c.testID,
FROM a
INNER JOIN b ON a.id=b.ID
INNER JOIN c ON b.r_ID=c.id
WHERE c.test IS NOT NULL;
Can this query be optimized further?, I want inner join between three tables to happen only if it meets the where clause.
Where clause works as filter on the data what appears after all JOINs,
whereas if you use same restriction to JOIN clause itself then it will be optimized in sense of avoiding filter after join. That is, join on filtered data instead.
SELECT c.testID,
FROM a
INNER JOIN b ON a.id = b.ID
INNER JOIN c ON b.r_ID = c.id AND c.test IS NOT NULL;
Moreover, you must create an index for the column test in table c to speed up the query.
Also, learn EXPLAIN command to the queries for best results.
Try the following:
SELECT
c.testID
FROM c
INNER JOIN b ON c.test IS NOT NULL AND b.r_ID=c.testID
INNER JOIN a ON a.id=b.r_ID;
I changed the order of the joins and conditions so that the first statement to be evaluated is c.test IS NOT NULL
Disclaimer: You should use the explain command in order to see the execution.
I'm pretty sure that even the minor change I just did might have no difference due to the MySql optimizer that work on all queries.
See the MySQL Documentation: Optimizing Queries with EXPLAIN
Three queries Compared
Have a look at the following fiddle:
https://www.db-fiddle.com/f/fXsT8oMzJ1H31FwMHrxR3u/0
I ran three different queries and in the end, MySQL optimized and ran them the same way.
Three Queries:
EXPLAIN SELECT
c.testID
FROM c
INNER JOIN b ON c.test IS NOT NULL AND b.r_ID=c.testID
INNER JOIN a ON a.id=b.r_ID;
EXPLAIN SELECT c.testID
FROM a
INNER JOIN b ON a.id = b.r_id
INNER JOIN c ON b.r_ID = c.testID AND c.test IS NOT NULL;
EXPLAIN SELECT
c.testID
FROM a
INNER JOIN b ON a.id=b.r_ID
INNER JOIN c ON b.r_ID=c.testID
WHERE c.test IS NOT NULL;
All tables should have a PRIMARY KEY. Assuming that id is the PRIMARY KEY for the tables that it is in, then you need these secondary keys for maximal performance:
c: INDEX(test, test_id, id) -- `test` must be first
b: INDEX(r_ID)
Both of those are useful and "covering".
Another thing to note: b and a is virtually unused in the query, so you may as well write the query this way:
SELECT c.testID,
FROM c
WHERE c.test IS NOT NULL;
At that point, all you need is INDEX(test, testID).
I suspect you "simplified" your query by leaving out some uses of a and b. Well, I simplified it from there, just as the Optimizer should have done. (However, elimination of tables is an optimization that it does not do; it figures that is something the user would have done.)
On the other hand, b and a are not totally useless. The JOIN verify that there are corresponding rows, possibly many such rows, in those tables. Again, I think you had some other purpose.
I have a query that looks like this:
select `adverts`.*
from `adverts`
inner join `advert_category` on `advert_category`.`advert_id` = `adverts`.`id`
inner join `advert_location` on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
and `advert_category`.`category_id` = ?
order by `updated_at` desc
The problem here is I have a huge database and this response is absolutely ravaging my database.
What I really need is to do the first join, and then do there where clause. This will whittle down my response from like 100k queries to less than 10k, then I want to do the other join, in order to whittle down the responses again so I can get the advert_location on the category items.
Doing it as is just isn't viable.
So, how do I go about using a join and a where condition, and then after getting that response doing a further join with a where condition?
Thanks
This is your query, written a bit simpler so I can read it:
select a.*
from adverts a inner join
advert_category ac
on ac.advert_id = a.id inner join
advert_location al
on al.advert_id = a.id
where al.location_id = ? and
ac.category_id = ?
order by a.updated_at desc;
I am speculating that advert_category and advert_locations have multiple rows per advert. In that case, you are getting a Cartesian product for each advert.
A better way to write the query uses exists:
select a.*
from adverts a
where exists (select 1
from advert_location al
where al.advert_id = a.id and al.location_id = ?
) and
exists (select 1
from advert_category ac
where ac.advert_id = a.id and ac.category_id = ?
)
order by a.updated_at desc;
For this version, you want indexes on advert_location(advert_id, location_id), advert_category(advert_id, category_id), and probably advert(updated_at, id).
You can write the 1st join in a Derived Table including a WHERE-condition and then do the 2nd join (but a decent optimizer might resolve the Derived Table again and do what he thinks is best based on statistics):
select adverts.*
from
(
select `adverts`.*
from `adverts`
inner join `advert_category`
on `advert_category`.`advert_id` =`adverts`.`id`
where `advert_category`.`category_id` = ?
) as adverts
inner join `advert_location`
on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
order by `updated_at` desc
MySQL will reorder inner joins for you during optimization, regardless of how you wrote them in your query. Inner join is the same in either direction (in algebra this is called commutative), so this is safe to do.
You can see the result of join reordering if you use EXPLAIN on your query.
If you don't like the order MySQL chose for your joins, you can override it with this kind of syntax:
from `adverts`
straight_join `advert_category` ...
https://dev.mysql.com/doc/refman/5.7/en/join.html says:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order.
Once the optimizer has decided on the join order, it always does one join at a time, in that order. This is called the nested join method.
There isn't really any way to "do the join then do the where clause". Conditions are combined together when looking up rows for joined tables. But this is a good thing, because you can then create a compound index that helps match rows based on both join conditions and where conditions.
PS: When asking query optimization question, you should include the EXPLAIN output, and also run SHOW CREATE TABLE <tablename> for each table, and include the result. Then we don't have to guess at the columns and indexes in your table.
So let's say I have the following tables Person and Wage. It's a 1-N relation, where a person can have more then one wage.
**Person**
id
name
**Wage**
id
person_id
amount
effective_date
Now, I want to query a list of all persons and their latest wages. I can get the results by doing the following query:
SELECT
p.*,
( SELECT w.amount
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as wage_amount,
( SELECT w.effective_date
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as effective_date
FROM person as p
The problem is, my query will have multiple sub-queries from different tables. I want to make it as efficient as possible. Is there an alternative to using sub-queries that would be faster and give me the same results?
Proper indexing would probably make your version work efficiently (that is, an index on wages(person_id, effective_date)).
The following produces the same results with a single subquery:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
(select person_id, max(effective_date) as maxdate
from wages
group by personid
) maxw
on maxw.person_id = p.id left outer join
wages w
on w.person_id = p.id and w.effective_date = maxw.maxdate;
And this version might make better us of indexes than the above version:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
wages w
on w.person_id = p.id
where not exists (select * from wages w2 where w2.effective_date > w.effective_date);
Note that these version will return multiple rows for a single person, when there are two "wages" with the same maximum effective date.
Subqueries can be a good solution like Sam S mentioned in his answer but it really depends on the subquery, the dbms you are using, and your indexes. See this question and answers for a good discussion on the performance of subqueries vs. joins: Join vs. sub-query
If performance is an issue for you, you must consider using the EXPLAIN command of your dbms. It will show you how the query is being built and where the bottlenecks are. Based on its results, you might consider rewriting your query some other way.
For instance, it was usually the case that a join would yield better performance, so you could rewrite your query according to this answer: https://stackoverflow.com/a/2111420/362298 and compare their performance.
Note that creating the right indexes will also make a big difference.
Hope it helps.
Subqueries are very efficient as long as you make sure you use indexes. Try running EXPLAIN on your query and see if it uses correct indexes
SELECT p.name, w.amount, MAX(w.effective_date) FROM Person p LEFT JOIN Wage
w ON w.person_id = p.id GROUP BY p.name
I didn't test this query.
It is common to use SELECT within SELECT to reduce the number of queries; but as I examined this leads to slow query (which is obviously harmful for mysql performance). I had a simple query as
SELECT something
FROM posts
WHERE id IN (
SELECT tag_map.id
FROM tag_map
INNER JOIN tags
ON tags.tag_id=tag_map.tag_id
WHERE tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6')
)
This leads to slow queries of "query time 3-4s; lock time about 0.000090s; with about 200 rows examined".
If I split the SELECT queries, each of them will be quite fast; but this will increase the number of queries which is not good at high concurrency.
Is it the usual situation, or something is wrong with my coding?
In MySQL, doing a subquery like this is a "correlated query". This means that the results of the outer SELECT depend on the result of the inner SELECT. The outcome is that your inner query is executed once per row, which is very slow.
You should refactor this query; whether you join twice or use two queries is mostly irrelevant. Joining twice would give you:
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags ON tags.tag_id = tag_map.tag_id
WHERE tags.tag IN ('tag1', ...)
For more information, see the MySQL manual on converting subqueries to JOINs.
Tip: EXPLAIN SELECT will show you how the optimizer plans on handling your query. If you see DEPENDENT SUBQUERY you should refactor, these are mega-slow.
You could improve it by using the following:
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags
ON tags.tag_id=tag_map.tag_id
WHERE <tablename>.tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6')
Just make sure you only select what you need and do not use *; also state in which table you have the tag column so you can substitute <tablename>
Join does filtering of results. First join will keep results having 1st ON condition satisfied and then 2nd condition gives final result on 2nd ON condition.
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags ON tags.tag_id = tag_map.tag_id AND tags.tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6');
You can see these discussions on stack overflow :
question1
question2
Join helps to decrease time complexity and increases stability of server.
Information for converting sub queries to joins:
link1
link2
link3
When I run this query mysql server cpu usages stays at 100% and chokes the server. What am I doing wrong?
SELECT *
FROM projects p, orders o, invoices i
WHERE p.project_state = 'product'
AND (
p.status = 'expired'
OR p.status = 'finished'
OR p.status = 'open'
)
AND p.user_id = '12'
AND i.projectid =0
GROUP BY i.invoiceid
LIMIT 0 , 30
You are including the orders table but not joining to it. This will make a full cross join that can potentially produce millions of rows.
Use EXPLAIN to find out the query plan. From that you can work out what indexes will be required. Those indexes will vastly improve performance.
Also you are not limiting the orders in any way.
You didn't put any joins on the tables. I believe by default that will do a cross join. That means if you have 1000 projects, 100,000 orders and 100,000 invoices the resultset will be 1,000,000,000,000 (1 trillion) records.
You probably want to put some inner joins between those tables.