I have five tables in my database. Members, items, comments, votes and countries. I want to get 10 items. I want to get the count of comments and votes for each item. I also want the member that submitted each item, and the country they are from.
After posting here and elsewhere, I started using subselects to get the counts, but this query is taking 10 seconds or more!
SELECT `items_2`.*,
(SELECT COUNT(*)
FROM `comments`
WHERE (comments.Script = items_2.Id)
AND (comments.Active = 1))
AS `Comments`,
(SELECT COUNT(votes.Member)
FROM `votes`
WHERE (votes.Script = items_2.Id)
AND (votes.Active = 1))
AS `votes`,
`countrys`.`Name` AS `Country`
FROM `items` AS `items_2`
INNER JOIN `members` ON items_2.Member=members.Id AND members.Active = 1
INNER JOIN `members` AS `members_2` ON items_2.Member=members.Id
LEFT JOIN `countrys` ON countrys.Id = members.Country
GROUP BY `items_2`.`Id`
ORDER BY `Created` DESC
LIMIT 10
My question is whether this is the right way to do this, if there's better way to write this statement OR if there's a whole different approach that will be better. Should I run the subselects separately and aggregate the information?
Yes, you can rewrite the subqueries as aggregate joins (see below), but I am almost certain that the slowness is due to missing indices rather than to the query itself. Use EXPLAIN to see what indices you can add to make your query run in a fraction of a second.
For the record, here is the aggregate join equivalent.
SELECT `items_2`.*,
c.cnt AS `Comments`,
v.cnt AS `votes`,
`countrys`.`Name` AS `Country`
FROM `items` AS `items_2`
INNER JOIN `members` ON items_2.Member=members.Id AND members.Active = 1
INNER JOIN `members` AS `members_2` ON items_2.Member=members.Id
LEFT JOIN (
SELECT Script, COUNT(*) AS cnt
FROM `comments`
WHERE Active = 1
GROUP BY Script
) AS c
ON c.Script = items_2.Id
LEFT JOIN (
SELECT votes.Script, COUNT(*) AS cnt
FROM `votes`
WHERE Active = 1
GROUP BY Script
) AS v
ON v.Script = items_2.Id
LEFT JOIN `countrys` ON countrys.Id = members.Country
GROUP BY `items_2`.`Id`
ORDER BY `Created` DESC
LIMIT 10
However, because you are using LIMIT 10, you are almost certainly as well off (or better off) with the subqueries that you currently have than with the aggregate join equivalent I provided above for reference.
This is because a bad optimizer (and MySQL's is far from stellar) could, in the case of the aggregate join query, end up performing the COUNT(*) aggregation work for the full contents of the Comments and Votes table before wastefully throwing everything but 10 values (your LIMIT) away, whereas in the case of your original query it will, from the start, only look at the strict minimum as far as the Comments and Votes tables are concerned.
More precisely, using subqueries in the way that your original query does typically results in what is called nested loops with index lookups. Using aggregate joins typically results in merge or hash joins with index scans or table scans. The former (nested loops) are more efficient than the latter (merge and hash joins) when the number of loops is small (10 in your case.) The latter, however, get more efficient when the former would result in too many loops (tens/hundreds of thousands or more), especially on systems with slow disks but lots of memory.
Related
I run this complicated query on Spring JPA Repository.
My goal is to get all info from the site table, ordering it by events severity on each site.
This is my query:
SELECT alls.* FROM sites AS alls JOIN
(
SELECT distinct ets.id FROM
(
SELECT s.id, et.`type`, et.severity_level, COUNT(et.`type`) FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
ORDER BY et.severity_level, COUNT(et.`type`) DESC
) AS ets
) as etsd ON alls.id = etsd.id
The second select (the one with "distinct") returns site_ids ordered correctly by severity.
Note that there are different event_types + severity in each site, and I use pagination on the answer, so I need the distinct.
The problem is - the main select doesn't keep this order.
Is there any way to keep the order in one complicated query?
Another related question - one of my ideas was making two queries:
The "select distinct" query that will return me the order --> saved in a list "order list"
The main "sites" query (that becomes very simple) with "where id in {"order list"}
Order the second query in code by "order list".
I use the query every 10 seconds, so it is very sensitive on performance.
What seems to be faster in this case - original complicated query or those 2?
Any insight will be appreciated.
Tnx a lot.
A quirk of SQL's declarative set-oriented syntax for us procedural programmers: ORDER by clauses in subqueries are not carried through to the outer query, except sometimes by accident. If you want ordering at any query level, you must specify it at that level or you will get unpredictable results. The query optimizers are usually smart enough to avoid wasting sort operations.
Your requirement: give at most one sites row for each sites.id value, ordered by the worst event. Worst: lowest event severity, and if there are more than one event with lowest severity, the largest count.
Use this sort of thing to get the "worst" for each id, in place of DISTINCT.
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
This gives at most one row per sites.id value. Then your outer query is
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
Putting it all together:
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
SELECT s.id, et.severity_level, COUNT(et.`type`) num
FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
An index on users.user_id will help performance for these single-user queries.
If you still have performance trouble, please read this and ask another question.
Considering the simple statement below, I am using the SUM and COUNT values in a customer table but I also want to use the values to calculate a third column, average_sale.
I instantly tried to just use the column aliases which would 'appear' to be clearer however I had to use the SUM and COUNT again.
Is this performant?
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
SUM(payments.amount) / COUNT(payments.id) as average_sale,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
ORDER BY `total_sales` DESC
As a general answer I would say, no. However, only a SQL execution plan will tell.
In your case you are reusing the same aggregation expressions multiple times. Even a basic SQL optimizer should realize they are the same ones and will compute each one a single time.
Since your query does not have filtering conditions, it's reading both tables whole. The biggest cost of your query is probably related to the join order. Should it start by payments and then walk to users, or vice versa? The presence/abscence of indexes can be decisive here.
Edit:
Now, if you find out your optimizer is not that clever, you can make sure it computes each aggregation only once by using a subquery (or a CTE if using MySQL 8.x). For example, you could rephrase your query as:
select
total_sales,
quantity,
total_sales / quantity as average_sale,
`name`,
`user_id`
from (
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
) x
ORDER BY `total_sales` DESC
I have a query that looks like this:
select `adverts`.*
from `adverts`
inner join `advert_category` on `advert_category`.`advert_id` = `adverts`.`id`
inner join `advert_location` on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
and `advert_category`.`category_id` = ?
order by `updated_at` desc
The problem here is I have a huge database and this response is absolutely ravaging my database.
What I really need is to do the first join, and then do there where clause. This will whittle down my response from like 100k queries to less than 10k, then I want to do the other join, in order to whittle down the responses again so I can get the advert_location on the category items.
Doing it as is just isn't viable.
So, how do I go about using a join and a where condition, and then after getting that response doing a further join with a where condition?
Thanks
This is your query, written a bit simpler so I can read it:
select a.*
from adverts a inner join
advert_category ac
on ac.advert_id = a.id inner join
advert_location al
on al.advert_id = a.id
where al.location_id = ? and
ac.category_id = ?
order by a.updated_at desc;
I am speculating that advert_category and advert_locations have multiple rows per advert. In that case, you are getting a Cartesian product for each advert.
A better way to write the query uses exists:
select a.*
from adverts a
where exists (select 1
from advert_location al
where al.advert_id = a.id and al.location_id = ?
) and
exists (select 1
from advert_category ac
where ac.advert_id = a.id and ac.category_id = ?
)
order by a.updated_at desc;
For this version, you want indexes on advert_location(advert_id, location_id), advert_category(advert_id, category_id), and probably advert(updated_at, id).
You can write the 1st join in a Derived Table including a WHERE-condition and then do the 2nd join (but a decent optimizer might resolve the Derived Table again and do what he thinks is best based on statistics):
select adverts.*
from
(
select `adverts`.*
from `adverts`
inner join `advert_category`
on `advert_category`.`advert_id` =`adverts`.`id`
where `advert_category`.`category_id` = ?
) as adverts
inner join `advert_location`
on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
order by `updated_at` desc
MySQL will reorder inner joins for you during optimization, regardless of how you wrote them in your query. Inner join is the same in either direction (in algebra this is called commutative), so this is safe to do.
You can see the result of join reordering if you use EXPLAIN on your query.
If you don't like the order MySQL chose for your joins, you can override it with this kind of syntax:
from `adverts`
straight_join `advert_category` ...
https://dev.mysql.com/doc/refman/5.7/en/join.html says:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order.
Once the optimizer has decided on the join order, it always does one join at a time, in that order. This is called the nested join method.
There isn't really any way to "do the join then do the where clause". Conditions are combined together when looking up rows for joined tables. But this is a good thing, because you can then create a compound index that helps match rows based on both join conditions and where conditions.
PS: When asking query optimization question, you should include the EXPLAIN output, and also run SHOW CREATE TABLE <tablename> for each table, and include the result. Then we don't have to guess at the columns and indexes in your table.
I'm having trouble optimizing a query and could use some help. I'm currently pulling in events in a system that has to join several other tables to make sure the event is supposed to display, etc... The query was running smoothly (around 480ms) until I introduced another table in the mix. The query is as follows:
SELECT
keyword_terms,
`esf`.*,
`venue`.`name` AS venue_name,
...
`venue`.`zip`, ase.region_id,
(DATE(NOW()) BETWEEN...AND ase.region_id IS NULL) as featured,
getDistance(`venue`.`lat`, `venue`.`lng`, 36.073, -79.7903) as distance,
`network_exclusion`.`id` as net_exc_id
FROM (`event_search_flat` esf)
# Problematic part of query (pulling in the very next date for the event)
LEFT JOIN (
SELECT event_id, MIN(TIMESTAMP(CONCAT(event_date.date, ' ', event_date.end_time))) AS next_date FROM event_date WHERE
event_date.date >= CURDATE() OR (event_date.date = CURDATE() AND TIME(event_date.end_time) >= TIME(NOW()))
GROUP BY event_id
) edate ON edate.event_id=esf.object_id
# Pull in associated ad space
LEFT JOIN `ad_space` ads ON `ads`.`data_type`=`esf`.`data_type` AND ads.object_id=esf.object_id
# and make sure it is featured within region
LEFT JOIN `ad_space_exclusion` ase ON ase.ad_space_id=ads.id AND region_id =5
# Get venue details
LEFT JOIN `venue` ON `esf`.`venue_id`=`venue`.`id`
# Make sure this event should be listed
LEFT JOIN `network_exclusion` ON network_exclusion.data_type=esf.data_type
AND network_exclusion.object_id=esf.object_id
AND network_exclusion.region_id=5
WHERE `esf`.`event_type` IN ('things to do')
AND (`edate`.`next_date` >= '2013-07-18 16:23:53')
GROUP BY `esf`.`esf_id`
HAVING `net_exc_id` IS NULL
AND `distance` <= 40
ORDER BY DATE(edate.next_date) asc,
`distance` asc
LIMIT 6
It seems that the issue lies with the event_date table, but I'm unsure how to optimize this query (I tried various views, indexes, etc... to no avail). I ran EXPLAIN and received the following: http://cl.ly/image/3r3u1o0n2A46 .
At the moment, the query is taking 6.6 seconds. Any help would be greatly appreciated.
You may be able to get Using index on the event_date subquery by creating a compound index over (event_id, date, end_time). That may turn the subquery into an index-only query, which should speed it up slightly.
The subquery might be better written as the following, without GROUP BY:
SELECT event_id, TIMESTAMP(CONCAT(event_date.date, ' ', event_date.end_time))) AS next_date
FROM event_date
WHERE event_date.date >= CURDATE()
OR (event_date.date = CURDATE() AND TIME(event_date.end_time) >= TIME(NOW()))
ORDER BY next_date LIMIT 1
I'm more concerned that your EXPLAIN shows so many tables with type=ALL. That means it has to read every row from those tables and compare to them rows in other tables. You can get an idea of how much work it's doing by multiplying the values in the rows column. Basically, it's making billions of row comparisons to resolve the joins. As the tables grow, this query will get a lot worse.
Using LEFT [OUTER] JOIN has a specific purpose, and if you really mean to use INNER JOIN you should do that, because using an outer join where it doesn't belong can mess up the optimization. Use an outer join like A LEFT JOIN B only if you want rows in A that may not have matching rows in B.
For example, I assume based on column naming convention that LEFT JOIN venue ON esf.venue_id=venue.id should be an inner join, because there should always be a venue referenced by esf.venue_id (unless esf.venue_id is sometimes null).
event_search_flat should have a compound index with columns used in the WHERE clause first, then columns to join to other tables: (event_type, object_id, data_type, event_id)
ad_space should have a compound index for the join: (data_type, object_id). Does this need to be an inner join too?
ad_space_exclusion should have a compound index for the join: (ad_space_id, region_id)
network_exclusion should have a compound index for the join: (data_type, object_id, region_id)
venue is okay because it's doing a primary key lookup already.
I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.
But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?
Query 1: HAVING, then AND
SELECT user_id
FROM forum_posts
GROUP BY user_id
HAVING COUNT(id) >= 100
AND user_id NOT IN (SELECT user_id FROM banned_users)
Query 2: WHERE, then HAVING
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN(SELECT user_id FROM banned_users)
GROUP BY user_id
HAVING COUNT(id) >= 100
Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE
Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second
Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.
The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.
You have already answers that the two queries will show same results and various opinions for which one is more efficient.
My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.
I would use the second version in any case, just to play safe.
Let me add that:
COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.
Therefore, you can also use:
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN
( SELECT user_id FROM banned_users )
GROUP BY user_id
HAVING COUNT(*) >= 100
There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.
Using LEFT JOIN / NULL :
SELECT fp.user_id
FROM forum_posts AS fp
LEFT JOIN banned_users AS bu
ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Using NOT EXISTS :
SELECT fp.user_id
FROM forum_posts AS fp
WHERE NOT EXISTS
( SELECT *
FROM banned_users AS bu
WHERE bu.user_id = fp.user_id
)
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.
HAVING conditions are applied to the grouped by results, and since you group by user_id, all of their possible values will be present in the grouped result, so the placing of the user_id condition is not important.
To me, second query is more efficient because it lowers the number of records for GROUP BY and HAVING.
Alternatively, you may try the following query to avoid using IN:
SELECT `fp`.`user_id`
FROM `forum_posts` `fp`
LEFT JOIN `banned_users` `bu` ON `fp`.`user_id` = `bu`.`user_id`
WHERE `bu`.`user_id` IS NULL
GROUP BY `fp`.`user_id`
HAVING COUNT(`fp`.`id`) >= 100
Hope this helps.
No it does not gives same results.
Because first query will filter records from count(id) condition
Another query filter records and then apply having clause.
Second Query is correctly written