How to improve slow MySQL subquery - mysql

I know there's many questions/answers for slow queries, but I'm struggling to relate an existing answer to my example.
I have the following simple query which counts article views in a subquery:
SELECT
articles.id,
articles.views,
articles.title,
articles.slug,
articles.created_at,
(SELECT count(*) FROM tracking WHERE element_id = articles.id AND tracking_type = 'article_view') AS tracking_views
FROM articles
WHERE articles.company_id = 123
ORDER BY articles.created_at DESC
This particular company has ~250 articles, and the query takes over 12 seconds.
Is there a better/more efficient way I could be doing this?

Try joining to a group by. Its pretty hard to say without knowing how many articles / views and companies there are though.
What you want is for SQL to be able to to the aggregation of tracking in one go, rather than individually for every row in the result, which is implied by the position of your tracking_view sub select.
If your lucky (I didnt check) the join to the counts sub select will be smart enough to skip any articles that are not for the right company. If not you can include the join back to company in the counts sub select.
eg
select a.*, counts.count
from articles a
join (
select count(*) as count, element_id
from tracking
where tracking_type = 'article_view'
group by tracking.element_id
) as counts on counts.element_id = a.id
where a.company_id = 123
ORDER BY articles.created_at DESC

Related

MySQL Spring complicated query - ways to order and query efficiency

I run this complicated query on Spring JPA Repository.
My goal is to get all info from the site table, ordering it by events severity on each site.
This is my query:
SELECT alls.* FROM sites AS alls JOIN
(
SELECT distinct ets.id FROM
(
SELECT s.id, et.`type`, et.severity_level, COUNT(et.`type`) FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
ORDER BY et.severity_level, COUNT(et.`type`) DESC
) AS ets
) as etsd ON alls.id = etsd.id
The second select (the one with "distinct") returns site_ids ordered correctly by severity.
Note that there are different event_types + severity in each site, and I use pagination on the answer, so I need the distinct.
The problem is - the main select doesn't keep this order.
Is there any way to keep the order in one complicated query?
Another related question - one of my ideas was making two queries:
The "select distinct" query that will return me the order --> saved in a list "order list"
The main "sites" query (that becomes very simple) with "where id in {"order list"}
Order the second query in code by "order list".
I use the query every 10 seconds, so it is very sensitive on performance.
What seems to be faster in this case - original complicated query or those 2?
Any insight will be appreciated.
Tnx a lot.
A quirk of SQL's declarative set-oriented syntax for us procedural programmers: ORDER by clauses in subqueries are not carried through to the outer query, except sometimes by accident. If you want ordering at any query level, you must specify it at that level or you will get unpredictable results. The query optimizers are usually smart enough to avoid wasting sort operations.
Your requirement: give at most one sites row for each sites.id value, ordered by the worst event. Worst: lowest event severity, and if there are more than one event with lowest severity, the largest count.
Use this sort of thing to get the "worst" for each id, in place of DISTINCT.
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
This gives at most one row per sites.id value. Then your outer query is
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
Putting it all together:
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
SELECT s.id, et.severity_level, COUNT(et.`type`) num
FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
An index on users.user_id will help performance for these single-user queries.
If you still have performance trouble, please read this and ask another question.

MySQL: Optimizing Sub-queries

I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.

SQL: Inner join on a Max Times Occurred Statement

Posed a question: What school was attended most frequently?
I came up with the following statement...
select unitid, count(unitid) as 'Times_Occurred'
from people
group by unitid
order by count(*) desc
limit 50;
I made it limit 50 because multiple unitid's had 3. I have two tables, a people table, and a post table. They are connected by unitid. I am trying to figure out how to do an inner join to get not only the top 50 highest unitid's, and how often they occur, but also the colleges that go along with those Id's.
Any and all help is much appreciated!!
Perhaps your counts are being inflated and you don't know how to handle that... this is one approach using an inline view. The reason this works is because the counts are calculated and retained prior to the join. Thus the 1-M cardinality doesn't negatively effect the counts.
Select *
from (
select unitid, count(unitid) as 'Times_Occurred'
from people
group by unitid
order by count(*) desc
limit 50) A
INNER JOIN Post B
on A.UnitID = B.UnitID

Why does a MySQL query with a Dependent Subquery take so much longer to execute than executing each statement individually?

I'm running two queries.
The first one gets unique IDs. This executes in ~350ms.
select parent_id
from duns_match_sealed_air_072815
group by duns_number
Then I paste those IDs into this second query. With >10k ids pasted in, it also executes in about ~350ms.
select term, count(*) as count
from companies, business_types, business_types_to_companies
where
business_types.id = business_types_to_companies.term_id
and companies.id = business_types_to_companies.company_id
and raw_score > 25
and diversity = 1
and company_id in (paste,ten,thousand,ids,here)
group by term
order by count desc;
When I combine these queries into one it takes a long time to execute. I don't know how long because I stopped it after minutes.
select term, count(*) as count
from companies, business_types, business_types_to_companies
where
business_types.id = business_types_to_companies.term_id
and companies.id = business_types_to_companies.company_id
and raw_score > 25
and diversity = 1
and company_id in (
select parent_id
from duns_match_sealed_air_072815
group by duns_number
)
group by term
order by count desc;
What is going on?
It's down to the way it processes the query - I believe it has to run your embedded query once for each row, whereas using two queries allows you to store the result.
Hope this helps!
The query has been re-written using JOIN, but particularly I've used EXISTS instead of IN. This is a short in the dark. It is possible that there may be many values generated in the sub-query causing the outer query to struggle while it goes through matching each item returned from the sub-query.
select term, count(*) as count
from companies c
inner join business_types_to_companies bc on bc.company_id = c.id
inner join business_types b on b.id = bc.term_id
where
raw_score > 25
and diversity = 1
and exists (
select 1
from duns_match_sealed_air_072815
where parent_id = c.id
)
group by term
order by count desc;
First, with respect, your subquery doesn't use GROUP BY in a sensible way.
select parent_id /* wrong GROUP BY */
from duns_match_sealed_air_072815
group by duns_number
In fact, it misuses the pernicious MySQL extension to GROUP BY. Read this. http://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html . I can't tell what your application logic intends from this query, but I can tell you that it actually returns an unpredictably selected parent_id value associated with each distinct duns_number value.
Do you want
select MIN(parent_id) parent_id
from duns_match_sealed_air_072815
group by duns_number
or something like that? That one selects the lowest parent ID associated with each given number.
Sometimes MySQL has a hard time optimizing the WHERE .... IN () query pattern. Try a join instead. Like this:
select term, count(*) as count
from companies
join (
select MIN(parent_id) parent_id
from duns_match_sealed_air_072815
group by duns_number
) idlist ON companies.id = idlist.parent_id
join business_types_to_companies ON companies.id = business_types_to_companies.company_id
join business_types ON business_types.id = business_types_to_companies.term_id
where raw_score > 25
and diversity = 1
group by term
order by count desc
To optimize this further we'll need to see the table definitions and the output from EXPLAIN.

MySQL Group By and HAVING

I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.
But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?
Query 1: HAVING, then AND
SELECT user_id
FROM forum_posts
GROUP BY user_id
HAVING COUNT(id) >= 100
AND user_id NOT IN (SELECT user_id FROM banned_users)
Query 2: WHERE, then HAVING
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN(SELECT user_id FROM banned_users)
GROUP BY user_id
HAVING COUNT(id) >= 100
Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE
Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second
Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.
The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.
You have already answers that the two queries will show same results and various opinions for which one is more efficient.
My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.
I would use the second version in any case, just to play safe.
Let me add that:
COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.
Therefore, you can also use:
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN
( SELECT user_id FROM banned_users )
GROUP BY user_id
HAVING COUNT(*) >= 100
There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.
Using LEFT JOIN / NULL :
SELECT fp.user_id
FROM forum_posts AS fp
LEFT JOIN banned_users AS bu
ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Using NOT EXISTS :
SELECT fp.user_id
FROM forum_posts AS fp
WHERE NOT EXISTS
( SELECT *
FROM banned_users AS bu
WHERE bu.user_id = fp.user_id
)
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.
HAVING conditions are applied to the grouped by results, and since you group by user_id, all of their possible values will be present in the grouped result, so the placing of the user_id condition is not important.
To me, second query is more efficient because it lowers the number of records for GROUP BY and HAVING.
Alternatively, you may try the following query to avoid using IN:
SELECT `fp`.`user_id`
FROM `forum_posts` `fp`
LEFT JOIN `banned_users` `bu` ON `fp`.`user_id` = `bu`.`user_id`
WHERE `bu`.`user_id` IS NULL
GROUP BY `fp`.`user_id`
HAVING COUNT(`fp`.`id`) >= 100
Hope this helps.
No it does not gives same results.
Because first query will filter records from count(id) condition
Another query filter records and then apply having clause.
Second Query is correctly written