Are multiple aggregates in a select statement a performance concern? - mysql

Considering the simple statement below, I am using the SUM and COUNT values in a customer table but I also want to use the values to calculate a third column, average_sale.
I instantly tried to just use the column aliases which would 'appear' to be clearer however I had to use the SUM and COUNT again.
Is this performant?
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
SUM(payments.amount) / COUNT(payments.id) as average_sale,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
ORDER BY `total_sales` DESC

As a general answer I would say, no. However, only a SQL execution plan will tell.
In your case you are reusing the same aggregation expressions multiple times. Even a basic SQL optimizer should realize they are the same ones and will compute each one a single time.
Since your query does not have filtering conditions, it's reading both tables whole. The biggest cost of your query is probably related to the join order. Should it start by payments and then walk to users, or vice versa? The presence/abscence of indexes can be decisive here.
Edit:
Now, if you find out your optimizer is not that clever, you can make sure it computes each aggregation only once by using a subquery (or a CTE if using MySQL 8.x). For example, you could rephrase your query as:
select
total_sales,
quantity,
total_sales / quantity as average_sale,
`name`,
`user_id`
from (
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
) x
ORDER BY `total_sales` DESC

Related

MySQL Spring complicated query - ways to order and query efficiency

I run this complicated query on Spring JPA Repository.
My goal is to get all info from the site table, ordering it by events severity on each site.
This is my query:
SELECT alls.* FROM sites AS alls JOIN
(
SELECT distinct ets.id FROM
(
SELECT s.id, et.`type`, et.severity_level, COUNT(et.`type`) FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
ORDER BY et.severity_level, COUNT(et.`type`) DESC
) AS ets
) as etsd ON alls.id = etsd.id
The second select (the one with "distinct") returns site_ids ordered correctly by severity.
Note that there are different event_types + severity in each site, and I use pagination on the answer, so I need the distinct.
The problem is - the main select doesn't keep this order.
Is there any way to keep the order in one complicated query?
Another related question - one of my ideas was making two queries:
The "select distinct" query that will return me the order --> saved in a list "order list"
The main "sites" query (that becomes very simple) with "where id in {"order list"}
Order the second query in code by "order list".
I use the query every 10 seconds, so it is very sensitive on performance.
What seems to be faster in this case - original complicated query or those 2?
Any insight will be appreciated.
Tnx a lot.
A quirk of SQL's declarative set-oriented syntax for us procedural programmers: ORDER by clauses in subqueries are not carried through to the outer query, except sometimes by accident. If you want ordering at any query level, you must specify it at that level or you will get unpredictable results. The query optimizers are usually smart enough to avoid wasting sort operations.
Your requirement: give at most one sites row for each sites.id value, ordered by the worst event. Worst: lowest event severity, and if there are more than one event with lowest severity, the largest count.
Use this sort of thing to get the "worst" for each id, in place of DISTINCT.
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
This gives at most one row per sites.id value. Then your outer query is
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
Putting it all together:
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
SELECT s.id, et.severity_level, COUNT(et.`type`) num
FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
An index on users.user_id will help performance for these single-user queries.
If you still have performance trouble, please read this and ask another question.

Optimizing this query that contains multiple subqueries

I am having a hard time working this one out in my head and need some help to figure out if there is a better way to produce the same results.
We have 3 tables: stores, departments, sales.
Each store can have many departments.
Each department can have many sales.
I have a query that lists each store and counts & sum's the sales relevant. But the operation is conducted with 2 subqueries when I feel like it should be able to be done in 1.
This is just a sample that runs pretty quickly. In my real world report that I am building I am finding that each subsequent subquery I add significantly reduces the overall performance of the report.
SQL Fiddle
SELECT `name`,
#totalvalue := ( SELECT SUM(`price`) FROM `sales` WHERE `sales`.`department` IN (
SELECT `id` FROM `departments` WHERE `departments`.`store`=`stores`.`id`
) ) AS `totalvalue`,
#totalsales := ( SELECT COUNT(*) FROM `sales` WHERE `sales`.`department` IN (
SELECT `id` FROM `departments` WHERE `departments`.`store`=`stores`.`id`
) ) AS `totalsales`,
ROUND(#totalvalue / #totalsales,2) AS `averagesale`
FROM `stores`;
How can I produce totalsales and totalvalue in the 1 subquery, or via a join?
Many thanks for any help you can provide.
You have to use GROUP BY aggregation functions and JOIN tables together:
SELECT st.name, SUM(s.price) totalvalue, COUNT(s.id) totalsales, ROUND(AVG(s.price),2) averagesale
FROM stores st
LEFT JOIN departments d ON d.store=st.id
LEFT JOIN sales s ON s.department=d.id
GROUP BY st.name
LEFT JOIN because you want to show all stores no matter if there is any data or not, otherwise you would use INNER JOIN.
Additionally you do not have to calculate average on your own - its built in function. As long as you do not need to have weighted average this is enough.

sql SELECT query for 3 tables

I have 3 tables:
1. products(product_id,name)
2. orders(id,order_id,product_id)
3. factors(id,order_id,date)
I want to retrieve product names(products.name) where have similar order_id on a date in two last tables.
I use this query for this purpose:
select products.name
from products
WHERE products.product_id ~IN
(
SELECT distinct orders.product_id FROM orders WHERE
order_id IN (select order_id FROM factors WHERE
factors.datex ='2017-04-29') GROUP BY product_id
)
but no result. where is my mistake? how can I resolve that? thanks
Your query should be fine. I am rewriting it to make a few changes to the structure, but not the logic (this makes it easier for me to understand the query):
select p.name
from products p
where p.product_id in (select o.product_id
from orders o
where o.order_id in (select f.order_id
from factors f
where f.datex = '2017-04-29'
)
) ;
Notes on the changes:
When using multiple tables in a query, always qualify the column names.
Use table aliases. They make queries easier to write and to read.
SELECT DISTINCT and GROUP BY are unnecessary in IN subqueries. The logic of IN already handles (i.e. ignores) duplicates. And by explicitly including the operations, you run the risk of a less efficient query plan.
Why might your query not work?
factors.datex has a time component. If so, then this will work date(f.datex) = '2017-04-29'.
There are no factors on that date.
There are no orders that match factors on that date.
There are no products in the orders that match the factors on that date.
In factors table column name is date so it should be -
factors.date ='2017-04-29'
You have written -
factors.datex ='2017-04-29'

Mysql Groupby and Orderby problem

Here is my data structure
alt text http://luvboy.co.cc/images/db.JPG
when i try this sql
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
group by dc_number
order by rec_id desc;
something is wrong somewhere, idk
I need
rec_id customer_id dc_number balance
2 IHS050018 DC3 -1
3 IHS050018 52 600
I want the recent balance of the customer with respective to dc_number ?
Thanx
There are essentially two ways to get this
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id='IHS050018' and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
Also if you want to get the last balance for each customer you might do
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id=p.customer_id and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
What I consider essentially another way is utilizing the fact that select rec_id with order by desc and limit 1 is equivalent to select max(rec_id) with appropriate group by, in full:
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select max(s.rec_id)
from payments s
group by s.customer_id, s.dc_number
);
This should be faster (if you want the last balance for every customer), since max is normally less expensive then sort (with indexes it might be the same).
Also when written like this the subquery is not correlated (it need not be run for every row of the outer query) which means it will be run only once and the whole query can be rewritten as a join.
Also notice that it might be beneficial to write it as correlated query (by adding where s.customer_id = p.customer_id and s.dc_number = p.dc_number in inner query) depending on the selectivity of the outer query.
This might improve performance, if you look for the last balance of only one or few rows.
I don't think there is a good way to do this in SQL without having window functions (like those in Postgres 8.4). You probably have to iterate over the dataset in your code and get the recent balances that way.
ORDER comes before GROUP:
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
order by rec_id desc
group by dc_number

A better way to build this MySQL statement with subselects

I have five tables in my database. Members, items, comments, votes and countries. I want to get 10 items. I want to get the count of comments and votes for each item. I also want the member that submitted each item, and the country they are from.
After posting here and elsewhere, I started using subselects to get the counts, but this query is taking 10 seconds or more!
SELECT `items_2`.*,
(SELECT COUNT(*)
FROM `comments`
WHERE (comments.Script = items_2.Id)
AND (comments.Active = 1))
AS `Comments`,
(SELECT COUNT(votes.Member)
FROM `votes`
WHERE (votes.Script = items_2.Id)
AND (votes.Active = 1))
AS `votes`,
`countrys`.`Name` AS `Country`
FROM `items` AS `items_2`
INNER JOIN `members` ON items_2.Member=members.Id AND members.Active = 1
INNER JOIN `members` AS `members_2` ON items_2.Member=members.Id
LEFT JOIN `countrys` ON countrys.Id = members.Country
GROUP BY `items_2`.`Id`
ORDER BY `Created` DESC
LIMIT 10
My question is whether this is the right way to do this, if there's better way to write this statement OR if there's a whole different approach that will be better. Should I run the subselects separately and aggregate the information?
Yes, you can rewrite the subqueries as aggregate joins (see below), but I am almost certain that the slowness is due to missing indices rather than to the query itself. Use EXPLAIN to see what indices you can add to make your query run in a fraction of a second.
For the record, here is the aggregate join equivalent.
SELECT `items_2`.*,
c.cnt AS `Comments`,
v.cnt AS `votes`,
`countrys`.`Name` AS `Country`
FROM `items` AS `items_2`
INNER JOIN `members` ON items_2.Member=members.Id AND members.Active = 1
INNER JOIN `members` AS `members_2` ON items_2.Member=members.Id
LEFT JOIN (
SELECT Script, COUNT(*) AS cnt
FROM `comments`
WHERE Active = 1
GROUP BY Script
) AS c
ON c.Script = items_2.Id
LEFT JOIN (
SELECT votes.Script, COUNT(*) AS cnt
FROM `votes`
WHERE Active = 1
GROUP BY Script
) AS v
ON v.Script = items_2.Id
LEFT JOIN `countrys` ON countrys.Id = members.Country
GROUP BY `items_2`.`Id`
ORDER BY `Created` DESC
LIMIT 10
However, because you are using LIMIT 10, you are almost certainly as well off (or better off) with the subqueries that you currently have than with the aggregate join equivalent I provided above for reference.
This is because a bad optimizer (and MySQL's is far from stellar) could, in the case of the aggregate join query, end up performing the COUNT(*) aggregation work for the full contents of the Comments and Votes table before wastefully throwing everything but 10 values (your LIMIT) away, whereas in the case of your original query it will, from the start, only look at the strict minimum as far as the Comments and Votes tables are concerned.
More precisely, using subqueries in the way that your original query does typically results in what is called nested loops with index lookups. Using aggregate joins typically results in merge or hash joins with index scans or table scans. The former (nested loops) are more efficient than the latter (merge and hash joins) when the number of loops is small (10 in your case.) The latter, however, get more efficient when the former would result in too many loops (tens/hundreds of thousands or more), especially on systems with slow disks but lots of memory.