Optimizing this query that contains multiple subqueries - mysql

I am having a hard time working this one out in my head and need some help to figure out if there is a better way to produce the same results.
We have 3 tables: stores, departments, sales.
Each store can have many departments.
Each department can have many sales.
I have a query that lists each store and counts & sum's the sales relevant. But the operation is conducted with 2 subqueries when I feel like it should be able to be done in 1.
This is just a sample that runs pretty quickly. In my real world report that I am building I am finding that each subsequent subquery I add significantly reduces the overall performance of the report.
SQL Fiddle
SELECT `name`,
#totalvalue := ( SELECT SUM(`price`) FROM `sales` WHERE `sales`.`department` IN (
SELECT `id` FROM `departments` WHERE `departments`.`store`=`stores`.`id`
) ) AS `totalvalue`,
#totalsales := ( SELECT COUNT(*) FROM `sales` WHERE `sales`.`department` IN (
SELECT `id` FROM `departments` WHERE `departments`.`store`=`stores`.`id`
) ) AS `totalsales`,
ROUND(#totalvalue / #totalsales,2) AS `averagesale`
FROM `stores`;
How can I produce totalsales and totalvalue in the 1 subquery, or via a join?
Many thanks for any help you can provide.

You have to use GROUP BY aggregation functions and JOIN tables together:
SELECT st.name, SUM(s.price) totalvalue, COUNT(s.id) totalsales, ROUND(AVG(s.price),2) averagesale
FROM stores st
LEFT JOIN departments d ON d.store=st.id
LEFT JOIN sales s ON s.department=d.id
GROUP BY st.name
LEFT JOIN because you want to show all stores no matter if there is any data or not, otherwise you would use INNER JOIN.
Additionally you do not have to calculate average on your own - its built in function. As long as you do not need to have weighted average this is enough.

Related

MySQL Spring complicated query - ways to order and query efficiency

I run this complicated query on Spring JPA Repository.
My goal is to get all info from the site table, ordering it by events severity on each site.
This is my query:
SELECT alls.* FROM sites AS alls JOIN
(
SELECT distinct ets.id FROM
(
SELECT s.id, et.`type`, et.severity_level, COUNT(et.`type`) FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
ORDER BY et.severity_level, COUNT(et.`type`) DESC
) AS ets
) as etsd ON alls.id = etsd.id
The second select (the one with "distinct") returns site_ids ordered correctly by severity.
Note that there are different event_types + severity in each site, and I use pagination on the answer, so I need the distinct.
The problem is - the main select doesn't keep this order.
Is there any way to keep the order in one complicated query?
Another related question - one of my ideas was making two queries:
The "select distinct" query that will return me the order --> saved in a list "order list"
The main "sites" query (that becomes very simple) with "where id in {"order list"}
Order the second query in code by "order list".
I use the query every 10 seconds, so it is very sensitive on performance.
What seems to be faster in this case - original complicated query or those 2?
Any insight will be appreciated.
Tnx a lot.
A quirk of SQL's declarative set-oriented syntax for us procedural programmers: ORDER by clauses in subqueries are not carried through to the outer query, except sometimes by accident. If you want ordering at any query level, you must specify it at that level or you will get unpredictable results. The query optimizers are usually smart enough to avoid wasting sort operations.
Your requirement: give at most one sites row for each sites.id value, ordered by the worst event. Worst: lowest event severity, and if there are more than one event with lowest severity, the largest count.
Use this sort of thing to get the "worst" for each id, in place of DISTINCT.
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
This gives at most one row per sites.id value. Then your outer query is
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
Putting it all together:
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
SELECT s.id, et.severity_level, COUNT(et.`type`) num
FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
An index on users.user_id will help performance for these single-user queries.
If you still have performance trouble, please read this and ask another question.

Are multiple aggregates in a select statement a performance concern?

Considering the simple statement below, I am using the SUM and COUNT values in a customer table but I also want to use the values to calculate a third column, average_sale.
I instantly tried to just use the column aliases which would 'appear' to be clearer however I had to use the SUM and COUNT again.
Is this performant?
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
SUM(payments.amount) / COUNT(payments.id) as average_sale,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
ORDER BY `total_sales` DESC
As a general answer I would say, no. However, only a SQL execution plan will tell.
In your case you are reusing the same aggregation expressions multiple times. Even a basic SQL optimizer should realize they are the same ones and will compute each one a single time.
Since your query does not have filtering conditions, it's reading both tables whole. The biggest cost of your query is probably related to the join order. Should it start by payments and then walk to users, or vice versa? The presence/abscence of indexes can be decisive here.
Edit:
Now, if you find out your optimizer is not that clever, you can make sure it computes each aggregation only once by using a subquery (or a CTE if using MySQL 8.x). For example, you could rephrase your query as:
select
total_sales,
quantity,
total_sales / quantity as average_sale,
`name`,
`user_id`
from (
SELECT
SUM(payments.amount) as total_sales,
COUNT(payments.id) as quantity,
`users`.`name`,
`payments`.`user_id`
FROM `payments`
INNER JOIN `users` on `payments`.`user_id` = `users`.`id`
GROUP BY `payments`.`user_id`
) x
ORDER BY `total_sales` DESC

Is there a way to return a calculated value from a nested query?

Ok I've spent a week trying to figure this out but I can't seem to find what I'm looking for. There are two similar problems that, I think, are easier to explain after seeing the pseudo(ish) code. Here is the first one
SELECT P_CODE, P_DESCRIPT, #diff
FROM PRODUCT
WHERE #diff :=
(SELECT ROUND(ABS(P_PRICE -(SUM(P_PRICE)/COUNT(P_PRICE))),2)
FROM PRODUCT
WHERE P_PRICE in
(SELECT P_PRICE
FROM PRODUCT));
Basically, I have product table where I'm trying to return the primary key, description, and difference between the product's price and the average price for all entries. In a similar note here is the second problem
SELECT *, #val
FROM LINE
WHERE P_CODE in
(SELECT P_CODE
FROM LINE
HAVING COUNT(P_CODE) >
(#val = (SELECT COUNT(*) FROM LINE)/(SELECT COUNT(DISTINCT P_CODE)
FROM LINE)));
Here I'm trying to return all fields from the line table (which is basically a receipt table) for which the product for that entry has more items sold than the average number sold
Is it clear what I'm trying to do with these? I'm trying to return other values as well as a calculated value that I can only calculate using values from another table (in list form if that wasn't clear). I'm not too sure but perhaps JOIN statements might work here? I'm new to mysql and haven't quite understood how to best employ JOIN statements yet. If someone could show me how to approach problems like these or at least point me to a link that describes how to do this (as I've had no luck finding one)
Join with a subquery that returns the average price:
SELECT p.p_code, p.p_descript, ROUND(ABS(p.p_price - x.avg_price)) AS price_diff
FROM product AS p
CROSS JOIN (
SELECT AVG(p_price) AS avg_price
FROM product
) AS x
Notice that there's a built-in AVG() function, you don't need to use SUM() and COUNT(). Also, there's no point to WHERE p_price in (SELECT p_price FROM product) -- that test is obviously always true.
SELECT l.*
FROM line AS l
JOIN (
SELECT p_code, COUNT(*) AS code_count
FROM line
GROUP BY p_code
) AS l1 ON l.p_code = l1.p_code
JOIN (
SELECT COUNT(*)/COUNT(DISTINCT p_code) AS avg_count
FROM line
) AS x ON l1.code_count > x.avg_count
Basically, I have product table where I'm trying to return the primary key, description, and difference between the product's price and the average price for all entries
Using MySQL's user variables makes the queries a bit harder.
There are many methods to get what you need.
Method one
Using a inner select or a subquery
SELECT
PRODUCT.P_CODE
, PRODUCT.P_DESCRIPT
, ROUND(ABS((PRODUCT.P_PRICE) - (SELECT AVG(PRODUCT_INNER.P_PRICE) FROM PRODUCT AS PRODUCT_INNER)), 2) AS diff
FROM
PRODUCT
Method two
Using a CROSS JOIN
Query
SELECT
PRODUCT.P_CODE
, PRODUCT.P_DESCRIPT
, ROUND(ABS(PRODUCT.P_PRICE - product_avg.avg_product_price), 2) AS diff
FROM
PRODUCT
CROSS JOIN
(SELECT
AVG(PRODUCT.P_PRICE) AS avg_product_price
FROM
PRODUCT) AS product_avg

Using SQL to calculate average number of customers

Please see above for the data structure. I am trying to write an SQL query to get the average number of customers for each session
My attempt:
select avg(A.NumberCustomer)
from(
select SessionName, count(distinct customers.Idcustomer) as NumberCustomer,
from customers, enrollments, sessions
where customers.Idcustomer=enrollments.Idcustomer and enrollments.Idsession=sessions.Idsession
group by sessions.SessionName
) A
But I seem to get an error on the from customers, enrollments, sessions line
Not sure about this, any help appreciated.
Thanks
You have and extra comma that you should to delete:
select avg(A.NumberCustomer)
from(
select SessionName,
count(distinct customers.Idcustomer) as NumberCustomer, #<--- here
from customers, enrollments, sessions
where customers.Idcustomer=enrollments.Idcustomer
and enrollments.Idsession=sessions.Idsession
group by sessions.SessionName
) A
By the way, I suggest to you to move to SQL'99 join syntax for readability reasons:
SELECT
avg(A.NumberCustomer)
FROM (
select
SessionName,
count(distinct customers.Idcustomer) as NumberCustomer
from customers
inner join enrollments
on customers.Idcustomer=enrollments.Idcustomer
inner join sessions
on enrollments.Idsession=sessions.Idsession
group by sessions.SessionName
) A
Also, nice diagram on question and remember to include your error message next time.
For the average number of customers in each session, you should be able to use just the enrollments table. The average would be the number of enrollments divided by the number of sessions:
select count(*) / count(distinct idSession)
from enrollments e;
This makes the following assumptions:
All sessions have at least one customer (your original query had this assumption as well).
No customer signs up multiple times for the same session.

MySQL huge tables JOIN makes database collapse

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?
300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.
You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.