When I study the SQL HAVING tutorial, it says: HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery.
Sometimes, HAVING statement is equivalent to subquery, like these:
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
having sum(total_amt_usd) >= 250000
select *
from (
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
) as subtable
where sum_amount >= 250000
I want to know which one is recommended and the reason why this one is faster or more efficient than the other.
As with any performance question, you should try it on your data. But, the two should be essentially equivalent. If you are interested in such questions, then you should learn how to read execution plans.
Just one note about MySQL. MySQL tends to materialize subqueries. This might incur a little extra overhead by writing the group by results before filtering them, but you probably would not notice the difference.
Related
Considering the "SQL order of execution", how is it possible for the GROUP BY statement to work on a column that is to be made by the CASE WHEN operation in the SELECT statement?
example query:
SELECT
CASE
WHEN COM_SALES_PRC > 1000000 THEN "good job"
END AS Performance,
COUNT(CUST_SEQ_NO) AS NumberofEmployees
FROM PREP_MONTHLY_STAT
GROUP BY Performance
While the above query is ran without an error, the one below gives me an error. Makes sense since WHERE statement would not know the alias to be given in the SELECT statement.
SELECT
CASE
WHEN COM_SALES_PRC > 1000000 THEN "good job"
END AS Performance,
COUNT(CUST_SEQ_NO) AS NumberofEmployees
FROM PREP_MONTHLY_STAT
WHERE Performance IS NOT NULL
SQL order of execution:
https://www.sisense.com/blog/sql-query-order-of-operations/
FROM
WHERE
GROUP BY
HAVE
SELECT
ORDER BY
LIMIT
I'm pretty new to writing queries, and it would be very grateful if the answers are as detailed as possible. Thank you.
First, be aware that the existence of COUNT(...) implies a GROUP BY. If you don't have one explicitly, then then entire tables is one "group". COUNT, SUM, etc are "aggregates".
WHERE cannot refer to expressions in the SELECT. HAVING and ORDER BY can refer to aggregates only after GROUP BY has aggregated them.
WHERE Performance IS NOT NULL could be changed to HAVING Performance IS NOT NULL
(I don't like having SELECT as 5.)
Some issues with your use of COUNT, plus more examples:
COUNT(x) checks x for being NOT NULL. Is that necessary in your case? If not, then simply say COUNT(*).
You are probably getting the total number of employees. If you just wanted the "good job" employees, then
SELECT
COUNT(*) AS GoodJobEmployees
FROM PREP_MONTHLY_STAT
WHERE COM_SALES_PRC > 1000000
If you wanted both the total and the "good job" counts:
SELECT
SUM(COM_SALES_PRC > 1000000) AS GoodJobEmployees,
COUNT(*) AS TotalEmployees
FROM PREP_MONTHLY_STAT
If your real goal is more complex,let's see it.
I am running the below query to retrive the unique latest result based on a date field within a same table. But this query takes too much time when the table is growing. Any suggestion to improve this is welcome.
select
t2.*
from
(
select
(
select
id
from
ctc_pre_assets ti
where
ti.ctcassettag = t1.ctcassettag
order by
ti.createddate desc limit 1
) lid
from
(
select
distinct ctcassettag
from
ctc_pre_assets
) t1
) ro,
ctc_pre_assets t2
where
t2.id = ro.lid
order by
id
Our able may contain same row multiple times, but each row with different time stamp. My object is based on a single column for example assettag I want to retrieve single row for each assettag with latest timestamp.
It's simpler, and probably faster, to find the newest date for each ctcassettag and then join back to find the whole row that matches.
This does assume that no ctcassettag has multiple rows with the same createddate, in which case you can get back more than one row per ctcassettag.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
INNER JOIN
(
SELECT
ctcassettag,
MAX(createddate) AS createddate
FROM
ctc_pre_assets
GROUP BY
ctcassettag
)
newest
ON newest.ctcassettag = ctc_pre_assets.ctcassettag
AND newest.createddate = ctc_pre_assets.createddate
ORDER BY
ctc_pre_assets.id
EDIT: To deal with multiple rows with the same date.
You haven't actually said how to pick which row you want in the event that multiple rows are for the same ctcassettag on the same createddate. So, this solution just chooses the row with the lowest id from amongst those duplicates.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
WHERE
ctc_pre_assets.id
=
(
SELECT
lookup.id
FROM
ctc_pre_assets lookup
WHERE
lookup.ctcassettag = ctc_pre_assets.ctcassettag
ORDER BY
lookup.createddate DESC,
lookup.id ASC
LIMIT
1
)
This does still use a correlated sub-query, which is slower than a simple nested-sub-query (such as my first answer), but it does deal with the "duplicates".
You can change the rules on which row to pick by changing the ORDER BY in the correlated sub-query.
It's also very similar to your own query, but with one less join.
Nested queries are always known to take longer time than a conventional query since. Can you append 'explain' at the start of the query and put your results here? That will help us analyse the exact query/table which is taking longer to response.
Check if the table has indexes. Unindented tables are not advisable(until unless obviously required to be unindented) and are alarmingly slow in executing queries.
On the contrary, I think the best case is to avoid writing nested queries altogether. Bette, run each of the queries separately and then use the results(in array or list format) in the second query.
First some questions that you should at least ask yourself, but maybe also give us an answer to improve the accuracy of our responses:
Is your data normalized? If yes, maybe you should make an exception to avoid this brutal subquery problem
Are you using indexes? If yes, which ones, and are you using them to the fullest?
Some suggestions to improve the readability and maybe performance of the query:
- Use joins
- Use group by
- Use aggregators
Example (untested, so might not work, but should give an impression):
SELECT t2.*
FROM (
SELECT id
FROM ctc_pre_assets
GROUP BY ctcassettag
HAVING createddate = max(createddate)
ORDER BY ctcassettag DESC
) ro
INNER JOIN ctc_pre_assets t2 ON t2.id = ro.lid
ORDER BY id
Using normalization is great, but there are a few caveats where normalization causes more harm than good. This seems like a situation like this, but without your tables infront of me, I can't tell for sure.
Using distinct the way you are doing, I can't help but get the feeling you might not get all relevant results - maybe someone else can confirm or deny this?
It's not that subqueries are all bad, but they tend to create massive scaleability issues if written incorrectly. Make sure you use them the right way (google it?)
Indexes can potentially save you for a bunch of time - if you actually use them. It's not enough to set them up, you have to create queries that actually uses your indexes. Google this as well.
Let's consider the following table.
Table:
ID
epoch_time_in_millis
counter
Query #1:
SELECT
DATE_FORMAT(FROM_UNIXTIME(epoch_time_in_millis/1000),"%Y-%m-%d") date,
SUM(counter) totalCount
FROM my_table
GROUP BY date
Query #2:
SELECT
(epoch_time_in_millis DIV 86400000 ) * 86400000 ms,
SUM(counter) totalCount
FROM my_table
GROUP BY (epoch_time_in_millis DIV 86400000) * 86400000;
My question is:
Will the above two queries show any performance difference?
If yes please let me understand why.
If no let me understand why. :p
Thanks in advance.
The best way to check performance is on your hardware using your data.
But, MySQL implements group by using a file sort algorithm. This algorithm does not generally take advantage of indexes, and especially not in your case. Hence, the work for the two queries is going to be in processing the aggregation.
The other operations are trivial. So, whether the engine does the calculation once or twice really isn't going to be relevant for the overall computation -- unless you have just a handful of rows. And, in that case, performance isn't really an issue.
So I have this MySQL query, and as I have lots of records this gets very slow, the computers that use the software (cash registers) aren't that powerful either.
Is there a way to get the same result, but faster? Would really appreciate help!
SELECT d.sifra, COUNT(d.sifra) AS pogosti, c.*, s.Stevilka as Stev_sk FROM Cenik c, dnevna d, Podskupina s
WHERE d.sifra = c.Sifra AND d.datum >= DATE(DATE_SUB(NOW(),INTERVAL 3 DAY))
GROUP BY d.sifra ORDER BY pogosti DESC limit 27
Have you tried indexing?
You are using c.Sifra in the WHERE, so you probably want
CREATE INDEX Cenik_Sifra ON Cenik(Sifra);
Also you use datum and sifra from dnevna, and datum is your SELECT, so
CREATE INDEX dnevna_ndx ON dnevna(datum, sifra);
Finally there's no JOIN condition on Podskupina, whence you draw Stevilka. Is this a constant table? As it is, you're just counting rows in Podskupina and/or getting an unspecified value out of it, unless it only has the one row.
On some versions of MySQL you might also find benefit in pre-calculating the datum:
SELECT #datum := DATE(DATE_SUB(NOW(), INTERVAL 3 DAY))
and then use #datum in your query. This might improve its chances of a good indexed performance.
Without knowing more about the structure and cardinality of the involved tables, though, there's little that can be done.
At the very least you should post the result of
EXPLAIN SELECT...(your select)
in the question.
you don't have condition to join Podskupina s, and you get cross join (all to all), so you get x rows from join "d.sifra = c.Sifra" multiplicate by y rows of Podskupina s
This looks like a very problematic query. Do you really need to return all of c.* ? And where's the join or filter on Podskupina? Once you tighten the query, make sure you've created good indexes on the tables. For example, presuming you've already got a clustered index on a unique ID as a primary key in dnevna, performance would typically benefit by putting a secondary index on the sifra and datum columns.
I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;