Considering the "SQL order of execution", how is it possible for the GROUP BY statement to work on a column that is to be made by the CASE WHEN operation in the SELECT statement?
example query:
SELECT
CASE
WHEN COM_SALES_PRC > 1000000 THEN "good job"
END AS Performance,
COUNT(CUST_SEQ_NO) AS NumberofEmployees
FROM PREP_MONTHLY_STAT
GROUP BY Performance
While the above query is ran without an error, the one below gives me an error. Makes sense since WHERE statement would not know the alias to be given in the SELECT statement.
SELECT
CASE
WHEN COM_SALES_PRC > 1000000 THEN "good job"
END AS Performance,
COUNT(CUST_SEQ_NO) AS NumberofEmployees
FROM PREP_MONTHLY_STAT
WHERE Performance IS NOT NULL
SQL order of execution:
https://www.sisense.com/blog/sql-query-order-of-operations/
FROM
WHERE
GROUP BY
HAVE
SELECT
ORDER BY
LIMIT
I'm pretty new to writing queries, and it would be very grateful if the answers are as detailed as possible. Thank you.
First, be aware that the existence of COUNT(...) implies a GROUP BY. If you don't have one explicitly, then then entire tables is one "group". COUNT, SUM, etc are "aggregates".
WHERE cannot refer to expressions in the SELECT. HAVING and ORDER BY can refer to aggregates only after GROUP BY has aggregated them.
WHERE Performance IS NOT NULL could be changed to HAVING Performance IS NOT NULL
(I don't like having SELECT as 5.)
Some issues with your use of COUNT, plus more examples:
COUNT(x) checks x for being NOT NULL. Is that necessary in your case? If not, then simply say COUNT(*).
You are probably getting the total number of employees. If you just wanted the "good job" employees, then
SELECT
COUNT(*) AS GoodJobEmployees
FROM PREP_MONTHLY_STAT
WHERE COM_SALES_PRC > 1000000
If you wanted both the total and the "good job" counts:
SELECT
SUM(COM_SALES_PRC > 1000000) AS GoodJobEmployees,
COUNT(*) AS TotalEmployees
FROM PREP_MONTHLY_STAT
If your real goal is more complex,let's see it.
Related
My requirements are: I now have a table, I need to group according to one of the fields, and get the latest record in the group, and then I search the scheme on the Internet,
SELECT
* FROM(
SELECT
*
FROM
record r
WHERE
r.id in (xx,xx,xx) HAVING 1
ORDER BY
r.time DESC
) a
GROUP BY
a.id
, the result is correct, but I can't understand the meaning of "having 1" after the where statement. I hope a friend can give me an answer. Thank you very much.
It does nothing, just like having true would. Presumably it is a placeholder where sometimes additional conditions are applied? But since there is no group by or use of aggregate functions in the subquery, any having conditions are going to be treated no differently than where conditions.
Normally you select rows and apply where conditions, then any grouping (explicit, or implicit as in select count(*)) occurs, and the having clause can specify further constraints after the grouping.
Note that your query is not guaranteed to give the results you want; the order by in the subquery in theory has no effect on the outer query and the optimizer may skip it. It is possible the presence of having makes a difference to the optimizer, but that is not something you should rely on, certainly from one version of mysql to another.
There are 2 samples.
In the first example, it gives faster results when using orderby. (according to phpmyadmin speed report)
In the other example, I don't use order by, it gives slower results. (according to phpmyadmin speed report)
Isn't it unreasonable that it gives quick results when using Orderby?
The ranking doesn't matter to me, it's the speed that matters.
select bayi,tutar
from siparisler
where durum='1' and MONTH(tarih) = MONTH(CURDATE()) and YEAR(tarih) = YEAR(CURRENT_DATE())
order by id desc
Speed: 0.0006
select bayi,tutar
from siparisler
where durum='1' and MONTH(tarih) = MONTH(CURDATE()) and YEAR(tarih) = YEAR(CURRENT_DATE())
Speed: 0.7785
An order by query will never execute faster than the same query without the order by clause. Sorting rows incurs more work for the database. In the best-case scenario, the sorting becomes a no-op because MySQL fetched the rows in the correct order in the first place: but that just make the two queries equivalent in terms of performance (it does not make the query that sorts faster).
Possibly, the results of the order by were cached already, so MYSQL gives you the result directly from the cache rather than actually executing the query.
If performance is what matters most to you, let me suggest to change the where predicate in order not to use date functions on the tarih column: such construct prevents the database to take advantage of an index (we say the predicate is non-SARGable). Consider:
select bayi, tutar
from siparisler
where
durum = 1
and tarih >= dateformat(current_date, '%Y-%m-01')
and tarih < dateformat(current_date, '%Y-%m-01') + interval 1 month
order by id desc
For performance with this query, consider an index on (durum, tarih, id desc, bay, tutar): it should behave as a covering index, that MySQL can use to execute the entire query, without even looking at the actual data.
At 0.0006s, you are almost certainly measuring the performance of the query_cache rather than the execution time. Try both queries again with SELECT SQL_NO_CACHE and see what the performance difference is.
First, I recommend writing the query as:
select bayi, tutar
from siparisler p
where durum = 1 and -- no quotes assuming this is an integer
tarih >= curdate() - interval (1 - day(curdate()) day;
This can take advantage of an index on (durm, tarih).
But that isn't your question. It is possible that the order by could result in a radically different execution plan. This is hypothetical, but the intention is to explain how this might occur.
Let me assume the following:
The table only has an index on (id desc, durum, tarih).
The where clause matches few rows.
The rows are quite wide.
The query without the order by would probably generate an execution plan that is a full table scan. Because the rows are wide, lots of unnecessary data would be read.
The query with the order by could read the data in order and then apply the where conditions. This would be faster than the other version, because only the rows that match the where conditions would be read in.
I cannot guarantee that this is happening. But there are some counterintuitive situations that arise with queries.
You can analyze it through the EXPLAIN command, and then check the value corresponding to the type field, index or all
Example:
EXPLAIN SELECT bayi,tutar
FROM siparisler
WHERE durum='1' AND MONTH(tarih) = MONTH(CURDATE()) AND YEAR(tarih) = YEAR(CURRENT_DATE())
ORDER BY id DESC;
When I study the SQL HAVING tutorial, it says: HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery.
Sometimes, HAVING statement is equivalent to subquery, like these:
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
having sum(total_amt_usd) >= 250000
select *
from (
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
) as subtable
where sum_amount >= 250000
I want to know which one is recommended and the reason why this one is faster or more efficient than the other.
As with any performance question, you should try it on your data. But, the two should be essentially equivalent. If you are interested in such questions, then you should learn how to read execution plans.
Just one note about MySQL. MySQL tends to materialize subqueries. This might incur a little extra overhead by writing the group by results before filtering them, but you probably would not notice the difference.
I have a question about using "group by" in mysql: group order whether to affect the efficiency of query.
1.SELECT SQL_NO_CACHE `er_ct`, `appve` FROM TBL_547 WHERE UAEWA_ts >= 1417276800 AND UAEWA_ts <= 1417449540 GROUP BY `appve`, `er_ct` ORDER BY `c79fd348-cc8e-41f2-ae93-0b2b2cde8a31` DESC limit 5;
2.SELECT SQL_NO_CACHE `er_ct`, `appve` FROM TBL_547 WHERE UAEWA_ts >= 1417276800 AND UAEWA_ts <= 1417449540 GROUP BY `er_ct`,`appve` ORDER BY `c79fd348-cc8e-41f2-ae93-0b2b2cde8a31` DESC limit 5;
The difference betwen two sentence is "GROUP BY appve, er_ct " and " GROUP BY er_ct,appve".There is no index(combined index) on appve and er_ct. The value of "SELECT COUNT(DISTINCT er_ct) FROM TBL_547" is 7000. The value of "SELECT COUNT(DISTINCT appve) FROM TBL_547" is 3.
here is the screenshot. http://i.stack.imgur.com/AeQy2.png
the structure: http://i.stack.imgur.com/ewgAy.png
thanks.
Creating index on column with group by will not boost your results. When you perform a query, the SQL statement first gets compiled into a tree of relational algebra operations. These operations each take one or more tables as input and produce another table as output. Then using the output table SQL engine is applying any other operations:
- agregation - group by
- sorting
So you can boost your query mostly by:
- creating smart queries, like on only indexed columns.
- asuring your result set is not huge, and not accesssing all joined tble columns, like Select * is a total overkill on production
I would also recommend SQL Tuning as an lecture. I hope my answer will help.
First think that pops in my mind is size of distinct results in both columns, you mentioned 3 and 7k, that would be the main factor I assume.
When query optimizer (they are changing all the time) will see that first group column is small, it will just go with the flow, but if he sees that the first column is large (7k distinct results) he could go with building up an index on it. That operation on a large column could be slow, thats why you have two different times for both queries.
The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR
If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.
Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.