Give me a simple answer because I can't understand big explain.
I want to know the main work of having vs where.
Example=select custname,salary,(salary*0.04) as EPF
from customer where EPF >1200;
select custname,salary,(salary*0.04) as EPF
from customer having EPF >1200;
select custname
from customer where salary >= 50000;
select custname
from customer having salary >= 50000;
select custname,salary
from customer where salary >= 50000;
In SQL, having is reserved exclusively for aggregation queries. It is used to filter after aggregation.
MySQL extends the use of having so it can be used in non-aggregation queries. The purpose is so column aliases defined in the SELECT can be used in the having.
Hence, both these queries are valid in MySQL:
select custname
from customer
where salary >= 50000;
select custname
from customer
having salary >= 50000;
The latter is not valid in other databases -- because most (all?) other databases conform more closely to the standard in the definition of having.
I would strongly recommend using WHERE in this case, because it is the SQL standard. It is also possible that using HAVING in MySQL -- under some circumstances -- would not make use of indexes or partitions as efficiently as WHERE.
Gordon's answer is precisely correct.
I'd like only give an example (that could probably be found in any SQL manual) where using HAVING makes sense:
SELECT
manager.id
, COUNT(*) AS workers_count
FROM
user AS manager
INNER JOIN user AS worker ON (
worker.manager_id = manager.id
)
WHERE
manager.type = 'manager'
HAVING
workers_count > 3
The query will return managers who has more than 3 workers.
You cannot use workers_count in WHERE clause as this is not defined at WHERE processing stage.
Related
I've written the following query that is supposed to find the max sum of all wages of every staff member. For some reason mysql doesn't seem to like the the sub query in the from clause.
select max(sumwages)
from
(select staff.name, sum(wages) as sumwages
from staff, schedule
where staff.ssn = schedule.ssn
group by staff.ssn)
I also wrote this simpler query to test my theory. This also gives a syntax error. Is what I'm trying to do even possible or do I have to find another way?
select *
from (select ssn from staff)
Yes. In MySQL, these are referred to as derived tables. And every derived table requires an alias:
select max(sumwages)
from (select staff.name, sum(wages) as sumwages
from staff join
schedule
on staff.ssn = schedule.ssn
group by staff.ssn
) ss;
Also note the use of proper, explicit, standard, readable JOIN syntax. Never use commas in the FROM clause.
Additionally - you may avoid the subquery by
select sum(wages) as sumwages
from staff, schedule
where staff.ssn = schedule.ssn
group by staff.ssn
ORDER BY sumwages DESC LIMIT 1
This will give one row with maximal sum too...
I have this query but apparently, the WITH statement has not been implemented in some database systems as yet. How can I rewrite this query to achieve the same result.
Basically what this query is supposed to do is to provide the branch names all of all the branches in a database whose deposit total is less than the average of all the branches put together.
WITH branch_total (branch_name, value) AS
SELECT branch_name, sum (balance) FROM account
GROUP BY branch_name
WITH branch_total_avg (value) AS SELECT avg(value)
FROM branch_total SELECT branch_name
FROM branch_total, branch_total_avg
WHERE branch_total.value < branch_total_avg.value;
Can this be written any other way without the WITH? Please help.
WITH syntax was introduced as a new feature of MySQL 8.0. You have noticed that it is not supported in earlier versions of MySQL. If you can't upgrade to MySQL 8.0, you'll have to rewrite the query using subqueries like the following:
SELECT branch_total.branch_name
FROM (
SELECT branch_name, SUM(balance) AS value FROM account
GROUP BY branch_name
) AS branch_total
CROSS JOIN (
SELECT AVG(value) AS value FROM (
SELECT SUM(balance) AS value FROM account GROUP BY branch_name
) AS sums
) AS branch_total_avg
WHERE branch_total.value < branch_total_avg.value;
In this case, the WITH syntax doesn't provide any advantage, so you might as well write it this way.
Another approach, which may be more efficient because it can probably avoid the use of temporary tables in the query, is to split it into two queries:
SELECT AVG(value) INTO #avg FROM (
SELECT SUM(balance) AS value FROM account GROUP BY branch_name
) AS sums;
SELECT branch_name, SUM(balance) AS value FROM account
GROUP BY branch_name
HAVING value < #avg;
This approach is certainly easier to read and debug, and there's some advantage to writing more straightforward code, to allow more developers to maintain it without having to post on Stack Overflow for help.
Another way to rewrite this query:
SELECT branch_name
FROM account
GROUP BY branch_name
HAVING SUM(balance) < (SELECT AVG(value)
FROM (SELECT branch_name, SUM(balance) AS value
FROM account
GROUP BY branch_name) t1)
As you can see from this code the account table has nearly the same aggregate query run against it twice, once at the outer level and again nested two levels deep.
The benefit of the WITH clause is that you can write that aggregate query once give it a name and use it as many times as needed. Additionally a smart DB engine will only run that subfactored query once but use the results as often as needed.
I have what should be a simple query for any database and which always runs in MySQL but not in SQL Server
select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
group by assetid
order by ts desc
The error is:
column tagalerts.id is invalid in the select list because it is not contained in either an aggregate function or the group by clause.
It is not a simple case of adding tagalerts.id to the group by clause because the error repeats for ts and for assetid etc, implying that all the selects need to be in a group or in aggregate functions... either of which will result in a meaningless and inaccurate result.
Splitting the select into a subquery to sort and group correctly (which again works fine with MySQL, as you would expect) makes matters worse
SELECT * from
(select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
order by ts desc
)T1
group by assetid
the order by clause is invalid in views, inline functions, derived tables and expressions unless TOP etc is used
the 'correct output' should be
id ts assetid zonename battlevel
1234 a datetime 1569 Reception 0
3182 another datetime 1572 Reception 0
Either I am reading SQL Server's rules entirely wrong or this is a major flaw with that database.
How can I write this to work on both systems?
In most databases you can't just include columns that aren't in the GROUP BY without using an aggregate function.
MySql is an exception to that. But MS SQL Server isn't.
So you could keep that GROUP BY with only the "assetid".
But then use the appropriate aggregate functions for all the other columns.
Also, use the JOIN syntax for heaven's pudding sake.
A SQL like select * from table1, table2 where table1.id2 = table2.id is using a syntax from the previous century.
SELECT
MAX(node.id) AS id,
MAX(ta.ts) AS ts,
ta.assetid,
MAX(node.zonename) AS zonename,
MAX(ta.battlevel) AS battlevel
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
GROUP BY ta.assetid
ORDER BY ta.ts DESC;
Another trick to use in MS SQL Server is the window function ROW_NUMBER.
But this is probably not what you need.
Example:
SELECT id, ts, assetid, zonename, battlevel
FROM
(
SELECT
node.id,
ta.ts,
ta.assetid,
node.zonename,
ta.battlevel,
ROW_NUMBER() OVER (PARTITION BY ta.assetid ORDER BY ta.ts DESC) AS rn
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
) q
WHERE rn = 1
ORDER BY ts DESC;
I strongly suspect this query is WRONG even in MySql.
We're missing a lot of details (sample data, and we don't know which table all of the columns belong to), but what I do know is you're grouping by assetid, where it looks like one assetid value could have more than one ts (timestamp) value in the group. It also looks like you're counting on the order by ts desc to ensure both that you see recent timestamps in the results first and that each assetid group uses the most recent possible ts timestamp for that group.
MySql only guarantees the former, not the latter. Nothing in this query guarantees that each assetid is using the most recent timestamp available. You could be seeing the wrong timestamps, and then also using those wrong timestamps for the order by. This is the problem the Sql Server rule is there to stop. MySql violates the SQL standard to allow you to write that wrong query.
Instead, you need to look at each column and either add it to the group by (best when all of the values are known to be the same, anyway) or wrap it in an aggregrate function like MAX(), MIN(), AVG(), etc, so there is a deterministic result for which value from the group is used.
If all of the values for a column in a group are the same, then there's no problem adding it to the group by. If the values are different, you want to be precise about which one is chosen for the result set.
While I'm here, the tagalerts, node join syntax has been obsolete for more than 20 years now. It's also good practice to use an alias with every table and prefix every column with the alias. I mention these to explain why I changed it for my code sample below, though I only prefix columns where I am confident in which table the column belongs to.
This query should run on both databases:
SELECT ta.assetid, MAX(ta.id) "id", MAX(ta.ts) "ts",
MAX(n.zonename) "zonename", MAX(battlevel) "battlevel"
FROM tagalerts ta
INNER JOIN node n ON ta.nodeid = n.id
WHERE ack = 0
GROUP BY ta.assetid
ORDER BY ts DESC
There is also a concern here the results may be choosing values from different records in the joined node table. So if battlevel is part of the node table, you might see a result that matches a zonename with a battlevel that never occurs in any record in the data. In Sql Server, this is easily fixed by using APPLY to match only one node record to each tagalert. MySql doesn't support this (APPLY or an equivalent has been in every other major database since at least 2012), but you can simulate with it in this case with two JOINs, where the first join is a subquery that uses GROUP BY to determine values will uniquely identify the needed node record, and second join is to the node table to actually produce that record. Unfortunately, we need to know more about the tables in question to actually write this code for you.
I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.
If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.