I've written the following query that is supposed to find the max sum of all wages of every staff member. For some reason mysql doesn't seem to like the the sub query in the from clause.
select max(sumwages)
from
(select staff.name, sum(wages) as sumwages
from staff, schedule
where staff.ssn = schedule.ssn
group by staff.ssn)
I also wrote this simpler query to test my theory. This also gives a syntax error. Is what I'm trying to do even possible or do I have to find another way?
select *
from (select ssn from staff)
Yes. In MySQL, these are referred to as derived tables. And every derived table requires an alias:
select max(sumwages)
from (select staff.name, sum(wages) as sumwages
from staff join
schedule
on staff.ssn = schedule.ssn
group by staff.ssn
) ss;
Also note the use of proper, explicit, standard, readable JOIN syntax. Never use commas in the FROM clause.
Additionally - you may avoid the subquery by
select sum(wages) as sumwages
from staff, schedule
where staff.ssn = schedule.ssn
group by staff.ssn
ORDER BY sumwages DESC LIMIT 1
This will give one row with maximal sum too...
Related
Give me a simple answer because I can't understand big explain.
I want to know the main work of having vs where.
Example=select custname,salary,(salary*0.04) as EPF
from customer where EPF >1200;
select custname,salary,(salary*0.04) as EPF
from customer having EPF >1200;
select custname
from customer where salary >= 50000;
select custname
from customer having salary >= 50000;
select custname,salary
from customer where salary >= 50000;
In SQL, having is reserved exclusively for aggregation queries. It is used to filter after aggregation.
MySQL extends the use of having so it can be used in non-aggregation queries. The purpose is so column aliases defined in the SELECT can be used in the having.
Hence, both these queries are valid in MySQL:
select custname
from customer
where salary >= 50000;
select custname
from customer
having salary >= 50000;
The latter is not valid in other databases -- because most (all?) other databases conform more closely to the standard in the definition of having.
I would strongly recommend using WHERE in this case, because it is the SQL standard. It is also possible that using HAVING in MySQL -- under some circumstances -- would not make use of indexes or partitions as efficiently as WHERE.
Gordon's answer is precisely correct.
I'd like only give an example (that could probably be found in any SQL manual) where using HAVING makes sense:
SELECT
manager.id
, COUNT(*) AS workers_count
FROM
user AS manager
INNER JOIN user AS worker ON (
worker.manager_id = manager.id
)
WHERE
manager.type = 'manager'
HAVING
workers_count > 3
The query will return managers who has more than 3 workers.
You cannot use workers_count in WHERE clause as this is not defined at WHERE processing stage.
The following query works just fine. I am using a value from the outer select to filter inside the inner select.
SELECT
bk.ID,
(SELECT COUNT(*) FROM guests WHERE BookingID = bk.ID) as count
FROM
bookings bk;
However, the following select will not work:
SELECT
bk.ID,
(SELECT SUM(count) FROM (SELECT COUNT(*) AS count FROM guests WHERE BookingID = bk.ID GROUP BY RoomID) sub) as sumcount
FROM
bookings bk;
The error message is: Error Code: 1054. Unknown column 'bk.ID' in 'where clause'
Why is my alias bk known in the subselect, but not in the subselect of the subselect?
For the record, I am using MySQL 5.6.
This is called "scoping". I know that Oracle (for instance) only looks out one level for resolving table aliases. SQL Server is also consistent: it looks out more than one level.
Based on this example, MySQL clearly limits the scope of the identifier bk to the immediate subquery. There is a small hint in the documentation (emphasis mine):
A correlated subquery is a subquery that contains a reference to a
table that also appears in the outer query.
However, I have not found any other specific reference to the scoping rules in the documentation. There are other answers (here and here) that specify that the scope of a table alias is limited to one level of subquery.
You already know how to fix the problem (your two queries should produce identical results). Re-arranging the query to have joins and aggregations can also resolve this problem.
Correlated Scalar Subqueries in the SELECT list can usually be rewritten to a LEFT JOIN on a Derived Table (and in many cases they might perform better then):
SELECT
bk.ID,
dt.sumcount
FROM
bookings bk
LEFT JOIN
(SELECT BookingID,SUM(COUNT) AS sumcount
FROM
(
SELECT BookingID, RoomId, COUNT(*) AS COUNT
FROM guests
GROUP BY BookingID, RoomID
) sub
) AS dt
ON bk.BookingID = dt.BookingID
I have 2 tables called Orders and Salesperson shown below:
And I want to retrieve the names of all salespeople that have more than 1 order from the tables above.
Then firing following query shows an error:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The error is:
Column 'Name' is invalid in the select list because it is
not contained in either an aggregate function or
the GROUP BY clause.
From the error and searching it on google, I could understand that the error is because of Name column must be either a part of the group by statement or aggregate function.
Also I tried to understand why does the selected column have to be in the group by clause or art of an aggregate function? But didn't understand clearly.
So, how to fix this error?
SELECT max(Name) as Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The basic idea is that columns that are not in the group by clause need to be in an aggregate function now here due to the fact that the name is probably the same for every salesperson_id min or max make no real difference (the result is the same)
example
Looking at your data you have 3 entry's for Dan(7) now when a join is created the with row Dan (Name) gets multiplied by 3 (For every number 1 Dan) and then the server does not now witch "Dan" to pick cos to the server that are 3 lines even doh they are semantically the same
also try this so that you see what I am talking about:
SELECT Orders.Number, Salesperson.Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
As far as the query goes INNER JOIN is a better solution since its kinda the standard for this simple query it should not matter but in some cases can happen that INNER JOIN produces better results but as far as I know this is more of a legacy thing since this days the server should pretty much produce the same execution plan.
For code clarity I would stick with INNER JOIN
Assuming the name is unique to the salesperson.id then simply add it to your group by clause
GROUP BY salesperson_id, salesperson.Name
Otherwise use any Agg function
Select Min(Name)
The reason for this is that SQL doesn't know whether there are multiple name per salesperson.id
For readability and correctness, I usually split aggregate queries into two parts:
The aggregate query
Any additional queries to support fields not contained in aggregate functions
So:
1.Aggregate query - salespeople with more than 1 order
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
2.Use aggregate as subquery (basically a select joining onto another select) to join on any additional fields:
SELECT *
FROM Salesperson SP
INNER JOIN
(
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
) AGG_QUERY
ON AGG_QUERY.salesperson_id = SP.ID
There are other approaches, such as selecting the additional fields via aggregation functions (as shown by the other answers). These get the code written quickly so if you are writing the query under time pressure you may prefer that approach. If the query needs to be maintained (and hence readable) I would favour subqueries.
I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.
SELECT alert,
(select created_at from alerts
WHERE alert = #ALERT ORDER BY created_at desc LIMIT 1)
AS latest FROM alerts GROUP BY alert;
I am having an issue with the above query where I would like to pass in each alert into the subquery so that I have a column called latest which displays the latest alert for each group of alerts. How should I do this?
This is called a correlated subquery. To make it work, you need table aliases:
SELECT a1.alert,
(select a2.created_at
from alerts a2
WHERE a2.alert = a1.alert
ORDER BY a2.created_at desc
LIMIT 1
) AS latest
FROM alerts a1
GROUP BY a1.alert;
Table aliases are a good habit to get into, because they often make the SQL more readable. It is also a good idea to use table aliases with column references, so you easily know where the column is coming from.
EDIT:
If you really want the latest, you can get it by simply doing:
select alert, max(created_at)
from alerts
group by alert;
If you are trying to get the latest created_at date for each group of alerts, there is a simpler way.
SELECT
alert,
max (created_at) AS latest
FROM
alerts
GROUP BY
alert;
I would do the following
SELECT
alert_group_name,
MAX(created_at) AS latest
FROM
alerts A
GROUP BY
alert_group_name;
For a correlated subquery, you need to reference an expression from the outer query.
The best way to do that is to assign an alias to the table on the outer query, and then reference that in the inner query. Best practice is to assign an alias to EVERY table reference, and qualify EVERY column reference.
All that needs to be done to "fix" your query is to replace the reference to "#ALERT" with a reference to the alert column from the table on the outer query.
In our shop, that statement would be formatted something like this:
SELECT a.alert
, (SELECT l.created_at
FROM alerts l
WHERE l.alert = a.alert
ORDER BY l.created_at DESC
LIMIT 1
) AS latest
FROM alerts a
GROUP
BY a.alert
Not because that's easier to write that way, but more importantly it's easier to read and understand what the statement is doing.
The correlated subquery approach can be efficient for a small number of rows returned (a very restrictive WHERE clause on the outermost query.) But in general, correlated subqueries in the SELECT list can make for a (what we refer to in our shop) an LDQ "light dimming query".
In our shop, if we needed the resultset returned by that query, that statement would likely be rewritten as:
SELECT a.alert
, MAX(a.created_at) AS latest
FROM alerts a
GROUP
BY a.alert
And we'd definitely have an index defined ON alerts(alert,created_at) (or an index with additional columns after those first two.)
size, we
(I don't anticipate any cases where this statement would return a different result.)