Why does count(*) restrict my search to 1 row - mysql

Using the employees database (https://github.com/datacharmer/test_db) I have (as an example) the following statement:
select first_name, count(*) from employees;
Now this would give the following output:
first_name | count(*)
Georgie | 300024
While this statement:
select first_name, (select count(*) from employees) from employees;
would give the following output:
first_name | select count(*) ...
Georgi | 300024
Bezalel | 300024
Parto | 300024
etc... | 300024
To clarify my Question: I don't understand why the first statement restricts my search query to 1 row (only Georgi is showing up) while the last statement (with the subquery) shows all names and hence doesn't restrict my search query to 1 row like the first one.

Your first Query:
Count() is an aggregate function, it counts all rows in a dataset (or a group). That is why you get only 1 row because you have only one set here.
Your second Query:
You are doing exactly the same thing (refer to first query) but now you're doing it in subquery so it runs for every row and it is again Count(*) so it returns the same value.
Solution:
You need to use the Group By to make groups in your data by first_name. And then Count the rows in every group:
Select first_name, count(first_name)
From employee
Group by first_name
It will give you count for all first_name in your dataset.

The way COUNT works in combination with the GROUP BY clause can be a little bit confusing the first time you encounter it. I will try to explain how it behaves in each of the examples you provided.
select first_name, count(*) from employees;
COUNT is an aggregate function. Such function will perform it's computation on each set of rows (called group) being returned by the SELECT statement. You can create multiple groups using the GROUP BY clause, but since you don't use it in this query the whole dataset will act as one group. Therefore, the COUNT function will count the entire number of rows in your table.
One way to put it is that the number of rows in the final result will be equal to the number of groups.
select first_name, count(*) from employees group by first_name;
Let's assume that your table contains 6 rows in total, from which three have the first_name field equal to Foo and the other three to Bar. You will end up having two groups, and the COUNT function will count the number of rows in each group. Therefore, the result will have two rows, looking like this:
-------------------------
| first_name | COUNT(*) |
-------------------------
| Foo | 3 |
| Bar | 3 |
-------------------------
select first_name, (select count(*) from employees) from employees;
In this example, having a the second query in your SELECT statement is no different than having a constant. Consider the following query:
select first_name, 5 from employees;
Your result will have two columns. The second column will always contain 5. When you do the second query as part of you SELECT statement, the result of that query will be used in exactly the same way.
I hope this makes it at least a little bit clearer.

No. It's correct. It counts all. Use group-by to count distinct of name.
Select first_name, count(*) from employee
Group by first_name
Or use proper count with distinct
Select first_name, count(distinct first_name) from employee
This one counts distinct values on given column name.

Related

Over() function does not cover all rows as expected

I have been practising SQL, and came across this behaviour i couldnt explain. ( I am also the one who asked this question : Over() function does not cover all rows in the table) -> its a different problem.
Suppose i have a table like this
MovieRating table:
movie_id
user_id
rating
created_at
1
1
3
2020-01-12
1
2
4
2020-02-11
1
3
2
2020-02-12
1
4
1
2020-01-01
2
1
5
2020-02-17
2
2
2
2020-02-01
2
3
2
2020-03-01
3
1
3
2020-02-22
3
2
4
2020-02-25
What I am trying to do, is to rank the movie by rating, which i have this SQL query:
SELECT
movie_id,
rank() over(partition by movie_id order by avg(rating) desc) as rank_rate
FROM
MovieRating
From my previous question, i learnt that the over() function will operate in a window selected by the query, basically the window this query returns:
SELECT movie_id FROM MovieRating
So I would expect to see at least 3 rows here, for id 1, 2 and 3.
The result is however just one row:
{"headers": ["movie_id", "rank_rate"], "values": [[1, 1]]}
Why is that ? Is something wrong with my understanding regarding how over() function works ?
You need an aggregation query and use RANK() window function on its results:
SELECT movie_id,
AVG(rating) AS average_rating, -- you may remove this line if you don't actually need the average rating
RANK() OVER (ORDER BY AVG(rating) DESC) AS rank_rate
FROM MovieRating
GROUP BY movie_id
ORDER BY rank_rate;
See the demo.
Your query is an aggregation query without a group by clause and this means that it operates on the whole table and not to each movie_id. Such queries return only 1 row with the result of the aggregation.
When yo apply RANK() window function, it will operate on that single row and not on the table.
I think you mean to get one row for each movie, with its average rating.
You should use GROUP BY, not a window function:
SELECT movie_id, AVG(rating) AS avg_rating
FROM MovieRating
GROUP BY movie_id
ORDER BY avg_rating DESC;
https://www.db-fiddle.com/f/o9qLFbJEwhaHDWoTS9Qfwp/1
The reason you only got one row is that when you use an aggregate function like AVG(), that implicitly makes the query into an aggregating query. The result of the query is one row per group.
https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html says:
If you use an aggregate function in a statement containing no GROUP BY clause, it is equivalent to grouping on all rows.
In other words, the whole table is considered one "group" if you use AVG() but don't specify a GROUP BY expression. Because the whole table is a single group, the result is one row.
Windows defined by windowing functions are not the same as groups defined by aggregate functions. The window functions are applied after the rows have been reduced by aggregation. Since there was only one group and therefore one row in your result, the rank was 1.

SQL: Do all attributes in a GROUP BY clause need to be listed in the SELECT clause?

Say I have a table called Employee with attributes Name, Salary, Department.
I know that this will work:
SELECT Department, AVG(Salary)
FROM EMPLOYEE
GROUP BY Department;
Would it be incorrect to discard 'Department' from the SELECT clause like so:
SELECT AVG(Salary)
FROM EMPLOYEE
GROUP BY Department;
Or would it still work?
Both queries will work and provide the output. However, second query will only result in one column (i.e. average salary) and hence, you won't be able to trace it back to department id from second query alone, e.g.:
Query 1 output:
dept | salary
1 | 5000
2 | 6000
Query 2 output:
salary
5000
6000
no, not necessarily. You can use the 2nd query too.But you can't see for which depart the salary got sorted by 'group-by' key-word.

sql min function and other column

i want to know if it is possible to add another column to a select statement that contain an aggregate function like min, max ...
example :
SELECT user_id, MAX(salary) FROM users;
is this statement correct in the sql standard(in mysql its work );
its work in mysql, but i think i read somewhere that if i put an aggregate function in the select clause, i can't put anything but an aggregate function or if there is a group by, the grouped column can be in the select clause (in mysql)
EDIT :
User(user_id, name, last_name, salary)
i want to select the user_id, name, (maximum salary column) from the User table; is it possible to do it without sub query?
User Table
User_id, Name, Salary
| 1 | user1 | last1 | 500 | |
|---|-------|-------|------|---|
| 2 | user2 | last2 | 1000 | |
| 3 | user3 | last3 | 750 | |
| | | | | |
the output must be the user_id, username, lastname, and salary of the user who have the max salary, so here the ouput must be :
2 user2 last2 1000
To start with: No,
SELECT user_id, MAX(salary) FROM users;
is not standard-compliant. You are using an aggregate function (MAX) without a GROUP BY clause. By doing so you tell the DBMS to aggregate all records to one single result row. Now what do you tell the DBMS to show in this result row? The maximum salary found in the table (MAX(salary)) and the user_id. However, there is no the user_id; there are possibly many different user_id in the table. This violates the SQL standard. MySQL takes the liberty to interpret the non-aggregated user_id as any user_id (arbitrarily picked).
So even though the query runs, it's result is usually not the desired one.
This query:
SELECT user_id, name, MAX(salary) FROM users GROUP BY user_id;
on the other hand is standard-compliant. Let's see again what this query does: This time there is a GROUP BY clause telling the DBMS you want one result row per user_id. For each user_id you want to show: the user_id, the name, and the maximum salary. All these are valid expressions; the user_id is the user_id itself, the name is the one user name associated with the user_id, and the maximum salary is the user's maximum salary. The unaggregated column name is allowed, because it is functionally dependent on the grouped-by user_id. Many DBMS don't support this, though, because it can get extremely complicated to determine whether an expression is functionally dependent on the group or not.
As to how to show the user record with the maximum salary, you need a limiting clause. MySQL provides LIMIT for this, which can get you the first n rows. It doesn't deal with ties however.
SELECT * FROM users ORDER BY salary DESC LIMIT 1;
is
SELECT * FROM users ORDER BY salary FETCH FIRST ROW ONLY;
in standard SQL.
In order to deal with ties, however, as in
SELECT * FROM users ORDER BY salary FETCH FIRST ROW WITH TIES;
you need a subquery in MySQL, because LIMIT doesn't support this:
SELECT * FROM users WHERE salary = (SELECT MAX(salary) FROM users);
Told you there are different solutions depending on what you want....
no group by, no subquery, Easy cake
select *
from users
ORDER BY salary DESC
LIMIT 1
Let's look at an example:
mysql> select * from users;
+---------+----------+
| user_id | salary |
+---------+----------+
| 1 | 42000.00 |
| 2 | 39000.00 |
| 3 | 50000.00 |
+---------+----------+
mysql> SELECT user_id, MAX(salary) FROM users;
+---------+-------------+
| user_id | MAX(salary) |
+---------+-------------+
| 1 | 50000.00 |
+---------+-------------+
What's up with that? User 1 is not the user that has a salary of 50000.00.
mysql> SELECT user_id, MAX(salary), MIN(SALARY) FROM users;
+---------+-------------+-------------+
| user_id | MAX(salary) | MIN(SALARY) |
+---------+-------------+-------------+
| 1 | 50000.00 | 39000.00 |
+---------+-------------+-------------+
User 1 is also not the one with 39000.00. This is getting fishy, right?
When you use aggregate functions, they only apply to the column you use the function in. The user_id column doesn't magically know which row that max value came from, and show the corresponding user_id.
In that example, I query both the MAX and MIN salary. But these belong to different users! Which user_id should be shown, even if the user_id could automatically be from the row where the aggregate value comes from?
And what if two users have the same salary, which are tied for the max salary? Which user_id should be displayed?
And what if you use an aggregate function that doesn't return a value that exists on any single row?
mysql> SELECT user_id, AVG(salary) FROM users;
+---------+--------------+
| user_id | AVG(salary) |
+---------+--------------+
| 1 | 43666.666667 |
+---------+--------------+
Here's the explanation: an aggregate function causes the result to be reduced to one single row, after reading the whole group of rows. A column that is not inside an aggregate function (like user_id here) takes its value from some arbitrary row in the group of rows. Arbitrary does not mean random—in practice, it tends to be the first MySQL row reads in the group. But there's no guarantee that'll always be the case.
How useful is this? Not very. In other databases, it's not a valid query, and it will literally generate an error.
In fact, MySQL 5.7 changed the behavior, by enforcing a rule that disallows ambiguous queries. If you try to run the query above on MySQL 5.7, it'll generate an error:
ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'test.users.user_id'; this is incompatible with sql_mode=only_full_group_by
There's an option to make it act like earlier versions of MySQL. For more information on this, read: https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
As a matter of trivia, SQLite is another database that allows this kind of arbitrary result. Only in SQLite, the value of user_id would come from the last row read in the group. Go figure.
Try to use this:
SELECT id,
salary
FROM (SELECT id,
salary,
MAX(salary) over ([partition by] [order by] dept) mx_sal
FROM your_tbl)
WHERE salary = mx_sal;

Mode calculation without a subquery field in MySQL?

In my application, each product group has many products, and each product has one manufacturer. These relations are stored by MySQL in InnoDB tables product_groups with an id field, and products with id, product_group and manufacturer fields.
Is there a way to find the most common manufacturer in each product group, without resorting to selecting subqueries?
This is how I'm doing it currently:
SELECT product_groups.id,
(
SELECT manufacturer FROM products
WHERE product_group = product_groups.id
GROUP BY manufacturer
ORDER BY count(*) DESC
LIMIT 1
) manufacturer_mode
FROM product_groups;
Try this solution:
SELECT
a.product_group,
SUBSTRING_INDEX(GROUP_CONCAT(a.manufacturer ORDER BY a.occurrences DESC SEPARATOR ':::'), ':::', 1) AS manufacturer_mode
FROM
(
SELECT
aa.product_group,
aa.manufacturer,
COUNT(*) AS occurrences
FROM
products aa
GROUP BY
aa.product_group,
aa.manufacturer
) a
GROUP BY
a.product_group
Explanation:
This still uses a form of subquery, but one which executes only once as opposed to one that executes on a row-by-row basis such as in your original example.
It works by first selecting the product_group id, the manufacturer, and the count of how many times the manufacturer appears for each particular group.
The FROM sub-select will look something like this after execution (just making up data here):
product_group | manufacturer | occurrences
---------------------------------------------------
1 | XYZ | 4
1 | Test | 2
1 | Singleton | 1
2 | Eloran | 2
2 | XYZ | 1
Now that we have the sub-select result, we need to pick out the row that has the maximum in the occurences field for each product group.
In the outer query, we group the subselect once again by the product_group field, but this time, only the product_group field. Now when we do our GROUP BY here, we can use a really compelling function in MySQL called GROUP_CONCAT which we can use to concatenate the manufacturers together and in any order we want.
...GROUP_CONCAT(a.manufacturer ORDER BY a.occurrences DESC SEPARATOR ':::'...
What we are doing here is concatenating the manufacturers together that are grouped together per product_group id, the ORDER BY a.occurrences DESC makes sure that the manufacturer with the most appearances appears first in the concatenated list. Finally we are separating each manufacturer with :::. The result of this for product_group 1 will look like:
XYZ:::Test:::Singleton
XYZ appears first since it has the highest value in the occurance field. We only want to select XYZ, so we encase the concatenation within SUBSTRING_INDEX, which will allow us to only pick the first element of the list based on the ::: delimiter.
The end result will be:
product_group | manufacturer_mode
---------------------------------------
1 | XYZ
2 | Eloran

How does this count work?

My query is given below:
select vend_id,
COUNT(*) as num_prods
from Products
group by vend_id;
Please tell me how does this part work - select vend_id, COUNT(vend_id) as opposed to select COUNT(vend_id)?
select COUNT(vend_id)
That will return the number of rows where the vendor ID is not null
select vend_id, COUNT(*) as num_prods
from Products
group by vend_id
That will group the elements by Id's, and return, for each Id, how many rows do you have.
An example:
ID name salary start_date city region
----------- ---------- ----------- ----------------------- ---------- ------
1 Jason 40420 1994-02-01 00:00:00.000 New York W
2 Robert 14420 1995-01-02 00:00:00.000 Vancouver N
3 Celia 24020 1996-12-03 00:00:00.000 Toronto W
4 Linda 40620 1997-11-04 00:00:00.000 New York N
5 David 80026 1998-10-05 00:00:00.000 Vancouver W
6 James 70060 1999-09-06 00:00:00.000 Toronto N
7 Alison 90620 2000-08-07 00:00:00.000 New York W
8 Chris 26020 2001-07-08 00:00:00.000 Vancouver N
If you run this query, you will get One row for city, and you can apply a function (in this case, count) to that row. So, for each city, you will get the count of rows. You can also use other functions.
SELECT City, COUNT(*) as Employees
FROM Employee
GROUP BY City
The result is:
City Employees
--------- ---------
New York 3
Toronto 2
Vancouver 3
as you can compare the numbers of rows for each city
When you simply select COUNT(vend_id) with no GROUP BY clause, you get one row with the total count of rows with a non-NULL vendor ID - that last bit is important and is one reason why you may prefer COUNT(*) so as to avoid "missing" rows. Some people may argue that COUNT(*) is somehow less efficient but that's true in no DBMS I've used. In any case, if you are using a brain-dead DBMS, you can always try COUNT(1).
When you group by vend_id, you get one row per vendor ID with the count being the number of rows for that ID.
In step-by-step detail (conceptually, though there are almost certainly efficiencies to be gained by optimising), the first query:
SELECT COUNT(vend_id) AS num_prods FROM products
Get a list of all rows in products.
Count the rows where vend_id is not NULL, then deliver one row containing that count in the single num_prods column.
For the grouping one:
SELECT vend_id, COUNT(vend_id) AS num_prods FROM products GROUP BY vend_id
Get a list of all rows in products.
For each value of vend_id:
Count the rows matching that vend_id where vend_id is not NULL, then deliver one row containing the vend_id in the first column and that count in the second num_prods column.
Note that those rows with a null vend_id do not contribute to the aggregate function (count in this case).
In the first query, that simply means they don't appear in the overall total.
In the second case, it means that the output row still exists but the count will be zero. That's another good reason to use COUNT(*) or COUNT(1).
select vend_id will only select the vend_id field, where select * will select all the fields
select vend_id, COUNT(vend_id) and select COUNT(vend_id) gives same result for the count column as long as you use group by vend_id. when you use select vend_id, COUNT(vend_id) you must group by it using vend_id