Mode calculation without a subquery field in MySQL? - mysql

In my application, each product group has many products, and each product has one manufacturer. These relations are stored by MySQL in InnoDB tables product_groups with an id field, and products with id, product_group and manufacturer fields.
Is there a way to find the most common manufacturer in each product group, without resorting to selecting subqueries?
This is how I'm doing it currently:
SELECT product_groups.id,
(
SELECT manufacturer FROM products
WHERE product_group = product_groups.id
GROUP BY manufacturer
ORDER BY count(*) DESC
LIMIT 1
) manufacturer_mode
FROM product_groups;

Try this solution:
SELECT
a.product_group,
SUBSTRING_INDEX(GROUP_CONCAT(a.manufacturer ORDER BY a.occurrences DESC SEPARATOR ':::'), ':::', 1) AS manufacturer_mode
FROM
(
SELECT
aa.product_group,
aa.manufacturer,
COUNT(*) AS occurrences
FROM
products aa
GROUP BY
aa.product_group,
aa.manufacturer
) a
GROUP BY
a.product_group
Explanation:
This still uses a form of subquery, but one which executes only once as opposed to one that executes on a row-by-row basis such as in your original example.
It works by first selecting the product_group id, the manufacturer, and the count of how many times the manufacturer appears for each particular group.
The FROM sub-select will look something like this after execution (just making up data here):
product_group | manufacturer | occurrences
---------------------------------------------------
1 | XYZ | 4
1 | Test | 2
1 | Singleton | 1
2 | Eloran | 2
2 | XYZ | 1
Now that we have the sub-select result, we need to pick out the row that has the maximum in the occurences field for each product group.
In the outer query, we group the subselect once again by the product_group field, but this time, only the product_group field. Now when we do our GROUP BY here, we can use a really compelling function in MySQL called GROUP_CONCAT which we can use to concatenate the manufacturers together and in any order we want.
...GROUP_CONCAT(a.manufacturer ORDER BY a.occurrences DESC SEPARATOR ':::'...
What we are doing here is concatenating the manufacturers together that are grouped together per product_group id, the ORDER BY a.occurrences DESC makes sure that the manufacturer with the most appearances appears first in the concatenated list. Finally we are separating each manufacturer with :::. The result of this for product_group 1 will look like:
XYZ:::Test:::Singleton
XYZ appears first since it has the highest value in the occurance field. We only want to select XYZ, so we encase the concatenation within SUBSTRING_INDEX, which will allow us to only pick the first element of the list based on the ::: delimiter.
The end result will be:
product_group | manufacturer_mode
---------------------------------------
1 | XYZ
2 | Eloran

Related

sql min function and other column

i want to know if it is possible to add another column to a select statement that contain an aggregate function like min, max ...
example :
SELECT user_id, MAX(salary) FROM users;
is this statement correct in the sql standard(in mysql its work );
its work in mysql, but i think i read somewhere that if i put an aggregate function in the select clause, i can't put anything but an aggregate function or if there is a group by, the grouped column can be in the select clause (in mysql)
EDIT :
User(user_id, name, last_name, salary)
i want to select the user_id, name, (maximum salary column) from the User table; is it possible to do it without sub query?
User Table
User_id, Name, Salary
| 1 | user1 | last1 | 500 | |
|---|-------|-------|------|---|
| 2 | user2 | last2 | 1000 | |
| 3 | user3 | last3 | 750 | |
| | | | | |
the output must be the user_id, username, lastname, and salary of the user who have the max salary, so here the ouput must be :
2 user2 last2 1000
To start with: No,
SELECT user_id, MAX(salary) FROM users;
is not standard-compliant. You are using an aggregate function (MAX) without a GROUP BY clause. By doing so you tell the DBMS to aggregate all records to one single result row. Now what do you tell the DBMS to show in this result row? The maximum salary found in the table (MAX(salary)) and the user_id. However, there is no the user_id; there are possibly many different user_id in the table. This violates the SQL standard. MySQL takes the liberty to interpret the non-aggregated user_id as any user_id (arbitrarily picked).
So even though the query runs, it's result is usually not the desired one.
This query:
SELECT user_id, name, MAX(salary) FROM users GROUP BY user_id;
on the other hand is standard-compliant. Let's see again what this query does: This time there is a GROUP BY clause telling the DBMS you want one result row per user_id. For each user_id you want to show: the user_id, the name, and the maximum salary. All these are valid expressions; the user_id is the user_id itself, the name is the one user name associated with the user_id, and the maximum salary is the user's maximum salary. The unaggregated column name is allowed, because it is functionally dependent on the grouped-by user_id. Many DBMS don't support this, though, because it can get extremely complicated to determine whether an expression is functionally dependent on the group or not.
As to how to show the user record with the maximum salary, you need a limiting clause. MySQL provides LIMIT for this, which can get you the first n rows. It doesn't deal with ties however.
SELECT * FROM users ORDER BY salary DESC LIMIT 1;
is
SELECT * FROM users ORDER BY salary FETCH FIRST ROW ONLY;
in standard SQL.
In order to deal with ties, however, as in
SELECT * FROM users ORDER BY salary FETCH FIRST ROW WITH TIES;
you need a subquery in MySQL, because LIMIT doesn't support this:
SELECT * FROM users WHERE salary = (SELECT MAX(salary) FROM users);
Told you there are different solutions depending on what you want....
no group by, no subquery, Easy cake
select *
from users
ORDER BY salary DESC
LIMIT 1
Let's look at an example:
mysql> select * from users;
+---------+----------+
| user_id | salary |
+---------+----------+
| 1 | 42000.00 |
| 2 | 39000.00 |
| 3 | 50000.00 |
+---------+----------+
mysql> SELECT user_id, MAX(salary) FROM users;
+---------+-------------+
| user_id | MAX(salary) |
+---------+-------------+
| 1 | 50000.00 |
+---------+-------------+
What's up with that? User 1 is not the user that has a salary of 50000.00.
mysql> SELECT user_id, MAX(salary), MIN(SALARY) FROM users;
+---------+-------------+-------------+
| user_id | MAX(salary) | MIN(SALARY) |
+---------+-------------+-------------+
| 1 | 50000.00 | 39000.00 |
+---------+-------------+-------------+
User 1 is also not the one with 39000.00. This is getting fishy, right?
When you use aggregate functions, they only apply to the column you use the function in. The user_id column doesn't magically know which row that max value came from, and show the corresponding user_id.
In that example, I query both the MAX and MIN salary. But these belong to different users! Which user_id should be shown, even if the user_id could automatically be from the row where the aggregate value comes from?
And what if two users have the same salary, which are tied for the max salary? Which user_id should be displayed?
And what if you use an aggregate function that doesn't return a value that exists on any single row?
mysql> SELECT user_id, AVG(salary) FROM users;
+---------+--------------+
| user_id | AVG(salary) |
+---------+--------------+
| 1 | 43666.666667 |
+---------+--------------+
Here's the explanation: an aggregate function causes the result to be reduced to one single row, after reading the whole group of rows. A column that is not inside an aggregate function (like user_id here) takes its value from some arbitrary row in the group of rows. Arbitrary does not mean random—in practice, it tends to be the first MySQL row reads in the group. But there's no guarantee that'll always be the case.
How useful is this? Not very. In other databases, it's not a valid query, and it will literally generate an error.
In fact, MySQL 5.7 changed the behavior, by enforcing a rule that disallows ambiguous queries. If you try to run the query above on MySQL 5.7, it'll generate an error:
ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'test.users.user_id'; this is incompatible with sql_mode=only_full_group_by
There's an option to make it act like earlier versions of MySQL. For more information on this, read: https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
As a matter of trivia, SQLite is another database that allows this kind of arbitrary result. Only in SQLite, the value of user_id would come from the last row read in the group. Go figure.
Try to use this:
SELECT id,
salary
FROM (SELECT id,
salary,
MAX(salary) over ([partition by] [order by] dept) mx_sal
FROM your_tbl)
WHERE salary = mx_sal;

Get the greatest Year value in mysql after grouping by a column

The below table contains an id and a Year and Groups
GroupingTable
id | Year | Groups
1 | 2000 | A
2 | 2001 | B
3 | 2001 | A
Now I want select the greatest year even after grouping them by the Groups Column
SELECT
id,
Year,
Groups
FROM
GroupingTable
GROUP BY
`Groups`
ORDER BY Year DESC
And below is what I am expecting even though the query above doesnt work as expected
id | Year | Groups
2 | 2001 | B
3 | 2001 | A
You need to learn how to use aggregate functions.
SELECT
MAX(Year) AS Year,
Groups
FROM
GroupingTable
GROUP BY
`Groups`
ORDER BY Year DESC
When using GROUP BY, only the column(s) you group by are unambiguous, because they have the same value on every row of the group.
Other columns return a value arbitrarily from one of the rows in the group. Actually, this is behavior of MySQL (and SQLite), but because of the ambiguity, it's an illegal query in standard SQL and all other brands of SQL implementations.
For more on this, see my answer to Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Your query misuses the heinously confusing nonstandard extension to GROUP BY that's built in to MySQL. Read this and weep. https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
If all you want is the year it's a snap.
SELECT MAX(Year) Year, Groups
FROM GroupingTable
GROUP BY Groups
If you want the id of the row in question, you have to do a bunch of monkey business to retrieve the column id from the above query.
SELECT a.*
FROM GroupingTable a
JOIN (
SELECT MAX(Year) Year, Groups
FROM GroupingTable
GROUP BY Groups
) b ON a.Groups = b.Groups AND a.Year = b.Year
You have to do this because the GROUP BY query yields a summary result set, and you have to join that back to the detail result set to retrieve the ID.

Why does count(*) restrict my search to 1 row

Using the employees database (https://github.com/datacharmer/test_db) I have (as an example) the following statement:
select first_name, count(*) from employees;
Now this would give the following output:
first_name | count(*)
Georgie | 300024
While this statement:
select first_name, (select count(*) from employees) from employees;
would give the following output:
first_name | select count(*) ...
Georgi | 300024
Bezalel | 300024
Parto | 300024
etc... | 300024
To clarify my Question: I don't understand why the first statement restricts my search query to 1 row (only Georgi is showing up) while the last statement (with the subquery) shows all names and hence doesn't restrict my search query to 1 row like the first one.
Your first Query:
Count() is an aggregate function, it counts all rows in a dataset (or a group). That is why you get only 1 row because you have only one set here.
Your second Query:
You are doing exactly the same thing (refer to first query) but now you're doing it in subquery so it runs for every row and it is again Count(*) so it returns the same value.
Solution:
You need to use the Group By to make groups in your data by first_name. And then Count the rows in every group:
Select first_name, count(first_name)
From employee
Group by first_name
It will give you count for all first_name in your dataset.
The way COUNT works in combination with the GROUP BY clause can be a little bit confusing the first time you encounter it. I will try to explain how it behaves in each of the examples you provided.
select first_name, count(*) from employees;
COUNT is an aggregate function. Such function will perform it's computation on each set of rows (called group) being returned by the SELECT statement. You can create multiple groups using the GROUP BY clause, but since you don't use it in this query the whole dataset will act as one group. Therefore, the COUNT function will count the entire number of rows in your table.
One way to put it is that the number of rows in the final result will be equal to the number of groups.
select first_name, count(*) from employees group by first_name;
Let's assume that your table contains 6 rows in total, from which three have the first_name field equal to Foo and the other three to Bar. You will end up having two groups, and the COUNT function will count the number of rows in each group. Therefore, the result will have two rows, looking like this:
-------------------------
| first_name | COUNT(*) |
-------------------------
| Foo | 3 |
| Bar | 3 |
-------------------------
select first_name, (select count(*) from employees) from employees;
In this example, having a the second query in your SELECT statement is no different than having a constant. Consider the following query:
select first_name, 5 from employees;
Your result will have two columns. The second column will always contain 5. When you do the second query as part of you SELECT statement, the result of that query will be used in exactly the same way.
I hope this makes it at least a little bit clearer.
No. It's correct. It counts all. Use group-by to count distinct of name.
Select first_name, count(*) from employee
Group by first_name
Or use proper count with distinct
Select first_name, count(distinct first_name) from employee
This one counts distinct values on given column name.

Join Left or WHERE solution - Most efficient?

I am learning about databases at college, and have the assignment about finding the minimum avg exam grade for a college course. I have made two solutions, but I hope you experts in here can help me with:
What is the best/most effective solution?
Solution 1:
SELECT courses.name , MIN(avg_grade)
FROM (SELECT courseCode, AVG(grade) as avg_grade
FROM exams
GROUP BY courseCode) avg_grades, courses
WHERE courses.code = avg_grades.courseCode
Solution 2:
SELECT name, min(avg_grade)
FROM (SELECT courses.name, AVG(grade) as avg_grade
FROM courses
LEFT JOIN exams on exams.courseCode = courses.code
GROUP BY courseCode) mytable
And I have been thinking about if JOIN or LEFT JOIN is the correct to use here?
Your two queries are different, so you can't really compare efficiency, your second query will return records for courses with no exam results.
Assuming that you switch the LEFT JOIN to an INNER to make the queries comparable, then I would expect the first query to be slightly more efficient since it only has one
derived table, and the second has two:
Solution 1:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA
1 PRIMARY ALL 5 100
1 PRIMARY courses ALL 5 100 Using where; Using join buffer
2 DERIVED exams ALL 5 100 Using temporary; Using filesort
Solution 2:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA
1 PRIMARY ALL 5 100
2 DERIVED courses ALL 5 100 Using temporary; Using filesort
2 DERIVED exams ALL 5 100 Using where; Using join buffer
I would however check this against your own execution plans as mine was just a quick example on SQL Fiddle.
I would like to take this chance to advise against using the ANSI-89 implicit join syntax, it was replaced over 20 years ago by the explicit join syntax in the ANSI-92 standard. Aaron Bertrand has written a great article on why to switch, I won't duplicate it here.
Another, much more important point though is that your queries are not deterministic, that is to say you could run the same query twice and get 2 different results even with no underlying change in the data.
Taking your second query as an example (although you will notice both queries are wrong on the SQL-Fiddle), you have a subquery MyTable like so:
SELECT courses.name, AVG(grade) as avg_grade
FROM courses
LEFT JOIN exams on exams.courseCode = courses.code
GROUP BY courseCode
This returned a table like so:
Name | avg_grade
--------+--------------
A | 10
B | 5
C | 6
D | 7
E | 2
You may expect the query as a whole to return:
Name | avg_grade
--------+--------------
E | 2
Since 2 is the lowest average grade, and E is the name that corresponds with that. You would be wrong though, as demonstrated here you can see this actually returns:
Name | avg_grade
--------+--------------
A | 2
What is essentially happening is that MySQL is calculating the minimum avg_grade correctly, but since you have not added any columns to the group by you have given MySQL Carte blanche to choose any value for Name it chooses.
To get the output you want, I think you need:
SELECT courses.name , MIN(avg_grade)
FROM ( SELECT courseCode, AVG(grade) as avg_grade
FROM exams
GROUP BY courseCode
) avg_grades
INNER JOIN courses
ON courses.code = avg_grades.courseCode
GROUP BY courses.Name;
Or if you only want to the course with the lowest average grade then use:
SELECT courseCode, AVG(grade) as avg_grade
FROM exams
GROUP BY courseCode
ORDER BY avg_grade
LIMIT 1;
Examples on SQL Fiddle
Please excuse the laziness of what I am about to do, but I have explained this problem a lot before, and now have a standard response that I post to explain the issue of MySQL grouping. It goes into more detail than the above, and hopefully explains it further.
MySQL Implicit Grouping
I would advise to avoid the implicit grouping offered by MySQL where possible, by this i mean including columns in the select list, even though they are not contained in an aggregate function or the group by clause.
Imagine the following simple table (T):
ID | Column1 | Column2 |
----|---------+----------|
1 | A | X |
2 | A | Y |
In MySQL you can write
SELECT ID, Column1, Column2
FROM T
GROUP BY Column1;
This actually breaks the SQL Standard, but it works in MySQL, however the trouble is it is non-deterministic, the result:
ID | Column1 | Column2 |
----|---------+----------|
1 | A | X |
Is no more or less correct than
ID | Column1 | Column2 |
----|---------+----------|
2 | A | Y |
So what you are saying is give me one row for each distinct value of Column1, which both results sets satisfy, so how do you know which one you will get? Well you don't, it seems to be a fairly popular misconception that you can add and ORDER BY clause to influence the results, so for example the following query:
SELECT ID, Column1, Column2
FROM T
GROUP BY Column1
ORDER BY ID DESC;
Would ensure that you get the following result:
ID | Column1 | Column2 |
----|---------+----------|
2 | A | Y |
because of the ORDER BY ID DESC, however this is not true (as demonstrated here).
The MySQL documents state:
The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause.
So even though you have an order by this does not apply until after one row per group has been selected, and this one row is non-deterministic.
The SQL-Standard does allow columns in the select list not contained in the GROUP BY or an aggregate function, however these columns must be functionally dependent on a column in the GROUP BY. For example, ID in the sample table is the PRIMARY KEY, so we know it is unique in the table, so the following query conforms to the SQL standard and would run in MySQL and fail in many DBMS currently (At the time of writing Postgresql is the closest DBMS I know of to correctly implementing the standard):
SELECT ID, Column1, Column2
FROM T
GROUP BY ID;
Since ID is unique for each row, there can only be one value of Column1 for each ID, one value of Column2 there is no ambiguity about what to return for each row.
EDIT
From the SQL-2003-Standard (5WD-02-Foundation-2003-09 - page 346) - http://www.wiscorp.com/sql_2003_standard.zip
If T is a grouped table, then let G be the set of grouping columns of T. In each contained
in , each column reference that references a column of T shall reference some column C that
is functionally dependent on G or shall be contained in an aggregated argument of a
whose aggregation query is QS.

specific sql query trouble

I'm having trouble writing this Query. I have 2 tables, vote_table and click_table. in the vote_table I have two fields, id and date. the format of the date is "12/30/11 : 14:28:36". in the click_table i have two fields, id and date. the format of the date is "12.30.11".
The id's occur multiple times in both tables. What i want to do is produce a result that contains 3 fields: id, votes, clicks. the id column should have distinct id values, the votes column should have the total times that ID has the date 12/30/11% from the vote_table, and the clicks should have the total times that ID has the date 12.30.11 from the click table, so something like this:
ID | VOTES | CLICKS
001 | 24 | 50
002 | 30 | 45
Assuming that the types of the 'date' columns are actually either DATE or DATETIME (rather than, say, VARCHAR), then the required operation is fairly straight-forward:
SELECT v.id, v.votes, c.clicks
FROM (SELECT id, COUNT(*) AS votes
FROM vote_table AS v1
WHERE DATE(v1.`date`) = TIMESTAMP('2011-12-30')
GROUP BY v1.id) AS v
JOIN (SELECT id, COUNT(*) AS clicks
FROM click_table AS c1
WHERE DATE(c1.`date`) = TIMESTAMP('2011-12-30')
GROUP BY c1.id) AS c
ON v.id = c.id
ORDER BY v.id;
Note that this only shows ID values for which there is at least one vote and at least one click on the given day. If you need to see all the ID values which either voted or clicked or both, then you have to do more work.
If you have to normalize the dates because they are VARCHAR columns, the WHERE clauses become correspondingly more complex.