Lets say I have a plant table:
id fruit
1 banana
2 apple
3 orange
I can do these
SELECT * FROM plant ORDER BY id;
SELECT * FROM plant ORDER BY fruit DESC;
which does the obvious thing.
But I was bitten by this, what does this do?
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
All these return just the first row (which is with id = 1).
What's happening underhood?
What are the scenarios where aggregate function will come in handy in ORDER BY?
Your results are more clear if you actually select the aggregate values instead of columns from the table:
SELECT SUM(id) FROM plant ORDER BY SUM(id)
This will return the sum of all id's. This is of course a useless example because the aggregation will always create only one row, hence no need for ordering. The reason you get a row qith columns in your query is because MySQL picks one row, not at random but not deterministic either. It just so happens that it is the first column in the table in your case, but others may get another row depending on storage engine, primary keys and so on. Aggregation only in the ORDER BY clause is thus not very useful.
What you usually want to do is grouping by a certain field and then order the result set in some way:
SELECT fruit, COUNT(*)
FROM plant
GROUP BY fruit
ORDER BY COUNT(*)
Now that's a more interesting query! This will give you one row for each fruit together with the total count for that fruit. Try adding some more apples and the ordering will actually start making sense:
Complete table:
+----+--------+
| id | fruit |
+----+--------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
| 4 | apple |
| 5 | apple |
| 6 | banana |
+----+--------+
The query above:
+--------+----------+
| fruit | COUNT(*) |
+--------+----------+
| orange | 1 |
| banana | 2 |
| apple | 3 |
+--------+----------+
All these queries will all give you a syntax error on any SQL platform that complies with SQL standards.
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
On PostgreSQL, for example, all those queries will raise the same error.
ERROR: column "plant.id" must appear in the GROUP BY clause or be
used in an aggregate function
That means you're using a domain aggregate function without using GROUP BY. SQL Server and Oracle return similar error messages.
MySQL's GROUP BY is known to be broken in several respects, at least as far as standard behavior is concerned. But the queries you posted were a new broken behavior to me, so +1 for that.
Instead of trying to understand what it's doing under the hood, you're probably better off learning to write standard GROUP BY queries. MySQL will process standard GROUP BY statements correctly, as far as I know.
Earlier versions of MySQL docs warned you about GROUP BY and hidden columns. (I don't have a reference, but this text is cited all over the place.)
Do not use this feature if the columns you omit from the GROUP BY part
are not constant in the group. The server is free to return any value
from the group, so the results are indeterminate unless all values are
the same.
More recent versions are a little different.
You can use this feature to get better performance by avoiding
unnecessary column sorting and grouping. However, this is useful
primarily when all values in each nonaggregated column not named in
the GROUP BY are the same for each group. The server is free to choose
any value from each group, so unless they are the same, the values
chosen are indeterminate.
Personally, I don't consider indeterminate a feature in SQL.
When you use an aggregate like that, the query gets an implicit group by where the entire result is a single group.
Using an aggregate in order by is only useful if you also have a group by, so that you can have more than one row in the result.
Related
I have the following table in Mysql
Name Age Group
abel 7 A
joe 6 A
Rick 7 A
Diana 5 B
Billy 6 B
Pat 5 B
I want to randomize the rows, but they should still remain grouped by the Group column.
For exmaple i want my result to look something like this.
Name Age Group
joe 6 A
abel 7 A
Rick 7 A
Billy 6 B
Pat 5 B
Diana 5 B
What query should i use to get this result? The entire table should be randomised and then grouped by "Group" column.
What you describe in your question as GROUPing is more correctly described as sorting. This is a particular issue when talking about SQL databases where "GROUP" means something quite different and determines the scope of aggregation operations.
Indeed "group" is a reserved word in SQL, so although mysql and some other SQL databases can work around this, it is a poor choice as an attribute name.
SELECT *
FROM yourtable
ORDER BY `group`
Using random values also has a lot of semantic confusion. A truly random number would have a different value every time it is retrieved - which would make any sorting impossible (and databases do a lot of sorting which is normally invisible to the user). As long as the implementation uses a finite time algorithm such as quicksort that shouldn't be a problem - but a bubble sort would never finish, and a merge sort could get very confused.
There are also degrees of randomness. There are different algorithms for generating random numbers. For encryption it's critical than the random numbers be evenly distributed and completely unpredictable - often these will use hardware events (sometimes even dedicated hardware) but I don't expect you would need that. But do you want the ordering to be repeatable across invocations?
SELECT *
FROM yourtable
ORDER BY `group`, RAND()
...will give different results each time.
OTOH
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(age, name, `group`))
...would give the results always sorted in the same order. While
SELECT
FROM yourtable
ORDER BY `group`, MD5(CONCAT(DATE(), age, name, `group`))
...will give different results on different days.
DROP TABLE my_table;
CREATE TABLE my_table
(name VARCHAR(12) NOT NULL
,age INT NOT NULL
,my_group CHAR(1) NOT NULL
);
INSERT INTO my_table VALUES
('Abel',7,'A'),
('Joe',6,'A'),
('Rick',7,'A'),
('Diana',5,'B'),
('Billy',6,'B'),
('Pat',5,'B');
SELECT * FROM my_table ORDER BY my_group,RAND();
+-------+-----+----------+
| name | age | my_group |
+-------+-----+----------+
| Joe | 6 | A |
| Abel | 7 | A |
| Rick | 7 | A |
| Pat | 5 | B |
| Diana | 5 | B |
| Billy | 6 | B |
+-------+-----+----------+
Do the random first then sort by column group.
select Name, Age, Group
from (
select *
FROM yourtable
order by RAND()
) t
order by Group
Try this:
SELECT * FROM table order by Group,rand()
I've found many similar questions but have not been able to understand / apply the answers; and I don't really know what to search for...
I have 2 tables (docs and words) which have a many to many relationship. I am trying to generate a list of the top 5 most frequently used words that DO NOT appear in a specified docs.
To this end I have 2 mySQL queries, each of which takes me part way to achieving my goal:
Query #1 - returns words sorted by frequency of use, falls short because it also returns ALL words (SQLFiddle.com)
SELECT `words_idwords` as wdID, COUNT(*) as freq
FROM docs_has_words
GROUP BY `words_idwords`
ORDER BY freq DESC, wdID ASC
Query #2 - returns words that are missing from specified document, falls short because it does not sort by frequency of use (SQLFiddle.com)
SELECT wordscol as wrd, idwords as wID
FROM `words` where NOT `idwords`
IN (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
But what I want the output to look like is:
idwords | wordscol | freq
-------------------------
| 8 | Dog | 3 |
| 3 | Ape | 2 |
| 4 | Bear | 1 |
| 6 | Cat | 1 |
| 7 | Cheetah | 1 |
| 5 | Beaver | 0 |
Note: `Dolphin`, one of the most frequently used words, is NOT in the
list because it is already in the document iddocs = 1
Note: `Beaver`, is a "never used word" BUT is in the list because it is
in the main word list
And the question is: how can I combine these to queries, or otherwise, get my desired output?
Basic requirements:
- 3 column output
- results sorted by frequency of use, even if use is zero
Updates:
In light of some comments, the approach that I was thinking of when I came up with the 2 queries was:
Step 1 - find all the words that are in the main word list but not used in document 1
Step 2 - rank words from Step 1 according to how many documents use them
Once I had the 2 queries I thought it would be easy to combine them with a where clause, but I just can't get it working.
A hack solution could be based on adding a dummy document that contains all the words and then subtract 1 from freq (but I'm not that much of a hack!).
I see now what the problem is. I was mislead by your statement regarding the results of the 1st query (emphasis is mine):
returns words sorted by frequency of use, falls short because it also returns ALL words
This query does not return all words, it only returns all used words.
So, you need to left join the words table on docs_has_words table to get all words and eliminate the words that are associated with doc 1:
SELECT w.idwords as wdID, w.wordscol, COUNT(d.words_idwords) as freq
FROM words w
LEFT JOIN `docs_has_words` d on w.idwords=d.words_idwords
WHERE w.idwords not in (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
GROUP BY w.idwords
ORDER BY freq DESC, wdID ASC;
See sqlfiddle
I think #Shadow has it right in his comment, you just need to add the where clause like this: sqlFiddle
SELECT
`words_idwords` as wdID,
COUNT(*) as freq
FROM docs_has_words
WHERE NOT `words_idwords` IN (SELECT `words_idwords` FROM `docs_has_words` WHERE `docs_iddocs` = 1)
GROUP BY `words_idwords`
ORDER BY freq DESC, wdID ASC
Does this produce the output you need?
I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:
| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
| 1 | 1 | John | White |
| 2 | 1 | John | Black |
| 3 | 2 | Mike | White |
| 4 | 2 | Mike | Black |
| 5 | 2 | Mike | Brown |
| 6 | 3 | Tony | White |
If I wanted to count the amount of cars per owner and return this:
| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
| 1 | John | 2 |
| 2 | Mike | 3 |
| 3 | Tony | 1 |
I know I can write the following query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name
However, removing owner_name from the GROUP BY clause gives me the same results.
What is the difference between those 2 queries?
Under what circumstances should I group by all non-agreggated fields in the SELECT statement and in which ones shouldn't I?
Can you give an example in which this grouping would return different results when removing a non-aggregated field and explain why?
The first thing to make clear is that SQL is not MySQL.
In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:
SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color
That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name field and there can be many owners for a given color, as it is the case of the White color. So if there can be many owner_name values for a single color which happens to be the only field in the GROUP BY clause... then which owner_name will be returned?
If it is needed to return an owner_name then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John. That criteria would result in adding an aggregate function MIN(owner_name) and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.
As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.
Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.
If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id and owner_name relates in the table. It is clear that given any owner_id, e.g. 2, you can only have one distinct owner_name. Even having many rows, by choosing any, you will get Mike as the result. In formal database jargon this can be explained as owner_id functionally determines owner_name.
Let's take a closer look at that fully working MySQL query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id
Given any owner_id this would return the same owner_name, so adding it to the GROUP BY clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name) will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.
However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:
You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.
As you can see, it might be worth applying this MySQL extension to the GROUP BY clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY SQL mode could be a last resort :)
May your grouping be correct and performant... or at least correct.
In this book I'm currently reading while following a course on databases, the following example of an illegal query using an aggregate operator is given:
Find the name and age of the oldest sailor.
Consider the following attempt to answer this query:
SELECT S.sname, MAX(S.age)
FROM Sailors S
The intent is for this query to return not only the maximum age but
also the name of the sailors having that age. However, this query is
illegal in SQL--if the SELECT clause uses an aggregate operation, then
it must use only aggregate operations unless the query contains a GROUP BY clause!
Some time later while doing an exercise using MySQL, I faced a similar problem, and made a mistake similar to the one mentioned. However, MySQL didn't complain and just spit out some tables which later turned out not to be what I needed.
Is the query above really illegal in SQL, but legal in MySQL, and if so, why is that?
In what situation would one need to make such a query?
Further elaboration of the question:
The question isn't about whether or not all attributes mentioned in a SELECT should also be mentioned in a GROUP BY.
It's about why the above query, using atributes together with aggregate operations on attributes, without any GROUP BY is legal in MySQL.
Let's say the Sailors table looked like this:
+----------+------+
| sname | age |
+----------+------+
| John Doe | 30 |
| Jane Doe | 50 |
+----------+------+
The query would then return:
+----------+------------+
| sname | MAX(S.age) |
+----------+------------+
| John Doe | 50 |
+----------+------------+
Now who would need that? John Doe ain't 50, he's 30!
As stated in the citation from the book, this is a first attempt to get the name and age of the oldest sailor, in this example, Jane Doe at the age of 50.
SQL would say this query is illegal, but MySQL just proceeds and spits out "garbage".
Who would need this kind of result?
Why does MySQL allow this little trap for newcomers?
By the way, it is default MySQL behavior. But it can be changed by setting ONLY_FULL_GROUP_BY server mode in the my.ini file or in the session -
SET sql_mode = 'ONLY_FULL_GROUP_BY';
SELECT * FROM sakila.film_actor GROUP BY actor_id;
Error: 'sakila.film_actor.film_id' isn't in GROUP BY
ONLY_FULL_GROUP_BY - Do not permit queries for which the select list refers to nonaggregated columns that are not named in the GROUP BY clause.
Is the query above really illegal in SQL, but legal in MySQL
Yes
if so, why is that
I don't know the reasons for the design decisions made in MySQL, but considering that you can get the actual related data from the same row(s) as the aggregate came from (e.g., MAX or MIN) with only slightly more work, I don't see any advantage in returning additional column data from arbitrary rows.
I strongly dislike this "feature" in MySQL and it trips up many people who learn aggregates on MySQL and then move to a different dbms, and suddenly realize they never quite knew what they were doing.
Based on a link which a_horse_with_no_name provided in a comment, I have arrived at my own answer:
It seems that the MySQL way of using GROUP BY differs from the SQL way, in order to permit leaving out columns, from the GROUP BY clause, when they are functionally dependant on other included columns anyways.
Lets say we have a table displaying the activity of a bank account.
It's not a very thought-out table, but it's the only one we have, and that will have to do.
Instead of keeping track of an amount, we imagine an account starts at '0', and all transactions to it is recorded instead, so the amount is the sum of the transactions. The table could look like this:
+------------+----------+-------------+
| costumerID | name | transaction |
+------------+----------+-------------+
| 1337 | h4x0r | 101 |
| 42 | John Doe | 500 |
| 1337 | h4x0r | -101 |
| 42 | John Doe | -200 |
| 42 | John Doe | 500 |
| 42 | John Doe | -200 |
+------------+----------+-------------+
It is clear that the 'name' is functionally dependant on the 'costumerID'.
(The other way around would also be possible in this example.)
What if we wanted to know the costumerID, name and current amount of each customer?
In such a situation, two very similar queries would return the following right result:
+------------+----------+--------+
| costumerID | name | amount |
+------------+----------+--------+
| 42 | John Doe | 600 |
| 1337 | h4x0r | 0 |
+------------+----------+--------+
This query can be executed in MySQL, and is legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID, name
This query can be executed in MySQL, and is NOT legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID
The following line would make the query return and error instead, since it would now have to follow the SQL way of using aggregation operations and GROUP BY:
SET sql_mode = 'ONLY_FULL_GROUP_BY';
The argument for allowing the second query in MySQL, seems to be that it is assumed that all columns mentioned in SELECT, but not mentioned in GROUP BY, are either used inside an aggregate operation, (the case with 'transaction'), or are functionally dependent on other included columns, (the case with 'name'). In the case of 'name', we can be sure that the correct 'name' is chosen for all group entries, since it is functionally dependant on 'costumerID', and therefore there is only one possibly name for each group of costumerID's.
This way of using GROUP BY seems flawed tough, since it doesn't do any further checks on what is left out from the GROUP BY clause. People can pick and choose columns from their SELECT statement to put in their GROUP BY clause as they see fit, even if it makes no sense to include or leave out any particular column.
The Sailor example illustrates this flaw very well.
When using aggregation operators (possibly in conjunction with GROUP BY), each group entry in the returned set has only one value for each of its columns. In the case of Sailors, since the GROUP BY clause is left out, the whole table is put into one single group entry. This entry needs a name and a maximum age. Choosing a maximum age for this entry is a no-brainer, since MAX(S.age) only returns one value. In the case of S.sname though, wich is only mentioned in SELECT, there are now as many choices as there are unique sname's in the whole Sailor table, (in this case two, John and Jane Doe). MySQL doens't have any clue which to choose, we didn't give it any, and it didn't hit the brakes in time, so it has to just pick whatever comes first, (Jane Doe). If the two rows were switched, it would actually give "the right answer" by accident. It just seems plain dumb that something like this is allowed in MySQL, that the result of a query using GROUP BY could potententially depend on the ordering of the table, if something is left out in the GROUP BY clause. Apparently, that's just how MySQL rolls. But still couldn't it at least have the courtesy of warning us when it has no clue what it's doing because of a "flawed" query? I mean, sure, if you give the wrong instructions to a program, it probably wouldn't (or shouldn't) do as you want, but if you give unclear instructions, I certainly wouldn't want it to just start guessing or pick whatever comes first... -_-'
MySQL allows this non-standard SQL syntax because there is at least one specific case in which it makes the SQL nominally easier to write. That case is when you're joining two tables which have a PRIMARY / FOREIGN KEY relationship (whether enforced by the database or not) and you want an aggregate value from the FOREIGN KEY side and multiple columns from the PRIMARY KEY side.
Consider a system with Customer and Orders tables. Imagine you want all the fields from the customer table along with the total of the Amount field from the Orders table. In standard SQL you would write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip
Notice the unwieldy GROUP BY clause, and imagine what it would look like if there were more columns you wanted from customer.
In MySQL, you could write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
or even (I think, I haven't tried it):
SELECT C.*, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
Much easier to write. In this particular case it's safe as well, since you know that only one row from the Customer table will contribute to each group (assuming CustomerID is PRIMARY or UNIQUE KEY).
Personally, I'm not a big fan of this exception to standard SQL syntax (since there are many cases where it's not safe to use this syntax and rely on getting values from any particular row in the group), but I can see where it makes certain kinds of queries easier and (in the case of my second MySQL example) possible.
I have two simple Mysql tables:
SYMBOL
| id | symbol |
(INT(primary) - varchar)
PRICE
| id | id_symbol | date | price |
(INT(primary), INT(index), date, double)
I have to pass two symbols to get something like:
DATE A B
2001-01-01 | 100.25 | 25.26
2001-01-02 | 100.23 | 25.25
2001-01-03 | 100.24 | 25.24
2001-01-04 | 100.25 | 25.26
2001-01-05 | 100.26 | 25.28
2001-01-06 | 100.27 | 30.29
Where A and B are the symbols i need to search and the date is the date of the prices. (because i need the same date to compare symbol)
If one symbol doesn't have a date that has the other I have to jump it. I only need to retrive the last N prices of those symbols.
ORDER: from the earliest date to latest (example the last 100 prices of both)
How could I implement this query?
Thank you
Implementing these steps should bring you the desired result:
Get dates and prices for symbol A. (Inner join PRICE with SYMBOL to obtain the necessary rows.)
Similarly get dates and prices for symbol B.
Inner join the two result sets on the date column and pull the price from the first result set as the A column and the other one as B.
This should be simple if you know how to join tables.
I think you should update your question to resolve any of the mistakes you made in representing your data. I'm having a hard time following the details. However, I think based on what I am seeing there are four MySQL concepts you need to solve your problem.
The first is JOINS you would use a join to put two tables together so you may select related data using the key that you describe as "id_symbol"
The second would be to use LIMIT which will allow you to specify the number of records to return such as that if you wanted one record you would use the keywould LIMIT 1 or if you wanted a hundred records LIMIT 100
The third would be to use a WHERE clause to allow you to search for a specific value in one of your fields from the table you are querying.
The last is the ORDER BY which will allow you to specify a field to sort your returned records and the direction you want them sorted ASC or DESC
An Example:
SELECT *
FROM table1
JOIN table2 ON table1.id = table2.table1_id
WHERE table1.searchfield = 'search string'
LIMIT 100
ORDER BY table1.orderfield DESC
(This is pseudo code so this query may not actually work but is close and should provide you with the correct idea.)
I suggest referencing the MySQL documentation found here it should provide everything you need to keep going.