SQL `group by` vs. `order by` Performance - mysql

tl;dr - lots of accepted stackoverflow answers suggest using a subquery to affect the row returned by a GROUP BY clause. While this works, is it the best advice?
I understand there are many questions already about how to retrieve a specific row in a GROUP BY statement. Most of them revolve around using a subquery in the FROM clause. The subquery will order the table appropriately and the group by will be run against the now-ordered temporary table. Some examples,
MySQL order by before group by
MySQL "Group By" and "Order By"
PostgreSQL removes the need for the subquery with the distinct on() clause.
Postgresql DISTINCT ON with different ORDER BY
However, what I'm not understanding in any of these cases is how badly I'm shooting myself in the foot trying to do something the system may not have originally been designed for. Take the following two examples in PostgreSQL and MySQL,
http://sqlfiddle.com/#!15/3b0f2/1
http://sqlfiddle.com/#!2/6d337/1
In both cases I have a table of posts that contain multiple versions of the same post (signified by its UUID). I want to select the most recently published version of each post ordered by it's created_at field.
My biggest concern is that given the MySQL approach a temporary table is necessary. Ratchet this up to "web scale" (lolz) and I'm wondering if I'm in for a world of hurt. Should I rethink my schema or are there ways to optimize the subquery-parentquery relationship enough that it'll be alright?

It is definitely not the best advice. SQL itself (and the MySQL documentation as far as I can tell) has little to say about the results from a subquery with an order by. Although they may be ordered in practice, they are not guaranteed to be.
The more important issue is the use of "hidden columns" in the aggregation. Consider this basic query:
select t.*
from (select t.* from table t order by datecol) t
group by t.col;
Everything except t.col in the select comes from an indeterminate row. The specific documentation is (emphasis is mine):
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.
A safe way to write such a query is:
select t.*
from table t
where not exists (select 1
from table t2
where t2.col = t.col and t2.datecol < t.datecol
);
This is not exactly the same, because it will return multiple values if the minimum is not unique. The logic is "get me all rows in the table where there are no rows with the same col value and a smaller datecol value.
EDIT:
The question in your comment doesn't make sense, because nothing is discussing two queries. In MySQL you can use order by with variables to solve this:
select t.*
from (select t.*,
#rn := if(#col = col, #rn := #rn + 1, 1) as rn,
#col := col
from table t cross join
(select #col := '', #rn := 0) vars
order by col, datecol) t
where rn = 1;
This method should be faster than the order by with group by.

Related

Order by showing bad results in MariaDB

In my query, I need to get the previous row with the current row and then join a few tables. I got the previous row by using SQL variables in my development server(MySQL 5.7), everything works fine, but in my production(MariaDB 10) server that previous row results are bad just mixed, bad part is only that previous row with SQL variables other query parts works good. Before it i thought, that problem is in sql variables part, but now i realized the problem is in "order by" keyword.
My query:
SELECT
customers.title,
calendar.start_time,
calendar.hours_per_time,
calendar.self_certification,
calendar.bulletin_certification,
calendar.extra,
DATE_FORMAT(calendar.date, '%d-%m') AS day_month,
TIME_FORMAT(calendar.start_time, '%H:%i') AS hours_min,
#previous_start AS previous_start,
#previous_start := calendar.start_time,
#previous_end AS previous_end,
#previous_end := calendar.hours_per_time
FROM
(SELECT #previous_start := '00:00', #previous_end := '0.00') AS calendar_prev, calendar
INNER JOIN relationships ON calendar.relation_id = relationships.relation_id
INNER JOIN customers ON customers.customer_id = relationships.customer_id
WHERE relationships.user_id = '$user_id'
AND DATE_FORMAT(calendar.date, '%m-%Y') = '$date'
ORDER BY calendar.date, hours_min ASC
If i remove hours_min from order by part everything works fine in both servers, but then i lose my ordering.
This is my result from development server green part is is from sql variables and here works fine:
And here is from production server with bad results in red part
So how can I keep my order and still have good results? Order is only needed by date(Dato) and hours_min(Fra kl.) columns.
Wrap the query as an inline view, and specify a different ORDER BY on the outer query.
As a simple demonstration of the pattern:
SELECT v.fee
, v.fo
, v.fi
FROM (
SELECT t.fee
, t.fi
, t.fo
JOIN t
ORDER BY t.fi ASC, t.fo ASC
) v
ORDER
BY v.fee DESC
We can process rows "in order" in the query inside the inline view, using an ORDER BY clause on the inner SELECT statement.
The ORDER BY on the outer query can reorder the results returned by the inner query.
NOTES:
The MySQL Reference manual cautions that the behavior of user-defined variables that are set and read within the same statement is not guaranteed. With that said, we do observe a consistent behavior.
It's an order of operations issue. That is, we are carefully constructing our SQL in such a way that MySQL execution plan gets us a predictable order of operations.
What we have discovered is that the ORDER BY is being processed before the expressions in the SELECT list are evaluated.
So, if we need to process rows "in order" such that the user defined variables contain values from the "previous" row when the expressions in the SELECT list are evaluated, then we need to have an ORDER BY that gets the rows in the desired order.
If we want the resulting rows in a different order, we need another ORDER BY operation to be processed later. And we can get that using an inline view (what MySQL refers to as a "derived table"). That's because MySQL materializes that derived table "v" before the outer query is processed.
The SELECT list of the outer query can specify a different order of columns or omit columns. The order of expressions in the SELECT list of the inner query can be dictated by the order of operations required when working with the user-defined variables: the assignments that save the current row to the user-defined variables has to happen AFTER the user-defined variables are evaluated.
Also, I would recommend ditching the comma syntax for the join operation, and replace that with the JOIN keyword. The CROSS keyword is optional, but it does serve as an indication to the future reader that the omission of the ON clause is intentional, and not an oversight.
The INNER keyword is also optional; it has no effect, and my preference is to omit that.
FROM (SELECT #previous_start := '00:00', #previous_end := '0.00') calendar_prev
CROSS
JOIN calendar
JOIN relationships
ON calendar.relation_id = relationships.relation_id
Combining ORDER BY with variables in MariaDB is a bit tricky. Can you make a copy of the table already ordered by calendar.date, hours_min ASC and run the query on that copy without the ORDER BY?

Save order of SELECT result in complex query

I need to sort selected_booking by cost first and then assign the index i to every row. My variant doesn't work properly (outer SELECT breaks the order):
SELECT (#i:=#i + 1) AS i, selected_booking.*
FROM (SELECT * FROM booking ORDER BY cost DESC) AS selected_booking;
Is there any way to save the order of inner selection when doing outer one?
Q: Is there any way to save the order of inner selection when doing outer selection?
A: Absent an ORDER BY clause on the outer query, MySQL is free to return the rows in any order it chooses.
If you want rows from the inline view (derived table) returned in a specific order, you need to specify that in the outer query... you'd need to add an ORDER BY clause on the outer query.
NOTE: The behavior of user-defined variables as in your query is not guaranteed, the MySQL Reference Manual warns of this. But in spite of that warning, we do observe repeatable behavior in MySQL 5.1 and 5.5.
It's not at all clear why you need an inline view (aka a derived table, in the MySQL venacular) in the example you give.
It seems like this query would return the result you seem to want:
SET #i = 0 ;
SELECT #i:=#i+1 AS i
, b.*
FROM booking b
ORDER BY b.cost DESC ;
Alternatively, you could do this in a single statement, and initialize #i within the query, rather than a separate SET statement.
SELECT #i:=#i+1 AS i
, b.*
FROM booking b
JOIN (SELECT #i:=0) i
ORDER BY b.cost DESC
(This initialization works, again, because of the way the MySQL processes inline views, the inline view query is run BEFORE the outer query. This isn't guaranteed behavior, and may change in a future release (it may have already changed in 5.6)
NOTE: For improved performance of this query, if a suitable index is available with cost as the leading column, e.g.
... ON booking (cost)
that may allow MySQL to use that index to return rows in order and avoid a "Using filesort" operation.

MySQL GROUP BY behavior (when using a derived table with order by)

Since mysql does not enforce the Single-Value Rule (See: https://stackoverflow.com/a/1646121/1688441) does a derived table with an order by guarantee which row values will be displayed? This is for columns not in an aggregate function and not in the group by.
I was looking at the question (MySQL GROUP BY behavior) after having commented on and answered the question (https://stackoverflow.com/a/24653572/1688441) .
I don't agree with the accepted answer, but realized that a possible improved upon answer would be:
SELECT * FROM
(SELECT * FROM tbl order by timestamp) as tb2
GROUP BY userID;
http://sqlfiddle.com/#!2/4b475/18
Is this correct though or will mysql still decide arbitrarily which row values will be displayed?
This query:
SELECT *
FROM (SELECT * FROM tbl order by timestamp) as tb2
GROUP BY userID;
Relies on a MySQL group by extension, which is documented here. You are specifically relying on the fact that all the columns come from the same row and the first one encountered. MySQL specifically warns against making this assumption:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
So, you cannot depend on this behavior. It is easy enough to work around. Here is an example query:
select t.*
from tbl t
where not exists (select 1 from tbl t2 where t2.userid = t.userid and t2.timestamp > t.timestamp)
With an index on tbl(userid, timestamp) this may even work faster. MySQL does a notoriously poor job of optimizing aggregations.

SQL select distinct but "keep first"?

According to another SO post (SQL: How to keep rows order with DISTINCT?), distinct has pretty undefined behavior as far as sorting.
I have a query:
select col_1 from table order by col_2
This can return values like
3
5
3
2
I need to then select a distinct on these that preserves ordering, meaning I want
select distinct(col_1) from table order by col_2
to return
3
5
2
but not
5
3
2
Here is what I am actually trying to do. Col_1 is a user id, and col_2 is a log in timestamp event by that user. So the same user (col_1) can have many login times. I am trying to build a historical list of users in which they were seen in the system. I would like to be able to say "our first user ever was, our second user ever was", and so on.
That post seems to suggest to use a group by, but group by is not meant to return an ordering of rows, so I do not see how or why this would be applicable here, since it does not appear group by will preserve any ordering. In fact, another SO post gives an example where group by will destroy the ordering I am looking for: see "Peter" in what is the difference between GROUP BY and ORDER BY in sql. Is there anyway to guarantee the latter result? The strange thing is, if I were implementing the DISTINCT clause, I would surely do the order by first, then take the results and do a linear scan of the list and preserve the ordering naturally, so I am not sure why the behavior is so undefined.
EDIT:
Thank you all! I have accepted IMSoP answer because not only was there an interative example that I could play around with (thanks for turning me on to SQL Fiddle), but they also explained why several things worked the way they worked, instead of simply "do this". Specifically, it was unclear that GROUP BY does not destroy (rather, keeps them in some sort of internal list) values in the other columns outside of the group by, and these values can still be examined in an ORDER BY clause.
This all has to do with the "logical ordering" of SQL statements. Although a DBMS might actually retrieve the data according to all sorts of clever strategies, it has to behave according to some predictable logic. As such, the different parts of an SQL query can be considered to be processed "before" or "after" one another in terms of how that logic behaves.
As it happens, the ORDER BY clause is the very last step in that logical sequence, so it can't change the behaviour of "earlier" steps.
If you use a GROUP BY, the rows have been bundled up into their groups by the time the SELECT clause is run, let alone the ORDER BY, so you can only look at columns which have been grouped by, or "aggregate" values calculated across all the values in a group. (MySQL implements a controversial extension to GROUP BY where you can mention a column in the SELECT that can't logically be there, and it will pick one from an arbitrary row in that group).
If you use a DISTINCT, it is logically processed after the SELECT, but the ORDER BY still comes afterwards. So only once the DISTINCT has thrown away the duplicates will the remaining results be put into a particular order - but the rows that have been thrown away can't be used to determine that order.
As for how to get the result you need, the key is to find a value to sort by which is valid after the GROUP BY/DISTINCT has (logically) been run. Remember that if you use a GROUP BY, any aggregated values are still valid - an aggregate function can look at all the values in a group. This includes MIN() and MAX(), which are ideal for ordering by, because "the lowest number" (MIN) is the same thing as "the first number if I sort them in ascending order", and vice versa for MAX.
So to order a set of distinct foo_number values based on the lowest applicable bar_number for each, you could use this:
SELECT foo_number
FROM some_table
GROUP BY foo_number
ORDER BY MIN(bar_number) ASC
Here's a live demo with some arbitrary data.
EDIT: In the comments, it was discussed why, if an ordering is applied before the grouping / de-duplication takes place, that order is not applied to the groups. If that were the case, you would still need a strategy for which row was kept in each group: the first, or the last.
As an analogy, picture the original set of rows as a set of playing cards picked from a deck, and then sorted by their face value, low to high. Now go through the sorted deck and deal them into a separate pile for each suit. Which card should "represent" each pile?
If you deal the cards face up, the cards showing at the end will be the ones with the highest face value (a "keep last" strategy); if you deal them face down and then flip each pile, you will reveal the lowest face value (a "keep first" strategy). Both are obeying the original order of the cards, and the instruction to "deal the cards based on suit" doesn't automatically tell the dealer (who represents the DBMS) which strategy was intended.
If the final piles of cards are the groups from a GROUP BY, then MIN() and MAX() represent picking up each pile and looking for the lowest or highest value, regardless of the order they are in. But because you can look inside the groups, you can do other things too, like adding up the total value of each pile (SUM) or how many cards there are (COUNT) etc, making GROUP BY much more powerful than an "ordered DISTINCT" could be.
I would go for something like
select col1
from (
select col1,
rank () over(order by col2) pos
from table
)
group by col1
order by min(pos)
In the subquery I calculate the position, then in the main query I do a group by on col1, using the smallest position to order.
Here the demo in SQLFiddle (this was Oracle, the MySql info was added later.
Edit for MySql:
select col1
from (
select col1 col1,
#curRank := #curRank + 1 AS pos
from table1, (select #curRank := 0) p
) sub
group by col1
order by min(pos)
And here the demo for MySql.
The GROUP BY in the referenced answer isn't attempting to perform an ordering... it is simply picking a single associated value for the column that we want to be distinct.
Like #bluefeet states, if you want a guaranteed ordering, you must use ORDER BY.
Why can't we specify a value in the ORDER BY that isn't included in the SELECT DISTINCT?
Consider the following values for col1 and col2:
create table yourTable (
col_1 int,
col_2 int
);
insert into yourTable (col_1, col_2) values (1, 1);
insert into yourTable (col_1, col_2) values (1, 3);
insert into yourTable (col_1, col_2) values (2, 2);
insert into yourTable (col_1, col_2) values (2, 4);
With this data, what should SELECT DISTINCT col_1 FROM yourTable ORDER BY col_2 return?
That's why you need the GROUP BY and the aggregate function, to decide which of the multiple values for col_2 you should order by... could be MIN(), could be MAX(), maybe even some other function such as AVG() would make sense in some cases; it all depends on the specific scenario, which is why you need to be explicit:
select col_1
from yourTable
group by col_1
order by min(col_2)
SQL Fiddle Here
For MySQL only, when you select columns that are not in the GROUP BY it will return columns from the first record in the group. You can use this behavior to select which record is returned from each group like this:
SELECT foo_number, bar_number
FROM
(
SELECT foo_number, bar_number
FROM some_table
ORDER BY bar_number
) AS t
GROUP BY foo_number
ORDER BY bar_number DESC;
This is more flexible because it allows you to order the records within each group using expressions that are not possible with aggregates - in my case I wanted to return the one with the shortest string in another column.
For completeness, my query looks like this:
SELECT
s.NamespaceId,
s.Symbol,
s.EntityName
FROM
(
SELECT
m.NamespaceId,
i.Symbol,
i.EntityName
FROM ImportedSymbols i
JOIN ExchangeMappings m ON i.ExchangeMappingId = m.ExchangeMappingId
WHERE
i.Symbol NOT IN
(
SELECT Symbol
FROM tmp_EntityNames
WHERE NamespaceId = m.NamespaceId
)
AND
i.EntityName IS NOT NULL
ORDER BY LENGTH(i.RawSymbol), i.RawSymbol
) AS s
GROUP BY s.NamespaceId, s.Symbol;
What this does is return a distinct list of symbols in each namespace, and for duplicated symbols returns the one with the shortest RawSymbol. When the RawSymbol lengths are the same, it returns the one who's RawSymbol comes first alphabetically.

Selecting last row WITHOUT any kind of key

I need to get the last (newest) row in a table (using MySQL's natural order - i.e. what I get without any kind of ORDER BY clause), however there is no key I can ORDER BY on!
The only 'key' in the table is an indexed MD5 field, so I can't really ORDER BY on that. There's no timestamp, autoincrement value, or any other field that I could easily ORDER on either. This is why I'm left with only the natural sort order as my indicator of 'newest'.
And, unfortunately, changing the table structure to add a proper auto_increment is out of the question. :(
Anyone have any ideas on how this can be done w/ plain SQL, or am I SOL?
If it's MyISAM you can do it in two queries
SELECT COUNT(*) FROM yourTable;
SELECT * FROM yourTable LIMIT useTheCountHere - 1,1;
This is unreliable however because
It assumes rows are only added to this table and never deleted.
It assumes no other writes are performed to this table in the meantime (you can lock the table)
MyISAM tables can be reordered using ALTER TABLE, so taht the insert order is no longer preserved.
It's not reliable at all in InnoDB, since this engine can reorder the table at will.
Can I ask why you need to do this?
In oracle, possibly the same for MySQL too but the optimiser will choose the quickest record / order to return you results. So there is potential if your data was static to run the same query twice and get a different answer.
You can assign row numbers using the ROW_NUMBER function and then sort by this value using the ORDER BY clause.
SELECT *,
ROW_NUMBER() OVER() AS rn
FROM table
ORDER BY rn DESC
LIMIT 1;
Basically, you can't do that.
Normally I'd suggest adding a surrogate primary key with auto-incrememt and ORDER BY that:
SELECT *
FROM yourtable
ORDER BY id DESC
LIMIT 1
But in your question you write...
changing the table structure to add a proper auto_increment is out of the question.
So another less pleasant option I can think of is using a simulated ROW_NUMBER using variables:
SELECT * FROM
(
SELECT T1.*, #rownum := #rownum + 1 AS rn
FROM yourtable T1, (SELECT #rownum := 0) T2
) T3
ORDER BY rn DESC
LIMIT 1
Please note that this has serious performance implications: it requires a full scan and the results are not guaranteed to be returned in any particular order in the subquery - you might get them in sort order, but then again you might not - when you dont' specify the order the server is free to choose any order it likes. Now it probably will choose the order they are stored on disk in order to do as little work as possible, but relying on this is unwise.
Without an order by clause you have no guarantee of the order in which you will get your result. The SQL engine is free to choose any order.
But if for some reason you still want to rely on this order, then the following will indeed return the last record from the result (MySql only):
select *
from (select *,
#rn := #rn + 1 rn
from mytable,
(select #rn := 0) init
) numbered
where rn = #rn
In the sub query the records are retrieved without order by, and are given a sequential number. The outer query then selects only the one that got the last attributed number.
We can use the having for that kind of problem-
SELECT MAX(id) as last_id,column1,column2 FROM table HAVING id=last_id;