Order of columns in GROUP BY clause does affect index use - mysql

This is more of an academic question, because in my particular case I can create an easy workaround, but I would like to understand the reason behind this anyway.
Using an InnoDB table (MariaDB 10.0.31) with (among others) columns customer and uri, I wanted to select the distinct uris for a specific customer. Now, the table is quite large (around 50M entries), so there is a composite index on customer and uri.
Basically what I don't understand is why the order of the columns in the group by clause matters.
explain select customer, uri from `tableName` group by customer,uri;
tells me it will use the existing index for group by, but
explain select customer, uri from `tableName` group by uri,customer;
won't do so.
Could someone explain why this is the case? I always thought of the group by clause as declarative.
Maybe it's because it's Friday, but I can't think of a case, where the order of the group by columns would affect the result.

Your observation is correct. Results would be different as the "prefix" order of columns mentioned in the composite index declaration is used for decision making by the Cost based optimizer. This behavior is due to the usage of B-TREE index
GROUP BY clause is used for ordering the result and hence if
the correct order of index is used or
only leftmost columns are used in group by
leftmost column is used in WHERE clause and rest in correct order in GROUP BY clause index would be used.
More on this and topic of Loose/Tight Index Scan can be found here
https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html

In index is basically an ordered table. In you case it is ordered according like ORDER BY customer, uri (because this is how your index is defined).
MySQL executes group by by first ordering the result according to the group by clause and then collapsing the rows with the same values (that happen to follow each other after sorting).
Apparently, MySQL is not smart enough to recognize that the different group by clause could also be executed when the result is ordered the other way.
More about this:
http://use-the-index-luke.com/sql/sorting-grouping
In particular: http://use-the-index-luke.com/sql/sorting-grouping/indexed-group-by

Write a feature request at bugs.mysql.com .
On the one hand, GROUP BY is (or was) defined to imply ORDER BY with the same columns in the same order.
On the other hand, if you ignore that non-standard feature, even by saying ORDER BY NULL, MySQL fails to shuffle the columns in order to use the index.
5.7 (and before) says
GROUP BY implicitly sorts by default (that is, in the absence of ASC
or DESC designators), but relying on implicit GROUP BY sorting is
deprecated. To produce a given sort order, use explicit ASC or DESC
designators for GROUP BY columns or provide an ORDER BY clause. GROUP
BY sorting is a MySQL extension that may change in a future release;
for example, to make it possible for the optimizer to order groupings
in whatever manner it deems most efficient and to avoid the sorting
overhead.
and
If a query includes GROUP BY but you want to avoid the overhead of
sorting the result, you can suppress sorting by specifying ORDER BY
NULL.
But, watch out; 8.0 says
Previously, relying on implicit GROUP BY sorting was deprecated but
GROUP BY did sort by default (that is, in the absence of ASC or DESC
designators). In MySQL 8.0, GROUP BY no longer sorts by default, so
query results may differ from previous MySQL versions. To produce a
given sort order, use explicit ASC or DESC designators for GROUP BY
columns or provide an ORDER BY clause.

Related

How do you order the indexing columns in MySQL if you are using order by in your query?

I am reading an article about how Pinterest shards their MySQL database: https://medium.com/#Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f
And here they have an example of a table:
CREATE TABLE board_has_pins (
board_id INT,
pin_id INT,
sequence INT,
INDEX(board_id, pin_id, sequence)
) ENGINE=InnoDB;
And they are showing how they query from that table:
SELECT pin_id FROM board_has_pins
WHERE board_id=241294561224164665 ORDER BY sequence
LIMIT 50 OFFSET 150
What I don't understand here is the ordering of the index. Would it not make more sense if the index was like this since they are ordering by sequence and filtering by board_id?
INDEX(board_id, sequence, pin_id)
Am I missing something here or have I misunderstood how indexing works?
You are correct. The better index for this query is:
INDEX(board_id, sequence, pin_id)
The columns should be in this order:
Column(s) involved in equality comparisons. If there are multiple columns, their order does not matter.
Column(s) involved the ORDER BY clause, in the same order they appear in the ORDER BY.
Other columns used to fetch values, like pin_id.
Once the equality conditions find the subset of matching rows, they are all tied with respect to their order, because naturally they all have the same value for the column of the quality condition (board_id in this case).
The tie is resolved by the order of the next column in the index. If (and only if) the next column is the one used in the ORDER BY clause, then the rows can be read in index order, with no further work needed to sort them.
I don't know what is the explanation for the Pinterest blog post you linked to. I guess it's a mistake, because the index is not optimal for the query they showed.

Exact same MYSQL query but different results on different servers

I have this query.
SELECT * FROM (SELECT * FROM private_messages ORDER BY id DESC) a
The "DESC" works on my local server but doesn't work on another server.
However if i just write it like this:
SELECT * FROM private_messages ORDER BY id DESC
it works on both servers. What would cause this?
This is going to be too long for a comment..
It's not a bug. It's documented here:
As of MySQL 5.7.6, the optimizer handles propagation of an ORDER BY
clause in a derived table or view reference to the outer query block
by propagating the ORDER BY clause if the following conditions apply:
The outer query is not grouped or aggregated; does not specify
DISTINCT, HAVING, or ORDER BY; and has this derived table or view
reference as the only source in the FROM clause. Otherwise, the
optimizer ignores the ORDER BY clause. Before MySQL 5.7.6, the
optimizer always propagated ORDER BY, even if it was irrelevant or
resulted in an invalid query.
However it's not the case for your query. So i guess your second server is running MariaDB which seams to ingnore any ORDER BY in a subquery without LIMIT
A "table" (and subquery in the FROM clause too) is - according to the
SQL standard - an unordered set of rows. Rows in a table (or in a
subquery in the FROM clause) do not come in any specific order. That's
why the optimizer can ignore the ORDER BY clause that you have
specified. In fact, SQL standard does not even allow the ORDER BY
clause to appear in this subquery (we allow it, because ORDER BY ...
LIMIT ... changes the result, the set of rows, not only their order).
You need to treat the subquery in the FROM clause, as a set of rows in
some unspecified and undefined order, and put the ORDER BY on the
top-level SELECT.
Why is ORDER BY in a FROM Subquery Ignored?
So best you can do is just to move the ORDER BY clause to the outer query. Or don't use a subquery at all.

Difference in select query result on MyISAM Vs InnoDB MySQL engines(especially for FULL TEXT SEARCHES)

I would like to know if there is any difference in the result output of the same select query on MyISAM Vs that on InnoDB for the same table.
The thing I am aware of is MyISAM can do FULL TEXT searches. But will the order of the output differ ?
The ordering of the output is determined by the order by clause. You have three possibilities.
First, there is no order by clause. Then the result set is in an indeterminate order. You cannot say that running the same query on the same data will produce results in the same order on multiple runs. You definitely cannot make any statement about runs on different databases.
Second, there is an order by clause and it is a stable sort -- meaning that each key for the order by uniquely identifies each row (there are no ties). Then the results are specified by both the SQL standard and MySQL documentation. The result sets will be in the same order.
Third, there is an order by clause and there are ties. The keys will be in the same order in both result sets. However, because keys with ties can be in any order, the two result sets are not guaranteed to be in the same order.
Summary: if you want results in a particular order, use order by.

Optimizing query instead of using order by

I want to run a simple query to get the "n" oldest records in the table. (It has a creation_date column).
How can i get that without using "order-by". It is a very big table and using order by on entire table to get only "n" records is not so convincing.
(Assume n << size of table)
When you are concerned about performance, you should probably not discard the use of order by too early.
Queries like that can be implemende as Top-N query supported by an appropriate index, that's running very fast because it doesn't need to sort the entire table, not even the selecte rows, because the data is already sorted in the index.
example:
select *
from table
where A = ?
order by creation_date
limit 10;
without appropriate index it will be slow if you are having lot's of data. However, if you create an index like that:
create index test on table (A, creation_date );
The query will be able to start fetching the rows in the correct order, without sorting, and stop when the limit is reached.
Recipe: put the where columns in the index, followed by the order by columns.
If there is no where clause, just put the order by into the index. The order by must match the index definition, especially if there are mixed asc/desc orders.
The indexed Top-N query is the performance king--make sure to use them.
I few links for further reading (all mine):
How to use index efficienty in mysql query
http://blog.fatalmind.com/2010/07/30/analytic-top-n-queries/ (Oracle centric)
http://Use-The-Index-Luke.com/ (not yet covering Top-N queries, but that's to come in 2011).
I haven't tested this concept before but try and create an index on the creation_date column. Which will automatically sort the rows is ascending order. Then your select query can use the orderby creation_date desc with the Limit 20 to get the first 20 records. The database engine should realize the index has already done the work sorting and wont actually need to sort, because the index has already sorted it on save. All it needs to do is read the last 20 records from the index.
Worth a try.
Create an index on creation_date and query by using order by creation_date asc|desc limit n and the response will be very fast (in fact it cannot be faster). For the "latest n" scenario you need to use desc.
If you want more constraints on this query (e.g where state='LIVE') then the query may become very slow and you'll need to reconsider the indexing strategy.
You can use Group By if your grouping some data and then Having clause to select specific records.

avoid Sorting by the MYSQL IN Keyword

When querying the db for a set of ids, mysql doesnot provide the results in the order by which the ids were specified. The query i am using is the following:
SELECT id ,title, date FROM Table WHERE id in (7,1,5,9,3)
in return the result provided is in the order 1,3,5,7,9.
How can i avoid this auto sorting
If you want to order your result by id in the order specified in the in clause you can make use of FIND_IN_SET as:
SELECT id ,title, date
FROM Table
WHERE id in (7,1,5,9,3)
ORDER BY FIND_IN_SET(id,'7,1,5,9,3')
There is no auto-sorting or default sorting going on. The sorting you're seeing is most likely the natural sorting of rows within the table, ie. the order they were inserted. If you want the results sorted in some other way, specify it using an ORDER BY clause. There is no way in SQL to specify that a sort order should follow the ordering of items in an IN clause.
The WHERE clause in SQL does not affect the sort order; the ORDER BY clause does that.
If you don't specify a sort order using ORDER BY, SQL will pick its own order, which will typically be the order of the primary key, but could be anything.
If you want the records in a particular order, you need to specify an ORDER BY clause that tells SQL the order you want.
If the order you want is based solely on that odd sequence of IDs, then you'd need to specify that in the ORDER BY clause. It will be tricky to specify exactly that. It is possible, but will need some awkward SQL code, and will slow down the query significantly (due to it no longer using a key to find the records).
If your desired ID sequence is because of some other factor that is more predictable (say for example, you actually want the records in alphabetical name order), you can just do ORDER BY name (or whatever the field is).
If you really want to sort by the ID in an arbitrary sequence, you may need to generate a temporary field which you can use to sort by:
SELECT *,
CASE id
WHEN 7 THEN 1
WHEN 1 THEN 2
WHEN 5 THEN 3
WHEN 3 THEN 4
WHEN 9 THEN 5
END AS mysortorder
FROM mytable
WHERE id in (7,1,5,9,3)
ORDER BY mysortorder;
The behaviour you are seeing is a result of query optimisation, I expect that you have an index on id so that the IN statement will use the index to return records in the most efficient way. As an ORDER BY statement has not been specified the database will assume that the order of the return records is not important and will optimise for speed. (Checkout "EXPLAIN SELECT")
CodeAddicts or Spudley's answer will give the result you want. An alternative is assigning a priority to the id's in "mytable" (or another table) and using this to order the records as desired.