Order by / limit execution in SQL - mysql

Lots of thread already on web, just trying to understand some nuances which had me confused!
Quoting the doc reference
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as
soon as it has found the first row_count rows of the sorted result,
rather than sorting the entire result. If ordering is done by using an
index, this is very fast.
and a SO thread
It will order first, then get the first 20. A database will also
process anything in the WHERE clause before ORDER BY.
Taking the same query from the question :
SELECT article
FROM table1
ORDER BY publish_date
LIMIT 20
lets say table has 2000 rows, of which query is expected to return 20 rows, now, looking at mysql ref ....stops sorting as soon as it has found the first row_count rows.... confuses me as i find it little ambiguous!!
Why does it say stops sorting? isn't the limit clause being applied on an already sorted data returned via order by clause ( assuming its a non-indexed column ) or is my understanding wrong and SQL is limiting first and then sorting!!??

The optimization mentioned in the documentation generally only works if there's an index on the publish_date column. The values are stored in the index in order, so the engine simply iterates through the index of the column, fetching the associated rows, until it has fetched 20 rows.
If the column isn't indexed, the engine will generally need to fetch all the rows, sort them, and then return the first 20 of these.
It's also useful to understand how this interacts with WHERE conditions. Suppose the query is:
SELECT article
FROM table1
WHERE last_read_date > '2018-11-01'
ORDER BY publish_date
LIMIT 20
If publish_date is indexed and last_read_date is not, it will scan the publish_date index in order, test the associated last_read_date against the condition, and add article to the result set if the test succeeds. When there are 20 rows in the result set it will stop and return it.
If last_read_date is indexed and publish_date is not, it will use the last_read_date index to find the subset of all the rows that meet the condition. It will then sort these rows using the publish_date column, and return the first 20 rows from that.
If neither column is indexed it will do a full table scan to test last_read_date, sort all the rows that match the condition, and return the first 20 rows of this.

MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result
This is actually a very sensible optimisation within mysql. If you use limit to return 20 rows and mysql knows it already found them, then why would mysql (or you) care how exactly the rest of the records are sorted? It does not matter, therefore mysql stops sorting the rest of the rows.
If the order by is done on an indexed column, then mysql can tell pretty quickly, if it found the top n records.

Related

Optimize LIMIT the number of rows to be SELECT in SQL

Consider a table Test having 1000 rows
Test Table
id name desc
1 Adi test1
2 Sam test2
3 Kal test3
.
.
1000 Jil test1000
If i need to fetch, say suppose 100 rows(i.e. a small subset) only, then I am using LIMIT clause in my query
SELECT * FROM test LIMIT 100;
This query first fetches 1000 rows and then returns 100 out of it.
Can this be optimised, such that the DB engine queries only 100 rows and returns them
(instead of fetching all 1000 rows first and then returning 100)
Reason for above supposition is that the order of processing will be
FROM
WHERE
SELECT
ORDER BY
LIMIT
You can combine LIMIT ROW COUNT with an ORDER BY, This causes MySQL to stop sorting as soon as it has found the first ROW COUNT rows of the sorted result.
Hope this helps, If you need any clarification just drop a comment.
The query you wrote will fetch only 100 rows, not 1000. But, if you change that query in any way, my statement may be wrong.
GROUP BY and ORDER BY are likely to incur a sort, which is arguably even slower than a full table scan. And that sort must be done before seeing the LIMIT.
Well, not always...
SELECT ... FROM t ORDER BY x LIMIT 100;
together with INDEX(x) -- This may use the index and fetch only 100 rows from the index. BUT... then it has to reach into the data 100 times to find the other columns that you ask for. UNLESS you only ask for x.
Etc, etc.
And here's another wrinkle. A lot of questions on this forum are "Why isn't MySQL using my index?" Back to your query. If there are "only" 1000 rows in your table, my example with the ORDER BY x won't use the index because it is faster to simply read through the table, tossing 90% of the rows. On the other hand, if there were 9999 rows, then it would use the index. (The transition is somewhere around 20%, but it that is imprecise.)
Confused? Fine. Let's discuss one query at a time. I can [probably] discuss the what and why of each one you throw at me. Be sure to include SHOW CREATE TABLE, the full query, and EXPLAIN SELECT... That way, I can explain what EXPLAIN tells you (or does not).
Did you know that having both a GROUP BY and ORDER BY may cause the use of two sorts? EXPLAIN won't point that out. And sometimes there is a simple trick to get rid of one of the sorts.
There are a lot of tricks up MySQL's sleeve.

Does MySQL not use LIMIT to optimize query select functions?

I've got a complex query I have to run in an application that is giving me some performance trouble. I've simplified it here. The database is MySQL 5.6.35 on CentOS.
SELECT a.`po_num`,
Count(*) AS item_count,
Sum(b.`quantity`) AS total_quantity,
Group_concat(`web_sku` SEPARATOR ' ') AS web_skus
FROM `order` a
INNER JOIN `order_item` b
ON a.`order_id` = b.`order_key`
WHERE `store` LIKE '%foobar%'
LIMIT 200 offset 0;
The key part of this query is where I've placed "foobar" as a placeholder. If this value is something like big_store, the query takes much longer (roughly 0.4 seconds in the query provided here, much longer in the query I'm actually using) than if the value is small_store (roughly 0.1 seconds in the query provided). big_store would return significantly more results if there were not limit.
But there is a limit and that's what surprises me. Both datasets have more than the LIMIT, which is only 200. It appears to me that MySQL performing the select functions COUNT, SUM, GROUP_CONCAT for all big_store/small_store rows and then applies the LIMIT retroactively. I would imagine that it'd be best to stop when you get to 200.
Could it not do the select functions COUNT, SUM, GROUP_CONCAT actions after grabbing the 200 rows it will use, making my query much much quicker? This seems feasible to me except in cases where there's an ORDER BY on one of those rows.
Does MySQL not use LIMIT to optimize a query select functions? If not, is there a good reason for that? If so, did I make a mistake in my thinking above?
It can stop short due to the LIMIT, but that is not a reasonable query since there is no ORDER BY.
Without ORDER BY, it will pick whatever 200 rows it feels like and stop short.
With an ORDER BY, it will have to scan the entire table that contains store (please qualify columns with which table they come from!). This is because of the leading wildcard. Only then can it trim to 200 rows.
Another problem -- Without a GROUP BY, aggregates (SUM, etc) are performed across the entire table (or at least those that remain after filtering). The LIMIT does not apply until after that.
Perhaps what you are asking about is MariaDB 5.5.21's "LIMIT_ROWS_EXAMINED".
Think of it this way ... All of the components of a SELECT are done in the order specified by the syntax. Since LIMIT is last, it does not apply until after the other stuff is performed.
(There are a couple of exceptions: (1) SELECT col... must be done after FROM ..., since it would not know which table(s); (2) The optimizer readily reorders JOINed table and clauses in WHERE ... AND ....)
More details on that query.
The optimizer peeks ahead, and sees that the WHERE is filtering on order (that is where store is, yes?), so it decides to start with the table order.
It fetches all rows from order that match %foobar%.
For each such row, find the row(s) in order_item. Now it has some number of rows (possibly more than 200) with which to do the aggregates.
Perform the aggregates - COUNT, SUM, GROUP_CONCAT. (Actually this will probably be done as it gathers the rows -- another optimization.)
There is now 1 row (with an unpredictable value for a.po_num).
Skip 0 rows for the OFFSET part of the LIMIT. (OK, another out-of-order thingie.)
Deliver up to 200 rows. (There is only 1.)
Add ORDER BY (but no GROUP BY) -- big deal, sort the 1 row.
Add GROUP BY (but no ORDER BY) in, now you may have more than 200 rows coming out, and it can stop short.
Add GROUP BY and ORDER BY and they are identical, then it may have to do a sort for the grouping, but not for the ordering, and it may stop at 200.
Add GROUP BY and ORDER BY and they are not identical, then it may have to do a sort for the grouping, and will have to re-sort for the ordering, and cannot stop at 200 until after the ORDER BY. That is, virtually all the work is performed on all the data.
Oh, and all of this gets worse if you don't have the optimal index. Oh, did I fail to insist on providing SHOW CREATE TABLE?
I apologize for my tone. I have thrown quite a few tips in your direction; please learn from them.

Performance - MySQL default sorting order on inserting records

The default ordering ID of records in mysql is ASC (i.e. Rows that i insert goes down the table) but we'll be using only the latest information from the table (i.e. Rows that are below).
Will there be any performance improvements if we change the default ordering to DESC (i.e New records goes to the top of the table) and frequent information will be queried from top of the table.
I think it would be the opposite.
I'm basing this comment on how I understand indexes to work in SQL Server-I'll try to revise later if I get a chance to read up more on how they work in MySQL.
There could be a slight performance advantage to insert your rows in the same order as your index is sorted, versus inserting them in the opposite order.
If you insert in the same order, and your next row to insert is always greater in sort order than existing rows then you will always find the next available empty spot (when one exists) in your last page of rows data.
If you do the opposite, always have your next insert row lesser in sort order than existing rows then you will probably always have a collision in your first page of rows data and the engine will do a tad bit more work to shift the position of rows if the page has room for it.
As for your order by clause in the select statement:
1) there's nothing in the SQL standard about indexes, and nothing that guarantees your result set ordering except for the ORDER BY clause. Normally queries in SQL Server that use just one index will see results returned in the order of the index. But if the isolation level changes to "read uncommitted" (chaos?) then it will return rows in more likely in the order it finds them in memory or on disk which is not necessarily the order you want.
2) If the order by in your select statement is based on the exact same column criteria as the index, then your database server should perform the same with either the index order, or the opposite of the index order. This is pretty straightforward except perhaps if you have a multi-column index with mixed ASC-DESC declarations for different columns. You get away with equal performance with order by equal to index order and with order by equal to inverse index order where the inverse index order is determined by substituting the ASC and DESC declarations (explicit and implicit) in the index declaration with DESC and ASC in the order by clause.
Any performance change would be on querying the records, not inserting one.
For queries, I doubt this will have much affect as database queries by keys usually have similar speeds.
It also depends on your data so I would run some tests.

Retrieving only a fixed number of rows in MySQL

I am testing my database design under load and I need to retrieve only a fixed number of rows (5000)
I can specify a LIMIT to achieve this, however it seems that the query builds the result set of all rows that match and then returns only the number of rows specified in the limit. Is that how it is implemented?
Is there a for MySQL to read one row, read another one and basically stop when it retrieves the 5000th matching row?
MySQL is smart in that if you specify a LIMIT 5000 in your query, and it is possible to produce that result without generating the whole result set first, then it will not build the whole result.
For instance, the following query:
SELECT * FROM table ORDER BY column LIMIT 5000
This query will need to scan the whole table unless there is an index on column, in which case it does the smart thing and uses the index to find the rows with the smallest column.
SELECT * FROM `your_table` LIMIT 0, 5000
This will display the first 5000 results from the database.
SELECT * FROM `your_table` LIMIT 1001, 5000
This will show records from 1001 to 6000 (counting from 0).
Complexity of such query is O(LIMIT) (unless you specify order by).
It means that if 10000000 rows will match your query, and you specify limit equal to 5000, then the complexity will be O(5000).
#Jarosław Gomułka is right
If you use LIMIT with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. In either case, after the initial rows have been found, there is no need to sort any remainder of the result set, and MySQL does not do so.
if the set is not sorted it terminates the SELECT operation as soon as it's got enough rows to the result set.
The exact plan the query optimizer uses depends on your query (what fields are being selected, the LIMIT amount and whether there is an ORDER BY) and your table (keys, indexes, and number of rows in the table). Selecting an unindexed column and/or ordering by a non-key column is going to produce a different execution plan than selecting a column and ordering by the primary key column. The later will not even touch the table, and only process the number of rows specified in your LIMIT.
Each database defines its own way of limiting the result set size depends on the database you are using.
While the SQL:2008 specification defines a standard syntax for limiting a SQL query, MySQL 8 does not support it.
Therefore, on MySQL, you need to use the LIMIT clause to restrict the result set to the Top-N records:
SELECT
title
FROM
post
ORDER BY
id DESC
LIMIT 50
Notice that we are using an ORDER BY clause since, otherwise, there is no guarantee which are the first records to be included in the returning result set.

Optimizing query instead of using order by

I want to run a simple query to get the "n" oldest records in the table. (It has a creation_date column).
How can i get that without using "order-by". It is a very big table and using order by on entire table to get only "n" records is not so convincing.
(Assume n << size of table)
When you are concerned about performance, you should probably not discard the use of order by too early.
Queries like that can be implemende as Top-N query supported by an appropriate index, that's running very fast because it doesn't need to sort the entire table, not even the selecte rows, because the data is already sorted in the index.
example:
select *
from table
where A = ?
order by creation_date
limit 10;
without appropriate index it will be slow if you are having lot's of data. However, if you create an index like that:
create index test on table (A, creation_date );
The query will be able to start fetching the rows in the correct order, without sorting, and stop when the limit is reached.
Recipe: put the where columns in the index, followed by the order by columns.
If there is no where clause, just put the order by into the index. The order by must match the index definition, especially if there are mixed asc/desc orders.
The indexed Top-N query is the performance king--make sure to use them.
I few links for further reading (all mine):
How to use index efficienty in mysql query
http://blog.fatalmind.com/2010/07/30/analytic-top-n-queries/ (Oracle centric)
http://Use-The-Index-Luke.com/ (not yet covering Top-N queries, but that's to come in 2011).
I haven't tested this concept before but try and create an index on the creation_date column. Which will automatically sort the rows is ascending order. Then your select query can use the orderby creation_date desc with the Limit 20 to get the first 20 records. The database engine should realize the index has already done the work sorting and wont actually need to sort, because the index has already sorted it on save. All it needs to do is read the last 20 records from the index.
Worth a try.
Create an index on creation_date and query by using order by creation_date asc|desc limit n and the response will be very fast (in fact it cannot be faster). For the "latest n" scenario you need to use desc.
If you want more constraints on this query (e.g where state='LIVE') then the query may become very slow and you'll need to reconsider the indexing strategy.
You can use Group By if your grouping some data and then Having clause to select specific records.