I'm wondering what should be expected from an SQL query that involves the operators DISTINCT, ORDER BY and LIMIT. I think my question is rooted in a misunderstanding of the order in which operators in SQL should be applied
For instance,
CREATE TABLE test(id int)
SELECT DISTINCT id
FROM test
ORDER BY id
LIMIT 10
Based on my knowledge of SQL, I can't see which (if any) of the following should happen
The first 10 rows of test are sorted, then a list of distinct ids in that subset are returned
A list of all the distinct ids in test are sorted, then the top 10 are returned
A list of the distinct id's in the first 10 rows of test are sorted then returned
If it matters, I'm using MySQL (MyISAM)
SELECT first, then ORDER BY, then LIMIT. That's true except for a few databases (well, I know one) that has both the TOP and LIMIT keywords. In that engine, LIMIT is part of the WHERE clause (evaluated at the SELECT level) and TOP is applied after ORDERing.
Think of it in terms of "inside to out", so the following happens:
Get a list of distinct ids
Order them in ascending order
List the first 10
Related
Lots of thread already on web, just trying to understand some nuances which had me confused!
Quoting the doc reference
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as
soon as it has found the first row_count rows of the sorted result,
rather than sorting the entire result. If ordering is done by using an
index, this is very fast.
and a SO thread
It will order first, then get the first 20. A database will also
process anything in the WHERE clause before ORDER BY.
Taking the same query from the question :
SELECT article
FROM table1
ORDER BY publish_date
LIMIT 20
lets say table has 2000 rows, of which query is expected to return 20 rows, now, looking at mysql ref ....stops sorting as soon as it has found the first row_count rows.... confuses me as i find it little ambiguous!!
Why does it say stops sorting? isn't the limit clause being applied on an already sorted data returned via order by clause ( assuming its a non-indexed column ) or is my understanding wrong and SQL is limiting first and then sorting!!??
The optimization mentioned in the documentation generally only works if there's an index on the publish_date column. The values are stored in the index in order, so the engine simply iterates through the index of the column, fetching the associated rows, until it has fetched 20 rows.
If the column isn't indexed, the engine will generally need to fetch all the rows, sort them, and then return the first 20 of these.
It's also useful to understand how this interacts with WHERE conditions. Suppose the query is:
SELECT article
FROM table1
WHERE last_read_date > '2018-11-01'
ORDER BY publish_date
LIMIT 20
If publish_date is indexed and last_read_date is not, it will scan the publish_date index in order, test the associated last_read_date against the condition, and add article to the result set if the test succeeds. When there are 20 rows in the result set it will stop and return it.
If last_read_date is indexed and publish_date is not, it will use the last_read_date index to find the subset of all the rows that meet the condition. It will then sort these rows using the publish_date column, and return the first 20 rows from that.
If neither column is indexed it will do a full table scan to test last_read_date, sort all the rows that match the condition, and return the first 20 rows of this.
MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result
This is actually a very sensible optimisation within mysql. If you use limit to return 20 rows and mysql knows it already found them, then why would mysql (or you) care how exactly the rest of the records are sorted? It does not matter, therefore mysql stops sorting the rest of the rows.
If the order by is done on an indexed column, then mysql can tell pretty quickly, if it found the top n records.
The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR
If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.
Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.
I have the following SQL query , it seems to run ok , but i am concerned as my site grows it may not perform as expected ,I would like some feeback as to how effective and efficient this query really is:
select * from articles where category_id=XX AND city_id=XXX GROUP BY user_id ORDER BY created_date DESC LIMIT 10;
Basically what i am trying to achieve - is to get the newest articles by created_date limited to 10 , articles must only be selected if the following criteria are met :
City ID must equal the given value
Category ID must equal the given value
Only one article per user must be returned
Articles must be sorted by date and only the top 10 latest articles must be returned
You've got a GROUP BY clause which only contains one column, but you are pulling all the columns there are without aggregating them. Do you realise that the values returned for the columns not specified in GROUP BY and not aggregated are not guaranteed?
You are also referencing such a column in the ORDER BY clause. Since the values of that column aren't guaranteed, you have no guarantee what rows are going to be returned with subsequent invocations of this script even in the absence of changes to the underlying table.
So, I would at least change the ORDER BY clause to something like this:
ORDER BY MAX(created_date)
or this:
ORDER BY MIN(created_date)
some potential improvements (for best query performance):
make sure you have an index on all columns you querynote: check if you really need an index on all columns because this has a negative performance when the BD has to build the index. -> for more details take a look here: http://dev.mysql.com/doc/refman/5.1/en/optimization-indexes.html
SELECT * would select all columns of the table. SELECT only the ones you really require...
When querying the db for a set of ids, mysql doesnot provide the results in the order by which the ids were specified. The query i am using is the following:
SELECT id ,title, date FROM Table WHERE id in (7,1,5,9,3)
in return the result provided is in the order 1,3,5,7,9.
How can i avoid this auto sorting
If you want to order your result by id in the order specified in the in clause you can make use of FIND_IN_SET as:
SELECT id ,title, date
FROM Table
WHERE id in (7,1,5,9,3)
ORDER BY FIND_IN_SET(id,'7,1,5,9,3')
There is no auto-sorting or default sorting going on. The sorting you're seeing is most likely the natural sorting of rows within the table, ie. the order they were inserted. If you want the results sorted in some other way, specify it using an ORDER BY clause. There is no way in SQL to specify that a sort order should follow the ordering of items in an IN clause.
The WHERE clause in SQL does not affect the sort order; the ORDER BY clause does that.
If you don't specify a sort order using ORDER BY, SQL will pick its own order, which will typically be the order of the primary key, but could be anything.
If you want the records in a particular order, you need to specify an ORDER BY clause that tells SQL the order you want.
If the order you want is based solely on that odd sequence of IDs, then you'd need to specify that in the ORDER BY clause. It will be tricky to specify exactly that. It is possible, but will need some awkward SQL code, and will slow down the query significantly (due to it no longer using a key to find the records).
If your desired ID sequence is because of some other factor that is more predictable (say for example, you actually want the records in alphabetical name order), you can just do ORDER BY name (or whatever the field is).
If you really want to sort by the ID in an arbitrary sequence, you may need to generate a temporary field which you can use to sort by:
SELECT *,
CASE id
WHEN 7 THEN 1
WHEN 1 THEN 2
WHEN 5 THEN 3
WHEN 3 THEN 4
WHEN 9 THEN 5
END AS mysortorder
FROM mytable
WHERE id in (7,1,5,9,3)
ORDER BY mysortorder;
The behaviour you are seeing is a result of query optimisation, I expect that you have an index on id so that the IN statement will use the index to return records in the most efficient way. As an ORDER BY statement has not been specified the database will assume that the order of the return records is not important and will optimise for speed. (Checkout "EXPLAIN SELECT")
CodeAddicts or Spudley's answer will give the result you want. An alternative is assigning a priority to the id's in "mytable" (or another table) and using this to order the records as desired.
Assuming the query
SELECT * FROM table WHERE a='1' and b='2' and c>'3' and d>'4' and e!='5' and f='6'
returns 10000 results.
My question is, let's say I limit the search to the first 10 results like this:
SELECT * FROM table WHERE a='1' and b='2' and c>'3' and d>'4' and e!='5' and f='6' LIMIT 10
Will mysql search through all the 10000 results or it will stop at the 10th result?
LIMIT will only display the rows specified, based on their position within the resultset. Without an ORDER BY, you're relying on the order the records were inserted.
You'll probably be interested to read about MySQL's ORDER BY/LIMIT performance...
Since there is no ORDER BY, it will stop at the 10th result (after going through as many non-matching rows as necessary). As OMG Ponies says, which 10 rows you get is unspecified.