I have a computationally expensive user-defined function that I need to use against a large dataset. I don't sort nor ask for row-count (no FOUND_ROWS). If I specify LIMIT as part of the query, does MYSQL engine actually stop calling the function after getting the LIMIT-rows or does it run the function against the entire dataset regardless? Example:
select cols, .. where fingerprint_match(col, arg) > score limit 5;
Ideally, fingerprint_match would be called as few as 5 times if first (random) rows resulted in a passing score.
As documented under Optimizing LIMIT Queries:
MySQL sometimes optimizes a query that has a LIMIT row_count clause and no HAVING clause:
[ deletia ]
As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using SQL_CALC_FOUND_ROWS.
I believe the query will stop processing as soon as the specified number of matches are found but ONLY IF there is no ORDER BY clause. Otherwise it must find and sort all matches before applying the limit.
The only evidence I have for this is the statement in the docs that "LIMIT 0 quickly returns an empty set. This can be useful for checking the validity of a query.". This suggests to me that it doesn't bother applying the where clause to any rows once the limit has already been satisfied.
http://dev.mysql.com/doc/refman/5.6/en/limit-optimization.html
Related
I have a large database in which I use LIMIT in order not to fetch all the results of the query every time (It is not necessary). But I have an issue: I need to count the number of results. The dumbest solution is the following and it works:
We just get the data that we need:
SELECT * FROM table_name WHERE param > 3 LIMIT 10
And then we find the length:
SELECT COUNT(1) FROM table_name WHERE param > 3 LIMIT 10
But this solution bugs me because unlike the query in question, the one that I work with is complex and you have to basically run it twice to achieve the result.
Another dumb solution for me was to do:
SELECT COUNT(1), param, anotherparam, additionalparam FROM table_name WHERE param > 3 LIMIT 10
But this results in only one row. At this point I will be ok if it would just fill the count row with the same number, I just need this information without wasting computation time.
Is there a better way to achieve this?
P.S. By the way, I am not looking to get 10 as the result of COUNT, I need the length without LIMIT.
You should (probably) run the query twice.
MySQL does have a FOUND_ROWS() function that reports the number of rows matched before the limit. But using this function is often worse for performance than running the query twice!
https://www.percona.com/blog/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
...when we have appropriate indexes for WHERE/ORDER clause in our query, it is much faster to use two separate queries instead of one with SQL_CALC_FOUND_ROWS.
There are exceptions to every rule, of course. If you don't have an appropriate index to optimize the query, it could be more costly to run the query twice. The only way to be sure is to repeat the tests shown in that blog, using your data and your query on your server.
This question is very similar to: How can I count the numbers of rows that a MySQL query returned?
See also: https://mariadb.com/kb/en/found_rows/
This is probably the most efficient solution to your problem, but it's best to test it using EXPLAIN with a reasonably sized dataset.
I was recently trying to use a user-defined variable to capture some information from the last row returned in my result set.
What I mean is, for example if I have a list of names from 'Aaron' to 'Zzarx',
SELECT #n:=Name FROM people ORDER BY Name;
SELECT #n;
The second SELECT should return 'Zzarx'.
That's the simple case. It works as expected; variable assignment reliably occurs in the same order as rows are sent to the client, so the last assignment corresponds to the last returned row.
But strange things seem to happen when the query is more complicated:
SELECT DISTINCT IFNULL(#n:=Name,'unknown') FROM people ORDER BY <some non-indexed expression> LIMIT 10;
SELECT #n;
Executing something like this on MariaDB v10.3.16 I get a final value of #n (from the second SELECT) that doesn't correspond to any of the rows returned by the first SELECT!. (Note that Name is a NOT NULL column, so the IFNULL() is actually redundant, but is still necessary to trigger this behaviour).
Note that it only seems to happen when ALL of the following hold:
SELECT DISTINCT
ORDER BY can't use an index
The variable assignment happens inside some expression
My theory is that:
SELECT DISTINCT forces early evaluation of the returned column expressions.
ORDER BY (non-indexed expression) forces an explicit sort operation after column data has been evaluated.
The SQL engine is smart enough to recognize the simple SELECT #var := (expression) pattern and evaluate #var only as the row is sent to the client, but can't make that optimization if the #var:=... assignment is embedded inside a larger expression, as in the IFNULL() in my example.
However, this is all only guesswork.
The manual page on user-defined variables doesn't really say anything useful in this regard (neither MySQL's nor MariaDB's).
It seems to me that using a #variable to capture something from the last-returned row in a multi-row query is a useful and probably quite commonplace trick, but now I'm not sure whether or when I can rely on it. Similarly for lots of row-numbering and other clever schemes I've seen that utilize #variables in the result set part of a SELECT.
Does someone here on SO have any definitive information on how this is supposed to work, and specifically, under what conditions will the order of evaluations of row variable-assignment expressions be guaranteed to correspond to the actual order of rows returned?
...Because this seems to be quite an important thing to know!
Another, slightly less pathological example:
Say table t has 1000 rows:
SET #n:=0;
SELECT #n:=#n+1 FROM t ORDER BY 1 DESC LIMIT 5;
SELECT #n;
Returned result sets are:
1000
999
998
997
996
and
1000
Note that once again, the final value of #n does NOT correspond to the last row returned, and indeed given the semantics of the query, in this case it can't.
Although you are not using 8.0.13, the following will be coming soon. You have found a reason why it is coming.
----- 2018-10-22 8.0.13 General Availability -- -- Important Change -----
Setting user variables in statements other than
SET is
now deprecated due to issues that included those listed here:
The order of evaluation for expressions involving user variables was
undefined.
The default result type of a variable is based on its type at the
beginning of the statement, which could have unintended effects when a
variable holding a value of one type at the beginning of a statement
was assigned a new value of a different type in the same statement.
HAVING, GROUP BY, and ORDER BY clauses, when referring to a variable
that was assigned a value in the select expression list, did not work
as expected because the expression was evaluated on the client and so
it was possible for stale column values from a previous row to be
used.
Syntax such as SELECT #var, #var:=#var+1 is still accepted in MySQL
8.0 for backward compatibility, but is subject to removal in a future release.
-- From the "change log".
Think of DISTINCT as similar to GROUP BY.
SELECT #v := ... FROM t ORDER BY x;
Case 1: INDEX(x) but the Optimizer may choose to fetch the rows, then sort them.
Case 2: INDEX(x) and the Optimizer chooses to fetch the rows based on the index.
SELECT #v := ... FROM t GROUP BY w ORDER BY x;
This almost certainly requires generating a temp table (for ordering), maybe two (one for grouping and one for ordering). The only rational way to run the query is to evaluate the expressions (including #v) in the SELECT, gather the results, then proceed to grouping and ordering. So, the evaluation order is not likely to be that of x. But it might mimic w.
What about PARTITIONing? Currently, there is no parallelism in MySQL's evaluation of a SELECT. But, what if that came into existence? Let's take an 'obvious' case -- separate threads working on separate PARTITIONs of the table. All bets are off in the order of evaluation.
Once that' is implemented, the how about splitting up even a non-partitioned SELECT to get some parallelism?
You are not going to win the argument.
Yes, it may stay "deprecated" for a long time. Or maybe there will be a sql_mode that runs queries the "old" way. Or the existence of #variables inhibits certain optimizations (in favor of predictability). Etc.
May I suggest that you write a "feature request" at bugs.mysql.com , stating what you would like to see. (You could also do it at mariadb.com, but they look at the former.)
When I add LIMIT 1 to a MySQL query, does it stop the search after it finds 1 result (thus making it faster) or does it still fetch all of the results and truncate at the end?
Depending on the query, adding a limit clause can have a huge effect on performance. If you want only one row (or know for a fact that only one row can satisfy the query), and are not sure about how the internal optimizer will execute it (for example, WHERE clause not hitting an index and so forth), then you should definitely add a LIMIT clause.
As for optimized queries (using indexes on small tables) it probably won't matter much in performance, but again - if you are only interested in one row than add a LIMIT clause regardless.
Limit can affect the performance of the query (see comments and the link below) and it also reduces the result set that is output by MySQL. For a query in which you expect a single result there is benefits.
Moreover, limiting the result set can in fact speed the total query time as transferring large result sets use memory and potentially create temporary tables on disk. I mention this as I recently saw a application that did not use limit kill a server due to huge result sets and with limit in place the resource utilization dropped tremendously.
Check this page for more specifics: MySQL Documentation: LIMIT Optimization
The answer, in short, is yes. If you limit your result to 1, then even if you are "expecting" one result, the query will be faster because your database wont look through all your records. It will simply stop once it finds a record that matches your query.
If there is only 1 result coming back, then no, LIMIT will not make it any faster. If there are a lot of results, and you only need the first result, and there is no GROUP or ORDER by statements then LIMIT will make it faster.
If you really only expect one single result, it really makes sense to append the LIMIT to your query. I don't know the inner workings of MySQL, but I'm sure it won't gather a result set of 100'000+ records just to truncate it back to 1 at the end..
I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same
How is it that when i use LIMIT, mysql checks the same number of rows? and how do i solve this?
The Where clause is processed first. Once the matches are found, the limit is applied on the result set, so all of the rows have to be evaluated to determine if they match the conditions before the limit can be applied.
The explain output is misleading. Mysql will evaluate the query using the where clause and such, but it'll stop after it find LIMIT number of matching rows (1 in this case). This is a known issue with mysql: http://bugs.mysql.com/bug.php?id=50168
To clarify... the limit clause will work as expected... it's only the explain output that's inaccurate.