Order of WHERE predicates and the SQL optimizer - mysql

When writting SQL queries with various where clauses (I only work with MySQL and sqlite) , I usually have the doubt of reordering the query clauses to put the "best ones" first (those which will remove a bigger amount of rows), and other "cosmetic" clauses later (which will barely change the output). In other words, I'm in doubt about if I really will help the optimizer to run faster by reordering clauses (specially when there are indexes in play), or if it could be another case of premature optimization. Optimizers are usually smarter than me.
For example:
select address.* from address inner join
user on address.user = user.id
where address.zip is not null and address.country == user.country
If we know that usually address.zip is not null, that check will be 90% true, and if the query order is respected, there will be a lot of dummy checks which can be avoided by placing the country check before.
Should I take care of that? In other words, is it important the order of where clauses or not?

The mysql optimizer seems well documented and you can find a many interesting considerations in offcial documents ..http://dev.mysql.com/doc/refman/5.7/en/where-optimizations.html
be taken into account especially of a very simple fact .... sql is not a procedural language but rather is a declarative language .. this mean it is not important the order in which the parts are written but it is important only the fact of what elements are been declared. This is evident in the documentation on optimization of mysql where where the focus is only on the components of a query and how they are transformed by optmizer in internal components

The order is mostly irrelevant.
In MySQL, with WHERE ... AND ...,
The Optimizer will first look for which part can use an index. If one can and one can't, the optimizer will use the index; the order becomes irrelevant
If both sides of the AND can use an index, MySQL will usually pick the 'better' one. (Sometimes it goofs.) Again, the order is ignored.
If neither side can use an index, it it evaluated left to right. But... Fetching rows is the bulk of effort in performing the query, so if one side of the AND is a little slower than the other, you probably won't notice. (Sure, if one side does SLEEP(3) you will notice.)
There's another issue in your example query (aside from the syntax error): The Optimizer will make a conscious decision of which table to start with.
If it decides to start with user, address needs INDEX(user, country) in either order.
If it decides to start with address, user needs (id, country) in either order.
It is unclear whether the Optimizer will bother with the NOT NULL test, even if that column is indexed.
Bottom line: Spend your time focusing on optimal indexes .

The answer is definitly maybe.
Mysterious are the ways of the optimizer.
Here is a demonstration based on exception caused due to division by zero.
create table t (i int);
insert into t (i) values (0);
The following query succeeds for Oracle, SQL Server, Postgres and Teradata (we'll skip the version information for now):
select 1 from t where i < 1 or 1/i < 1;
The following query fails for SQL Server and Postgres but succeeds for Oracle and Teradata
select 1 from t where 1/i < 1 or i < 1;
However, the following query does fail for Oracle and Teradata:
select 1 from t where 1/i < 1 or i/1 < 1;
What do we learn?
That some optimizers seem to respect the order of the predicates (or at least in some manner) and some seem to reorder the predicates by their estimated cost (e.g 1/i < 1 is more costly than i < 1 but not i/1 < 1).
For those who respect the order of the predicates we probably can improve performance by putting the light wait predicates first for OR operators and the frequently false predicates for AND operators.
Being that said, since databases do not guarantee to preserve the order of the predicates even if some of them currently seem to do that, you definitly can't count on it.
MySQL 5.7.11
This query returns immediately:
select 1 from t where i < 1 or sleep(3);
This query returns after 3 seconds:
select 1 from t where sleep(3) or i < 1

Related

is where (A and B) and where (B and A) in SQL the same? [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

MySQL where on indexed column and not indexed behavior [duplicate]

Say that I have a long, expensive query, packed with conditions, searching a large number of rows. I also have one particular condition, like a company id, that will limit the number of rows that need to be searched considerably, narrowing it down to dozens from hundreds of thousands.
Does it make any difference to MySQL performance whether I do this:
SELECT * FROM clients WHERE
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar) AND
company = :ugh
or this:
SELECT * FROM clients WHERE
company = :ugh AND
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar)
Here is a demo showing the order of WHERE clause conditions can make a difference due to short-circuiting. It runs the following queries:
-- query #1
SELECT myint FROM mytable WHERE myint >= 3 OR myslowfunction('query #1', myint) = 1;
-- query #2
SELECT myint FROM mytable WHERE myslowfunction('query #2', myint) = 1 OR myint >= 3;
The only difference between these is the order of operands in the OR condition.
myslowfunction deliberately sleeps for a second and has the side effect of adding an entry to a log table each time it is run. Here are the results of what is logged when running the two queries:
myslowfunction called for query #1 with value 1
myslowfunction called for query #1 with value 2
myslowfunction called for query #2 with value 1
myslowfunction called for query #2 with value 2
myslowfunction called for query #2 with value 3
myslowfunction called for query #2 with value 4
The above shows that a slow function is executed more times when it appears on the left side of an OR condition when the other operand isn't always true.
So IMO the answer to the question:
Does the order of conditions in a WHERE clause affect MySQL performance?
is "Sometimes it can do."
No, the order should not make a large difference. When finding which rows match the condition, the condition as a whole (all of the sub-conditions combined via boolean logic) is examined for each row.
Some intelligent DB engines will attempt to guess which parts of the condition can be evaluated faster (for instance, things that don't use built-in functions) and evaluate those first, and more complex (estimatedly) elements get evaluated later. This is something determined by the DB engine though, not the SQL.
The order of columns in your where clause shouldn't really matter, since MySQL will optimize the query before executing it. But I suggest you read the chapter on Optimization in the MySQL reference manual, to get a basic idea on how to analyze queries and tables, and optimize them if necessary. Personally though, I would always try to put indexed fields before non-indexed fields, and order them according to the number of rows that they should return (most restrictive conditions first, least restrictive last).
Mathematically Yes It has an effect. Not only in SQL Query. rather in all programming languages whenever there is an expression with and / or .
There works a theory of Complete evaluation or partial evaluation.
If its an and query and first expression of and evaluates to false it will not check further. as anding false with anything yields false .
Similerly in an or expression if first one is true it will not check further.
A sophisticated DBMS should be able to decide on its own which where condition to evaluate first. Some Databases provide tools to display the "strategy" how a query is executed. In MySQL, e.g. you can enter EXPLAIN in front of a query. The DBMS then prints the actions it performed for executing the query, as e.g. index or full-table scan. So you could see at a glance whether or not it uses the index for 'company' in both cases.
this shouldn't have any effect, but if you aren't sure, why don't you simply try it out? the order of where-clauses on an select from a single table makes no difference, but if you join multiple tables, the order of the joins could affect the performace (sometimes).
I don't think the order of the where clause has any impact. I think the MySQL query optimizer will reorganize where clauses as it sees fit so it filters away the largest subset first.
It's another deal when talking about joins. The optimizer tries to reorder here too, but doesn't always finds the best way and sometimes doesn't use indexes. SELECT STRAIGHT JOIN and FORCE INDEX let's you be in charge of the query.
No it doesn't, the tables required are selected and then evaluated row by row. Order can be arbitrary.

SQL row return order

I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same

What databases could run the following SQL?

I have constructed a query and I'm wondering if it would work on any database besides MySQL. I have never actually used another database so I'm not great with the differences.
UPDATE `locks` AS `l1`
CROSS JOIN (SELECT SUM(`value`) AS `sum` FROM `locks`
WHERE `key` IN ("key3","key2")) AS `l2`
SET `l1`.`value` = `l1`.`value` + 1
WHERE `l1`.`key` = "key1" AND (`l2`.`sum` < 1);
Here are the specific features I'm relying on (as I can think of them):
Update queries.
Joins in update queries.
Aggregate functions in non-explicitly-grouped queries.
WHERE...IN condition.
I'm sure people will be curious exactly what this does, and this may also include database features that might not be ubiquitous. This is an implementation of mutual exclusion using a database, intended for a web application. In my case I needed it because certain user actions cause tables to be dropped and recreated with different columns, and I want to avoid errors if other parts of the application try to insert data. The implementation, therefore, is specialized to solve the readers-writers problem.
This query assumes there exists a table locks with two fields: key (varchar) and value (int). It further assumes that the table contains a row such that key="key1". Then it tries to increment the value for "key1". It only does so if for every key in the list ("key2","key3"), the associated value is 0 (the WHERE condition for l2 is an approximation that assumes value is never negative). Therefore this query only "obtains a lock" if certain conditions are met, presumably in an atomic fashion. Then, the application checks if it received a lock by the return value of the query which presumably states how many rows were affected. If and only if no rows were affected, the application did not receive a lock.
So, here are the additional conditions not discernable from the query itself:
Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
Processing the query must return whether any values were affected.
As a secondary request, I would appreciate any resources on "standard SQL." I've heard about it but never been able to find any kind of definition, and I feel like I'm missing a lot of things when the MySQL documentation says "this feature is an extension of standard SQL."
Based on the responses, this query should work better across all systems:
UPDATE locks AS l1
CROSS JOIN (SELECT SUM(val) AS others FROM locks
WHERE keyname IN ('key3','key2')) AS l2
SET l1.val = l1.val + 1
WHERE l1.keyname = 'key1' AND (l2.others < 1);
Upvotes for everyone because of the good answers. The marked answer seeks to directly answer my question, even if just for one other DBMS, and even though there may be better solutions to my particular problem (or even the problem of cross-platform SQL in general).
This exact syntax would only work in MySQL.
It's an ugly workaround for this construct:
UPDATE locks
SET value = 1
WHERE key = 'key1'
AND NOT EXISTS
(
SELECT NULL
FROM locks li
WHERE li.key IN ('key2', 'key3')
AND li.value > 0
)
which works in all systems except MySQL, because the latter does not allow subqueries on the target table in UPDATE or DELETE statements.
For PostgreSQL
1) Update queries.
Can't imagine a RDBMS that has no UPDATE. (?)
2) Joins in update queries.
In PostgreSQL you would include additional tables with FROM from_list.
3) Aggregate functions in non-grouped queries.
Not possible in PostgreSQL. Use subqueries, CTE or Window functions for that.
But your query is grouped. The GROUP BY clause is just not spelled out. That works in PostgreSQL, too.
The presence of HAVING turns a query into a grouped query even if
there is no GROUP BY clause. This is the same as what happens when the
query contains aggregate functions but no GROUP BY clause.
(Quote from the manual).
4) WHERE...IN condition
Works in any RDBMS I know of.
"Additional conditions": Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
PostgreSQL's multiversion model MVCC (Multiversion Concurrency Control) is superior to MySQL for handling concurrency. Then again, most RDBMS are superior to MySQL in this respect.
Processing the query must return whether any values were affected.
Postgres does that, most every RDBMS does.
Furthermore, this query wouldn't run in PostgreSQL because:
no identifiers with backticks (that's MySQL slang).
values need to be single-quoted, not double-quoted.
See the list of reserved words in Postgres and SQL standards.
A combined list for various RDBMS.
This will only work in mysql, just because you use "`" delimiter, which is mysql-specific only.
What if you replace delimiter with more "standard" one: then probably it will work in all modern DBMS (postgres, sql server, oracle), but I would never write a general query for all - I'd better written a specific query for each used (or potentially used) DBMS to use its specific language dialects to get the best performance and query readability.
What about "As a secondary request, I would appreciate any resources on "standard SQL."" --- get a look at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

What happens when limiting an oracle subquery

A friend of mine and I recently were having a debate about this, so I was wondering if anyone here might actually know the answer.
Generally, to emulate the LIMIT start_index,result_count syntax of MySQL and the LIMIT result_count OFFSET start_index functionality of MySQL and PostgreSQL in Oracle, we use:
SELECT P.* FROM
( SELECT COL1...COLN, ROW_NUMBER() OVER ID_COLUMN AS RN FROM MY_TABLE )
WHERE P.RN BETWEEN START_INDEX AND END_INDEX;
Instead of an explicit limit function, this alternate means needs to be used.
(If there is a better way, please let me know)
One of us argued that this means that Oracle actually fetches END_INDEX records and then only returns those records which have a rn over START_INDEX. This means that when looking for records 123,432-123,442 Oracle would retrieve 123,431 unnecessary records. It was then argued that it followed that the two open source DB's mentioned (MySQL & PgSQL) by implication have a means of shortcutting this.
The counter argument is that DBMS's are optimized to handle sub-queries, meaning that the syntax does not necessarily imply the behavior. Further, the LIMIT syntaxes are likely merely syntactic sugar which are really wrappers around what has to be stated explicitly in Oracle.
Is there any who can determine which of these is the correct interpretation? Perhaps both are correct in some way?
It is correct that Oracle processes END_INDEX rows and discards rows 1 to START_INDEX-1, but it only fetches rows START_INDEX to END_INDEX in the cursor.
I don't know how LIMIT is implemented in the other DBMSs, but I can't imagine how they could do otherwise: how would they know they are fetching the 123,432nd row of the result set without first finding and discarding the previous 123,431 rows?
In practice, if you find yourself applying a LIMIT clause (or Oracle equivalent) with a START_INDEX of more than a few hundreds, you really need to rethink your requirements.