MySQL where on indexed column and not indexed behavior [duplicate] - mysql

Say that I have a long, expensive query, packed with conditions, searching a large number of rows. I also have one particular condition, like a company id, that will limit the number of rows that need to be searched considerably, narrowing it down to dozens from hundreds of thousands.
Does it make any difference to MySQL performance whether I do this:
SELECT * FROM clients WHERE
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar) AND
company = :ugh
or this:
SELECT * FROM clients WHERE
company = :ugh AND
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar)

Here is a demo showing the order of WHERE clause conditions can make a difference due to short-circuiting. It runs the following queries:
-- query #1
SELECT myint FROM mytable WHERE myint >= 3 OR myslowfunction('query #1', myint) = 1;
-- query #2
SELECT myint FROM mytable WHERE myslowfunction('query #2', myint) = 1 OR myint >= 3;
The only difference between these is the order of operands in the OR condition.
myslowfunction deliberately sleeps for a second and has the side effect of adding an entry to a log table each time it is run. Here are the results of what is logged when running the two queries:
myslowfunction called for query #1 with value 1
myslowfunction called for query #1 with value 2
myslowfunction called for query #2 with value 1
myslowfunction called for query #2 with value 2
myslowfunction called for query #2 with value 3
myslowfunction called for query #2 with value 4
The above shows that a slow function is executed more times when it appears on the left side of an OR condition when the other operand isn't always true.
So IMO the answer to the question:
Does the order of conditions in a WHERE clause affect MySQL performance?
is "Sometimes it can do."

No, the order should not make a large difference. When finding which rows match the condition, the condition as a whole (all of the sub-conditions combined via boolean logic) is examined for each row.
Some intelligent DB engines will attempt to guess which parts of the condition can be evaluated faster (for instance, things that don't use built-in functions) and evaluate those first, and more complex (estimatedly) elements get evaluated later. This is something determined by the DB engine though, not the SQL.

The order of columns in your where clause shouldn't really matter, since MySQL will optimize the query before executing it. But I suggest you read the chapter on Optimization in the MySQL reference manual, to get a basic idea on how to analyze queries and tables, and optimize them if necessary. Personally though, I would always try to put indexed fields before non-indexed fields, and order them according to the number of rows that they should return (most restrictive conditions first, least restrictive last).

Mathematically Yes It has an effect. Not only in SQL Query. rather in all programming languages whenever there is an expression with and / or .
There works a theory of Complete evaluation or partial evaluation.
If its an and query and first expression of and evaluates to false it will not check further. as anding false with anything yields false .
Similerly in an or expression if first one is true it will not check further.

A sophisticated DBMS should be able to decide on its own which where condition to evaluate first. Some Databases provide tools to display the "strategy" how a query is executed. In MySQL, e.g. you can enter EXPLAIN in front of a query. The DBMS then prints the actions it performed for executing the query, as e.g. index or full-table scan. So you could see at a glance whether or not it uses the index for 'company' in both cases.

this shouldn't have any effect, but if you aren't sure, why don't you simply try it out? the order of where-clauses on an select from a single table makes no difference, but if you join multiple tables, the order of the joins could affect the performace (sometimes).

I don't think the order of the where clause has any impact. I think the MySQL query optimizer will reorganize where clauses as it sees fit so it filters away the largest subset first.
It's another deal when talking about joins. The optimizer tries to reorder here too, but doesn't always finds the best way and sometimes doesn't use indexes. SELECT STRAIGHT JOIN and FORCE INDEX let's you be in charge of the query.

No it doesn't, the tables required are selected and then evaluated row by row. Order can be arbitrary.

Related

is where (A and B) and where (B and A) in SQL the same? [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

MySQL performance of LIKE (without wildcards) vs =

I'm going to start of by saying that I know that you can't use indexes for LIKE queries if the value starts with a wildcard. This is NOT a question about that. I'm not using any wildcards.
In an application that accepts users to pass wildcard into queries, the value is passed to a query's LIKE clause. I've done some testing and have come to the conclusing that when searching for an exact address (so no wildcards) the query runs is slower than when I'm using an =. Take following 2 queries:
SELECT id FROM users WHERE email LIKE 'user#host.tld'
vs
SELECT id FROM users WHERE email = 'user#host.tld'
Both queries will return the exact same records. When doing EXPLAIN on both, I can see that they are both using the index of the email field. The main difference is that the LIKE query is using a RANGE type, and the = query is using a REF type. Also, the RANGE query is examining some 1000 records where the = query is only examining 1 record (on 2 million records in the table).
The profiles of the query are the same, with the exception that the LIKE query uses significantly more time to process the "sending data" step, where it is actually examining the 1000 records. So basically, the query is slower because it is touching more data.
The thing I don't get, is to why it is doing that? Since the RANGE query is using the exact same index and exactly the same set of matches should be returned from the index, why is it examining more rows? This is probably a question about the internals of how a range query uses an index vs how the ref query does, but I can't seem to find any detailed information about it.
Q: why ... is [MySQL Optimizer] doing that?
A:
The short answer is that the optimizer is not converting the LIKE with no wildcards into an = operation.
MySQL optimizer only uses ref access for = and <=> comparisons.
MySQL optimizer can use range access for a lot more operations including =, <=>, <, <=, >, >=, BETWEEN, ...
A predicate like this: col LIKE 'foo' is being handled equivalent to
col >= 'foo' AND col <= 'foo'
We look at that and say, that's the same as col = 'foo', but the optimizer doesn't see it that way. The optimizer approach probably makes more sense if we use a wildcard. For example
col LIKE `foo%bar`
MySQL could use the foo portion for the "range" part of the scan, akin to this:
col >= 'foo' AND col < 'fop'
MySQL optimizer can use an index range scan to satisfy the >= and < comparison.
(I use fop here as a simplistic representation of the lowest "higher weighted" string in the collating sequence. We don't need to dive into charactersets and collating sequences, but just as a short justification of my use of 'fop', with latin1_swedish_ci collating sequence...
SELECT HEX(WEIGHT_STRING(_latin1'foo' AS CHAR(3))) AS ws_foo
, HEX(WEIGHT_STRING(_latin1'fop' AS CHAR(3))) AS ws_fop
And for the rows that are found by the index range scan, the rest of the matching can be performed, akin to
SUBSTR(col,4) LIKE '%bar'
I'm not saying that this is exactly how the MySQL optimizer is operating. (I haven't reviewed the source code.)
I'm suggesting that the MySQL optimizer is not treating 'col LIKE 'foo' the same as 'col = 'foo', and the primary reason for that is the potential for wild card characters.
If we want col = 'foo' performance, we should write col = 'foo'.
We pay a price for a range scan when we opt for the flexibility of the LIKE comparison.
And we pay an even higher price (a full index scan, index operation in the EXPLAIN output), when we use a regular expression col REGEXP '^foo$'.
EDIT
Even with the difference shown in the EXPLAIN plan, I wouldn't expect any measurable difference in performance of these two statements:
SELECT SQL_NO_CACHE id FROM users WHERE email LIKE 'user#host.tld'
SELECT SQL_NO_CACHE id FROM users WHERE email = 'user#host.tld'
For evaluating performance, I would run the statements four (or more) times in a row, capturing the execution time of each statement run, and throw out the result from the first run. Average the execution time of the runs except for the first. (We'd expect the execution times of the subsequent runs to be very close to each other.)
Note that other concurrent operations on the database could impact the performance of the statement we're measuring.
The Optimizer...
(effectively) turns a LIKE without any wild cards into =.
turns IN (one-item) into =.
turns LIKE _with_ a _trailing_%` (as the only wildcard) into a range test.
cannot optimize LIKE with in most other situations with wildcards.
These optimizations are useless without a relevant INDEX.
sending data is a useless metric.
Running a query the first time may have to load stuff from disk; the second time it will find stuff cached in RAM, hence be much faster.
EXPLAIN's "Rows" is an estimate; don't jump to any conclusions if the value varies by less than a factor of 2.
An = drills down the BTree to find the first matching row. Then it scans forward to find any more matching rows.
Ditto for a "range" (BETWEEN or LIKE 'foo%' or ...) -- drill down to find the first (or last) item in the range, then scan forward (or backward). Backward scanning happens if the Optimizer can use ORDER BY .. DESC at the same time.
(spencer7593's Answer goes into more detail.)

Order of WHERE predicates and the SQL optimizer

When writting SQL queries with various where clauses (I only work with MySQL and sqlite) , I usually have the doubt of reordering the query clauses to put the "best ones" first (those which will remove a bigger amount of rows), and other "cosmetic" clauses later (which will barely change the output). In other words, I'm in doubt about if I really will help the optimizer to run faster by reordering clauses (specially when there are indexes in play), or if it could be another case of premature optimization. Optimizers are usually smarter than me.
For example:
select address.* from address inner join
user on address.user = user.id
where address.zip is not null and address.country == user.country
If we know that usually address.zip is not null, that check will be 90% true, and if the query order is respected, there will be a lot of dummy checks which can be avoided by placing the country check before.
Should I take care of that? In other words, is it important the order of where clauses or not?
The mysql optimizer seems well documented and you can find a many interesting considerations in offcial documents ..http://dev.mysql.com/doc/refman/5.7/en/where-optimizations.html
be taken into account especially of a very simple fact .... sql is not a procedural language but rather is a declarative language .. this mean it is not important the order in which the parts are written but it is important only the fact of what elements are been declared. This is evident in the documentation on optimization of mysql where where the focus is only on the components of a query and how they are transformed by optmizer in internal components
The order is mostly irrelevant.
In MySQL, with WHERE ... AND ...,
The Optimizer will first look for which part can use an index. If one can and one can't, the optimizer will use the index; the order becomes irrelevant
If both sides of the AND can use an index, MySQL will usually pick the 'better' one. (Sometimes it goofs.) Again, the order is ignored.
If neither side can use an index, it it evaluated left to right. But... Fetching rows is the bulk of effort in performing the query, so if one side of the AND is a little slower than the other, you probably won't notice. (Sure, if one side does SLEEP(3) you will notice.)
There's another issue in your example query (aside from the syntax error): The Optimizer will make a conscious decision of which table to start with.
If it decides to start with user, address needs INDEX(user, country) in either order.
If it decides to start with address, user needs (id, country) in either order.
It is unclear whether the Optimizer will bother with the NOT NULL test, even if that column is indexed.
Bottom line: Spend your time focusing on optimal indexes .
The answer is definitly maybe.
Mysterious are the ways of the optimizer.
Here is a demonstration based on exception caused due to division by zero.
create table t (i int);
insert into t (i) values (0);
The following query succeeds for Oracle, SQL Server, Postgres and Teradata (we'll skip the version information for now):
select 1 from t where i < 1 or 1/i < 1;
The following query fails for SQL Server and Postgres but succeeds for Oracle and Teradata
select 1 from t where 1/i < 1 or i < 1;
However, the following query does fail for Oracle and Teradata:
select 1 from t where 1/i < 1 or i/1 < 1;
What do we learn?
That some optimizers seem to respect the order of the predicates (or at least in some manner) and some seem to reorder the predicates by their estimated cost (e.g 1/i < 1 is more costly than i < 1 but not i/1 < 1).
For those who respect the order of the predicates we probably can improve performance by putting the light wait predicates first for OR operators and the frequently false predicates for AND operators.
Being that said, since databases do not guarantee to preserve the order of the predicates even if some of them currently seem to do that, you definitly can't count on it.
MySQL 5.7.11
This query returns immediately:
select 1 from t where i < 1 or sleep(3);
This query returns after 3 seconds:
select 1 from t where sleep(3) or i < 1

What difference does it make which column SQL COUNT() is run on?

Firstly, this is not asking In SQL, what's the difference between count(column) and count(*)?.
Say I have a users table with a primary key user_id and another field logged_in which describes if the user is logged in right now.
Is there a difference between running
SELECT COUNT(user_id) FROM users WHERE logged_in=1
and
SELECT COUNT(logged_in) FROM users WHERE logged_in=1
to see how many users are marked as logged in? Maybe a difference with indexes?
I'm running MySQL if there are DB-specific nuances to this.
In MySQL, the count function will not count null expressions, so the results of your two queries may be different. As mentioned in the comments and Remus' answer, this is as a general rule for SQL and part of the spec.
For example, consider this data:
user_id logged_in
1 1
null 1
SELECT COUNT(user_id) on this table will return 1, but SELECT COUNT(logged_in) will return 2.
As a practical matter, the results from the example in the question ought to always be the same, as long as the table is properly constructed, but the utilized indexes and query plans may differ, even though the results will be the same. Additionally, if that's a simplified example, counting on different columns may change the results as well.
See also this question: MySQL COUNT() and nulls
For the record: the two queries return different results. As the spec says:
Returns a count of the number of non-NULL values of expr in the rows
retrieved by a SELECT statement.
You may argue that given the condition for logged_in=1 the NULL logged_in rows are filtered out anyway, and user_id will not have NULLs in a table users. While this may be true, it does not change the fundamentals that the queries are different. You are asking the query optimizer to make all the logical deductions above, for you they may be obvious but for the optimizer may be is not.
Now, assuming that the results are in practice always identical between the two, the answer is simple: don't run such a query in production (and I mean either of them). Is a scan, no matter how you slice it. logged_in has too low cardinality to matter. Keep a counter, update it at each log in and each log out event. It will drift in time, refresh as often as needed (once a day, once an hour).
As for the question itself: SELECT COUNT(somefield) FROM sometable can use a narrow index on somefield resulting in less IO. The recommendation is to use * because this room for the optimizer to use any index it sees fit (this will vary from product to product though, depending on how smart a query optimizer are we dealing with, YMMV). But as you start adding WHERE clauses the possibile alternatives (=indexes to use) quickly vanish.

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.