I'm interested in where condition; if I write:
Select * from table_name
where insert_date > '2010-01-03'
and text like '%friend%';
is it different from:
Select * from table_name
where text like '%friend%'
and insert_date > '2010-01-03';
I mean if the table is very big, has a lot of rows and if mysql takes records compliant with condition " where insert_date > '2010-01-03' " first and then searches in these records for a word "friend" it can be much faster than from first search for "friend" rows and than look into the date field.
Is it important to write where condition smartly, or mysql analyze the condition and rewrites where condition in the best way?
thanks
No, the two where clauses should be equivalent. The optimizer should pick the same index whichever you use.
The order of columns in an index does matter though.
If you think the optimizer is using the wrong index, you could give it a hint. More often than not though, there's a good reason for using the index it has chosen to use, so unless you know exactly what you are doing, giving the optimizer hints will often make things worse not better.
I don't know about MySQL in particular, but typically this kind of optimization is left to the database engine, as which order is faster depends on indexes, cardinality of data, and quantity of data among other things.
I think it's true, that both of where clause ar similar in database abstraction
By definition, a logical conjunction (the AND operator) is commutative. This means that WHERE A AND B is equal to WHERE B AND A.
It makes no difference in which order you write your conditions.
However, what makes a difference is what indexes you have in place on your table. The query analyzer takes these into account. It is also smart enough to find the part of the condition that is easiest to check and apply that one first.
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.
I have a MySQL table with ~17M rows where I end up doing a lot of aggregation queries.
For this example lets say I have index_on_b, index_on_c, compound_index_on_a_b, compound_index_on_a_c
I try and run a query explain
EXPLAIN SELECT SUM(revenue) FROM table WHERE a = some_value AND b = other_value
And I find that the selected index is index_on_b, but when I use a query hint
SELECT SUM(revenue) FROM table USE INDEX(compound_index_on_a_b)
The query runs way way faster. Is there anything I can do in MySQL config to make MySQL choose the compound indexes first?
There are 2 possible routes you can take:
A) The index resolution process is when according to the optimizer all things are equal based on the order the indexes are created in. You could drop index_b and recreate it and check if the optimizer was in a scenario where it just thought they were the same.
Or
B) Use optimizer_search_depth (see https://mariadb.com/blog/setting-optimizer-search-depth-mysql). By altering this parameter you determine how much effort the optimizer is allowed to spend on a query plan, and it might come up with the much better solution of using the combined index.
A possible explanation:
If a has the same value throughout the table, then INDEX(b) is actually better than INDEX(a,b). This is because the former is smaller, hence faster to work with. Note that both will return the same number of rows, even without further checking of a.
Please provide:
SHOW CREATE TABLE
SHOW INDEXES -- to see cardinality
EXPLAIN SELECT
I have a table with 510,085 rows, which is now pushing me to seek higher performance. One of the fields in this table is called 'photoStatus'.
In 'photoStatus', 510,045 rows contain the word 'Active' and the remaining 40 contain the word 'Suspended'.
Which of these two queries would be faster to search for 'Active' photos or doesn't it matter?
WHERE photoStatus = 'Active'
Or
WHERE photoStatus <> 'Suspended'
Obviously this is part of a massive query, it's not just one WHERE condition.
Database is MySQL (MyISAM)
Why not convert the column to a boolean, or a numeric value, which would be much faster than a string compare, then you could just do:
....
WHERE isActive;
If you have an index on that column WHERE photoStatus = 'Active' will be faster since the server can just scan the range in the index matching Active.
Second will be "a little" faster because it will not require to compare whole string just first character comparison is enough to include result according to database comparison algo
Be sure to use an index on that field. And EXPLAIN the query to see how efficient your query is.
Other than that your query would filter out just 40, so the rest of the query has to be efficient.
Its going to have to index the table and rows either way.
Personally I would always match. Use equals.
WHERE photoStatus = 'Active'
I would always use int or boolean, better than matching a string..
A normal index won't help in this scenario since the percentage of actually returned rows is to large.
So the database will have to look at each row. There might be some differenece, depending on how fast an equal vs !equal comparison is, but that should be neglectable.
So I expect the result to be pretty much of same speed.
You have posted to few details to find a shortcut for your query.
As it appears you need a full scan. In this case you can try to read the table in parallel.
Don't know what DBMS you use, but in Oracle you can use a hint select /*+parallel(yourtable 8)*/ from yourtable
What you try to do with this data? What types of queryes are slow? can you give an example?There can be many tricks and you can do many mistakes. And not all queryes should work fast. If they are for UI, must respond in a time < 1s. But if it is for admin task, may take 1 minute :)
WHERE photoStatus = 'Active' is better if you have index on that column based on small testing similar to your example.
I added query execution in sql server. Short one belongs to equal comparison. It is reporting better performance. If you dont have index, query cost is similar.
Firstly, .5M rows is not a large table - by ANY means.
A column like "Active" / "Inactive", is likely to be pretty useless to be an index by itself, because it doesn't have enough selectivity to make an index scan beneficial (in fact, if it's 50% of the rows in the table, a table-scan would probably be better).
I suspect that in fact, "Active" has nothing to do with your problem - after all, you're not trying to return .5M rows to the client are you?
A query which returns .5M rows is not going to be fast, because just returning the rows takes a (relatively) long time.
Anyway my answer: It makes no difference, you need to check the other parts of your query. Post a question with the full query, table structure and explain output.
Suppose I have a MySQL query with two conditions:
SELECT * FROM `table` WHERE `field_1` = 1 AND `field_2` LIKE '%term%';
The first condition is obviously going to be a lot cheaper than the second, so I'd like to be sure that it runs first, limiting the pool of rows which will be compared with the LIKE clause. Do MySQL query conditions run in the order they're listed or, if not, is there a way to specify order?
The optimiser will evaluate the WHERE conditions in the order it sees fit.
SQL is declarative: you tell the optimiser what you want, not how to do it.
In a procedural/imperative language (.net, Java, php etc) then you say how and would choose which condition is evaluated first.
Note: "left to right" does apply in certain expressions like (a+b)*c as you'd expect
MySQL has an internal query optimizer that takes care of such things in most cases. So, typically, you don't need to worry about it.
But, of course, the query optimizer is not foolproof. So...
Sorry to do this to you, but you'll want to get familiar with EXPLAIN if you suspect that a query may be running less efficiently than it should.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
If you have doubts about MySQL usage of index, you can suggest what index should be used.
http://dev.mysql.com/doc/refman/5.1/en/index-hints.html