Drawback of using multiple conditions in where clause - mysql

As we all know that the where clause of a sql query executes before select clause. And we put some conditions in where clause to filter out the result according to our requirement.
While writing some queries I encountered a question in my mind that, is/are there any drawback(s) of using multiple conditions in the where clause and in what order they are applied to filter the result from the selected table.
For example: we have a table
Building(name,height,owner,builder_name,age)
and we have a query:
select * from Building
where height between X and Y and age between A and B
Now, how this query will execute. And what about the order of the conditions i.e.
X<=height<=Y and A<=age<=B
Will it be something like, first the whole record will be searched for the height within the given range and then same thing will be done for age well. ???

the Database server has multiple options to solve that query. It will choose the option the server "thinks" is faster.
The options I see are:
Scan the whole table and filter out rows that don't satisfy the where clause
Seek a height range on an index on height column, then filter out rows using age between A and B predicate.
Seek an age range on an index on age column, then filter out rows using height between X and Y predicate
Seek both indexes, then perform an index intersection
The database server not always use an index that might be applicable, it considers some things before using it, such as:
The index selectivity.
The index coverage.
High selectivity indexes are more likely to be used.
Covering indexes are likely used.

The query planner will try to find the optimal way to search the table and test the WHERE clause. It will start by trying to use an index if possible, which will narrow down the rows that it needs to search. If there are multiple potential indexes, it will try to use the one that it estimates will narrow it down best.
Then it will scan all these rows, and test each of them against all of the remaining conditions. It should never need to make multiple passes over the entire table.
If you want to see how a particular query will be executed, use the EXPLAIN command.

Related

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Is a MySQL SELECT query faster with as much as possible keys in a table?

I was wondering if there is a difference in perfonance if you have a query that SELECTS on 4 keys or 5 keys.
If yes, why is this? What tests can I run for this?
For example
SELECT name FROM table where country='ES' AND page='ABC' AND color='BLUE' AND hobby='PAINTING' AND status='SINGLE'
versus
SELECT name FROM table where country='ES' AND page='ABC' AND color='BLUE' AND hobby='PAINTING'
Other things being equal testing for 5 columns (using AND) should be marginally slower. For example:
If both queries perform a full index scan (assuming there's no secondary index suitable for the query), then testing the condition with five columns could be a bit slower, but probably not really perceptible. This is because the engine would need to evaluate the extra predicate on all the rows of the table/primary index.
On the flip side, if both queries use an index (primary or secondary) then the performance could be almost identical, since it would require pretty much a single index seek.
In generic, I prefer you to do the following rather than indexing [ofcourse must have index]
Since this is the AND operator, whatever condition not satisfied it will move off to next level [ here 5 and operator, if the 1st ANDcondition didn't satisfy, then it will quit from the query]
So apart from indexing need to check the records which shouldn't go for a full table scan. In your case, if the status should be single/married. This should be in the last of your AND operator.
color='BLUE' => which means, if this is very few like [blue/yellow], then use this one in the beginning. because it will do filtering the rows very faster, and then do the search with remaining.
Make sure to place the where condition column which is having very less filtering content should be in the first place.
Coming back to your query, you can have a composite index. But having 5 rows in the composite index will affect the query performance. Bit slower for sure.
The EXPLAIN plan will help best in your case.

Queries executing in orders [duplicate]

My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.

The process order of SQL order by, group by, distinct and aggregation function?

Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.

MySQL: Optimize query with DISTINCT

In my Java application I have found a small performance issue, which is caused by such simple query:
SELECT DISTINCT a
FROM table
WHERE checked = 0
LIMIT 10000
I have index on the checked column.
In the beginning, the query is very fast (i.e. where almost all rows have checked = 0). But as I mark more and more rows as checked, the query becomes greatly inefficient (up to several minutes).
How can I improve the performance of this query ? Should I add a complex index
a, checked
or rather
checked, a?
My table has a lot of millions of rows, that is why I do not want to test it manually and hope to have lucky guess.
I would add an index on checked, a. This means that the value you're returning has already been found in the index and there's no need to re-access the table to find it. Secondly if you're doing lot's of individual updates of the table there's a good chance both the table and the index have become fragmented on the disc. Rebuilding (compacting) a table and index can significantly increase performance.
You can also use the query rewritten as (just in case the optimizer does not understand that it's equivalent):
SELECT a
FROM table
WHERE checked = 0
GROUP BY a
LIMIT 10000
Add a compound index on the DISTINCT column (a in this case). MySQL is able to use this index for the DISTINCT.
MySQL may also take profit of a compound index on (a, checked) (the order matters, the DISTINCT column has to be at the start of the index). Try both and compare the results with your data and your queries.
(After adding this index you should see Using index for group-by in the EXPLAIN output.)
See GROUP BY optimization on the manual. (A DISTINCT is very similar to a GROUP BY.)
The most efficient way to process GROUP BY is when an index is used to directly retrieve the grouping columns. With this access method, MySQL uses the property of some index types that the keys are ordered (for example, BTREE). This property enables use of lookup groups in an index without having to consider all keys in the index that satisfy all WHERE conditions.>
My table has a lot of millions of rows <...> where almost all rows have
checked=0
In this case it seems that the best index would be a simple (a).
UPDATE:
It was not clear how many rows get checked. From your comment bellow the question:
At the beginning 0 is in 100% rows, but at the end of the day it will
become 0%
This changes everything. So #Ben has the correct answer.
I have found a completely different solution which would do the trick. I will simple create a new table with all possible unique "a" values. This will allow me to avoid DISTINCT
You don't state it, but are you updating the index regularly? As changes occur to the underlying data, the index becomes less and less accurate and processing gets worse and worse. If you have an index on checked, and checked is being updated over time, you need to make sure your index is updated accordingly on a regular basis.