I have a MySQL table with ~17M rows where I end up doing a lot of aggregation queries.
For this example lets say I have index_on_b, index_on_c, compound_index_on_a_b, compound_index_on_a_c
I try and run a query explain
EXPLAIN SELECT SUM(revenue) FROM table WHERE a = some_value AND b = other_value
And I find that the selected index is index_on_b, but when I use a query hint
SELECT SUM(revenue) FROM table USE INDEX(compound_index_on_a_b)
The query runs way way faster. Is there anything I can do in MySQL config to make MySQL choose the compound indexes first?
There are 2 possible routes you can take:
A) The index resolution process is when according to the optimizer all things are equal based on the order the indexes are created in. You could drop index_b and recreate it and check if the optimizer was in a scenario where it just thought they were the same.
Or
B) Use optimizer_search_depth (see https://mariadb.com/blog/setting-optimizer-search-depth-mysql). By altering this parameter you determine how much effort the optimizer is allowed to spend on a query plan, and it might come up with the much better solution of using the combined index.
A possible explanation:
If a has the same value throughout the table, then INDEX(b) is actually better than INDEX(a,b). This is because the former is smaller, hence faster to work with. Note that both will return the same number of rows, even without further checking of a.
Please provide:
SHOW CREATE TABLE
SHOW INDEXES -- to see cardinality
EXPLAIN SELECT
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
I am trying to optimize the query which we are using to generate reports.
I think I did quite good to optimize to some extends.
Below was the original query:
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name, trat.name
order by field(trat.name,'create-account','change-attr',
'add-member-to-group',
'grant-access','disable-account','revoke-access',
'remove-member-from-group','update-license')
When I see the execution plain in Extra column it says using where,Using Temporary,filesort.
So I optimize the query like this
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name,trat.name
order by null
Which gives me the execution plan as using where,using temporary. So filesort is no more use and there is no extra overhead as optimizer doesn't have to sort,which will be taken care during group by.
I again went forward and created indexes on group by columns in same order as they mentioned in group by(this is important or optimization won't happen) i.e create index on (trat.asset_name,trat.name).
Now this optimization gives me Using where only in extra column. Also the query execution time got deduced by almost half(earlier it was 0.568 sec. and now 0.345sec ,not exact though it vary every time but more or less in this range).
Now I want to optimize the range query ,below part of query
trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
I am reading on mysl reference guide to optimize range query,Which says not in is not in range query ,So I did the modification in query like this
trat.status in ('completed','failedredundant')
and trat.name in ('add-member-to-group','change-attr','create-account',
'disable-account','grant-access', 'remove-member-from-group',
'update-license')
But it doesn't show any improvement in Extra(I mean using index should be there,it is still showing using where).
I also tried by splitting both range part into unions(that will change the query result but still no improvement in execution plan)
I want some help on how to optimize this query more,mostly the range part(in part).
Any other optimization if I need to make on this?
I appreciate your time,Thanks in advance
EDIT 1 I forgot to mentioned that I have index on trat.status also,So Below are the indexes
(trat.asset_name,trat.name)
(trat.status)
In virtually all cases, only one index is used in a SELECT. So, one must have available the best.
Both of the first two queries will probably benefit most from the same 'composite' index:
INDEX(asset_name, name)
Normally, one would try to handle the WHERE conditions in the index, but they do not look amenable to an index. (More discussion below.) Second choice is the GROUP BY, which I am recommending. But, since (in the first case) the ORDER BY and the GROUP BY are different, there will necessarily be a tmp table created for the output of the GROUP BY so that it can be sorted according to the ORDER BY. (There may also be a tmp and sort for the GROUP BY; I cannot tell.)
"Using index" means that a "covering" index was used. A "covering" index is a composite index that includes all of the columns used anywhere in the SELECT. That would be about 5 columns, and probably not wise to attempt. (More below.)
Another thing to note that even something this simple:
WHERE x IN (11,22)
GROUP BY y
cannot use any index to handle both the WHERE and GROUP BY. So, there is no way for your query to consume both (except by 'covering').
A covering index, when used, is only partially useful. It says that all the work is done just in the BTree of the index. But that could include a full index scan -- which is not that much faster than a full table scan. This is another argument against recommending 'covering'.
In some situations, IN or OR can be sped up by turning it into UNION:
( SELECT ... WHERE status in ('completed') )
UNION ALL
( SELECT ... WHERE status in ('failedredundant') )
but this will only cause you to stumble into the NOT IN(...) clause, which is worse than an IN.
The goal of finding the best index is to find one that has the rows (in the index and/or in the table) consecutively sitting in the BTree.
To make any further improvements on this query will probably require re-thinking the schema -- it seems to be forcing you to have IN, NOT IN, FIELD and other hard-to-optimize constructs.
I have a table t with columns a int, b int, c int; composite index i (b, c). I fetch some data with following query:
select * from t where c = 1 and b = 2;
So the question is: will MySQL and Postgres use the index i? And, more generally: does the query composite where clause order affect the possibility of index use?
What you need to do is use the explain function in both, to see what's going on. If it says it's using an index then it is. One caveat is that in a small table with minimal data, it's very likely that postgresql (and probably mysql) will ignore the indexes and favor of a scan. To get a real result, insert quite a bit of dummy data (at least 20 rows, and I always do about 500) and be sure the analyze the table. Also, realize that if the search criteria will return a large percentage of the table results, it will likely not use the index either (as a scan will be faster).
create table
generate data (perhaps using generate_series)
run explain select * from t where c=1 and b=2
create index `create index on t(b,c)
Analyze table analyze t
run explain select * from t where c=1 and b=2 and compare with first run
hopefully this will help answer this, and other questions you might have in the future about when indexes will run. To answer your original question though, yes, in general postgresql will use the index, regardless of order, if the optimizer determines that to be the best way to get your results. Remember to analyze your table though, so the optimizer has an idea of what information is in your table, and analyze it any time a ton of data is added or deleted from your table. Depending on your PG version and settings, some of this may be done automatically for you, but it won't hurt to manually analyze, especially when testing this kind of thing.
Edit: the index order may (especially if you don't use an order by in your query and the optimizer uses the index) effect the order of the results of your query-- the returned rows may be ordered in the same order of the index.
It's not, the order doesn't matter.
Optimizer does a lot of smart things to perform a query in the most efficient way.
Im trying to understand indexes better for when I use Mysql. One issue is Im still having a hard time to determine what type of index I should use such as individual indexes, multi column indexes, covering indexes etc.
One question I have is, is there a general rule to decide what type of indexes to use? When I design my database layout I dont know exactly what all queries will be used until the application is done being built. For one table I could query on one or multiple fields as well as query it for reporting. So if I query a table like so:
SELECT * FROM table1 WHERE field1 = this AND field2 = that GROUP BY field3 ORDER BY field4
Would I create a multiple column index on field1,field3,field3 and field4?
Also what if I have a different query on the same table like:
SELECT * FROM table1 WHERE field1 = this and field3 = that
If I had the multiple column index from the first query will that same index work for the second query since field1 is on the farthest left of the index?
And another question I had was is there a specific order mysql looks for indexes? So for multiple column or a covering index do I add indexes in order of the where clause? Then anything in group clause then anything in order clause? Or does mysql automatically do this?
Sorry for all the questions, just looking for help on this.
Engine
First you have to decide which Engine you want to use for a given table
InnoDB is preferable (transactions...) but does not offer fulltext index
If you need fulltext index, you have to chose MyISAM
(Full text index keeps an index based on words in a column)
Tables
You have to know that MySQL uses only one index per table maximum in a join. So, don't expect MySQL to combine two indexes of a given table.
Multi-columns
Chose the order of the column based on the queries, provided that MySQL can use the top of the index if necessary
For instance
CREATE INDEX myindex ON mytable (col1,col2,col3)
MySQL can use (col1), (col1,col2) and (col1,col2,col3) as index. So to answer your question, your index should be created on
(field1,field3,field2,field4).
since your two queries needs (field1,field3) and (field1,field2,field3,field4).
When I design my database layout I dont know exactly what all queries will be used until the application is done being built
Correct. Don't build indexes until you know all the queries. It's okay to add, change, alter and remove indexes. Indeed, good designers change the indexes as the use of the software changes.
Would I create a multiple column index on field1,field3,field3 and field4?
Rarely.
If I had the multiple column index from the first query will that same index work for the second query since field1 is on the farthest left of the index?
No.
And another question I had was is there a specific order mysql looks for indexes?
No.
So for multiple column or a covering index do I add indexes in order of the where clause?
No
Then anything in group clause then anything in order clause?
No.
Or does mysql automatically do this?
More-or-less.
Here's the rule.
Design the database.
Write the queries.
Find the most common queries. 20% of your queries do 80% of the work. Focus on the few, slow queries that need indexes.
Explain the query execution plans for only the most common queries. There's an EXPLAIN statement for this.
Measure the performance of those queries with realistic loads of data. You have to build fake data for this. Some queries will be slow. Indexes may help. Some queries will not be slow.
Now comes the hard part. Try different indexes until (a) the explain plan looks optimal and (b) the measured query performance meets your expectations.
You cannot get all queries to be fast.
You do not build indexes for all queries.
Focus on the 20% of the queries that cost 80% of the time.
I'm interested in where condition; if I write:
Select * from table_name
where insert_date > '2010-01-03'
and text like '%friend%';
is it different from:
Select * from table_name
where text like '%friend%'
and insert_date > '2010-01-03';
I mean if the table is very big, has a lot of rows and if mysql takes records compliant with condition " where insert_date > '2010-01-03' " first and then searches in these records for a word "friend" it can be much faster than from first search for "friend" rows and than look into the date field.
Is it important to write where condition smartly, or mysql analyze the condition and rewrites where condition in the best way?
thanks
No, the two where clauses should be equivalent. The optimizer should pick the same index whichever you use.
The order of columns in an index does matter though.
If you think the optimizer is using the wrong index, you could give it a hint. More often than not though, there's a good reason for using the index it has chosen to use, so unless you know exactly what you are doing, giving the optimizer hints will often make things worse not better.
I don't know about MySQL in particular, but typically this kind of optimization is left to the database engine, as which order is faster depends on indexes, cardinality of data, and quantity of data among other things.
I think it's true, that both of where clause ar similar in database abstraction
By definition, a logical conjunction (the AND operator) is commutative. This means that WHERE A AND B is equal to WHERE B AND A.
It makes no difference in which order you write your conditions.
However, what makes a difference is what indexes you have in place on your table. The query analyzer takes these into account. It is also smart enough to find the part of the condition that is easiest to check and apply that one first.