Im trying to understand indexes better for when I use Mysql. One issue is Im still having a hard time to determine what type of index I should use such as individual indexes, multi column indexes, covering indexes etc.
One question I have is, is there a general rule to decide what type of indexes to use? When I design my database layout I dont know exactly what all queries will be used until the application is done being built. For one table I could query on one or multiple fields as well as query it for reporting. So if I query a table like so:
SELECT * FROM table1 WHERE field1 = this AND field2 = that GROUP BY field3 ORDER BY field4
Would I create a multiple column index on field1,field3,field3 and field4?
Also what if I have a different query on the same table like:
SELECT * FROM table1 WHERE field1 = this and field3 = that
If I had the multiple column index from the first query will that same index work for the second query since field1 is on the farthest left of the index?
And another question I had was is there a specific order mysql looks for indexes? So for multiple column or a covering index do I add indexes in order of the where clause? Then anything in group clause then anything in order clause? Or does mysql automatically do this?
Sorry for all the questions, just looking for help on this.
Engine
First you have to decide which Engine you want to use for a given table
InnoDB is preferable (transactions...) but does not offer fulltext index
If you need fulltext index, you have to chose MyISAM
(Full text index keeps an index based on words in a column)
Tables
You have to know that MySQL uses only one index per table maximum in a join. So, don't expect MySQL to combine two indexes of a given table.
Multi-columns
Chose the order of the column based on the queries, provided that MySQL can use the top of the index if necessary
For instance
CREATE INDEX myindex ON mytable (col1,col2,col3)
MySQL can use (col1), (col1,col2) and (col1,col2,col3) as index. So to answer your question, your index should be created on
(field1,field3,field2,field4).
since your two queries needs (field1,field3) and (field1,field2,field3,field4).
When I design my database layout I dont know exactly what all queries will be used until the application is done being built
Correct. Don't build indexes until you know all the queries. It's okay to add, change, alter and remove indexes. Indeed, good designers change the indexes as the use of the software changes.
Would I create a multiple column index on field1,field3,field3 and field4?
Rarely.
If I had the multiple column index from the first query will that same index work for the second query since field1 is on the farthest left of the index?
No.
And another question I had was is there a specific order mysql looks for indexes?
No.
So for multiple column or a covering index do I add indexes in order of the where clause?
No
Then anything in group clause then anything in order clause?
No.
Or does mysql automatically do this?
More-or-less.
Here's the rule.
Design the database.
Write the queries.
Find the most common queries. 20% of your queries do 80% of the work. Focus on the few, slow queries that need indexes.
Explain the query execution plans for only the most common queries. There's an EXPLAIN statement for this.
Measure the performance of those queries with realistic loads of data. You have to build fake data for this. Some queries will be slow. Indexes may help. Some queries will not be slow.
Now comes the hard part. Try different indexes until (a) the explain plan looks optimal and (b) the measured query performance meets your expectations.
You cannot get all queries to be fast.
You do not build indexes for all queries.
Focus on the 20% of the queries that cost 80% of the time.
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
I have a MySQL table with ~17M rows where I end up doing a lot of aggregation queries.
For this example lets say I have index_on_b, index_on_c, compound_index_on_a_b, compound_index_on_a_c
I try and run a query explain
EXPLAIN SELECT SUM(revenue) FROM table WHERE a = some_value AND b = other_value
And I find that the selected index is index_on_b, but when I use a query hint
SELECT SUM(revenue) FROM table USE INDEX(compound_index_on_a_b)
The query runs way way faster. Is there anything I can do in MySQL config to make MySQL choose the compound indexes first?
There are 2 possible routes you can take:
A) The index resolution process is when according to the optimizer all things are equal based on the order the indexes are created in. You could drop index_b and recreate it and check if the optimizer was in a scenario where it just thought they were the same.
Or
B) Use optimizer_search_depth (see https://mariadb.com/blog/setting-optimizer-search-depth-mysql). By altering this parameter you determine how much effort the optimizer is allowed to spend on a query plan, and it might come up with the much better solution of using the combined index.
A possible explanation:
If a has the same value throughout the table, then INDEX(b) is actually better than INDEX(a,b). This is because the former is smaller, hence faster to work with. Note that both will return the same number of rows, even without further checking of a.
Please provide:
SHOW CREATE TABLE
SHOW INDEXES -- to see cardinality
EXPLAIN SELECT
Consider fetching data with
SELECT * FROM table WHERE column1='XX' && column2='XX'
Mysql will filter the results matching with the first part of WHERE clause, then with the second part. Am I right?
Imagine the first part match 10 records, and adding the second part filters 5 records. Is it needed to INDEX the second column too?
You are talking about short circuit evaluation. A DBMS has cost-based optimizer. There is no guarantee wich of both conditions will get evaluated first.
To apply that to your question: Yes, it might be benificial to index your second column.
Is it used regulary in searches?
What does the execution plan tell you?
Is the access pattern going to change in the near future?
How many records does the table contain?
Would a Covering Index would be a better choice?
...
Indexes are optional in MySQL, but they can increase performance.
Currently, MySQL can only use one index per table select, so with the given query, if you have an index on both column1 and column2, MySQL will try to determine the index that will be the most beneficial, and use only one.
The general solution, if the speed of the select is of utmost importance, is to create a multi-column index that includes both columns.
This way, even though MySQL could only use one index for the table, it would use the multi-column index that has both columns indexed, allowing MySQL to quickly filter on both criteria in the WHERE clause.
In the multi-column index, you would put the column with the highest cardinality (the highest number of distinct values) first.
For even further optimization, "covering" indexes can be applied in some cases.
Note that indexes can increase performance, but with some cost. Indexes increase memory and storage requirements. Also, when updating or inserting records into a table, the corresponding indexes require maintenance. All of these factors must be considered when implementing indexes.
Update: MySQL 5.0 can now use an index on more than one column by merging the results from each index, with a few caveats.
The following query is a good candidate for Index Merge Optimization:
SELECT * FROM t1 WHERE key1=1 AND key2=1
When processing such query RDBMS will use only one index. Having separate indices on both colums will allow it to choose one that will be faster.
Whether it's needed depends on your specific situation.
Is the query slow as it is now?
Would it be faster with index on another column?
Would it be faster with one index containing both
columns?
You may need to try and measure several approaches.
You don't have to INDEX the second column but it may speed up your SELECT
I am designing an SQL database (accessed via PHP/MySQL) and have questions about designing the database in a way that helps the website run relatively quickly. My precise question pertains to speed when querying a table with many columns where one column is of type text. I am wondering, if
this table will be frequently queried,
the queries only about 50% of the time will include the text column,
and I specify the columns names so the text column is not returned in those 50% of queries,
will the presence of the text column in the table affect the query speed? As a follow-up, do text columns generally slow down database queries?
Any other general tips on database design to help boost query speed are appreciated, as are any suggestions for books or other references on this topic. Thanks!
afaik there is no difference if you add a text column to a table, as long as you do not use it in the where clause.
if you use it in the where clause it's definately good to have an index on it. Avoid comparsions with like, as they are slower.
I'm not convinced that text columns are much slower than the alternatives.
Specifying the columns to be returned is a good performance choice - as there is no need to move more data than is needed across the wires.
If you get your indexes right you will get far better performance improvements than using text columns will cost.
If you are doing more many more database reads than writes then indexes will improve read speed.
Helping your optimizer by regularly dropping and re-adding indexes will also help as the data shape of your tables will change over time.
The 'text' data type will only slow your queries down if you intend to filter using this column in the WHERE clause of your select statements.
SELECT textColumn
FROM table WHERE varcharColumn LIKE '%Spanner%'
can be optimised more easily than
SELECT varcharColumn
FROM table WHERE textColumn LIKE '%Spanner%'
however
SELECT textColumn
FROM table WHERE integerColumn = 1
performs just as well as
SELECT varcharColumn
FROM table WHERE integerColumn = 1
Some general tips:
As a general rule you should think about how your tables are going to be ordered in your output (by date or alphabetically?) and put an index on that column.
If you are starting out with DB design you should generally have all your tables using an INT Primary Key that is also your IDENTITY column AND Clustered Index. This means that the tables will be physically ordered (on disk) by that column (generally your ID such as PersonID etc), then use Non-clustered indexes on the columns that you are going to filter and order by.
At a later stage when you've built a few DBs I'd recommend you go further into optimising table design by setting your Clustered Index to be the unique column that is most frequently being used to order the table, including using multiple columns as your Clustered Index.
Does a sort use a MySQL index if there is an index on the sorting column? Also what other things is the index used for?
What difference does it make in a combined and separate indexes of the columns?
Yes, MySQL uses your index to sort the information when the order is by the sorted column.
Also, if you have indexes in all columns that you have added to the SELECT clause, MySQL will not load the data from the table itself, but from the index (which is faster).
The difference between combined and separate indexes is that MySQL can not use more than one index per query, so, if your query filters by many columns and you would like to have it correctly indexed you will need to create a combined index of all columns.
But before adding lots of indexes to your tables, remember that each index makes insert/update/delete operations go slower.
I would also highly recommend the High Performance MySQL book by O'Reilly that will cover in depth all of these issues and a lot of other hints you need to know to really be able to use MySQL to the limit.