Should "duplicate" indices in MySQL be deleted? - mysql

I am aware that in MySQL indices on (A,B,C) benefit ANDed WHERE clauses with |A|, |A,B|, |A,B,C|. This makes it seem that having the index (A,B,C) means that there is no point in having a single index on (A) or a composite on (A,B).
1. Is this true?
2. Is it just a waste maintaining an index on (A) when you already have an index on (A,B,C)?

I believe the answer to both your questions is the same: it's almost entirely true; it's almost always wasteful to have indexes on both (A, B, C) and (A).
As Danblack mentioned, the size could make a minor difference, although that's probably negligible.
More importantly, in my experience, note that (A) is actually (A, Primary), where Primary is those primary key columns that are not already explicitly included in the index. In practice, that often means (A, Id). The other index, then, is actually (A, B, C, Id). Note how this affects the order in which rows are encountered in the index.
Imagine doing this:
SELECT *
FROM MyTable
WHERE A = 'Whatever'
ORDER BY Id
Index (A), AKA (A, Id), is perfect for this. For any fixed value of A, corresponding rows are then ordered by Id. No sorting is needed - the results are in our desired order.
However, for index (A, B, C), AKA (A, B, C, Id), it's different. For any fixed value of A, corresponding rows are then ordered by B! This means that the above query will require sorting of the results.
EXPLAIN should confirm what I have described. A filesort will take place if only the (A, B, C) index is available, but not if (A) is available.
It should be easy to see that this matters very little if there are generally very few rows for a particular value of A. However, if there could be 100,000 rows for such a value, then the filesort starts to be impactful. In such a case, you might choose to have index (A) to optimize for this scenario.
Generally speaking, such prefix indexes are superfluous. It's good to analyze your indexes and queries to identify these scenario's, though. In a rare case, one may be worth adding. In the more common case, at least you'll be able to weigh such effects into your overall index choices.

true
almost always
There is a very rare case that if:
A as a standalone index is used most frequently, and
that queries that use A,B or A,B,C are very rare, and
that the sizeof(A) is significantly less than sizeof(A,B,C), and
you are memory constrained such that normally index A,B,C usage is using a significant buffer pool size/key cache size to the determent of other queries;
then there may be a may small benefit having a small duplicate subset of a index A.
Note: possibly might include other conditions

Related

Indexed field + limit 1 vs. Unique field comparison in terms of performance | MySQL

Example 1. Field c.alias is unique.
Select c.* from category c where c.alias = 'some-alias'
Example 2. Field c.alias is simply indexed.
Select c.* from category c where c.alias = 'some-alias' limit 1
Does these two examples differ in terms of query performance?
The answer to your question is to test and find out.
If there is a difference, however, I would expect it to be very, very minor. In both cases, the where clause should use an index. You are really asking whether MySQL "knows" to stop after the first match (because of the unique index). Or will MySQL need to look at the next value to see that it is different. This second step would have very little overhead.
I would also offer that if this is a serious question about optimizing code, then you are probably looking in the wrong place. Such micro-optimizations (perhaps nano- in this case) are not usually helpful.
I should note that if you were comparing a primary key index to a regular index, you might see a measurable difference. In MySQL, a primary key is automatically clustered and that could be a (small but measurable) performance difference.
(Gordon's answer is good. But I have a few different things to say...)
SELECT ... WHERE unique_col = constant -- This will read one row (if any) from the table. LIMIT 1 has no impact other than the trivial parsing time.
SELECT ... WHERE non_uniq_col = constant -- This will read until it finds a row that does not match. That is, it will read N+1 rows to deliver N rows. For this case, LIMIT 1 would stop it short. Now the question is not about performance (not much difference if N is small), but about functionality (did you want 1 row or N).
In InnoDB, the PRIMARY KEY is "clustered" with the data, and is implicitly UNIQUE. So it is as described above.
With a secondary key, there is an extra step. First the row(s) is found (col=const) in the index BTree. At the leafnode of that BTree is the PK for the desired row. The extra step is to drill down the PK's BTree to find the row. This costs some extra, but is rarely worth worrying about for small-to-medium-sized tables. (The LIMIT 1 issues still apply, and are minor.)
If all the columns referenced in the SELECT are in the same secondary index (including the implicit copy of the PK), then the index is said to be "covering". Hence, the query can be performed entirely in the secondary index's BTree, thereby avoiding the extra step. Again, this is usually small, but may be worth doing. "Covering" is indicated in EXPLAIN by "Using index" (not "Using index condition", which refers to something different).
May I suggest you study my Indexing Cookbook rather than learn the gory details first. My document gives you the important things for performance, and leaves out some of the less important details, such as what this question is discussing.
For tiny tables in a low-volume project, none of this matters.
For huge tables, the number of disk hits is the main performance metric; minimizing that becomes the goal. In particular, the extra step for secondary indexes usually involves N disk hits (for N rows).
But... SELECT ... GROUP BY ... ORDER BY ... LIMIT 1 may involve fetching a bunch of rows, sorting them, and finally delivering 1 row. If the sorting cannot be consumed in the index, N rows are manipulated, perhaps multiple times. In this case, LIMIT 1 is acted on only after 99% of the effort was expended. This is a common misconception about LIMIT for novices.

How can I detect if an MySQL index is necessary or required?

How can I detect if an MySQL index is necessary or required?
We have the idea that some queries can be improved. And I know that I can dive in slow query logs ... but I ran across the post below for MS SQL and was wondering if there is an easy way of analyzing if an index is required (and will give immediate speed improvements) for the current MySQL database.
Help appreciated
Resource for MS SQL: https://dba.stackexchange.com/questions/56/how-to-determine-if-an-index-is-required-or-necessary
You can't.
There are ways to detect, over a period of time, whether an index is used. But there is no way to be sure that an index is not used. Let's say you have a once-a-month task that does some major maintenance on the table. And you really need a certain index to keep the task from locking the table and bringing down the application. If you checked for index usage for most of the month, but failed to include that usage, you might decide that you don't need the index. Then you would drop the index... and be sorry. (This is a real anecdote.)
Meanwhile, there are some simplistic rules about indexes...
INDEX(a) is unnecessary if you also have INDEX(a,b).
INDEX(id) is unnecessary if you also have PRIMARY KEY(id) or UNIQUE(id).
An index with 5 or more columns may be used, but is unlikely to be "useful". (Shorten it.)
INDEX(a), INDEX(b) is not the same as INDEX(a,b).
INDEX(b,a) is not the same as INDEX(a,b); you may need both.
INDEX(flag), where flag has a small number of distinct values, will probably never be used -- the optimizer will scan the table instead.
In many cases, "prefix" indexing (INDEX(foo(10))) is useless. (But there are many exceptions.)
"I indexed every column" -- a bad design pattern.
Often, but not always, having both a PRIMARY KEY and a UNIQUE key means that something is less than optimal.
InnoDB tables really should have an explicit PRIMARY KEY.
InnoDB implicitly include the PK in any secondary key. So, given PRIMARY KEY(id), INDEX(foo) is really INDEX(foo, id).
Sometimes the Optimizer will ignore the WHERE clause and use an index for the ORDER BY.
Some queries have such skewed properties that the Optimizer will use a different index depending on different constants. (I have literally see as many as 6 different explain plans for one query.)
"Index merge intersect" is almost always not as good as a composite index.
There are exceptions to most of these tips.
So, I prefer to take all the queries (SELECTs, UPDATEs, and DELETEs), decide on the optimal index for each, eliminate redundancies, etc, in order to find the "best" set of indexes. See my cookbook on creating an index, given a SELECT.
You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, and index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that has some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is "M". For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.

Does the order of indexes matter?

Let's say you have a table with columns A and B, among others. You create a multi-column index (A, B) on the table.
Does your query have to take the order of indexes into account? For example,
select * from MyTable where B=? and A in (?, ?, ?);
In the query we put B first and A second. But the index is (A, B). Does the order matter?
Update: I do know that the order of indexes matters significantly in terms of the leftmost prefix rule. However, does it matter which column comes first in the query itself?
In this case no but I recommend to use the EXPLAIN keyword and you will see which optimizations MySQL will use (or not).
The order of columns in the index can affect the way the MySQL optimiser uses the index. Specifically, MySQL can use your compound index for queries on column A because it's the first part of the compound index.
However, your question refers to the order of column references in the query. Here, the optimiser will take care of the references appropriately, and the order is unimportant. The different clauses must come in a particular order to satisfy syntax rules, so you have little control anyway.
Mysql reference on multi-column index optimisation is here
You can test out specific queries of you think they are problems, but otherwise I wouldn't worry about this optimization. Your query will mostly likely be mangled from its original form by the query plan. That is to say MySQL should do a good job of planning how it will use the indices to optimize speed. This may require the conditions to be in a different order, but I doubt it. If MySQL actually did have to reorder the conditions for optimization it would be a very minor cost relative to the execution of the query (at least if the result set is large).

Why having each column only in a single index in MySQL and does the order of columns in the query matter?

I very often search the table posts for values in the columns user+status and user+time.
SELECT * FROM `posts` WHERE `user`='xxx' and `status`='active'
SELECT * FROM `posts` WHERE `user`='xxx' and `time`>...
Thus I have set up two indices (user, status) and (user, time)
I'm aware, that writing processes are slowed down the more indices need to be updated. But I think in this case it is useful to have both indices, since reading operations outnumber writing operations by far.
Anyway, PHPMyAdmin gives a Warning saying "More than one index has been created for the column user". Can I just ignore this warning? I checked the Wordpress DB tables and saw that they have put a column at the second position, if it already had an index.
comment_approved_date_gmt = INDEX(comment_approved, comment_date_gmt)
comment_date_gmt = INDEX(comment_date_gmt)
Why don't they use only one two column index (INDEX(comment_date_gmt, comment_approved)), that would save INDEX(comment_date_gmt)? and why is it disadvantageous to have two indices starting with the same column-name?
Is there a general rule, which column should go first in my query? For example the one with the lowest number of different entries (e.G. status) and afterwards the one with a higher number of different values (e.g. user names)
Yes, the order of columns in an index matters.
Think of an analogy to a telephone book. It's like an index on (last_name, first_name). Looking up a person by last name, you use the sorted order of the phone book to help you find them quickly.
But if you only know the person's first name, they are scattered throughout the book. To find one, you'd have to search the book page by page.
Yes, indexes can be redundant.
Any query that is searching for last_name can use a single-column index on (last_name), or it can get the same benefit from a two-column index on (last_name, first_name). So why create both indexes?
There's a tool pt-duplicate-key-checker that can help you identify redundant indexes. I've never come across a database that didn't have at least a few such indexes.
phpMyAdmin is wrong.
If phpMyAdmin is warning about the indexes (user, status) and (user, time), then it's being over-zealous, because these indexes are not redundant with respect to each other. Basically, an index is redundant if its columns comprise a left-prefix of the columns in another index. So an index (A) is redundant with respect to an index (A, B), but an index (A, C) is distinct from (A, B) and both may be used by different queries.
PS: I cover these points and more in my presentation How to Design Indexes, Really.
I feel that the ordering of columns in a SQL query is a premature optimisation, which according to Knuth, is the root of all evil. You should program for maintenance, not for optimisation and let the optimiser take care of the speed.

MySQL Indexes: Having multiple indexes vs. having one multi-field key?

When I manually create tables in MySQL, I add indexes one at a time for each field that I think I will use for queries.
When I use phpMyAdmin to create tables for me, and I select my indexes in the create-table form, I see that phpMyAdmin combines my indexes into 1 (plus my primary).
What's the difference? Is one better than the other? In which case?
Thanks!
Neither is a particularly good strategy, but if I had to choose I'd pick the multiple single indexes.
The reason is that an index can only be used if you use all the fields in any complete prefix of the index. If you have an index (a, b, c, d, e, f) then this works fine for a query that filters on a or a query that filter on both a and b, but it will be useless for a query filtering only on c.
There's no easy rule that always works for choosing the best indexes. You need to look at the types of queries you are making and choose the indexes that would speed up those particular queries. If you think carefully about the order of the columns you can find a small number of indexes that will be useful for multiple different queries. For example if in one query you filter on both a and b, and another query you filter on only b then an index on (b, a) will be usable by both queries but an index an (a, b) will not.
This actually depends on your queries. Some queries make better use of multicolumn indexes, some not.
EXPLAIN is your friend.
http://dev.mysql.com/doc/refman/5.6/en/explain.html
Also a very good resource is here:
http://dev.mysql.com/doc/refman/5.6/en/optimization-indexes.html