MySQL EXPLAIN: "Using index" vs. "Using index condition" - mysql

The MySQL 5.4 documentation, on Optimizing Queries with EXPLAIN, says this about these Extra remarks:
Using index
The column information is retrieved
from the table using only information
in the index tree without having to do
an additional seek to read the actual
row. This strategy can be used when
the query uses only columns that are
part of a single index.
[...]
Using index condition
Tables are read by accessing index
tuples and testing them first to
determine whether to read full table
rows. In this way, index information
is used to defer (“push down”) reading
full table rows unless it is
necessary.
Am I missing something, or do these two mean the same thing (i.e. "didn't read the row, index was enough")?

An example explains it best:
SELECT Year, Make --- possibly more fields and/or from extra tables
FROM myUsedCarInventory
WHERE Make = 'Toyota' AND Year > '2006'
Assuming the Available indexes are:
CarId
VIN
Make
Make and Year
This query would EXPLAIN with 'Using Index' because it doesn't need, at all, to "hit" the myUsedCarInventory table itself since the "Make and Year" index "cover" its need with regards to the elements of the WHERE clause that pertain to that table.
Now, imagine, we keep the query the same, but for the addition of a condition on the color
...
WHERE Make = 'Toyota' AND Year > '2006' AND Color = 'Red'
This query would likely EXPLAIN with 'Using Index Condition' (the 'likely', here is for the case that Toyota + year would not be estimated to be selective enough, and the optimizer may decide to just scan the table). This would mean that MySQL would FIRST use the index to resolve the Make + Year, and it would have to lookup the corresponding row in the table as well, only for the rows that satisfy the Make + Year conditions. That's what is sometimes referred as "push down optimization".

The difference is that "Using index" doesn't need a lookup from the index to the table, while "Using index condition" sometimes has to. I'll try to illustrate this with an example. Say you have this table:
id, name, location
With an index on
name, id
Then this query doesn't need the table for anything, it can retrieve all it's information "Using index":
select id, name from table where name = 'Piskvor'
But this query needs a table lookup for all rows where name equals 'Piskvor', because it can't retrieve location from the index:
select id from table where name = 'Piskvor' and location = 'North Pole'
The query can still use the index to limit the results to the small sets of row with a particular name, but it has to look at those rows in the table to check if the location matches too.

Related

MySQL query index

I am using MySQL 5.6 and try to optimize next query:
SELECT t1.field1,
...
t1.field30,
t2.field1
FROM Table1 t1
JOIN Table2 t2 ON t1.fk_int = t2.pk_int
WHERE t1.int_field = ?
AND t1.enum_filed != 'value'
ORDER BY t1.created_datetime desc;
A response can contain millions of records and every row consists of 31 columns.
Now EXPLAIN says in Extra that planner uses 'Using where'.
I tried to add next index:
create index test_idx ON Table1 (int_field, enum_filed, created_datetime, fk_int);
After that EXPLAIN says in Extra that planner uses "Using index condition; Using filesort"
"rows" value from EXPLAIN with index is less than without it. But in practice time of execution is longer.
So, the questions are next:
What is the best index for this query?
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
Should the fields from ORDER BY be in index?
Should the fields from JOIN be in index?
try refactor your index this way
avoid (o move to the right after fk_int) the created_datetime column.. and move fk_int before the enum_filed column .. the in this wahy the 3 more colums used for filter shold be use better )
create index test_idx ON Table1 (int_field, fk_int, enum_filed);
be sure you have also an specific index on table2 column pk_int. if you have not add
create index test_idx ON Table2 (int_field, fk_int, enum_filed);
What is the best index for this query?
Maybe (int_field, created_datetime) (See next Q&A for reason.)
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
enum_filed != defeats the optimizer. If there is only one other value for that enum (and it is NOT NULL), then use = and the other value. And try INDEX(int_field, enum_field, created_datetime) The Optimizer is much happier with = than with any inequality.
"5" could be indicating 2 columns, or it could be indicating one INT that is Nullable. If int_field can be NULL, consider changing it to NOT NULL; then the "5" would drop to "4".
Should the fields from ORDER BY be in index?
Only if the index can completely handle the WHERE. This usually occurs only if all the WHERE tests are =. (Hence, my previous answer.)
Another case for including those columns is "covering"; see next Q&A.
Should the fields from JOIN be in index?
It depends. One thing that gives some performance benefit is to include all columns mentioned anywhere in the SELECT. This is called a "covering" index and is indicated in EXPLAIN by Using index (not Using index condition). There are too many columns in t1 to add a "covering" index. I think the practical limit is about 5 columns.
My guess for your question № 1:
create index my_idx on Table1(int_field, created_datetime desc, fk_int)
or one of these (but neither will probably be worthwhile):
create index my_idx on Table1(int_field, created_datetime desc, enum_filed, fk_int)
create index my_idx on Table1(int_field, created_datetime desc, fk_int, enum_filed)
I'm supposing 3 things:
Table2.pk_int is already a primary key, judging by the name
The where condition on Table1.int_field is only satisfied by a small subset of Table1
The inequality on Table1.enum_filed (I would fix the typo, if I were you) only excludes a small subset of Table1
Question № 2: the key_len refers to the keys used. Don't forget that there is one extra byte for nullable keys. In your case, if int_field is nullable, it means that this is the only key used, otherwise both int_field and enum_filed are used.
As for questions 3 and 4: If, as I suppose, it's more efficient to start the query plan from the where condition on Table1.int_field, the composite index, in this case also with the correct sort order (desc), enables a scan of the index to get the output rows in the correct order, without an extra sort step. Furthermore, adding also fk_int to the index makes the retrieval of any record of Table1 unnecessary unless a corresponding record is present in Table2. For a similar reason you could also add enum_filed to the index, but, if this doesn't considerably reduce the output record count, the increase in index size will make things worse instead of better. In the end, you will have to try it out (with realistic data!).
Note that if you put another column between int_field and created_datetime in the index, the index won't provide the created_datetime (for a given int_field) in the desired output order.
The issue was fixed by adding more filters (to where clause) to the query.
Regarding to indexes 2 proposed indexes were helpful:
From #WalterTross with next index for initial query:
(int_field, created_datetime desc, enum_filed, fk_int)
With my short comment: desc indexes is not supported at MySQL 5.6 - this key word just reserved.
From #RickJames with next index for modified query:
(int_field, created_datetime)
Thanks everyone who tried to help. I really appreciate it.

mysql index column order for join

I have two table (requests, results)
requests:
email
results:
email, processed_at
I now want to get all results that have a request with the same email and that have not been processed:
SELECT * FROM results
INNER JOIN requests ON requests.email = results.email
AND results.processed_at IS NULL
I have an index on each individual column, but the query is very slow. So I assume I need a multi column index on results:
I am just not sure which order the columns have to be:
ALTER TABLE results
ADD INDEX results_email_processed_at (email,processed_at)
ALGORITHM=INPLACE LOCK=NONE;
or
ALTER TABLE results
ADD INDEX results_processed_at_email (processed_at,email)
ALGORITHM=INPLACE LOCK=NONE;
Either composite index will be equally beneficial.
However, if you are fetching 40% of the table, then the Optimizer may choose to ignore any index and simply scan the table.
Is that SELECT the actual query? If not, please show us the actual query; a number of seemingly minor changes could make a big difference in optimization options.
Please provide EXPLAIN SELECT ... so we can see what it thinks with the current index(es). And please provide SHOW CREATE TABLE in case there are datatype issues that are relevant.
Not withstanding any indexing issues, you explicitly asked about all requests that WERE NOT processed. You have an INNER JOIN which means I WANT FROM BOTH Sides, so your NULL check in the where would never qualify.
You need a LEFT JOIN to the results table.
As for index, since the join is on the email, I would just have the EMAIL as the primary component of the index. By having a covering index and including the processed_at column would be faster as it would not have to go to the raw data page to qualify the results, but have index specifically ordered as (email, processed_at) so the EMAIL is first qualifier, THEN when it was processed comes along for the ride to complete the query requirement fields.

MySQL Index for ORDER BY with Date Range in WHERE

Suppose you have a table with the following columns:
id
date
col1
I would like to be able to query this table with a specific id and date, and also order by another column. For example,
SELECT * FROM TABLE WHERE id = ? AND date > ? ORDER BY col1 DESC
According to this range documentation, an index will stop being used after it hits the > operator. But according to this order by documentation, an index can only be used to optimize the order by clause if it is ordering by the last column in the index. Is it possible to get an indexed lookup on every part of this query, or can you only get 2 of the 3? Can I do any better than index (id, date)?
Plan A: INDEX(id, date) -- works best if when it filters out a lot of rows, making the subsequent "filesort" not very costly.
Plan B: INDEX(col1), which may work best if very few rows are filtered by the WHERE clause. This avoids the filesort, but is not necessarily faster than the other choices here.
Plan C: INDEX(id, date, col1) -- This is a "covering" index if the query does not reference any other fields. The potential advantage here is to look only at the index, and not have to touch the data. If it applies, Plan C is better than Plan A.
You have not provided enough information to say which of these INDEXes will work best. Suggest you add C and B, if "covering" applies; else add A and B. The see which index the Optimizer picks. (There is still a chance that the Optimizer will not pick 'right'.)
(These three indexes are what my Index blog recommends.)

How to optimize queries with user-defined parameters in the "where" clause?

I'm learning how to do proper query optimization using indexes. Let's say I have a huge table of products with all kinds of details for each product, e.g. price, category, number of purchases, review average, and more. When having multiple "where" conditions, I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
However, I'm having difficulty figuring out how to scale it if there are so many queries for different purposes, and if users get to pick how to filter the products table. For example, a user can browse products WHERE rating > 4 AND purchases > 100, or it could be WHERE category = 'x' AND price < 100 AND price > 20. How would a proper multi-column index work if the columns chosen to be filtered are random?
I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
You learned... not quite correctly.
The order of appearance in the WHERE clause is not meaningful, since the optimizer is free to evaluate the conditions in any logically valid way, subject of course to parentheses and logical operators (AND, OR, etc.) in the expression.
The order of columns in a multi-column index is important because, from left to right, as soon as a column is encountered in an index that not mentioned in the where clause, nothing more toward the right side of that index can be used.
If 3 columns, (a,b,c) are indexed, and the query is WHERE a = 1 AND c = 6 then the optimizer will only be able to use the left-most "a" column values in that index, not "c".
In that case, it would likely still choose to use the index to find rows where a = 1, and then scan all of those identified rows for only those with c = 6.
You can visualize a multi-column index as a multidimensional array. Without a known value or range you need to match for the first column (a), the values for the second column (b) are a meaningless, unordered jumble of data, because they're sorted in "groups of 'a'"... you'd have to iterate through every "a" to find the matching "b" values, and iterate through every "a,b" to find the matching "c" values. Since, in the example above, the "b" value is "anything" since it isn't specified, the ordering of the "c" values is meaningless and inaccessible for optimizing the query (although when every column within the SELECT list is available within a single index, the optimizer may scan the index instead of scanning the whole table, treating it as a "covering index," which is generally better than a full table scan but still suboptimal).
If your WHERE clause includes two columns both of which are indexed individually, the optimizer will check the index statistics and try to use the one that is most likely to produce the fewest matches... if "a" and "c" each have an individual index, and the index stats indicate that there are many values for "c" (high cardinality) but only a few values for "a" (low cardinality) the optimizer will typically use the index on "c" to find matching rows, then scan all of those rows for the requested values of "a".
Or, it may try to use the union of the two indexes, to precisely identify which rows satisfy both conditions.
Neither of these strategies is optimal, either, but still far better than a full table scan, so itdoes suggest that you should -- at a minimum -- have every independently-searchable column as the leftmost column in an index... that is, any column that can be queried on its own, with no other columns in the WHERE clause, and return a reasonably-sized result-set. If the result-set will not be reasonable in size, you may wish to restrict the user to searching on additional attributes, in the application.
In the case of WHERE category = 'x' AND price < 100 AND price > 20 the better index would be (category,price) and not (price,category) but this is not because of the ordering of expressions in the WHERE clause. It is because category is an equality test, but price is a range. WHERE price < 100 AND price > 20 AND category ='x' is equivalent, and (category,price) is still the appropriate index -- because indexes are sorted by the first column, then within each value for the first column, they are sorted by the values of the second column, then within each (first,second) pair they are sorted by the values in the third column, ad infinitum... so with (category,price) the server goes directly to all of the rows for category = 'x' and within that grouping in the index, the referenced rows are already sorted by price, so it only has to select the range of price within the category 'x' of the index. Optimal. The (price,category) index requires checking all the prices in the range, and then checing the category value for all of those. The index could still be used, but depending on the criteria, the optimizer could still opt to scan the whole table.
If you add a third criteria to the WHERE clause that isn't indexed, the same path will be followed, but the server will scan the identified rows for matches with the required value of the non-indexed column. Again, suboptimal, but often acceptable, depending on your business needs -- which play a role in determining the correct answer to this question.
Every index requires space, and resources, because every insert, update, and delete, requires that the server make the necessary changes -- right then -- to every index that is affected by the changes to the table.
Note also that if you have an index on (a,b) or (a,b,c), etc., then a separate index on (a) generally is considered a waste of space, since the index on (a,...anything-else...) will also serve as an index on (a).
Experimenting with EXPLAIN SELECT (which also supports INSERT/UPDATE/DELETE as of MySQL 5.6) and genuinely understanding its output is an indispensable tool for understanding how indexes work. MySQL 5.6 also supports optimizer tracing, which gives you detailed output of how the optimizer understood your query, the various plans it considered, the cost it estimated of each plan, and how it arrived at the decision of how to execute a particular query.

MySQL Secondary Indexes

If I'm trying to increase the performance of a query that uses 4 different columns from a specific table, should I create 4 different indexes (one with each column individually) or should I create 1 index with all columns included?
One index with all 4 values is by my experience the fastest. If you use a where, try to put the columns in an order that makes it useful for the where.
An index with all four columns; the columns used in the WHERE should go first, and those for which you do == compare should go first of all.
Sometimes, giving priority to integer columns gives better results; YMMV.
So for example,
SELECT title, count(*) FROM table WHERE class = 'post' AND topic_id = 17
AND date > ##BeginDate and date < ##EndDate;
would have an index on: topic_id, post, date, and title, in this order.
The "title" in the index is only used so that the DB may find the value of "title" for those records matching the query, without the extra access to the data table.
The more balanced the distribution of the records on the first fields, the best results you will have (in this example, say 10% of the rows have topic_id = 17, you would discard the other 90% without ever having to run a string comparison with 'post' -- not that string comparisons are particularly costly. Depending on the data, you might find it better to index date first and post later, or even use date first as a MySQL PARTITION.
Single index is usually more effective than index merge, so if you have condition like f1 = 1 AND f2 = 2 AND f3 = 3 AND f4 = 4 single index would right decision.
To achieve best performance enumerate index fields in descending order of cardinality (count of distinct values), this will help to reduce analyzed rows count.
Index of less than 4 fields can be more effective, as it requires less memory.
http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/