I'm learning how to do proper query optimization using indexes. Let's say I have a huge table of products with all kinds of details for each product, e.g. price, category, number of purchases, review average, and more. When having multiple "where" conditions, I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
However, I'm having difficulty figuring out how to scale it if there are so many queries for different purposes, and if users get to pick how to filter the products table. For example, a user can browse products WHERE rating > 4 AND purchases > 100, or it could be WHERE category = 'x' AND price < 100 AND price > 20. How would a proper multi-column index work if the columns chosen to be filtered are random?
I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
You learned... not quite correctly.
The order of appearance in the WHERE clause is not meaningful, since the optimizer is free to evaluate the conditions in any logically valid way, subject of course to parentheses and logical operators (AND, OR, etc.) in the expression.
The order of columns in a multi-column index is important because, from left to right, as soon as a column is encountered in an index that not mentioned in the where clause, nothing more toward the right side of that index can be used.
If 3 columns, (a,b,c) are indexed, and the query is WHERE a = 1 AND c = 6 then the optimizer will only be able to use the left-most "a" column values in that index, not "c".
In that case, it would likely still choose to use the index to find rows where a = 1, and then scan all of those identified rows for only those with c = 6.
You can visualize a multi-column index as a multidimensional array. Without a known value or range you need to match for the first column (a), the values for the second column (b) are a meaningless, unordered jumble of data, because they're sorted in "groups of 'a'"... you'd have to iterate through every "a" to find the matching "b" values, and iterate through every "a,b" to find the matching "c" values. Since, in the example above, the "b" value is "anything" since it isn't specified, the ordering of the "c" values is meaningless and inaccessible for optimizing the query (although when every column within the SELECT list is available within a single index, the optimizer may scan the index instead of scanning the whole table, treating it as a "covering index," which is generally better than a full table scan but still suboptimal).
If your WHERE clause includes two columns both of which are indexed individually, the optimizer will check the index statistics and try to use the one that is most likely to produce the fewest matches... if "a" and "c" each have an individual index, and the index stats indicate that there are many values for "c" (high cardinality) but only a few values for "a" (low cardinality) the optimizer will typically use the index on "c" to find matching rows, then scan all of those rows for the requested values of "a".
Or, it may try to use the union of the two indexes, to precisely identify which rows satisfy both conditions.
Neither of these strategies is optimal, either, but still far better than a full table scan, so itdoes suggest that you should -- at a minimum -- have every independently-searchable column as the leftmost column in an index... that is, any column that can be queried on its own, with no other columns in the WHERE clause, and return a reasonably-sized result-set. If the result-set will not be reasonable in size, you may wish to restrict the user to searching on additional attributes, in the application.
In the case of WHERE category = 'x' AND price < 100 AND price > 20 the better index would be (category,price) and not (price,category) but this is not because of the ordering of expressions in the WHERE clause. It is because category is an equality test, but price is a range. WHERE price < 100 AND price > 20 AND category ='x' is equivalent, and (category,price) is still the appropriate index -- because indexes are sorted by the first column, then within each value for the first column, they are sorted by the values of the second column, then within each (first,second) pair they are sorted by the values in the third column, ad infinitum... so with (category,price) the server goes directly to all of the rows for category = 'x' and within that grouping in the index, the referenced rows are already sorted by price, so it only has to select the range of price within the category 'x' of the index. Optimal. The (price,category) index requires checking all the prices in the range, and then checing the category value for all of those. The index could still be used, but depending on the criteria, the optimizer could still opt to scan the whole table.
If you add a third criteria to the WHERE clause that isn't indexed, the same path will be followed, but the server will scan the identified rows for matches with the required value of the non-indexed column. Again, suboptimal, but often acceptable, depending on your business needs -- which play a role in determining the correct answer to this question.
Every index requires space, and resources, because every insert, update, and delete, requires that the server make the necessary changes -- right then -- to every index that is affected by the changes to the table.
Note also that if you have an index on (a,b) or (a,b,c), etc., then a separate index on (a) generally is considered a waste of space, since the index on (a,...anything-else...) will also serve as an index on (a).
Experimenting with EXPLAIN SELECT (which also supports INSERT/UPDATE/DELETE as of MySQL 5.6) and genuinely understanding its output is an indispensable tool for understanding how indexes work. MySQL 5.6 also supports optimizer tracing, which gives you detailed output of how the optimizer understood your query, the various plans it considered, the cost it estimated of each plan, and how it arrived at the decision of how to execute a particular query.
Related
I have a large data table containing details by date and across 3 independent criteria with around 12 discreet values for each criteria. That is, each criteria field in the table is defined as a 12 value ENUM. Users pull summary data by date and any filtering across the three criteria, including none at all. To make a single criteria lookup efficient, 3 separate indexes are required (date,CriteriaA), (date,CriteriaB), (date,CriteriaC). 4 indexes if you want to lookup against any of the 3 (date,A,B,C),(date,A,C),(date,B,C),(date,C).
In an attempt to be more efficient in the lookup, I built a SET column containing all 36 values from the 3 criteria. All values across the criteria are unique and none are a subset of any other. I added an index to this set (date, set_col). Queries against this table using a set lookup fails to take advantage of the index, however. Neither FIND_IN_SET('Value',set_col), set_col LIKE '%Value%', nor set_col & [pos. in set] triggers the index (according to explain and overall resultset return speed).
Is there a trick to indexing SET columns?
I tried queries like
Select Date, count(*)
FROM tbl
where DATE between [Start] and [End]
and FIND_IN_SET('Value',set_col)
group by Date
I would expect it to run nearly as fast as a lookup against the individual criteria column that has an index against it. But instead it runs as fast when just an index against DATE exists. Same number of rows processed according to Explain.
It's not possible to index SET columns for arbitrary queries.
A SET type is basically a bitfield, with one bit set for each of the values defined for your set. You could search for a specific bit pattern in such a bitfield, or you could search for a range of specific bit patterns, or an inequality, etc. But searching for rows where one specific bit is set in the bitfield is not going to be indexable.
FIND_IN_SET() is really searching for a specific bit set in the bitfield. It will not use an index for this predicate. The best you can hope to do for optimization is to have an index that narrows down the examined rows based on the other search term on date. Then among the rows matching the date range, the FIND_IN_SET() will be applied row-by-row.
It's the same problem as searching for substrings. The following predicates will not use an index on the column:
SELECT ... WHERE SUBSTRING(mytext, 5, 8) = 'word'
SELECT ... WHERE LOCATE(mytext, 'word') > 0
SELECT ... WHERE mytext LIKE '%word%'
A conventional index on the data would be alphabetized from the start of the string, not from some arbitrary point in the middle of the string. This is why fulltext indexing was created as an alternative to a simple B-tree index on the whole string value. But there's no special index type for bitfields.
I don't think the SET data type is helping in your case.
You should use your multi-column indexes with permutations of the columns.
Go back to 3 ENUMs. Then have
INDEX(A, date),
INDEX(B, date),
INDEX(C, date)
Those should significantly help with queries like
WHERE A = 'foo' AND date BETWEEN...
and somewhat help for
WHERE A = 'foo' AND date BETWEEN...
AND B = 'bar'
If you will also have queries without A/B/C, then add
INDEX(date)
Note: INDEX(date, A) is no better than INDEX(date) when using a "range". That is, I recommend against the indexes you mentioned.
FIND_IN_SET(), like virtually all other function calls, is not sargable . However enum=const is sargable since it is implemented as a simple integer.
You did not mention
WHERE A IN ('x', 'y') AND ...
That is virtually un-indexable. However, my suggestions are better than nothing.
Let's say I have transactions table in a mysql database, I want to create a multi-column index on 3 columns reference, kind and status.
I have this request that I am trying to speed up :
Transaction.where(parent_ref: merchant_ref, kind: 'OFFER',status: 1) which performs the following SQL :
SELECT `merchant_transactions`.* FROM `merchant_transactions`
WHERE `merchant_transactions`.`parent_ref` = '1-0001'
AND `merchant_transactions`.`kind` = 'BATCH_BET'
AND `merchant_transactions`.`status` = 1
The parent_ref column can take a really wide variety of values so if I have 1M records in that table I will have 500K different references. Status can only take 6 different values and kind only 3.
What will be the best order for the columns in my index for optimal performance.
Does the spread of values in my columns have an impact ? intuitively I would say that I would need to start with the column with the lowest spread of values. In that example I would thus do index(kind, status, reference).
Are there any other factors related to the values in my tables to take into account when figuring out the order of columns for my index ?
Okay, now that you've shared the query, we can see that you reference all three columns in your WHERE clause, all three predicates are doing equality comparisons, and the expression in the WHERE clause uses only AND operations.
There are no more exotic parts of the query like JOIN, GROUP BY, ORDER BY, DISTINCT, etc. to complicate the optimization of this query.
Given these conditions, my experience is that the order of columns hardly matters. If there's any difference, it's barely perceptible.
I'd put the column that is unique first, based on some assumption that it's most selective and therefore narrows down the search most effectively. But I'm not sure it would make any noticeable difference either way.
In your example, each of 3 columns is tested with =, and they are ANDd together. So build a 3-column composite with those 3 columns. The order of the columns will not matter for this query. Contrary to what others may say, "cardinality" of the individual columns does not matter in a composite INDEX.
See my indexing cookbook
Having a 10+ million table with three columns: one, two, three and SQL query like SELECT * FROM table ORDER BY one, two, three LIMIT 1 - do I really need to create a multi-column index using all three columns?
I know for sure that if one and two matches, there would be max 10 rows with distinct three.
Is it enough for fast SELECTs? -
CREATE INDEX MY_INDEX ON table (one, two);
With INDEX(one, two, three), the query will go straight down the BTree to the one (LIMIT 1) desired row.
With INDEX(one, two), the query will go straight down the BTree to the first such row, then scan forward the up-to-10 rows, save them to a tmp table, sort them (ORDER BY includes three) (probably done in memory), and deliver the first one. Although this sounds more complex it will not (in this example) be much slower.
It will not be a "table scan" ("ALL"), but perhaps a "range" scan. Use EXPLAIN SELECT ... to see.
If three is a bulky string, then the 3-col index will be bulkier; this has some impact on disk space and performance.
If you need only (one, two) for some other queries, then either index works reasonably well (barring the "bulky" comment).
If you do SELECT one, two, three FROM ..., the 3-part index will be better because it will be "covering". SELECT * won't have such a bonus.
Bottom line: Either index is "OK", many other factors factor in, making it hard to say for sure what to do.
You might think MySQL is clever enough to only read at most the first 10 rows using the index and then sort these. Unfortunately, it isn't (because the optimizer doesn't regard the limit at this point). You can verify that by using explain select ..., it will show that MySQL will do a full table scan ("ALL").
The documentation describes conditions to be able to use an index to optimize order by:
The index can also be used even if the ORDER BY does not match the index exactly, as long as all unused portions of the index and all extra ORDER BY columns are constants in the WHERE clause.
Your third column does not satisfy this. So this query will not use this index (which does not mean that it might not be usefull for something else).
Since MySQL 5.6, there is however the so-called filesort priority queue optimization to accommodate the limit: while MySQL will still read the whole table, it will not sort the whole table (which would be a time consuming process), but will stop when it knows what the first row will be, which makes your query acceptable fast.
But you can rewrite your query to do exactly what you are thinking of:
SELECT * FROM
(select * from table ORDER BY one, two LIMIT 10) sub
order by one, two, three limit 1;
This will read the first 10 rows using that index, and then just sort these. It will of course only work correctly if you are absolutely sure you will only have at most 10 rows.
A more general way to optimize your query independently from knowing the maximum number of possible rows would be e.g.
SELECT * FROM table
where one = (select min(one) from table)
order by one, two, three limit 1;
This will use the index to reduce the number of rows that have to be read and filesorted by looking up the lowest value for one first (using the index) and only considering these rows. You can similarly include a condition for two.
Or you can simply can use all three columns in your index (although depending on the size of your third column, it can make sense to not do this). These kind of optimizations tend to catch up at one point. If you e.g. use the first method, and in 2 year there will be 11 rows possible, you (or your successor) will have to remember that you have this implied condition in your code.
For example, if I have a table with a city and a state column, what is the best way to use the index?
Obviously city will have the highest cardinality, so should I put that column first in the index, should I put state or doesn't it matter much?
It does not matter in this case:
INDEX cs (city, state),
INDEX sc (state, city)
WHERE city = 'Atlanta'
AND state = 'Georgia'
With either index, the drill-down in the BTree will be the same effort, and you will get to the one row just as fast.
(The order of clauses in WHERE does not matter.)
(If you are using a "range" test instead of = test, well, that's a different Question.)
MySQL composite index lookups must take place in the order in which the columns are defined within the index. Since you want MySQL to be able to discriminate between records by performing as few comparisons as possible, with all other things being equal you will benefit most from from a composite index in which the columns are ordered from highest- to lowest-cardinality.
That is, assuming comparisons must eventually be performed against the highest cardinality column in order to discriminate records, why force comparisons to take place first against the lowest cardinality column when ultimately that may be unnecessary?
If I'm trying to increase the performance of a query that uses 4 different columns from a specific table, should I create 4 different indexes (one with each column individually) or should I create 1 index with all columns included?
One index with all 4 values is by my experience the fastest. If you use a where, try to put the columns in an order that makes it useful for the where.
An index with all four columns; the columns used in the WHERE should go first, and those for which you do == compare should go first of all.
Sometimes, giving priority to integer columns gives better results; YMMV.
So for example,
SELECT title, count(*) FROM table WHERE class = 'post' AND topic_id = 17
AND date > ##BeginDate and date < ##EndDate;
would have an index on: topic_id, post, date, and title, in this order.
The "title" in the index is only used so that the DB may find the value of "title" for those records matching the query, without the extra access to the data table.
The more balanced the distribution of the records on the first fields, the best results you will have (in this example, say 10% of the rows have topic_id = 17, you would discard the other 90% without ever having to run a string comparison with 'post' -- not that string comparisons are particularly costly. Depending on the data, you might find it better to index date first and post later, or even use date first as a MySQL PARTITION.
Single index is usually more effective than index merge, so if you have condition like f1 = 1 AND f2 = 2 AND f3 = 3 AND f4 = 4 single index would right decision.
To achieve best performance enumerate index fields in descending order of cardinality (count of distinct values), this will help to reduce analyzed rows count.
Index of less than 4 fields can be more effective, as it requires less memory.
http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/