Which column to put first in index? Higher or lower cardinality? - mysql

For example, if I have a table with a city and a state column, what is the best way to use the index?
Obviously city will have the highest cardinality, so should I put that column first in the index, should I put state or doesn't it matter much?

It does not matter in this case:
INDEX cs (city, state),
INDEX sc (state, city)
WHERE city = 'Atlanta'
AND state = 'Georgia'
With either index, the drill-down in the BTree will be the same effort, and you will get to the one row just as fast.
(The order of clauses in WHERE does not matter.)
(If you are using a "range" test instead of = test, well, that's a different Question.)

MySQL composite index lookups must take place in the order in which the columns are defined within the index. Since you want MySQL to be able to discriminate between records by performing as few comparisons as possible, with all other things being equal you will benefit most from from a composite index in which the columns are ordered from highest- to lowest-cardinality.
That is, assuming comparisons must eventually be performed against the highest cardinality column in order to discriminate records, why force comparisons to take place first against the lowest cardinality column when ultimately that may be unnecessary?

Related

how to order column in Multi-columns index for best performance in Mysql

Let's say I have transactions table in a mysql database, I want to create a multi-column index on 3 columns reference, kind and status.
I have this request that I am trying to speed up :
Transaction.where(parent_ref: merchant_ref, kind: 'OFFER',status: 1) which performs the following SQL :
SELECT `merchant_transactions`.* FROM `merchant_transactions`
WHERE `merchant_transactions`.`parent_ref` = '1-0001'
AND `merchant_transactions`.`kind` = 'BATCH_BET'
AND `merchant_transactions`.`status` = 1
The parent_ref column can take a really wide variety of values so if I have 1M records in that table I will have 500K different references. Status can only take 6 different values and kind only 3.
What will be the best order for the columns in my index for optimal performance.
Does the spread of values in my columns have an impact ? intuitively I would say that I would need to start with the column with the lowest spread of values. In that example I would thus do index(kind, status, reference).
Are there any other factors related to the values in my tables to take into account when figuring out the order of columns for my index ?
Okay, now that you've shared the query, we can see that you reference all three columns in your WHERE clause, all three predicates are doing equality comparisons, and the expression in the WHERE clause uses only AND operations.
There are no more exotic parts of the query like JOIN, GROUP BY, ORDER BY, DISTINCT, etc. to complicate the optimization of this query.
Given these conditions, my experience is that the order of columns hardly matters. If there's any difference, it's barely perceptible.
I'd put the column that is unique first, based on some assumption that it's most selective and therefore narrows down the search most effectively. But I'm not sure it would make any noticeable difference either way.
In your example, each of 3 columns is tested with =, and they are ANDd together. So build a 3-column composite with those 3 columns. The order of the columns will not matter for this query. Contrary to what others may say, "cardinality" of the individual columns does not matter in a composite INDEX.
See my indexing cookbook

Performance cost for using primary key in order by

Let's say I have a table with a three column primary key. If I select all from that table without any order by clause they are, to my understanding, ordered by these columns. The first one, and within that by the second column and within that by the third.
Is there any additional cost, or perhaps performance gain, by explicitly adding a order by clause with the three columns in the order they are part of the primary key?
If I select all from that table without any order by clause they are, occurring to my understanding, ordered by these columns
Your understanding is incorrect. Without an order by the database engine may output the results in any order it chooses. In fact if you were to add a clustered index on another field it is likely to change the order of the result.
Is there any additional cost, or perhaps performance gain
There will never be a performance gain by ordering (unless you're doing something that effects branch prediction ). There will be no cost if the database was going to order that way. However there may be a correctness issue if you're not specifying the order.

How do I create one MySQL index for 2 SQL queries?

SELECT * FROM messages_messages WHERE (from_user_id=? AND to_user_id=?) OR (from_user_id=? AND to_user_id=?) ORDER BY created_at DESC
I have another query, which is this:
SELECT COUNT(*) FROM messages_messages WHERE from_user_id=? AND to_user_id=? AND read_at IS NULL
I want to index both of these queries, but I don't want to create 2 separate indexes.
Right now, I'm using 2 indexes:
[from_user_id, to_user_id, created_at]
[from_user_id, to_user_id, read_at]
I was wondering if I could do this with one index instead of 2?
These are the only 2 queries I have for this table.
The docs explain fairly completely how MySQL uses indices. In particular, its optimizer can use any left prefix of a multi-column index. Therefore, you could drop either of your two existing indices, and the other would be eligible for use in both queries, though it would be more selective / useful for one than for the other.
In principle, it could be more beneficial to keep your first index, provided that the created_at column was indexed in descending order. In practice, MySQL allows you to specify index column order, but in fact implements only ascending order. Therefore, having created_at in your index probably doesn't help very much.
No, you need both indexes for these two queries if you want to optimize fully.
Once you reach the column used for either sorting or range comparison (IS [NOT] NULL counts as a range predicate for this purpose), you don't get any benefit from putting more columns in the index. In other words, your index can have:
Some columns that are used in equality predicates
One column that is used either in a range predicate, or to avoid a filesort -- but not both.
Extra columns used in neither searching nor sorting, but only for the sake of a covering index.
So you cannot make a four-column index that serves both queries.
The only way you can reduce this to one index, as #JohnBollinger says, is to make an index that optimizes for one query, and uses a subset of the index for the second query. But that won't work as well.

How to optimize queries with user-defined parameters in the "where" clause?

I'm learning how to do proper query optimization using indexes. Let's say I have a huge table of products with all kinds of details for each product, e.g. price, category, number of purchases, review average, and more. When having multiple "where" conditions, I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
However, I'm having difficulty figuring out how to scale it if there are so many queries for different purposes, and if users get to pick how to filter the products table. For example, a user can browse products WHERE rating > 4 AND purchases > 100, or it could be WHERE category = 'x' AND price < 100 AND price > 20. How would a proper multi-column index work if the columns chosen to be filtered are random?
I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.
You learned... not quite correctly.
The order of appearance in the WHERE clause is not meaningful, since the optimizer is free to evaluate the conditions in any logically valid way, subject of course to parentheses and logical operators (AND, OR, etc.) in the expression.
The order of columns in a multi-column index is important because, from left to right, as soon as a column is encountered in an index that not mentioned in the where clause, nothing more toward the right side of that index can be used.
If 3 columns, (a,b,c) are indexed, and the query is WHERE a = 1 AND c = 6 then the optimizer will only be able to use the left-most "a" column values in that index, not "c".
In that case, it would likely still choose to use the index to find rows where a = 1, and then scan all of those identified rows for only those with c = 6.
You can visualize a multi-column index as a multidimensional array. Without a known value or range you need to match for the first column (a), the values for the second column (b) are a meaningless, unordered jumble of data, because they're sorted in "groups of 'a'"... you'd have to iterate through every "a" to find the matching "b" values, and iterate through every "a,b" to find the matching "c" values. Since, in the example above, the "b" value is "anything" since it isn't specified, the ordering of the "c" values is meaningless and inaccessible for optimizing the query (although when every column within the SELECT list is available within a single index, the optimizer may scan the index instead of scanning the whole table, treating it as a "covering index," which is generally better than a full table scan but still suboptimal).
If your WHERE clause includes two columns both of which are indexed individually, the optimizer will check the index statistics and try to use the one that is most likely to produce the fewest matches... if "a" and "c" each have an individual index, and the index stats indicate that there are many values for "c" (high cardinality) but only a few values for "a" (low cardinality) the optimizer will typically use the index on "c" to find matching rows, then scan all of those rows for the requested values of "a".
Or, it may try to use the union of the two indexes, to precisely identify which rows satisfy both conditions.
Neither of these strategies is optimal, either, but still far better than a full table scan, so itdoes suggest that you should -- at a minimum -- have every independently-searchable column as the leftmost column in an index... that is, any column that can be queried on its own, with no other columns in the WHERE clause, and return a reasonably-sized result-set. If the result-set will not be reasonable in size, you may wish to restrict the user to searching on additional attributes, in the application.
In the case of WHERE category = 'x' AND price < 100 AND price > 20 the better index would be (category,price) and not (price,category) but this is not because of the ordering of expressions in the WHERE clause. It is because category is an equality test, but price is a range. WHERE price < 100 AND price > 20 AND category ='x' is equivalent, and (category,price) is still the appropriate index -- because indexes are sorted by the first column, then within each value for the first column, they are sorted by the values of the second column, then within each (first,second) pair they are sorted by the values in the third column, ad infinitum... so with (category,price) the server goes directly to all of the rows for category = 'x' and within that grouping in the index, the referenced rows are already sorted by price, so it only has to select the range of price within the category 'x' of the index. Optimal. The (price,category) index requires checking all the prices in the range, and then checing the category value for all of those. The index could still be used, but depending on the criteria, the optimizer could still opt to scan the whole table.
If you add a third criteria to the WHERE clause that isn't indexed, the same path will be followed, but the server will scan the identified rows for matches with the required value of the non-indexed column. Again, suboptimal, but often acceptable, depending on your business needs -- which play a role in determining the correct answer to this question.
Every index requires space, and resources, because every insert, update, and delete, requires that the server make the necessary changes -- right then -- to every index that is affected by the changes to the table.
Note also that if you have an index on (a,b) or (a,b,c), etc., then a separate index on (a) generally is considered a waste of space, since the index on (a,...anything-else...) will also serve as an index on (a).
Experimenting with EXPLAIN SELECT (which also supports INSERT/UPDATE/DELETE as of MySQL 5.6) and genuinely understanding its output is an indispensable tool for understanding how indexes work. MySQL 5.6 also supports optimizer tracing, which gives you detailed output of how the optimizer understood your query, the various plans it considered, the cost it estimated of each plan, and how it arrived at the decision of how to execute a particular query.

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?
The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.
2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.
yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.