MySQL: required indexes for multi-column ordering

MySQL: required indexes for multi-column ordering - mysql

Having a 10+ million table with three columns: one, two, three and SQL query like SELECT * FROM table ORDER BY one, two, three LIMIT 1 - do I really need to create a multi-column index using all three columns?
I know for sure that if one and two matches, there would be max 10 rows with distinct three.
Is it enough for fast SELECTs? -
CREATE INDEX MY_INDEX ON table (one, two);

With INDEX(one, two, three), the query will go straight down the BTree to the one (LIMIT 1) desired row.
With INDEX(one, two), the query will go straight down the BTree to the first such row, then scan forward the up-to-10 rows, save them to a tmp table, sort them (ORDER BY includes three) (probably done in memory), and deliver the first one. Although this sounds more complex it will not (in this example) be much slower.
It will not be a "table scan" ("ALL"), but perhaps a "range" scan. Use EXPLAIN SELECT ... to see.
If three is a bulky string, then the 3-col index will be bulkier; this has some impact on disk space and performance.
If you need only (one, two) for some other queries, then either index works reasonably well (barring the "bulky" comment).
If you do SELECT one, two, three FROM ..., the 3-part index will be better because it will be "covering". SELECT * won't have such a bonus.
Bottom line: Either index is "OK", many other factors factor in, making it hard to say for sure what to do.

You might think MySQL is clever enough to only read at most the first 10 rows using the index and then sort these. Unfortunately, it isn't (because the optimizer doesn't regard the limit at this point). You can verify that by using explain select ..., it will show that MySQL will do a full table scan ("ALL").
The documentation describes conditions to be able to use an index to optimize order by:
The index can also be used even if the ORDER BY does not match the index exactly, as long as all unused portions of the index and all extra ORDER BY columns are constants in the WHERE clause.
Your third column does not satisfy this. So this query will not use this index (which does not mean that it might not be usefull for something else).
Since MySQL 5.6, there is however the so-called filesort priority queue optimization to accommodate the limit: while MySQL will still read the whole table, it will not sort the whole table (which would be a time consuming process), but will stop when it knows what the first row will be, which makes your query acceptable fast.
But you can rewrite your query to do exactly what you are thinking of:
SELECT * FROM
(select * from table ORDER BY one, two LIMIT 10) sub
order by one, two, three limit 1;
This will read the first 10 rows using that index, and then just sort these. It will of course only work correctly if you are absolutely sure you will only have at most 10 rows.
A more general way to optimize your query independently from knowing the maximum number of possible rows would be e.g.
SELECT * FROM table
where one = (select min(one) from table)
order by one, two, three limit 1;
This will use the index to reduce the number of rows that have to be read and filesorted by looking up the lowest value for one first (using the index) and only considering these rows. You can similarly include a condition for two.
Or you can simply can use all three columns in your index (although depending on the size of your third column, it can make sense to not do this). These kind of optimizations tend to catch up at one point. If you e.g. use the first method, and in 2 year there will be 11 rows possible, you (or your successor) will have to remember that you have this implied condition in your code.

Related

mysql query speed at table which has 1.5million rows

It takes around 5 seconds to get result of query from a table consisting 1.5 million row. Query is "select * from table where code=x"
Is there a setting to increase speed ? Or should I jump to another database apart from MySQL ?

You could index the code column. Note that the trade off is that inserting new rows or updating the code column on existing rows will be slowed down a bit since the index also needs to be updated. In any event, you should benchmark the improvement to make sure it's worth it.

WHERE code=x -- needs INDEX(code)
SELECT * when many of the columns are bulky: Large columns are stored "off-record". Hence they take longer to fetch. So, explicitly list the columns you really need, hoping to leave out some of the bulky columns.
When a GROUP BY or LIMIT is involved, it is sometimes best to do
SELECT lots of columns
FROM ( SELECT id FROM t WHERE ... group-by or limit ) AS x
JOIN t AS y USING(id)
etc.
That is, start by finding just the ids as simply as possible, then JOIN back to the original table and other table(s). (This is not the case you presented, but I worry that you over-simplified it.)

how to order column in Multi-columns index for best performance in Mysql

Let's say I have transactions table in a mysql database, I want to create a multi-column index on 3 columns reference, kind and status.
I have this request that I am trying to speed up :
Transaction.where(parent_ref: merchant_ref, kind: 'OFFER',status: 1) which performs the following SQL :
SELECT `merchant_transactions`.* FROM `merchant_transactions`
WHERE `merchant_transactions`.`parent_ref` = '1-0001'
AND `merchant_transactions`.`kind` = 'BATCH_BET'
AND `merchant_transactions`.`status` = 1
The parent_ref column can take a really wide variety of values so if I have 1M records in that table I will have 500K different references. Status can only take 6 different values and kind only 3.
What will be the best order for the columns in my index for optimal performance.
Does the spread of values in my columns have an impact ? intuitively I would say that I would need to start with the column with the lowest spread of values. In that example I would thus do index(kind, status, reference).
Are there any other factors related to the values in my tables to take into account when figuring out the order of columns for my index ?

Okay, now that you've shared the query, we can see that you reference all three columns in your WHERE clause, all three predicates are doing equality comparisons, and the expression in the WHERE clause uses only AND operations.
There are no more exotic parts of the query like JOIN, GROUP BY, ORDER BY, DISTINCT, etc. to complicate the optimization of this query.
Given these conditions, my experience is that the order of columns hardly matters. If there's any difference, it's barely perceptible.
I'd put the column that is unique first, based on some assumption that it's most selective and therefore narrows down the search most effectively. But I'm not sure it would make any noticeable difference either way.

In your example, each of 3 columns is tested with =, and they are ANDd together. So build a 3-column composite with those 3 columns. The order of the columns will not matter for this query. Contrary to what others may say, "cardinality" of the individual columns does not matter in a composite INDEX.
See my indexing cookbook

MySQL - Poor performance in a select from a simple table

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.

If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.

Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/

Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

How do I create one MySQL index for 2 SQL queries?

SELECT * FROM messages_messages WHERE (from_user_id=? AND to_user_id=?) OR (from_user_id=? AND to_user_id=?) ORDER BY created_at DESC
I have another query, which is this:
SELECT COUNT(*) FROM messages_messages WHERE from_user_id=? AND to_user_id=? AND read_at IS NULL
I want to index both of these queries, but I don't want to create 2 separate indexes.
Right now, I'm using 2 indexes:
[from_user_id, to_user_id, created_at]
[from_user_id, to_user_id, read_at]
I was wondering if I could do this with one index instead of 2?
These are the only 2 queries I have for this table.

The docs explain fairly completely how MySQL uses indices. In particular, its optimizer can use any left prefix of a multi-column index. Therefore, you could drop either of your two existing indices, and the other would be eligible for use in both queries, though it would be more selective / useful for one than for the other.
In principle, it could be more beneficial to keep your first index, provided that the created_at column was indexed in descending order. In practice, MySQL allows you to specify index column order, but in fact implements only ascending order. Therefore, having created_at in your index probably doesn't help very much.

No, you need both indexes for these two queries if you want to optimize fully.
Once you reach the column used for either sorting or range comparison (IS [NOT] NULL counts as a range predicate for this purpose), you don't get any benefit from putting more columns in the index. In other words, your index can have:
Some columns that are used in equality predicates
One column that is used either in a range predicate, or to avoid a filesort -- but not both.
Extra columns used in neither searching nor sorting, but only for the sake of a covering index.
So you cannot make a four-column index that serves both queries.
The only way you can reduce this to one index, as #JohnBollinger says, is to make an index that optimizes for one query, and uses a subset of the index for the second query. But that won't work as well.

Mysql - sequence of multiple column indexes

Say if I have a query that look like this:
SELECT * FROM table WHERE category='5' and status='1' LIMIT 5
The table has 1 million rows.
To speed things up, I create index (status, category), i.e. multiple column index.
There are 600 categories but only 2 status (1 or 0). I'm wondering if there is any difference in performance if I create index (category, status) instead of index (status, category).

Status first.
The trick is then if you only need to query by category you can.
SELECT * from table where status in (1,0) and category = 'whatever'
and still get index support. Of course if your queries all use both columns it's the same either way, but in this case if you use only status it's much better, and only category only slightly worse if at all.
If you are looking at a lot of inserts as well, you want to minimize the number of indices, so this is your best bet rather than having multiple indices.

There shouldn't be any difference. The selectivity of the index is identical whether you order it (category, status) or (status, category).
By the way, using LIMIT is often meaningless without also using ORDER BY. The order of rows returned by an SQL query is arbitrary unless you specify an order.
Re your comment: Yes, it's common to need five random rows, but arbitrary is not the same as random. It's not common to need five arbitrary rows.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008