Optimizing a SELECT statement with ORDER BY by using indices

Optimizing a SELECT statement with ORDER BY by using indices - mysql

I have these queries:
1st query:
SELECT (..) FROM db WHERE A = const AND B > const AND C >= const ORDER BY B DESC LIMIT const
2nd query (different db):
SELECT (...) FROM db' WHERE A' = const ORDER BY X' DESC LIMIT const
Question about 1st query:
Is it sufficient to have a multiple row index (A, B, C) or do I need an additional single row index (B) (or a different one) because of the ORDER BYstatement?
Question about 2nd query: Do I need a multiple row index (A', X') or two single row indices (A'), (X') to make us of them in this query?

It is an important thing to know that MySQL will use at most one index (for searching, filtering and ordering) per table and subquery (so basically per row in explain), so you can use only one index here.
For your first query, an index (A,B) will allow MySQL to do a range scan and use the order. If you use (A,B,C), the column C cannot be used in the range condition (because B is already a range), but MySQL will save the time to read the actual tabledata to get the value for C to check the last condition. So (A,B,C) is in general the fastest choice here.
"In general", because you can of course have a data distribution where another index would be best: If you e.g. have only one or two rows that match C >= const and 10M+ rows with A = const, using an index on just C would be fastest. And if C is a very big column (e.g. varchar(700)), it could blow up the index and slow it down. But to estimate such exceptions would require deeper knowledge of your data.
For your second query, (A', X') will be the best choice. If you have the two indexes (A'), (X'), MySQL will in most cases (unless A' is unique, but then you wouldn't need an order by anyway) use the index on X' and hope it will find matching rows for A' soon. This will sometimes be unexpectedly and painfully slow if you only have some rows that match A' = const (because it has to jump back and forth in the table (that is ordered by the primary key) in the order of X' to find rows that match the condition for A').
You might get the same problem for your first query if you have the indexes (A) and (B) (but not (A,B) or (A,B,C)) there: MySQL will probably use (B) instead of (A) (but check the explain to make sure). Even if you just add one index now, this can e.g. happen when you add the index (B) to optimize a different query next week and forgot about this query, so I'd suggest to stick with (at least) (A,B)

Related

Issues with a MySQL Index due to a specific part

The query is
SELECT row
FROM `table`
USE INDEX(`indexName`)
WHERE row1 = '0'
AND row2 = '0'
AND row3 >= row4
AND (row5 = '0' OR row5 LIKE 'value')
I have the following MySQL Query which I've created a index for using;
CREATE INDEX indexName ON `table` (row1, row2, row3, row5);
However, the performance is not really good. It's extracting about 17,000+ rows out of a 5.9+ million row table in anywhere from 6-12 seconds.
It seems like the bottleneck is the row3 >= row4 - because without that part in the code it runs in 0.6-0.7 seconds.
(from Comment)
The row (placeholder column name) is actually the id (primary key, index) column in the table, which is the result set I'm outputting later on. I'm outputting an array of IDs that are matching the parameters in my query, and then selecting a random ID from that array to gather data through the final query on a specific row. This was done as a workaround for rand(). Any adjustments needed based on that knowledge?

17K rows is not a tiny result set. Large result sets often take time just because of the overhead of delivering the data from the MySQL server to the program requesting them.
The contents of the 'value' you use in row5 LIKE 'value' matter a great deal to query performance. If 'value' starts with a wildcard character like % your query will be slow.
That being said, you need a so-called covering index. You've tried to create one with the index you created. It's close but not perfect.
Your query filters on equality to constant values on row1, row2, and row5, so those columns should come first in your index. The query planner can random-access your index to the first matching entry, and then sequentially scan the index until it gets to the last matching entry. That's as fast as it gets.
Then you want to examine row3 and row4 (to compare them). Those columns should come next in the index. Finally, if your query's SELECT clause mentions a subset of the columns in your table you should put the rest of those columns in the index. So, based on the query in your question, your index should be
CREATE INDEX indexName ON `table` (row1, row2, row5, row3, row4, row);
The query planner will be able to satisfy the entire query by scanning through a subset of the index, using a so-called index range scan. That should be decently fast.
Pro tip: don't force the query planner's hand with USE INDEX(). Instead, structure your indexes to handle your queries efficiently.

An index can't be used to compare two columns in the same table (at best, it could be used for an index scan rather than a table scan if all output fields are contained in the index), so there basically is no "correct" way to do this.
If you have control over the structure AND the processes the fill the table, you could add a calculated field that holds the difference between the two fields. Then add that field to the index and adjust your query to use that field instead of the other 2.
It ain't pretty and doesn't offer a lot of flexibility (eg. if you want to compare another field, you need to add it as well etc), but it does get the job done.

(This is an adaptation of http://mysql.rjweb.org/doc.php/random )
Let's actually fold the randomization into the query. This will eliminate gathering a bunch of ids, processing them, and then reaching back into the table. It will also avoid the need for an extra index.
Find min and max id values.
Pick a random id between min and max.
Scan forward, looking for the first row with col1...col5 matching the criteria.
Something like...
SELECT b.* -- should replace with actual list of columns
FROM
( SELECT id
FROM tbl
WHERE id >= ( SELECT MIN(id) +
( MAX(id) - MIN(id)
- 22 -- somewhat avoids running off end
) * RAND()
FROM tbl )
AND col1 = 0 ... -- your various criteria
ORDER BY id
LIMIT 1
) AS a
JOIN tbl AS b USING(id);
Pros/cons:
Probably faster than anything else you can devise.
If there RAND() hits too late in the table, it will return nothing. In this (rare) case, run the query again, but starting at 0.
Big gaps in id will lead to a bias in which id is returned. (The link above discusses some kludges to handle such.)

MySQL - Poor performance in a select from a simple table

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.

If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.

Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/

Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

SQL use a two column index or two single indexes ( x = ? and y > ?)

This question might seem very similar to previous posted ones but slightly different.
When your queries always contains both columns I know it is a good practice to create an index for both column but for this scenario the queries will be like checking for a column and then finding all records greater than a given value from the second column, e.g.
SELECT * FROM Data WHERE User_Id = 'Johndoe' AND Last_Query > 1001
So should I still use a single index for both Columns (User_id, Last_Query in the above example) or should I make different indices for each columns?

Multicolumn indexes are designed to handle exactly this case. When the first column in an index has a value that is known (User_Id = 'Johndoe' in your example), all the rows with that value are ordered by the second column in the index (Last_Query in your example). This allows for efficient seeking to where the range on the second column starts.
In general, this works when any prefix of the index columns have "=" filters on them followed by a range filter (> or <) on the next index column. i.e, an index on (col1, ..., colN+1) and a WHERE filter such as:
WHERE col1=a AND col2=b ... AND colN=c AND colN+1 > d

MySQL multiple index optimization

I have a question about optimizing sql queries with multiple index.
Imagine I have a table "TEST" with fields "A, B, C, D, E, F".
In my code (php), I use the following "WHERE" query :
Select (..) from TEST WHERE a = 'x' and B = 'y'
Select (..) from TEST WHERE a = 'x' and B = 'y' and F = 'z'
Select (..) from TEST WHERE a = 'x' and B = 'y' and (D = 'w' or F = 'z')
what is the best approach to get the best speed when running queries?
3 multiple Index like (A, B), (A, B, F) and (A, B, D, F)?
Or A single multiple index (A, B, D, F)?
I would tend to say that the 3 index would be best even if the space of index in the database will be larger.
In my problem, I search the best execution time not the space.
The database being of a reasonable size.

Multiple-column indexes:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
In other words, it is a waste of space an computing power to define an index that covers the same first N columns as another index and in the same order.

The best way to exam the index is to practice. Use "explain" in mysql, it will give you a query plan and tell you which index to use. In addition, it will give you an estimate time for your query to run. Here is an example
explain select * from TEST WHERE a = 'x' and B = 'y'

It is hard to give definitive answers without experiments.
BUT: ordinarily an index like (A,B,D) is considered to be superfluous if you have an index on (A,B,D,F). So, in my opinion you only need the one multicolumn index.
There is one other consideration. If your table has a lot of columns and a lot of rows and your SELECT list has a small subset of those columns, you might consider including those columns in your index. For example, if your query says SELECT D,F,G,H FROM ... you should try creating an index on
(A,B,D,F,G,H)
as it will allow the query to be satisfied from the index without having to refer back to the rows of the table. This can sometimes help performance a great deal.

It's hard to explain well, but generally you should use as few indexes as you can get away with, using as many columns of the common queries as you can, with the most commonly queried columns first.
In your example WHERE clauses, A and B are always included. These should thus be part of an index. If A is more commonly used in a search then list that first, if B is more commonly used then list that first. MySQL can partially use the index as long as each column (seen from the left) in the index is used in the WHERE clause. So if you have an index ( A, B, C ) then WHERE ( A = .. AND B = .. AND Z = .. ) can still use that index to narrow down the search. If you have a WHERE ( B = .. AND Z = .. ) clause then A isn't part of the search condition and it can't be used for that index.
You want the single multiple column index A, B, D, F OR A, B, F, D (only one of these at a time can be used), but which depends mostly on the number of times D or F are queried for, and the distribution of data. Say if most of the values in D are 0 but one in a hundred values are 1 then that column would have a poor key distribution and thus putting the index on that column wouldn't be all that useful.

The optimiser can use a composite index for where conditions that follow the order of the index with no gaps:
An index on (A,B,F) will cover the first two queries.
The last query is a bit trickier, because of the OR. I think only the A and B conditions will be covered by (A,B,F) but using a separate index (D) or index (F) may speed up the query depending on the cardinality of the rows.
I think an index on (A,B,D,F) can only be used for the A and B conditions on all three queries. Not the F condition on query two, because the D value in the index can be anything and not the D and F conditions because of the OR.
You may have to add hints to the query to get the optimiser to use the best index and you can see which indexes are being used by running an EXPLAIN ... on the query.
Also, adding indexes slows down DML statements and can cause locking issues, so it's best to avoid over-indexing where possible.

MySQL Secondary Indexes

If I'm trying to increase the performance of a query that uses 4 different columns from a specific table, should I create 4 different indexes (one with each column individually) or should I create 1 index with all columns included?

One index with all 4 values is by my experience the fastest. If you use a where, try to put the columns in an order that makes it useful for the where.

An index with all four columns; the columns used in the WHERE should go first, and those for which you do == compare should go first of all.
Sometimes, giving priority to integer columns gives better results; YMMV.
So for example,
SELECT title, count(*) FROM table WHERE class = 'post' AND topic_id = 17
AND date > ##BeginDate and date < ##EndDate;
would have an index on: topic_id, post, date, and title, in this order.
The "title" in the index is only used so that the DB may find the value of "title" for those records matching the query, without the extra access to the data table.
The more balanced the distribution of the records on the first fields, the best results you will have (in this example, say 10% of the rows have topic_id = 17, you would discard the other 90% without ever having to run a string comparison with 'post' -- not that string comparisons are particularly costly. Depending on the data, you might find it better to index date first and post later, or even use date first as a MySQL PARTITION.

Single index is usually more effective than index merge, so if you have condition like f1 = 1 AND f2 = 2 AND f3 = 3 AND f4 = 4 single index would right decision.
To achieve best performance enumerate index fields in descending order of cardinality (count of distinct values), this will help to reduce analyzed rows count.
Index of less than 4 fields can be more effective, as it requires less memory.
http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008