Optimize sql query by cover index - mysql

Query:
SELECT a, b, c FROM table WHERE a = .. and b like 'example%' and c = '..'
Does this query use index (a,b,c) or (a,b)?

For a covering index to even begin to help this query, it needs to be
a,c,b
That's because the query wants a specific single value for a and c and a range of values (LIKE 'string%') for b.
The compound BTREE index gets random-accessed to the specific a,c value and the starting b value. It scans (in a so-called tight scan) to the last eligible b value.
Note that
c,a,b
will also work.

Related

Optimizing a SELECT statement with ORDER BY by using indices

I have these queries:
1st query:
SELECT (..) FROM db WHERE A = const AND B > const AND C >= const ORDER BY B DESC LIMIT const
2nd query (different db):
SELECT (...) FROM db' WHERE A' = const ORDER BY X' DESC LIMIT const
Question about 1st query:
Is it sufficient to have a multiple row index (A, B, C) or do I need an additional single row index (B) (or a different one) because of the ORDER BYstatement?
Question about 2nd query: Do I need a multiple row index (A', X') or two single row indices (A'), (X') to make us of them in this query?
It is an important thing to know that MySQL will use at most one index (for searching, filtering and ordering) per table and subquery (so basically per row in explain), so you can use only one index here.
For your first query, an index (A,B) will allow MySQL to do a range scan and use the order. If you use (A,B,C), the column C cannot be used in the range condition (because B is already a range), but MySQL will save the time to read the actual tabledata to get the value for C to check the last condition. So (A,B,C) is in general the fastest choice here.
"In general", because you can of course have a data distribution where another index would be best: If you e.g. have only one or two rows that match C >= const and 10M+ rows with A = const, using an index on just C would be fastest. And if C is a very big column (e.g. varchar(700)), it could blow up the index and slow it down. But to estimate such exceptions would require deeper knowledge of your data.
For your second query, (A', X') will be the best choice. If you have the two indexes (A'), (X'), MySQL will in most cases (unless A' is unique, but then you wouldn't need an order by anyway) use the index on X' and hope it will find matching rows for A' soon. This will sometimes be unexpectedly and painfully slow if you only have some rows that match A' = const (because it has to jump back and forth in the table (that is ordered by the primary key) in the order of X' to find rows that match the condition for A').
You might get the same problem for your first query if you have the indexes (A) and (B) (but not (A,B) or (A,B,C)) there: MySQL will probably use (B) instead of (A) (but check the explain to make sure). Even if you just add one index now, this can e.g. happen when you add the index (B) to optimize a different query next week and forgot about this query, so I'd suggest to stick with (at least) (A,B)

Mysql index use

I have 2 tables with a common field. On one table the common field has an index
while on the other not. Running a query as the following :
SELECT *
FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
the query is way less performing than running the opposite :
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Anybody could explain me why and the logic behind the use of indexes in this case?
You can prepend your queries with EXPLAIN to find out how MySQL will use the indexes and in which order it will join the tables.
Take a look at the documentation of the EXPLAIN output format to see how to interpret the result.
Because of the LEFT JOINs, the order of the tables cannot be changed. MySQL needs to include in the final result set all the rows from the left table, whether or not they have matches in the right table.
On INNER JOINs, MySQL usually swaps the tables and puts the table having less rows first because this way it has a smaller number of rows to analyze.
Let's take this query (it's your query with shorter names for the tables):
SELECT *
FROM a
LEFT JOIN b ON a.col = b.col
WHERE 1
How MySQL runs this query:
It gets the first row from table a that matches the query conditions. If there are conditions in the WHERE or join clauses that use only fields of table a and constant values then an index that contain some or all of these fields is used to filter only the rows that matches the conditions.
After a row from table a was selected it goes to the next table from the execution plan (this is table b in our query). It has to select all the rows that match the WHERE condition(s) AND the JOIN condition(s). More specifically, the row(s) selected from table b must match the condition b.col = X where X is the value of column col for the row currently selected from table a on step 1. It finds the first matching row then goes to the next table. Since there is no "next table" in this query, it will put the pair of rows (from a and b) into the result set then discard the row from b and search for the next one, repeating this step until it finds all the rows from b that match the row currently selected from a (on step 1).
If on step 2 cannot find any row from b that match the row currently selected from a, the LEFT JOIN forces MySQL to make up a row (having the columns of b) full of NULLs and together with the current row from a it creates a row puts it into the result set.
After all the matching rows from b were processed, MySQL discards the current row from a, selects the next row from a that matches the WHERE and join conditions and starts over with the selection of matching rows from b (step 2).
This process loops until all the rows from a are processed.
Remarks:
The meaning of "first row" on step 1 depends on a lot of factors. For example, if there is an index on table a that contains all the columns (of table a) specified in the query then MySQL will not read the table data but will use the index instead. In this case, the order of the rows is given by the index. In other cases the rows are read from the table data and the order is provided by the order they are stored on the storage medium.
This simple query doesn't have any WHERE condition (WHERE 1 is always TRUE) and also there is no condition in the JOIN clause that contains only columns from a. All the rows from table a are included in the result set and that leads to a full table scan or an index scan, if possible.
On step 2, if table b has an index on column col then MySQL uses the index to find the rows from b that have value X on column col. This is a fast operation. If table b does not have an index on column col then MySQL needs to perform a full table scan of table b. That means it has to read all the rows of table b in order to find those having values X on column col. This is a very slow and resource consuming operation.
Because there is no condition on rows of table a, MySQL cannot use an index of table a to filter the rows it selects. On the other hand, when it needs to select the rows from table b (on step 2), it has a condition to match (b.col = X) and it could use an index to speed up the selection, given such an index exists on table b.
This explains the big difference of performance between your two queries. More, because of the LEFT JOIN, your two queries are not equivalent, they produce different results.
Disclaimer: Please note that the above list of steps is an overly simplified explanation of how the execution of a query works. It attempts to put it in simple words and skip the many technical aspects of what happens behind the scene.
Hints about how to make your query run faster can be found on MySQL documentation, section 8. Optimization
To check what's going on with MySQL Query optimizer please show EXPLAIN plan of these two queries. Goes like this:
EXPLAIN
SELECT * FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
and
EXPLAIN
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1

MySQL multiple index optimization

I have a question about optimizing sql queries with multiple index.
Imagine I have a table "TEST" with fields "A, B, C, D, E, F".
In my code (php), I use the following "WHERE" query :
Select (..) from TEST WHERE a = 'x' and B = 'y'
Select (..) from TEST WHERE a = 'x' and B = 'y' and F = 'z'
Select (..) from TEST WHERE a = 'x' and B = 'y' and (D = 'w' or F = 'z')
what is the best approach to get the best speed when running queries?
3 multiple Index like (A, B), (A, B, F) and (A, B, D, F)?
Or A single multiple index (A, B, D, F)?
I would tend to say that the 3 index would be best even if the space of index in the database will be larger.
In my problem, I search the best execution time not the space.
The database being of a reasonable size.
Multiple-column indexes:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
In other words, it is a waste of space an computing power to define an index that covers the same first N columns as another index and in the same order.
The best way to exam the index is to practice. Use "explain" in mysql, it will give you a query plan and tell you which index to use. In addition, it will give you an estimate time for your query to run. Here is an example
explain select * from TEST WHERE a = 'x' and B = 'y'
It is hard to give definitive answers without experiments.
BUT: ordinarily an index like (A,B,D) is considered to be superfluous if you have an index on (A,B,D,F). So, in my opinion you only need the one multicolumn index.
There is one other consideration. If your table has a lot of columns and a lot of rows and your SELECT list has a small subset of those columns, you might consider including those columns in your index. For example, if your query says SELECT D,F,G,H FROM ... you should try creating an index on
(A,B,D,F,G,H)
as it will allow the query to be satisfied from the index without having to refer back to the rows of the table. This can sometimes help performance a great deal.
It's hard to explain well, but generally you should use as few indexes as you can get away with, using as many columns of the common queries as you can, with the most commonly queried columns first.
In your example WHERE clauses, A and B are always included. These should thus be part of an index. If A is more commonly used in a search then list that first, if B is more commonly used then list that first. MySQL can partially use the index as long as each column (seen from the left) in the index is used in the WHERE clause. So if you have an index ( A, B, C ) then WHERE ( A = .. AND B = .. AND Z = .. ) can still use that index to narrow down the search. If you have a WHERE ( B = .. AND Z = .. ) clause then A isn't part of the search condition and it can't be used for that index.
You want the single multiple column index A, B, D, F OR A, B, F, D (only one of these at a time can be used), but which depends mostly on the number of times D or F are queried for, and the distribution of data. Say if most of the values in D are 0 but one in a hundred values are 1 then that column would have a poor key distribution and thus putting the index on that column wouldn't be all that useful.
The optimiser can use a composite index for where conditions that follow the order of the index with no gaps:
An index on (A,B,F) will cover the first two queries.
The last query is a bit trickier, because of the OR. I think only the A and B conditions will be covered by (A,B,F) but using a separate index (D) or index (F) may speed up the query depending on the cardinality of the rows.
I think an index on (A,B,D,F) can only be used for the A and B conditions on all three queries. Not the F condition on query two, because the D value in the index can be anything and not the D and F conditions because of the OR.
You may have to add hints to the query to get the optimiser to use the best index and you can see which indexes are being used by running an EXPLAIN ... on the query.
Also, adding indexes slows down DML statements and can cause locking issues, so it's best to avoid over-indexing where possible.

Partial multi-field index usage in MySQL

I have a MyISAM table with almost 1 billion records, with say, three fields: a, b and c.
The table has a btree multi-field index on columns a, b and c in that order. Analyzing the index shows that the cardinalities for the fields in that index are:
a: 112 (int)
b: 2694 (int)
c: 936426795 (datetime)
Which means that there are around 100 different values for a, around 20 different values for b, and for each combination of a and b, a whole lot of values of c.
I want to perform a query over a specific value of a, and a range over c. Something like
select a, b, c from mytable where a=4 and c >= "2011-01-01 00:00:00" and c < "2011-01-02 00:00:00"
Getting the query explained shows me that it will indeed use the index, but I don't know if it will use only the first field of the index and then scan over the rest of the table, or if it will be smart enough to apply the third field index, for each value of b, which would be the same as executing 20 different queries, one for each different value of b.
Anybody who knows the internal working of mysql indices can answer this question?
Edit: I'm not asking whether or not I can have mysql to use the index over only a and c. I know how btrees work, and I know that you can only use it over a, a and b, or a and b and c. I would like to know if the mysql optimizer is smart enough to apply the index over all the values in b so it can use the a+b+c index, considering that the cardinality of b is extremely small.
Consider an even simpler example. A table with two columns: a and b, and the index has cardinality 1 over a and 10000000 over b. Mysql should be smart enough to know that there's only one value of a, therefore this index is equivalent to an index only over b, and should use this index when performing queries only over b.
MySQL Reference Manual :: How MySQL Uses Indexes
If the table has a multiple-column index, any leftmost prefix of the
index can be used by the optimizer to find rows. For example, if you
have a three-column index on (col1, col2, col3), you have indexed
search capabilities on (col1), (col1, col2), and (col1, col2, col3).
MySQL cannot use an index if the columns do not form a leftmost prefix of the index.
a,c is not a leftmost prefix of the index a,b,c so the index cannot be used to resolve the search on c.
The question makes sense from the point of view that some database engines are smart enough to scan the index rather than scanning the table. (And they allow "data" to be stored in the index for this exact reason.) Scanning the index will be faster than joining the index to the base data, then limiting (excluding) returned rows based on the where clause.
It would make sense that only the rows in the index that meet the where condition (on columns in the index) are joined. Particularly if you are running a large key cache...
It would appear this doesn't happen in MySQL which is disappointing.
Therefore no.
Below are some facts related with B-TREE index usage by mysql and one example to understand this logic.
a) If any table has approx. 75% same data then index will not be used instead mysql will do table scan.
b) Normally mysql use only single index per table.
c) Index ordering methodology: Mysql will use index as per their order.
For example there is an combined index on a, b and c field idx_a_b_c(a,b,c)
i. select a, b, c from mytable where a=4
This query will use index as 'a' column is first in index order.
ii. select a, b, c from mytable where a=4 and b=5
This query will use combined index on a & b as these column are continue in index order.
iii. select a, b, c from mytable where a=4 and b=5 and c >= "2011-01-01 00:00:00"
This query will use combined index on a, b & c as these column are continue in index order.
iv. select a, b, c from mytable where c >= "2011-01-01 00:00:00"
This query will not use index as mysql consider index from left most corner and column c is not a left most column in index.
v. select a, b, c from mytable where a=4 and c >= "2011-01-01 00:00:00" and c < "2011-01-02 00:00:00"
This query will use only index on 'a' column but not of 'c' column as continuity is breaking here from left side. So this query will use index on a column and then scan table for column c for corresponding rows as per filter on column a.

Is there a better way to index multiple columns than creating an index for each permutation?

Suppose I have a database table with columns a, b, and c. I plan on doing queries on all three columns, but I'm not sure which columns in particular I'm querying. There's enough rows in the table that an index immensely speeds up the search, but it feels wrong to make all the permutations of possible indexes (like this):
a
b
c
a, b
a, c
b, c
a, b, c
Is there a better way to handle this problem? (It's very possible that I'll be just fine indexing a, b, c alone, since this will cut down on the number of rows quickly, but I'm wondering if there's a better way.)
If you need more concrete examples, in the real-life data, the columns are city, state, and zip code. Also, I'm using a MySQL database.
In MS SQL the index "a, b, c" will cover you for scenarios "a"; "a, b"; and "a, b, c". So you would only need the following indexes:
a, b, c
b, c
c
Not sure if MySQL works the same way, but I would assume so.
To use indexes for all possible equality conditions on N columns, you will need C([N/2], N) indexes, that is N! / ([N/2]! * (N - [N/2])!)
See this article in my blog for detailed explanations:
Creating indexes
You can also read the strict mathematical proof by Russian mathematician Egor Timoshenko (update: now in English).
One can, however, get decent performance with less indexes using the following techniques:
Index merging
If the columns col1, col2 and col3 are selective, then this query
SELECT *
FROM mytable
WHERE col1 = :value1
AND col2 = :value2
AND col3 = :value3
can use three separate indexes on col1, col2 and col3, select the ROWID's that match each condition separately and them find their intersection, like in:
SELECT *
FROM (
SELECT rowid
FROM mytable
WHERE col1 = :value1
INTERSECT
SELECT rowid
FROM mytable
WHERE col2 = :value2
INTERSECT
SELECT rowid
FROM mytable
WHERE col3 = :value3
) mo
JOIN mytable mi
ON mi.rowid = mo.rowid
Bitmap indexing
PostgreSQL can build temporary bitmap indexes in memory right during the query.
A bitmap index is quite a compact contiguous bit array.
Each bit set for the the array tells that the corresponging tid should be selected from the table.
Such an index can take but 128M of temporary storage for a table with 1G rows.
The following query:
SELECT *
FROM mytable
WHERE col1 = :value1
AND col2 = :value2
AND col3 = :value3
will first allocate a zero-filled bitmap large enough to cover all possible tid's in the table (that is large enough to take all tid's from (0, 0) to the last tid, not taking missing tid's into account).
Then it will seek the first index, setting the bits to 1 if they satisfy the first condition.
Then it will scan the second index, AND'ing the bits that satisfy the second condition with a 1. This will leave 1 only for those bits that satisfy both conditions.
Same for the third index.
Finally, it will just select rows with the tid's corresponding to the bits set.
The tid's will be fetched sequentially, so it's very efficient.
The more the indexes you create the more your performance will be hit during update and delete operations. Because the index itself might get updated.
Yes, you can use multiple-column indexes. Something like
CREATE TABLE temp (
id INT NOT NULL,
a INT NULL,
b INT NULL,
c INT NULL,
PRIMARY KEY (id),
INDEX ind1 (a,b,c),
INDEX ind2 (a,b)
);
This type of index i.e. ind1 will surely help you in queries like
SELECT * FROM temp WHERE a=2 AND b=3 AND c=4;
Similarly, ind2 will help you in queries like
SELECT * FROM temp WHERE a=2 AND b=3;
But these indexes won't be used if the query is some thing like
SELECT * FROM temp WHERE a=2 OR b=3 OR c=4;
Here you will need separate indexes on a, b, and c.
So instead of having so many indexes, I would agree with what John said i.e. have indexes on a,b,c and if you feel that your workload covers more multi-column queries then you can switch to multi-column indexes.
cheers
Given that your columns are actually City, State and Zip Code, I would suggest just the following indexes:
INDEX(ZipCode)
If I am correct, Zip Codes are not duplicated across the USA, so it's pointless adding City or State information to the index as well because they will be the same value for all Zip Codes. E.g., 90210 is always Los Angeles, CA.
INDEX(City(5)) or INDEX(City(5)), State)
This is just an index on the first five letters of the city name. In many cases, this will be specific enough that having the State indexed wouldn't provide any useful filtering. E.g., 'Los A' will almost certainly be records from Los Angeles, CA. Maybe there is another small town in the USA starting with 'Los A', but there will be so few records it's not worth cluttering the index with State data as well. On the other hand, some city names appear in many states (Springfield comes to mind), so in those cases it is better to have the State indexed as well. You will need to figure out for yourself which index is most suited to your set of data. If in doubt, I would go with the second index (City and State).
INDEX(State, sort_field)
State is a pretty broad index (quite possibly NY and CA alone will have 30% of the records). If you plan displaying this information to the user, say, 30 records at a time, then you would have a query ending in
... WHERE STATE = "NY"
ORDER BY <sort_field>
LIMIT <number>, 30
To make that query efficient, you need to include the sorting column in the State index. So if you're showing pages ordered by Last Name (presuming you have that column), then you would use INDEX(State, LastName(3)), otherwise MySQL has to sort all of the 'NY' records before it can give you the 30 you want.
It's depend on your sql-query.
index (a, b, c) is different to index(b, c, a) or index(a, c, b)