Understanding the order of indexing in MySQL - mysql

There are 3 columns in a table with 10 million entries. col1, col2, col3. col1 stores numbers with at most 2 digits, col2 stores numbers with at most 9 digits and col3 stores either 0 or 1.
Now, when I compound index in the order (col1,col2,col3) I get results(of some select operations with all the 3 columns involved in the where condition- exact values of col1 and col3 are specified while a range for col2) in around 0.5 seconds while if I order it as (col3,col1,col2) it takes around 10 secs(for the same query).
From what I understand, indexing in mysql concatenates the values in the 3 columns appropriately in the order I specify them and runs a binary search while querying after an initial sort. According to this understanding, mentioning col3 in the very beginning should be equivalent if not superior to writing it in the order (col1,col2,col3) since if I specify col3=1 or col3=0 it narrows the search by half.
Please explain the anomaly!

Well its tough to make decision like this but personally I would go with indexing
INDEX `compound_index`(col1,col2,col3);
If I wouldn't have range scans to be done I would have created
INDEX `compound_index`(col2,col1,col3);
as col2 most likely have better cardinality
Generally speaking if you don't have range scan to a table columns have better cardinality would become the first column for the index and so on..
In case you have range scan loose index scan works better than covering index
http://www.arubin.org/blog/2010/11/18/loose-index-scan-vs-covered-indexes-in-mysql/

If your WHERE clause gives a range of values for col2, then anything after col2 in the index is not very useful.
If that's not clear, suppose you index on (col1, col2, col3) and your where clause is "where col1=5 and col2 between 2 and 4 and col3=1". So at best, the SQL engine can go to the place in the index beginning at col1=5, col2=2, and col3=1. Theoretically, it could say that when it gets to the end of col2=2, when it sees the first col2=3, col3=0, it could skip ahead to col2=3, col3=1. Similarly when it gets to col2=4, col3=0, it could skip ahead to col2=4, col3=1. But in practice skipping around in the index is relatively slow. The engine reads the index in blocks, so once it gets a block, if it sequentially searches that block, it already has it all in memory. But to skip it may have to read another block, which means additional I/o operations. I think most SQL engines say that once you give a range, everything after that in the index is not used. So most likely the engine would scan all records from 5,2 to 5,4 and pick out the col3=1 as it went, rather than skipping around in the index.
Given that, while you say that col3 is always 0 or 1. I take it that col1 and col2 have a wider range of values? Let's suppose for the sake of discussion that they each have 10 possible values, and that your range on col2 covers 3 values. And let's assume a relatively even distribution across all values -- there are just as many 1's as 2's, etc.
Then if you index on (col1, col2, col3), the engine can use col1 to immediately narrow the search to just 10% of the index, and col2 to narrow to 30% of that or 3% of the total.
If you index on (col3, col2, col1), then the engine can use col3 to narrow the search to 50% of the index, and col2 to 30% of that, or 15%.
Option (b) has the engine searching 5 times as much of the index as option 1. So yes, it would be slower.

Related

MySQL queries - defining compound and single indexes across multiple queries. How to prevent the indexes from conflicting and creating slow queries?

I am struggling to work out which columns are best to put my indexes on, when it seems adding additional indexes can have a detrimental effect on the query performance.
For example, I have the following query on a table with around 5m rows;
SELECT col1, col2 FROM table WHERE col1 = 'a' AND col2 = 'b' AND col3 = 'c';
Running this with no indexes takes 12 seconds!
I add a compound index on all 3 columns - table_col1_col2_col3_index;
My query now drops down to 2 seconds - great!
I now have another query on the same table (with no indexes on any column):
SELECT col1, col2 FROM table WHERE col1 = 'a';
Running this on its own and the query takes 4 seconds - still pretty slow!
So now I add a single column index to col1 table_col1_index
My query reduces down to 0.2 seconds. This is great, however I now run the original query again and notice that it is using this index opposed to the one I specified earlier. The original query is now back up at 6 seconds.
I am unsure how to go about ensuring that both queries can be optimised at the same time.
You can create indexes taking care to leave the most used or selective column on the left and then organizing the indexes if you can so you can use the same index in more queries ..
Furthermore you can always print the index you think is best adapted using FORCE (or IGNORE) https://dev.mysql.com/doc/refman/8.0/en/index-hints.html
SELECT * FROM table1 FORCE INDEX (col3_index)
WHERE col1=1 AND col2=2 AND col3=3;
Turn off the query cache, or use SELECT SQL_NO_CACHE ... when doing timing.
Run each timing test twice. The first may spend extra time fetching data and/or index blocks from disk. The second timing is better for comparisons. (And is closer to the way it would be in a "production" server.)
How many rows are being returned? That could have an impact. The 2nd query may be returning many times as many rows.
Please provide SHOW CREATE TABLE -- there could be subtle issues. (datatypes, column sizes, collations, who-knows-what)
Please provide EXPLAIN SELECT ... -- As written, each of your examples should say "Using index", meaning that the index "covers" the query, which means that all the columns in the SELECT exist in the INDEX being used.
Do not use "index hints" -- while it may help a query today, it may hurt it more tomorrow.
All of your examples would ('should') benefit from INDEX(col1, col2, col3), in that order; I would not add any others.

single-column or composite index

For a table like this
[ col1 - col2 - col3 - col4 ]
[ 1 - 2 - 3 - 4 ]
I'm going to use two types of queries in two cases
One is SELECT * FROM table WHERE col1 = 1 AND col2 = 2 AND col3 = 3;
Another is SELECT * FROM table WHERE col1 = 1 AND col2 = 2 AND col4 = 4;
In this case, Do I make a
composite index for col1 AND col2 only and a single-column index for col3 AND col4
or do I go
ALL columns single-column index
or put
ALL the columns in composite index
Side question: Do I have to name the Index? And what is the Index size?
Have these two:
INDEX col123 (col1, col2, col3),
INDEX col214 (col2, col1, col4)
Notes:
For the 2 queries given, it does not matter which order the 3 columns are in the composite queries.
I did col1 and col2 in different orders just in case some other query needs col2 without col1.
Having INDEX(col3) (single-column) is less useful.
With INDEX(col1, col2), INDEX(col3) -- The optimizer will pick one index and not use the other. This is less good than having an index with all three columns.
Luke is good; my index cookbook might be better?
The "rules of thumb" are aimed at Postgres. Do not use them; there are too many things that are incorrect for MySQL.
The "query tuning" link is aimed at DB2; it mostly applies to MySQL.
INSERTs do take a little time to update the index(es), but most of that work is delayed (see "Change buffering") for non-UNIQUE indexes. Don't let that stop you from adding an index. The benefit on a SELECT usually far outweighs the cost in INSERT.
Index names are optional in MySQL, but the default for a composite index can be misleading.
Another way to compare queries and/or indexes, even with too few rows to get reliable timings:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Big numbers = bad; small numbers = smile.
A compound index on col1 and col2, with single-column indexes on col3 and col4 is probably going to work best. But the way to tell for certain is to build a table for testing, and populate it with sample data. If possible, insert roughly the same amount of data you expect in production.
Then build indexes, run the queries, and read the execution plans. Drop the indexes, build them a different way, run the queries, and read the execution plans.
You should also think about what other queries need to use this table, and how indexes affect those queries. Think about INSERT and DELETE queries as well as SELECT statements.
Whether you have to name the index depends on the dbms. Most of them will supply a system-generated name if you leave it out. The size of the index depends on the dbms; there's usually a way to figure it out if the dbms doesn't supply explicit functions or stored procedures to do that.

In MySQL, is "WHERE column LIKE 'abc%'" considered a range scan, for the purpose of indexing?

Say I have a query like this:
SELECT column1, date FROM table WHERE column1 LIKE 'abc%' AND date > '2015-01-01'
If I have an index on (column1, date), will it use the index as a covering index? Usually, a range column must be last in a multi-column index, because the following columns can't be used in constraints.
I cannot find anything in the documentation regarding this. It seems to me that the explain plan shows far too many rows that need to be looked up, even though it should be a covering index.
EDIT: will show the real query:
SELECT
count(*) AS cnt,
`col1`
FROM table
USE INDEX (table_col2_col1_date_index)
WHERE `col1` IN ('25485') AND `col2` LIKE 'text-%' AND `date` > '2016-06-03 18:13:40'
GROUP BY `col1`;
As you can see, my index covers all three columns. Explain says:
Using where; Using index; Using temporary; Using filesort
Explain shows 38069776 rows to examine. Doing a count(*) for col2 like 'text-%' shows 20427133. So assuming they're just estimates, this is probably the issue: it's only considering the first column in the index. Even though it IS, in fact, using the covering index fully, it still has to do a ton of reads on data in the index.
Now I don't know how to optimize this query. This is a log-based table, so the date is crucial, but we have a lot of different values for column2, and there's nothing I can do to optimize for both?
Short answer to title question: Yes. But that is not your real question, so...
IN(single_value) is treated like = single_value, so
WHERE `col1` IN ('25485') AND `col2` LIKE 'text-%' AND `date` > '2016-06-03 18:13:40'
needs one of these:
INDEX(col1, col2)
INDEX(col1, date)
That is, col1 needs to be first.
Using where; Using index; Using temporary; Using filesort -- says that the index was 'covering', but it does not say whether the columns were in the optimal order. That is, one of these is the optimal 'covering' index:
INDEX(col1, col2, date)
INDEX(col1, date, col2)
I can't predict which will be better, and the Optimizer may or may not predict correctly from its statistics. But col1 needs to be first.
If you have
col1 IN (123, 234) AND ...
and you have a new enough version, the Optimizer will efficiently leapfrog through the index -- first do the 123 AND ..., then do 234 AND .... That is, in this case, IN works as efficiently as =, and can see past col1 to make use of a 'range' after it.
In "Data Warehouse" type of tables, it is often very efficient to build and maintain Summary Tables. (Since I don't have a feel for the columns or the likely queries, I can't provide details at this time.)
If you are using InnoDB, do you have innodb_buffer_pool_size set to about 70% of RAM? If not, that may help.
Only one index is chosen to access a table.
If you want to make use of all three values, create an index on all three columns:
create index myindex on mytable(col1, col2, col3);
Try to put the "most specific" column first in the list of columns.

Unique first column in multi-column index

I have multi-column index for 2 columns. Can I make first column unique without making separate index for that?
If I understand correctly mysql can use only first column in this index for lookups, so can it use it to detect uniqueness?
The short answer is "No". Because it doesn't make much sense.
Indeed, MySQL is able to use a multiple-column index for operations that use only the leftmost "n" columns from the index definition.
Let's say you have an index on columns (col1, col2). MySQL can use it to find records matching conditions on both col1 and col2, GROUP BY col1, col2 or ORDER BY col1, col2. It is important to notice that col1 and col2 needs to used in this order in the GROUP BY or ORDER BY clause. Their order doesn't matter on WHERE or ON clauses as long as both are used.
MySQL can also use the same index for WHERE or ON conditions and GROUP BY or ORDER BY clauses that contain only col1. It cannot, however, use the index if col2 appears without col1.
What happens when you have an index on columns (col1, col2) and all the rows have distinct values in column col1?
Let's assume we have a table that have distinct values in column col1 and it has an index on columns (col1, col2). When MySQL needs to find the rows that match WHERE col1 = val1 AND col2 = val2, by consulting the index it can find the row that have col1 = val1. It doesn't need to use the index to refine the list of candidate rows because there is no list: there is at most one row having col1 = val1.
Sure, most of the times MySQL will use the index to check if col2 = val2 but having col2 in this index doesn't bring more useful information to the index. The storage space it takes and the processing power it uses on table data updates are too big for the tiny contribution it adds to rows searching.
The whole purpose of having indexes on multiple columns is to help searching by shrinking the list of matching rows for a given set of values when the columns included in a multiple-column index cannot be used individually because they don't contain enough distinct values.
Technically speaking, there is no way to tell MySQL you want to have a multiple-column index on (col1, col2) that must have unique values on col1. Create an UNIQUE INDEX on col1 instead. Then think about the data you have in the table and the queries you run against it and decide if another index on col2 only isn't better than the multiple-column index on (col1, col2).
In order to decide you can create the new indexes (UNIQUE on col1, INDEX on col2), put EXPLAIN in front of the most frequent queries you run on the table and check what index will pick MySQL up for use.
You need to have enough data (thousands of rows, at least, more is better) in the table to get accurate results.
You asked.
I have multi-column index for 2 columns. Can I make first column unique without making separate index for that?
The answer is no. You need a separate unique index on the first column to enforce a uniqueness constraint.

Are the MySQL query performance benefits of indices retained if a subset of the index columns are used in a query?

Do I get to keep the performance and efficiency advantages of having an index setup for multiple columns on a MySQL table if I run a SELECT statement that queries some subset of those columns in the index?
So, if I have an index setup on columns A, B and C but my statement only queries for columns A and B, is that the same as having no index setup at all. Do I need to have another index setup exclusively for A and B to gain any performance benefits with queries?
Short answer to a generic question: It's depends
Long answer:
The DB build the explain plan based on the statistics of the table. basically the DB engine estimates how much it "effort" it takes for every operation the two main factors in this case are the indexed data size and distribution of the indexed data.
Data distribution
If the first two columns data granularity is low (a few possible value for example values column A stands for gender column B stands for age) then there is a good chance that the optimizer will prefer to read the entire table rather then using the index. ** At this case adding an index only on A,B won't be useful either**
** Indexed data size **
Another factor is the size of data in column C. the size of data in column C effects directly on the index size. since reading the index tree also requires IO the bigger the index the so is the cost.
lets assume that the data in column C is comment and the average comment size is 500 chars. the data may have lot's of possible values but the index is going to be a very large index. This may also cause the DB to prefer reading the entire table rather then using the index. ** At this case adding an index on A,B is useful **
See this answer: https://stackoverflow.com/a/20939127/2520738
Basically:
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).
So basically, yes, if your index reads A, B, C from left to right, you can search on A, A and B, A and B and C. If you don't have single column indexes on B or C then no index will be used when they are searched individually.