MySQL Hash Indexes for Optimization - mysql

So maybe this is noob, but I'm messing with a couple tables.
I have TABLE A roughly 45,000 records
I have TABLE B roughly 1.5 million records
I have a query:
update
schema1.tablea a
inner join (
SELECT DISTINCT
ID, Lookup,
IDpart1, IDpart2
FROM
schema1.tableb
WHERE
IDpart1 is not NULL
AND
Lookup is not NULL
ORDER BY
ID,Lookup
) b Using(ID,Lookup)
set
a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
where
a.ID is NOT NULL
AND
a.Elg_IDpart1 is NULL
So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it.
It is taking FOR-EVER to run, and it really should take, i'd imagine under 5 minutes...
My questions are in regards to the indexes, not the query.
I know that you can't use hash index in ordered index.
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
Can I have a single hash index, and the rest of the indexes b B-tree index?
This is not a primary key field.
I would post my explain but i changed the name on these tables. Basically it is using the index only for ID...instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
The Lookup field maps the first and second part of the ID...
Any help or insight on this is appreciated.
 UPDATE
An EXPLAIN on the UPDATE after I took out sub-query.
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | m | ALL | Lookup_Idx,ID_Idx,ID_Lookup | | | | 44023 | Using where |
| 1 | SIMPLE | c | ref | ID_LookupIdx | ID_LookupIdx | 5 | schema1.tableb.ID | 4 | Using where |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
tablea relevant indexes:
ID_LookupIdx (ID, Lookup)
tableb relevant indexes:
ID (ID)
Lookup_Idx (Lookup)
ID_Lookup_Idx (ID, Lookup)
All of the indexes are normal B-trees.

Firstly, to deal with the specific questions that you raise:
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
As documented under CREATE INDEX Syntax:
+----------------+--------------------------------+
| Storage Engine | Permissible Index Types |
+----------------+--------------------------------+
| MyISAM | BTREE |
| InnoDB | BTREE |
| MEMORY/HEAP | HASH, BTREE |
| NDB | BTREE, HASH (see note in text) |
+----------------+--------------------------------+
Therefore, before even considering HASH indexing, one should be aware that it is only available in the MEMORY and NDB storage engines: so may not even be an option to you.
Furthermore, be aware that indexes on combinations of ID and Lookup alone may not be optimal, as your WHERE predicate also filters on tablea.Elg_IDpart1 and tableb.IDpart1—you may benefit from indexing on those columns too.
Can I have a single hash index, and the rest of the indexes b B-tree index?
Provided that the desired index types are supported by the storage engine, you can mix them as you see fit.
instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
You could use an index hint to force MySQL to use different indexes to those that the optimiser would otherwise have selected.
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
It is usually smart enough, but not always. In this case, however, it has probably determined that the cardinality of the indexes is such that it is better to use those that it has chosen.
Now, depending on the version of MySQL that you are using, tables derived from subqueries may not have any indexes upon them that can be used for further processing: consequently the join with b may require a full scan of that derived table (there's insufficient information in your question to determine exactly how much of a problem this might be, but schema1.tableb having 1.5 million records suggests it could be a significant factor).
See Subquery Optimization for more information.
One should therefore try to avoid using derived tables if at all possible. In this case, there does not appear to be any purpose to your derived table as one could simply join schema1.tablea and schema1.tableb directly:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND a.ID IS NOT NULL
AND b.IDpart1 IS NOT NULL
AND b.Lookup IS NOT NULL
ORDER BY ID, Lookup
The only thing that has been lost is the filter for DISTINCT records, but duplicate records will simply (attempt to) overwrite updated values with those same values again—which will have no effect, but may have proved very costly (especially with so many records in that table).
The use of ORDER BY in the derived table was pointless as it could not be relied upon to achieve any particular order to the UPDATE, whereas in this revised version it will ensure that any updates that overwrite previous ones take place in the specified order: but is that necessary? Perhaps it can be removed and save on any sorting operation.
One should check the predicates in the WHERE clause: are they all necessary (the NOT NULL checks on a.ID and b.Lookup, for example, are superfluous given that any such NULL records will be eliminated by the JOIN predicate)?
Altogether, this leaves us with:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND b.IDpart1 IS NOT NULL
Only if performance is still unsatisfactory should one look further at the indexing. Are relevant columns (i.e. those used in the JOIN and WHERE predicates) indexed? Are the indexes being selected for use by MySQL (bear in mind that it can only use one index per table for lookups: for testing both the JOIN predicate and the filter predicates: perhaps you need an appropriate composite index)? Check the query execution plan by using EXPLAIN to investigate such issues further.

Related

Why - or when - doesn't MySQL use indexes for OR conditions, if it does for AND conditions?

I have a table the_table with attributes the_table.id, the_table.firstVal and the_table.secondVal (the primary key is the_table.id, of course).
After defining an index over the first non-key attribute like this:
CREATE INDEX idx_firstval
ON the_table (firstVal);
The EXPLAIN result for the following disjunctive (OR) query
SELECT * FROM the_table WHERE the_table.firstVal = 'A' OR the_table.secondVal = 'B';
is
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
| 1 | SIMPLE | the_table | ALL | idx_firstval | NULL | NULL | NULL | 3436 | Using where
which shows that the index idx_firstval is not used. Now, the EXPLAIN result for the following conjunctive (AND) query
SELECT * FROM the_table WHERE the_table.firstVal = 'A' AND the_table.secondVal = 'B';
is
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
| 1 | SIMPLE | the_table | ref | idx_firstval | idx_firstval | 767 | const | 124 | Using index condition; Using where
which shows the index in use, this time around.
Why is MySQL choosing not to use indexes for the disjunctive query, but it is for the conjunctive one?
I've scoured SO, and as suggested by the answer in this thread, "using OR in a query will often cause the Query Optimizer to abandon use of index seeks and revert to scans". However, this doesn't answer why it happens, just that it does.
Another thread tries to answer why a disjunctive query doesn't use indexes, but I think it fails at doing so - it is merely concluded that the OP is using a small database. I'm wanting to know the difference between the disjunctive and the conjunctive case.
Because MySQL execution plan uses only one index for a table.
If MySQL uses range scan on idx_firstval to satisfy equality predicate on firstVal column, that leaves MySQL still needing to check the condition on secondVal column.
With the AND, MySQL only needs to check the rows returned from the range scan of the index. The set of rows that need to be checked is constrained by the condition.
With the OR, MySQL needs to check the rows that were not returned by the index range scan, all the rest of the rows in the table. Without an index, that means a full scan of the table. And if we're doing a full scan of the table to check secondVal, then it will be less expensive to check both conditions on the scan (i.e. a plan that includes an index accesses as well as a full scan will be more expensive.)
(If a composite index containing both firstVal and secondVal is available, then for the OR query, it is conceivable that optimizer might think its less expensive to check all the rows in the table by doing a full index scan, and then looking up the data pages.)
When we understand what operations are available to the optimizer, that's leads us to avoiding the OR and to rewrite the query, to return an equivalent resultset, with a query pattern that more explicitly defines a combination of two sets
SELECT a.*
FROM the_table a
WHERE a.firstVal = 'A'
UNION ALL
SELECT b.*
FROM the_table b
WHERE b.secondVal = 'B'
AND NOT ( b.firstVal <=> 'A' )
(Add an ORDER BY if we expect rows to be returned in a particular order)
I am surprised that MySQL is using an index for either of the two queries. The correct index to use here would be a composite index which covers the two columns in the WHERE clause:
CREATE INDEX idx ON the_table (firstVal, secondVal);
As to why MySQL is using the index in the second case, one possibility might be if most of the records in the_table have firstVal values which are not A. In this case, simply knowing that the equality the_table.firstVal = 'A' is false would mean that the entire outcome of the WHERE clause would be known (as false). So, the answer as to why the index is being used could have something to do with the cardinality of your exact data. But in any case, consider using the composite index to cover all bases.

What is the "Default order by" for a mysql Innodb query that omits the Order by clause?

So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.
Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!
Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.

Make MySQL read from multiple indexes?

Let's start off with a simple example:
CREATE TABLE `test` (
`id` INT UNSIGNED NOT NULL,
`value` CHAR(12) NOT NULL,
INDEX (`id`),
INDEX (`value`)
) ENGINE = InnoDB;
So 2 columns, both indexed. What I thought this meant was that MySQL would never have to read the actual table anymore, since all the data is stored in an index.
mysql> EXPLAIN SELECT id FROM test WHERE id = 1;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| 1 | SIMPLE | test | ref | id | id | 4 | const | 1 | Using index |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
"Using index", very nice. To my understanding this means that it is reading data from the index and not from the actual table. But what I really want is the "value" column.
mysql> EXPLAIN SELECT value FROM test WHERE id = 1;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
| 1 | SIMPLE | test | ref | id | id | 4 | const | 1 | |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
Hmm, no "using index" this time.
I thought it might help if I add an index that covers both columns.
ALTER TABLE `test` ADD INDEX `id_value` (`id`,`value`);
Now let's run that previous select-statement again and tell it to use the new index.
mysql> EXPLAIN SELECT id, value FROM test USE INDEX (id_value) WHERE id = 1;
+----+-------------+-------+------+---------------+----------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | test | ref | id_value | id_value | 4 | const | 1 | Using index |
+----+-------------+-------+------+---------------+----------+---------+-------+------+-------------+
Praise the Lord, it's reading from the index.
But actually I don't really need the combined index for anything else. Is it possible to make MySQL read from 2 separate indexes?
Any insights would be greatly appreciated.
EDIT: Ok, yet another example. This one is with the original table definition (so an index on each column).
mysql> EXPLAIN SELECT t1.value
-> FROM test AS t1
-> INNER JOIN test AS t2
-> ON t1.id <> t2.id AND t1.value = t2.value
-> WHERE t1.id = 1;
+----+-------------+-------+------+---------------+-------+---------+----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-------+---------+----------+------+-------------+
| 1 | SIMPLE | t1 | ref | id,value | id | 4 | const | 1 | |
| 1 | SIMPLE | t2 | ref | value | value | 12 | t1.value | 1 | Using where |
+----+-------------+-------+------+---------------+-------+---------+----------+------+-------------+
This must certainly read from both indexes (since both fields are used in the join condition) yet it STILL reads the data from the actual record, right? Why doesn't it just use the data it has read from the index? Or does it actually use that data without saying "using index"?
Thanks again
The key, ref and rows columns are more telling for this purpose. In each case, they indicate that MySQL has selected an index, has a value to lookup in that index, and is retrieving only one row from the table as a result. This is what you were after.
In your second query, MySQL still needs to retrieve the value from the record even though it has located the record on id via an index. If your WHERE criterion looked up based on value, then that index would have been used and there would have been no need to retrieve the record.
The manual on Using index Extra information:
The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.
If the Extra column also says Using where, it means the index is being used to perform lookups of key values. Without Using where, the optimizer may be reading the index to avoid reading data rows but not using it for lookups. For example, if the index is a covering index for the query, the optimizer may scan it without using it for lookups.
For InnoDB tables that have a user-defined clustered index, that index can be used even when Using index is absent from the Extra column. This is the case if type is index and key is PRIMARY.
In your first query, MySQL says using index because it can answer your query by looking at the index and the index alone. It does not need to go to the table to look up the corresponding value for the id column, because that's actually the same thing it's already got in the index.
In the second query, MySQL does need to look at the table to fetch the correct value, but it's still using the index, as you can see in the key column of your EXPLAIN statement.
In the third query, MySQL again doesn't have to look at the table anymore, because all the information it needs to answer your query is right there in the multiple-column index.
Just think a bit how indexes works.
Say, you have 10k records in your test table and index on the value column. While you're populating your table with data (or explicitly using ANALYZE command), database is keeping statistics on your table and all indexes.
At the moment you issue your query, there're several ways how to deliver you the data. In the very simplified case of test table and value column, something like:
SELECT * FROM test WHERE value = 'a string';
database query planner has 2 options:
performing a sequential scan on the whole table and filter the results or
performing index scan to lookup the desired data entries.
Querying indexes has some performance penalty, as database must seek for the value in the index. If we take that you have a B-tree index in a "good shape" (i.e. balanced), then you'll find your entry in at most 14 lookups in the index (as 2^14 > 10k, I hope I'm not mistaken here). So, in order to deliver you 1 row with a string value, database will have to perform up to 14 lookups in the index and 1 extra lookup in your table. In the unlucky case, this will mean system will perform 15 random I/O operations to read in custom data portions from your disk.
In the case there's only one value that requires lookup in the index and that your table is quite big in size, index operations will give you a significant performance boost.
But there's a point after which index scan becomes more expensive, then a straightforward sequential scan:
when your table is occupying really small size on the disk;
when your query will require lookup of round 10% of the total number of records in the test table (the number 10% very approximate, don't take it for granted).
Things to consider:
comparison operations for numeric data types are significantly cheaper, then comparing strings;
statistics accuracy;
how often index / table is queried, or which probability it is to find needed data in the database's shared pool.
These all affects performance and also the plans that database chooses to deliver the data.
So, indexes are not always good.
To answer your to read from 2 separate indexes question: feature you're looking for is called Bitmap index, and it is not available in MySQL as far as I know.
New with 5.0, MySQL can utilize more than one index on a table with Index merge, though they're not as speedy (by far) as multi-column covering indexes, so MySQL will only use them in special cases.
So, other than the merge index case, MySQL only uses one index per table.
Don't be too afraid of covering indexes. They can serve double duty. Indexes are left most prefixed, so you can use a multi-column index for just the left most column, or the first and second, and so on.
For example, if you have the multi-column index id_value (id,value), you can delete the index id (id), since it's redundant. The id_value index can also be used for just the id column.
Also, with InnoDB, every index automatically includes the primary key column(s), so if id were your primary key, an index on value provides the same benefit as having a covering index on (id, value).
Every index does negatively affect inserts, and updates against the indexed columns. There's a trade off, and only you (and some testing) can decide if covering indexes are right for you.
Deletes don't have much impact on indexes because they're just "marked for deletion", and they only get purged when your system's load is low.
Indexes also use up memory. Given enough memory, a properly configured MySQL server will have every index loaded in memory. This makes selects that utilize a covering index super fast.

Index a query "WHERE a IN (1,2,3) AND b = 4"

I am attempting to apply an index that will speed up one of the slowest queries in my application:
SELECT * FROM orders WHERE product_id IN (1, 2, 3, 4) AND user_id = 5678;
I have an index on product_id, user_id, and the pair (product_id, user_id). However, the server does not use any of these indexes:
+----+-------------+------- +------+-------------------------------------------------------------------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+-------------------------------------------------------------------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | orders | ALL | index_orders_on_product_id,index_orders_on_user_id,index_orders_on_product_id_and_user_id | NULL | NULL | NULL | 6 | Using where |
+----+-------------+--------+------+-------------------------------------------------------------------------------------------+------+---------+------+------+-------------+
(There are only 6 rows on development, so whatever, but on production there are about 400k rows, so execution takes about 0.25s, and this query is fired pretty darn often.)
How can I avoid a simple WHERE here? I suppose I could send a query for each product_id, which would likely be faster than this version, but the number of products could be very high, so if it's doable in one query that would be significantly preferable. This query is generated by Rails, so I'm a bit limited in how much I can restructure the query itself. Thanks!
For optimal performance of this particular query on your production table (with 400k rows), you need a composite index on {user_id, product_id}, in that order.
Ideally, this would be the only index, and you would use InnoDB so the table is clustered. Every additional index incurs a penalty when modifying data, and on top of that secondary indexes in clustered tables are even more expensive than secondary indexes in heap-based tables.
To understand why user_id (and not product_id) should be at the leading edge of the index, please take a look at the the Anatomy of an Index. Essentially, since WHERE searches for only one user_id, putting it first clusters the related product_id values closer in the index.
(The {product_id, user_id} would also work, but would "scatter" the "target" index nodes less favorably.)
When there are so little rows on the database, it does not use indexes, because it's cheaper to do a full scan. Try checking the data on your prod environment and see if it uses one of your indexes.
Also, note that you can eliminate one of your indexes, index_by_product_id, because you already have another index that starts with product_id field.

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.
An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?
Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).
The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.
I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.