Very simple AVG() aggregation query on MySQL server takes ridiculously long time - mysql

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!

To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).

For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization

Related

mysql not picking up the optimal index

Here's my table:
CREATE TABLE `idx_weight` (
`ID` bigint(20) NOT NULL AUTO_INCREMENT,
`SECURITY_ID` bigint(20) NOT NULL COMMENT,
`CONS_ID` bigint(20) NOT NULL,
`EFF_DATE` date NOT NULL,
`WEIGHT` decimal(9,6) DEFAULT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `BPK_AK` (`SECURITY_ID`,`CONS_ID`,`EFF_DATE`),
KEY `idx_weight_ix` (`SECURITY_ID`,`EFF_DATE`)
) ENGINE=InnoDB AUTO_INCREMENT=75334536 DEFAULT CHARSET=utf8
For query 1:
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 1782:
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
| 1 | SIMPLE | idx_weight | ref | BPK_AK,idx_weight_ix | idx_weight_ix | 8 | const | 887856 | Using index |
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
This query runs fine.
Now Query 2 (the only thing changed is the security_id param):
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 26622:
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
| 1 | SIMPLE | idx_weight | ref | BPK_AK,idx_weight_ix | BPK_AK | 8 | const | 10700002 | Using index |
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
Notice that it picks up the index BPK_AK, and the actual query runs for over 1 minute.
This is incorrect. Second time took over 10 seconds. I'm guessing the first time the index is not in the buffer pool.
I can get a workaround by appending group by security_id:
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 26622 group by security_id:
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
| 1 | SIMPLE | idx_weight | range | BPK_AK,idx_weight_ix | idx_weight_ix | 8 | NULL | 10314 | Using where; Using index for group-by |
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
But I still don't understand why would mysql not picking idx_weight_ix for some security_id, which is a covering index for this query (and a lot cheaper). Any idea?
=========================================================================
Update:
#oysteing
Learned a new trick, cool! :)
Here's the optimizer trace:
Query 1: https://gist.github.com/aping/c4388d49d666c43172a856d77001f4ce
Query 2: https://gist.github.com/aping/1af5504b428ca136a8b1c41c40d763e4
And some extra information that might be useful:
From INFORMATION_SCHEMA.STATISTICS:
+------------+---------------+--------------+-------------+-------------+
| NON_UNIQUE | INDEX_NAME | SEQ_IN_INDEX | COLUMN_NAME | CARDINALITY |
+------------+---------------+--------------+-------------+-------------+
| 0 | BPK_AK | 1 | SECURITY_ID | 74134 |
| 0 | BPK_AK | 2 | CONS_ID | 638381 |
| 0 | BPK_AK | 3 | EFF_DATE | 68945218 |
| 1 | idx_weight_ix | 1 | SECURITY_ID | 61393 |
| 1 | idx_weight_ix | 2 | EFF_DATE | 238564 |
+------------+---------------+--------------+-------------+-------------+
CARDINALITY for SECURITY_ID are different, but technically they should be exactly the same, am I right?
From this: https://dba.stackexchange.com/questions/49656/find-the-size-of-each-index-in-a-mysql-table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+---------------+-------------------+
| index_name | indexentry_length |
+---------------+-------------------+
| BPK_AK | 1376940279 |
| idx_weight_ix | 797175951 |
+---------------+-------------------+
The index size is about 800MB vs 1.3GB.
Running select count(*) from idx_weight where security_id = 1782 returns 509994
and select count(*) from idx_weight where security_id = 26622 returns 5828054
Then force using BPK_AK for query 1:
select SQL_NO_CACHE SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight use index (BPK_AK) where security_id = 1782 took 0.2 sec.
So basically, 26622 has 10 times more rows than 1782, but using the same index, it took 50 times more time.
PS: buffer pool size is 25GB.
The optimizer traces shows that the reason for the difference in the selection of index, is due to the estimates received from InnoDB. For each potential index, the optimizer asks the storage engine for an estimate on how many records are in the range. For the first query it gets the following estimates:
BPK_AK: 1031808
idx_weight_ix: 887856
So the estimated read cost is lowest for idx_weight_ix, and this index is chosen. For the second query the estimates are:
BPK_AK: 11092112
idx_weight_ix: 12003098
And the estimated read cost of BPK_AK is lowest due to the lower number of rows . You could say that MySQL should know that the real number of rows in the range is the same in both cases, but that logic has not been implemented.
I do not know the details of how InnoDB computes this estimates, but it basically does two "index dives" to find the first and last row in the range, and then somehow computes the "distance" between the two. It could be that the estimates are affected by unused space in index pages, and that OPTIMIZE TABLE could fix this, but running OPTIMIZE TABLE will probably take very long on such a large table.
The quickest way to solve this, is to add a GROUP BY clause as mentioned by a few other people here. Then MySQL will only need to read 2 rows per group; the first and the last since index is ordered by EFF_DATE for each value of security_id. Alternatively, you could use FORCE INDEX to force a particular index.
It may also be that MySQL 8.0 will handle this query better. The cost model has change somewhat, and it will put higher cost on "cold" indexes that are not cached in the buffer pool.
When you mix normal columns (SECURITY_ID) and aggregate functions (min & max in your case), you should use the GROUP BY. If you do not, MySQL is free give any result it pleases. With GROUP BY, you will get the correct result. Newer MySQL databases force this behavior by default.
The reason the second index is not selected when you leave out the GROUP BY, is most likely due to the fact that the aggregate functions are not limited into the same group (=security_id) abd therefore cannot be used as limiter.
I can get a workaround by appending group by security_id
Well, yes. I wouldn't do it any other way, since when you use aggregate functions you NEED to group by something. I didn't even know that MySQL allowed you to work around it.
I think #slaakso is right. Upvote him.

MariaDB - Should I add index to my table?

Recently I was checking my system logs and I noticed some of my queries are very slow.
I have a table that stores user activites. The table structure is id (int), user (int), type (int), object (varchar), extra (mediumtext) and date (timestamp).
Also I only have index for id (BTREE, unique).
I have performance issues for following query;
SELECT DISTINCT object as usrobj
from ".MV15_PREFIX."useractivities
WHERE user='".$user_id."'
and type = '3'
limit 0,1000000"
Question is, should I also index user same as id? What should be the best practise I should follow?
This table is actively used and has over 500k+ rows in it. And there are 2k~ concurrent users online average on site.
The reason I am asking this question is I am not really good at managing DB, and also I have slow query issue on another table which has proper indexes.
Thanks in advance for suggestions.
Side note:
Result of mysqltuner
General recommendations:
Reduce or eliminate persistent connections to reduce connection usage
Adjust your join queries to always utilize indexes
Temporary table size is already large - reduce result set size
Reduce your SELECT DISTINCT queries without LIMIT clauses
Consider installing Sys schema from https://github.com/mysql/mysql-sys
Variables to adjust:
max_connections (> 768)
wait_timeout (< 28800)
interactive_timeout (< 28800)
join_buffer_size (> 64.0M, or always use indexes with joins)
(I will set max_connections > 768, not really sure about timeouts and as far I read topics/suggestions in Stackoverflow I think I shouldn't increase the size of join_buffer_size but I'd really appreciate getting feedback about these variables too.)
EDIT - SHOW INDEX result;
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| ***_useractivities | 0 | PRIMARY | 1 | id | A | 434006 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_index | 1 | user | A | 13151 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_type_index | 1 | user | A | 10585 | NULL | NULL | | BTREE | | |
| ***_useractivities | 1 | user_type_index | 2 | type | A | 13562 | NULL | NULL | | BTREE | | |
+--------------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Most of these rules of thumb for PostgreSQL indexing apply to most SQL database management systems.
https://dba.stackexchange.com/a/31517/1064
So, yes, you will probably benefit from an index on user and an index on type. You might benefit more from an index on the pair user, type.
You'll benefit from learning how to read an execution plan, too.
For that query, either of these is optimal:
INDEX(user, type)
INDEX(type, user)
Separate indexes (INDEX(user), INDEX(type)) is likely to be not nearly as good.
MySQL's InnoDB has only BTree, not Hash. Anyway, BTree is essentially as good as Hash for 'point queries', and immensely better for 'range' queries.
Indexing tips.
Indexes help SELECTs and UPDATEs, sometimes a lot. Use them. The side effects are minor -- such as extra disk space used.

Why an index can make a query really slow?

Some day I answered a question on SO (accepted as correct), but the answer left me with a great doubt.
Shortly, user had a table with this fields:
id INT PRIMARY KEY
dt DATETIME (with an INDEX)
lt DOUBLE
The query SELECT DATE(dt),AVG(lt) FROM table GROUP BY DATE(dt) was really slow.
We told him that (part of) the problem was using DATE(dt) as field and grouping, but db was on a production server and wasn't possible to split that field.
So (with a trigger) was inserted another field da DATE (with an INDEX) filled automatically with DATE(dt). Query SELECT da,AVG(lt) FROM table GROUP BY da was a bit faster, but with about 8mln records it took about 60s!!!
I tried on my pc and finally I discovered that, removing the index on field da query took only 7s, while using DATE(dt) after removing index it took 13s.
I've always thought an index on column used for grouping could really speed the query up, not the contrary (8 times slower!!!).
Why? Which is the reason?
Thanks a lot.
Because you still need to read all the data from both index + data file. Since you're not using any where condition - you always will have the query plan, that access all the data, row by row and you can do nothing with this.
If performance is important for this query and it is performed often - I'd suggest to cache the results into some temporary table and update it hourly (daily, etc).
Why it becomes slower: because in index data is already sorted and when mysql calculates cost of the query execution it thinks that it will be better to use already sorted data, then group it, then calculate agregates. But it is not in this case.
I think this is because of this or similiar MySQL bug: Index degrades sort performance and optimizer does not honor IGNORE INDEX
I remember the question as I was going to answer it but got distracted with something else. The problem was that his table design wasnt taking advantage of a clustered primary key index.
I would have re-designed the table creating a composite clustered primary key with the date as the leading part of the index. The sm_id field is still just a sequential unsigned int to guarantee uniqueness.
drop table if exists speed_monitor;
create table speed_monitor
(
created_date date not null,
sm_id int unsigned not null,
load_time_secs double(10,4) not null default 0,
primary key (created_date, sm_id)
)
engine=innodb;
+------+----------+
| year | count(*) |
+------+----------+
| 2009 | 22723200 | 22 million
| 2010 | 31536000 | 31 million
| 2011 | 5740800 | 5 million
+------+----------+
select
created_date,
count(*) as counter,
avg(load_time_secs) as avg_load_time_secs
from
speed_monitor
where
created_date between '2010-01-01' and '2010-12-31'
group by
created_date
order by
created_date
limit 7;
-- cold runtime
+--------------+---------+--------------------+
| created_date | counter | avg_load_time_secs |
+--------------+---------+--------------------+
| 2010-01-01 | 86400 | 1.66546802 |
| 2010-01-02 | 86400 | 1.66662466 |
| 2010-01-03 | 86400 | 1.66081309 |
| 2010-01-04 | 86400 | 1.66582251 |
| 2010-01-05 | 86400 | 1.66522316 |
| 2010-01-06 | 86400 | 1.66859480 |
| 2010-01-07 | 86400 | 1.67320440 |
+--------------+---------+--------------------+
7 rows in set (0.23 sec)

Scaling a High Score Database

I have a simple high score service for an online game, and it has become more popular than expected. The high score is a webservice which uses a MYSQL backend with a simple table as shown below. Each high score record is stored as a row in this table. The problem is, with >140k rows, I see certain key queries slowing down so much that it will soon be too slow to service requests.
The main table looks like this:
id is a unique key for each score record
game is the ID number of the game which submitted the score (currently, always equal to "1", soon will have to support more games though)
name is the display name for that player's submission
playerId is a unique ID for a given user
score is a numeric score representation ex 42,035
time is the submission time
rank is a large integer which uniquely sorts the score submissions for a given game. It is
common for people to tie at a certain score, so in that case the tie is broken by who submitted first. Therefore this field's value is equal roughly to "score * 100000000 + (MAX_TIME - time)"
+----------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| game | int(11) | YES | MUL | NULL | |
| name | varchar(100) | YES | | NULL | |
| playerId | varchar(50) | YES | | NULL | |
| score | int(11) | YES | | NULL | |
| time | datetime | YES | | NULL | |
| rank | decimal(50,0) | YES | MUL | NULL | |
+----------+---------------+------+-----+---------+----------------+
The indexes look like this:
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pozscores | 0 | PRIMARY | 1 | id | A | 138296 | NULL | NULL | | BTREE | |
| pozscores | 0 | game | 1 | game | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 0 | game | 2 | rank | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 1 | rank | 1 | rank | A | 138296 | NULL | NULL | YES | BTREE | |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
When a user requests high scores, they typically request around 75 high scores from an arbitrary point in the "sorted by rank descending list". These requests are typically for "alltime" or just for scores in the past 7 days.
A typical query looks like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 0, 75;" and runs in 0.00 sec.
However, if you request towards the end of the list
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 10000, 75;" and runs in 0.06 sec.
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 100000, 75;" and runs in 0.58 sec.
It seems like this will quickly start taking way too long as several thousand new scores are submitted each day!
Additionally, there are two other types of queries, used to find a particular player by id in the rank ordered list.
They look like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? AND playerId=? ORDER BY rank DESC LIMIT 1"
followed by a
"SELECT count(id) as count FROM scoretable WHERE game=1 AND time>? AND rank>[rank returned from above]"
My question is: What can be done to make this a scalable system? I can see the number of rows growing to be several million very soon. I was hoping that choosing some smart indexes would help, but the improvement has only been marginal.
Update:
Here is an explain line:
mysql> explain SELECT * FROM scoretable WHERE game=1 AND time>0 ORDER BY rank DESC LIMIT 100000, 75;
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | scoretable| range | game | game | 5 | NULL | 138478 | Using where |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
Solution Found!
I have solved the problem thanks to some of the pointers from this thread. Doing a clustered index was exactly what I needed, so I converted the table to use InnoDB in mysql, which supports clustered indexes. Next, I removed the id field, and just set the primary key to be (game ASC, rank DESC). Now, all queries run super fast, no matter what offset I use. The explain shows that no additional sorting is being done, and it looks like it is easily handling all the traffic.
Seeing as how there are no takers, I'll give it a shot. I am from an SQL Server background, but the same ideas apply.
Some general observations:
The ID column is pretty much pointless, and should not participate in any indexes unless there are other tables/queries you're not telling us about. In fact, it doesn't even need to be in your last query. You can do COUNT(*).
Your clustered index should target your most common queries. Therefore, a clustered index on game ASC, time DESC, and rank DESC works well. Sorting by time DESC is usually a good idea for historical tables like this where you are usually interested in the most recent stuff. You may also try a separate index with the rank sorted the other direction, though I'm not sure how much of a benefit this will be.
Are you sure you need SELECT *? If you can select fewer columns, you may be able to create an index which contains all columns needed for your SELECT and WHERE.
1 million rows is really not that much. I created a table like yours with 1,000,000 rows of sample data, and even with the one index (game ASC, time DESC, and rank DESC), all queries ran in less than 1 second.
(The only part I'm not sure of is playerId. The queries performed so well that playerId didn't seem to be necessary. Perhaps you can add it at the end of your clustered index.)

Big SQL SELECT performance difference when using <= against using < on a DATETIME column

Given the following table:
desc exchange_rates;
+------------------+----------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| time | datetime | NO | MUL | NULL | |
| base_currency | varchar(3) | NO | MUL | NULL | |
| counter_currency | varchar(3) | NO | MUL | NULL | |
| rate | decimal(32,16) | NO | | NULL | |
+------------------+----------------+------+-----+---------+----------------+
I have added indexes on time, base_currency and counter_currency, as well as a composite index on (time, base_currency, counter_currency), but I'm seeing a big performance difference when I perform a SELECT using <= against using <.
The first SELECT is:
ExchangeRate Load (95.5ms)
SELECT * FROM `exchange_rates` WHERE (time <= '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
As you can see this is taking 95ms.
If I change the query such that I compare time using < rather than <= I see this:
ExchangeRate Load (0.8ms)
SELECT * FROM `exchange_rates` WHERE (time < '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
Now it takes less than 1 millisecond, which sounds right to me. Is there a rational explanation for this behaviour?
The output from EXPLAIN provides further details, but I'm not 100% sure how to intepret this:
-- Output from the first, slow, select
SIMPLE | 5,5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | index_merge | Using intersect(index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency); Using where | 813 | | index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency
-- Output from the second, fast, select
SIMPLE | 5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | ref | Using where | 4988 | const | index_exchange_rates_on_counter_currency
(Note: I'm producing these queries through ActiveRecord (in a Rails app) but these are ultimately the queries which are being executed)
In the first case, MySQL tries to combine results from all indexes. It fetches all records from both indexes and joins them on the value of the row pointer (table offset in MyISAM, PRIMARY KEY in InnoDB).
In the second case, it just uses a single index, which, considering LIMIT 1, is the best decision.
You need to create a composite index on (base_currency, counter_currency, time) (in this order) for this query to work as fast as possible.
The engine will use the index for filtering on the leading columns (base_currency, counter_currency) and for ordering on the trailing column (time).
It also seems you want to add something like ORDER BY time DESC to your query to get the last exchange rate.
In general, any LIMIT without ORDER BY should ring the bell.