MYSQL indexing with partitioning - mysql

We are currently evaluating mysql for one of our use case related to analytics.
The table schema is some what like this
CREATE TABLE IF NOT EXISTS `analytics`(
`date` DATE,
`dimension1` BIGINT UNSIGNED,
`dimension2` BIGINT UNSIGNED,
`metrics1` BIGINT UNSIGNED,
`metrics2` BIGINT UNSIGNED,
INDEX `baseindex` (`dimension1`,`dt`)
);
Since most query would be around dimension 1 and date we felt that a combined index would be our best case to optimize query lookup
With this table schema in mind an explain query returns the following results
explain
select dimension2,dimension1
from analytics
where dimension1=1123 and dt between '2016-01-01' and '2016-01-30';
The following query returns the
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| 1 | SIMPLE | analytics | ref | baseindex | baseindex | 13 | const,const | 1 | Using index condition |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
This look good so far as we are getting indication that the indexes are being fired up.
However we though if we can optimize this a bit further, since most of our lookups will be for the current month or month based lookup we felt a date partitioning will further improve the performance.
The table was later modified to add partitions by month
ALTER TABLE analytics
PARTITION BY RANGE( TO_DAYS(`dt`))(
PARTITION JAN2016 VALUES LESS THAN (TO_DAYS('2016-02-01')),
PARTITION FEB2016 VALUES LESS THAN (TO_DAYS('2016-03-01')),
PARTITION MAR2016 VALUES LESS THAN (TO_DAYS('2016-04-01')),
PARTITION APR2016 VALUES LESS THAN (TO_DAYS('2016-05-01')),
PARTITION MAY2016 VALUES LESS THAN (TO_DAYS('2016-06-01')),
PARTITION JUN2016 VALUES LESS THAN (TO_DAYS('2016-07-01')),
PARTITION JUL2016 VALUES LESS THAN (TO_DAYS('2016-08-01')),
PARTITION AUG2016 VALUES LESS THAN (TO_DAYS('2016-09-01')),
PARTITION SEPT2016 VALUES LESS THAN (TO_DAYS('2016-10-01')),
PARTITION OCT2016 VALUES LESS THAN (TO_DAYS('2016-11-01')),
PARTITION NOV2016 VALUES LESS THAN (TO_DAYS('2016-12-01')),
PARTITION DEC2016 VALUES LESS THAN (TO_DAYS('2017-01-01'))
);
With the partition in place, the same query now returns the following results
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
| 1 | SIMPLE | analytics | range | baseindex | baseindex | 13 | NULL | 1 | Using where |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
Now the "Extra" column show that its switching to where instead of using index condition.
We have not noticed any performance boost or degradation, so curios to know how does adding a partition changes the value inside the extra column

This is too long for a comment.
MySQL partitions both the data and the indexes. So, the result of your query is that the query is accessing a smaller index which refers to fewer data pages.
Why don't you see a performance boost? Well, looking up rows in a smaller index is negligible savings (although there might be some savings for the first query from a cold start because the index has to be loaded into memory).
I am guessing that the data you are looking for is relatively small -- say, the records come from a handful of data pages. Well, fetching a handful of data pages from a partition is pretty much the same thing as fetching a handful of data pages from the full table.
Does this mean that partitioning is useless? Not at all. For one thing, the partitioned data and index is much smaller than the overall table. So, you have a savings in memory on the server side -- and this can be a big win on a busy server.
In general, though, partitions really shine when you have queries that don't fully use indexes. The smaller data sizes in each partition often make such queries more efficient.

Use NOT NULL (wherever appropriate).
Don't use BIGINT (8 bytes) unless you really need huge numbers. Dimension ids can usually fit in SMALLINT UNSIGNED (0..64K, 2 bytes) or MEDIUMINT UNSIGNED. (0..16M, 3 bytes).
Yes, INDEX(dim1, dt) is optimal for that one SELECT.
No, PARTITIONing will not help for that SELECT.
PARTITION BY RANGE(TO_DAYS(..)) is excellent if you intend to delete old data. But there is rarely any other benefit.
Use InnoDB.
Explicitly specify the PRIMARY KEY. It will be important in the discussion below.
When working with huge databases, it is a good idea to "count the disk hits". So, let's analyze your query.
INDEX(dim1, dt) with WHERE dim1 = a AND dt BETWEEN x and y will
If partitioned, prune down to the partition(s) representing x..y.
Drill down in the Index's BTree to [a,x]. With partitioning the BTree might be 1 level shallower, but that savings is lost to the pruning of step 1.
Scan forward until [a,y]. If only one partition is involved, this scan hits exactly the same number of blocks whether partitioned or not. If multiple partitions are needed, then there is some extra overhead.
For each row, use the PRIMARY KEY to reach over into the data to get dim2. Again, virtually the same amount of effort. Without the Engine and PRIMARY KEY, I cannot finish discussion this #4.
If (dim1, dim2, dt) is unique, make it the PK. In this case, INDEX(dim1, dt) is actually dim1, dt, dim2 since the PK is included in every secondary index. That says that #4 really involves a 'covering' index. That is, the no extra work to reach for dim2 (zero disk hits).
If, on the other hand, you did SELECT metric..., then #4 does have the effort mentioned.

Related

MariaDB 5.5.68, prevent toxic selects?

I know that this MariaDB version 5.5.68 is really out of date, but I have to stay with this old version for a while.
Is there a way to prevent toxic selects, possibly blocking MyISAM tables for a longer time (minutes)? The thing is that the select creates a READ BLOCK on the whole MyISAM table and further inserts wait until they're all gone. So the long running select starts to block the system.
Take this example table:
CREATE TABLE `tbllog` (
`LOGID` bigint unsigned NOT NULL auto_increment,
`LOGSOURCE` smallint unsigned default NULL,
`USERID` int unsigned default NULL,
`LOGDATE` datetime default NULL,
`SUBPROVIDERID` int unsigned default NULL,
`ACTIONID` smallint unsigned default NULL,
`COMMENT` varchar(255) default NULL,
PRIMARY KEY (`LOGID`),
KEY `idx_LogDate` (`LOGDATE`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The following select works fine until less than 1 Mio entries in the table (the customers set the date range):
SELECT *
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500;
But it becomes toxic if there are 10 Mio entries or more in the table. Then it starts to run for minutes, consumes a lot of memory and starts blocking the app.
This is the query plan with ~600.000 entries in the table:
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | tbllog | index | idx_LogDate | PRIMARY | 8 | NULL | 624 | Using where |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
The thing is, that I need to know if this becomes toxic or not before execution. So maybe I can warn the user that this might block the system for a while or even deny execution.
I know that InnoDB might not have this issue, but I don't know the drawbacks of a switch yet and I think it might be best to stay for the moment.
I tried to do a simple SELECT COUNT(*) FROM tbllog WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 before (removing LIMIT and ORDER BY), but it is not really much faster than the real query and produces double the load in the worst case.
I also considered a worker thread (like mentioned here). But this is a relevant change to the whole system, too. InnoDB would be less impact I think.
Any ideas about this issue?
Your EXPLAIN report shows that it's doing an index-scan on the primary key index. I believe this is because the range of dates is too broad, so the optimizer thinks that it's not much help to use the index instead of simply reading the whole table. By doing an index-scan of the primary key (logid), the optimizer can at least ensure that the rows are read in the order you requested in your ORDER BY clause, so it can skip sorting.
If I test your query (I created the table and filled it with 1M rows of random data), but make it ignore the primary key index, I get this EXPLAIN report:
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate | idx_LogDate | 6 | NULL | 271471 | 10.00 | Using index condition; Using where; Using filesort |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
This makes it use the index on the logdate, so it examine fewer rows, according to the proportion matched by the date range condition. But the resulting rows must be sorted ("Using filesort" in the Extra column) before it can apply the LIMIT.
This won't help at all if your range of dates covers the whole table anyway. In fact, it will be worse, because it will access rows indirectly by the logdate index, and then it will have to sort rows. This solution helps only if the range of dates in the query matches a small portion of the table.
A somewhat better index is a compound index on (subproviderid, logdate).
mysql> alter table tbllog add index (subproviderid, logdate);
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 ORDER BY logid LIMIT 500;
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate,SUBPROVIDERID | SUBPROVIDERID | 11 | NULL | 12767 | 100.00 | Using index condition; Using filesort |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
In my test, this helps the estimate of rows examined drop from 271471 to 12767 because they're restricted by subproviderid, then by logdate. How effective this is depends on how frequently subproviderid=1 is matched. If that's matched by virtually all of the rows anyway, then it won't be any help. If there are many different values of subproviderid and they each have a small fraction of rows, then it will help more to add this to the index.
In my test, I made an assumption that there are 20 different values of subproviderid with equal frequency. That is, my random data inserted round(rand()*20) as the value of subproviderid on each row. Thus it is expected that adding subproviderid resulted in 1/20th of the examined rows in my test.
To choose the order of columns listed in the index, columns referenced in equality conditions must be listed before the column referenced in range conditions.
There's no way to get a prediction of the runtime of a query. That's not something the optimizer can predict. You should block users from requesting a range of dates that will match too great a portion of the table.
For this
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
Add both of these and hope that the Optimizer picks the better one:
INDEX(subproviderid, logdate, logid)
INDEX(subproviderid, logid)
Better yet would be to also change to this (assuming it is 'equivalent' for your purposes):
ORDER BY logdate, logid
Then that first index will probably work nicely.
You really should change to InnoDB. (Caution: the table is likely to triple in size.) With InnoDB, there would be another indexing option. And, with an updated version, you could do "instant" index adding. Meanwhile, MyISAM will take a lot of time to add those indexes.
Try creating a multi-column index specifically for your query.
CREATE INDEX sub_date_logid ON tbllog (subproviderid, logdate, logid);
This index should satisfy the WHERE filters in your query directly. Then it should present the rows in logid order so your ORDER BY ... LIMIT clauses don't have to sort the whole table. Will this help on long-dead MariaDB 5.5 with MyISAM? Hard to say for sure.
If it doesn't solve your performance problem, keep the multicolumn index and try doing the ORDER BY...LIMIT on the logid values rather than all the rows.
SELECT *
FROM tbllog
WHERE logid IN (
SELECT logid
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500 )
ORDER BY logid;
This can speed things up because it lets MariaDB sort just the logid values to find the ones it wants. Then the outer query fetches only the 500 rows needed for your result set. Less data to sort = faster.
One of the options, although an external one, would be to use ProxySQL. It has capabilities to shape the traffic. You can create rules deciding how to process queries that match them. You could, for example, create a query rule that would check if a query is accessing a given table (you can use regular expressions to match the query) and, for example, block that query or introduce a delay in execution.
Another option could be to use pt-kill. It's a script that's part of the Percona Toolkit and it's intended to, well, kill queries. You can define which queries you want to kill (matching them by regular expressions, by how long they ran or in other ways).
Having said that, if SELECTs can be optimized by rewriting or adding proper indexes, that may be the best option to go for.

MySQL does not always use index

Very simple problem yet hard to find a solution.
Address table with 2,498,739 rows has a field of min_ip and max_ip fields. These are the core anchors of the table for filtering.
The query is very simple.
SELECT *
FROM address a
WHERE min_ip < value
AND max_ip > value;
So it is logical to create an index for the min_ip and max_ip to make the query faster.
Index created for the following.
CREATE INDEX ip_range ON address (min_ip, max_ip) USING BTREE;
CREATE INDEX min_ip ON address (min_ip ASC) USING BTREE;
CREATE INDEX max_ip ON address (max_ip DESC) USING BTREE;
I did try to create just the first option (combination of min_ip and max_ip) but it did not work so I prepared at least 3 indexes to give MySQL more options for index selection. (Note that this table is pretty much static and more of a lookup table)
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| network | varchar(20) | YES | | NULL | |
| min_ip | int(11) unsigned | NO | MUL | NULL | |
| max_ip | int(11) unsigned | NO | MUL | NULL | |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
Now, it should be straight forward to query the table with min_ip and max_ip as the filter criteria.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
The query performed something around 0.120 to 0.200 secs. However, on a load test, the query rapidly degrades performance.
MySQL server CPU usage sky rocket to 100% CPU usage on just a few simultaneous queries and performance degrades rapidly and does not scale up.
Slow query on mysql server was turned on with 10 secs or higher, and eventually the select query shows up in the logs just after a few seconds of load test.
So I checked the query with explain and found out that it did'nt use an index.
Explain plan result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- ------ ------- ------ ------- -------------
1 SIMPLE a ALL ip_range,min_ip,max_ip (NULL) (NULL) (NULL) 2417789 Using where
Interestingly, it was able to determine ip_range, ip_min and ip_max as potential indexes but never use any of it as shown in the key column.
I know I can use FORCE INDEX and tried to use explain plan on it.
EXPLAIN
SELECT *
FROM address a
FORCE INDEX (ip_range)
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
Explain plan with FORCE INDEX result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ------------- -------- ------- ------ ------- -----------------------
1 SIMPLE a range ip_range ip_range 4 (NULL) 1208894 Using index condition
With FORCE INDEX, yes it uses the ip_range index as key, and rows shows a subset from the query that does not use FORCE INDEX which is 1,208,894 from 2,417,789.
So definitely, using the index should have better performance. (Unless I misunderstood the explain result)
But what is more interesting is, after a couple of test, I found out that on some instances, MySQL does use index even without FORCE INDEX. And my observation is when the value is small, it does use the index.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 508496
AND max_ip > 508496;
Explain Result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- -------- ------- ------ ------ -----------------------
1 SIMPLE a range ip_range,min_ip,max_ip ip_range 4 (NULL) 1 Using index condition
So, it just puzzled me that base on the value pass to the select query, MySQL decides when to use an index and when not to use an index.
I can't imagine what is the basis for determining when to use the index on a certain value being passed to the query. I do understand that
index may not be used if there is no matching index suitable in the WHERE condition but in this case, it is very clear the ip_range index which
is an index based on min_ip and max_ip column is suitable for the WHERE condition in this case.
But the bigger problem I have is, what about other queries. Do I have to go and test those queries on a grand scale.
But even then, as the data grows, can I rely and expect MySQL to use the index?
Yes, I can always use FORCE INDEX to ensure it uses the index. But this is not standard SQL that works on all database.
ORM frameworks may not be able to support FORCE INDEX syntax when they generate the SQL and it tightly couples your query with your index names.
Not sure if anyone has ever encountered this issue but this seems to be a very big problem for me.
Fully agree with Vatev and the others. Not only MySQL does that. Scanning the table is sometimes cheaper than looking at the index first then looking up corresponding entries on disk.
The only time when it for sure uses the index is, when it's a covering index, which means, that every column in the query (for this particular table of course) is present in the index. Meaning, if you need for example only the network column
SELECT network
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
then a covering index like
CREATE INDEX ip_range ON address (min_ip, max_ip, network) USING BTREE;
would only look at the index as there's no need to lookup additional data on disk at all. And the whole index could be kept in memory.
Ranges like that are nasty to optimize. But I have a technique. It requires non-overlapping ranges and stores only a start_ip, not the end_ip (which is effectively available from the 'next' record). It provides stored routines to hide the messy code, involving ORDER BY ... LIMIT 1 and other tricks. For most operations it won't hit more than one block of data, unlike the obvious approaches that tend to fetch half or all the table.
I do agree to all the answers above. but you can try to make only one composite
index like this:
create index ip_rang on address (min_ip ASC,max_ip DESC) using BTREE;
As you know index is also has the disadvantage of using your disk space so consider the optimal index for using.

Optimizing SQL Query from a Big Table Ordered by Timestamp

We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.
Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)

Improve performance of count and sum when already indexed

First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.
You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.

Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?

I'm troubleshooting a query performance problem. Here's an expected query plan from explain:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | table1 | range | tdcol | tdcol | 8 | NULL | 5437848 | Using where |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.
However, if I query for just one more minute of data, we get this query plan:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| 1 | SIMPLE | table1 | ALL | tdcol | NULL | NULL | NULL | 381601300 | Using where |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)
The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.
Also, the 'USE KEY tdcol' syntax does not change the query plan.
Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.
5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).
This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.
In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).
MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?
I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.
What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.
It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.
"so I have a hard time believing that the table scan is better."
True. YOU have a hard time believing it. But the optimizer seems not to.
I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.
That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?
If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".