Improve performance of count and sum when already indexed - mysql

First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.

You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.

Related

No further optimization for this query?

I have some tables I want to join, but it cannot take dozens of seconds.
I want to go from this query that takes ~1s
SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 1.173s
to this joined query that is taking ~50s
SELECT COUNT(*) FROM business b
INNER JOIN business_group bg ON b.id=bg.business_id
WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 51.346s
Why does it take that long if the only thing it does differently is to join on the primary key of the business table (business.id)?
Besides this primary key index, I also have this one (group_id, business_id) on business_group (with (business_id, group_id) it took even longer).
Following is the execution plan:
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| 1 | SIMPLE | bg | <null> | ref | FKo2q0jurx07ein31bgmfvuk8gf,idx_bg_group_id_business_id | idx_bg_group_id_business_id | 9 | const | 2654528 | 100.0 | Using index |
| 1 | SIMPLE | b | <null> | eq_ref | PRIMARY | PRIMARY | 4 | database.bg.group_id | 1 | 100.0 | Using where; Using index |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
Is it possible to optimize the second query so it takes less time?
business table is ~45M rows while business_group is ~60M rows.
I'm writing this as someone who does a lot of indexing setups on SQL Server rather than MySQL. It is too long as a comment, and is based on what I believe are fundamentals, so hopefully it will help.
Why?
Firstly - why does it take so long for the second query to run? The answer is that it needs to do a lot more work in the second one
To demonstrate, imagine the only non-clustered index you have is on business_group for group_id.
You run the first query SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040.
All the engine needs to do is to seek to the appropriate spot in the index (where group_id = 1040), then read/count rows from the index (which is just a series of ints) until it changes - because the non-clustered index is sorted by that column.
Note if you had a second field in the non-clustered index (e.g., group_id, business_id), it would be almost as fast because it's still sorted on group_id first. However, it will be a slightly larger read as each row is 2x the size of the single-column version (but would still be very small).
Imagine you then run a slightly different query, counting business_id instead of * e.g., SELECT COUNT(business_id) FROM business_group bg WHERE bg.group_id=1040.
Assuming business_id is not the PK (and is not in the non-clustered index), then for every row it finds in the index, it then needs to go back and read the business_id from the table check it's not null (either in some sort of loop/reference, or read the whole table - I'm not 100% on how MySQL deals with these). However, it is a lot more work than above.
If business_id was in the index (as above, for group_id, business_id), then it could read that data straight from the index and not need to refer back to the original table - which is good.
Now add the join (your second query) SELECT COUNT(*) FROM business b INNER JOIN business_group bg ON b.id=bg.business_id WHERE bg.group_id=1040. The engine needs to
Get each business_id as above
Potentially sort the business IDs to help with the join
Join it to the business table (to ensure it has a valid row in the business table)
... and to do so, it may need to read all the row's data in the business table
Suggestions #1 - Avoid going to the business table
If you set up foreign keys to ensure that business_id in business_group is valid - then do you need to run the version with the join? Just run the first version.
Suggestion #2 - Indexes
If this was SQL Server and you needed that second query to run as fast as possible, I would set up two non-clustered indexes
NONCLUSTERED INDEX ... ON business_group (group_id, business_id)
NONCLUSTERED INDEX ... ON business (id)
The first means the engine can seek directly to the specific group_id, and then get a sorted list of business_id.
The second provides a sorted list of id (business_id) from the business table. As it has the same sort as the the results from the first index, it means the join is a lot less work.
However, the second one is controversial - many people would say 'no' to this as it overlaps your PK (or, more specifically, clustered index). It would also be sorted the same way. However, at least in SQL Server, this would include all the other data about the businesses e.g., the business name, address, etc. So to read the list of IDs from business, you'd also need to read the rest of the data - taking a lot more time.
However, if you put a non-clustered index just on ID, it will be a very narrow index (just the IDs) and therefore the amount of data to be read would be much less - and therefore often a lot faster.
Note though, that this is not as fast as if you could avoid doing the join altogether (e.g., Suggestion #1 above).

DISTINCT COUNT with GROUP BY query is too slow despite indexes

I have the following query that counts the number of vessels in each zone for each week:
SELECT zone,
DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE zone IS NOT NULL
AND creation_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY zone, date;
The table has about 40 million rows. The execution plan for this is:
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| 1 | SIMPLE | vessel_position | NULL | range | creation_date,zone | zone | 5 | NULL | 21190904 | 50.00 | Using where; Using index; Using filesort |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
Columns vessel_imo, zone and creation_date each indexed. Primary key is the composite key (vessel_imo, creation_date).
When I look at the query profile, I can see that a large amount of time i spent doing Creating sort index.
Is there anything I can do to improve this query further?
Assuming the data, once inserted, does not change, then build and maintain a Summary Table.
The table would have three columns: the zone, the week, and the count-distinct for that week. At the start of each week, build only the rows for the previous week (one per zone; skip NULL). Then build a query to work against that table -- it will be extremely fast since it will be fetching far fewer rows.
Meanwhile, the INDEX(creation_date, zone, vessel_imo) as a secondary index, will make the weekly task reasonably efficient (~52 times as fast as your current query).
It depends on how selective your filtering condition is, and your table structure. Does the filtering condition selects 20% of the rows, 5%, 1%, 0.1%?
If your answer is less than 5% then the following index could help:
create index ix1_date_zone on vessel_position (creation_date, zone);
If your table has many and/or heavy columns, then this option could still be slow, depending on how selective your filtering condition is.
Otherwise, you could try using a more expensive index, to avoid using the table and do:
create index ix2_date_zone_imo on vessel_position
(creation_date, zone, vessel_imo);
This index is more expensive to maintain -- read insert, update, delete rows -- but it would be faster for your select.
Try both options and pick the best for your needs.
SET #mystartdate = DATE_SUB(CURDATE(), INTERVAL 12 MONTH);
SELECT zone, DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE creation_date >= #mystartdate
AND zone > 0
GROUP BY zone, date;
may provide results in less time, please post your comparative times of second run of each ( old and suggested )
Please post new EXPLAIN SELECT … to confirm index of creation date is now used.
Unless old data is allowed to change, why do you have to gather 12 months history, the numbers more than 1 month ago are NOT going to change.

using range with composite key

A MySQL table contains the following two table tables (simplified):
(~13000) (~7000000 rows)
--------------- --------------------
| packages | | packages_prices |
--------------- --------------------
| id (int) |<- ->| package_id (int) |
| state (int) | | variant_id (int) |
- - - - - - - | for_date (date) |
| price (float) |
- - - - - - - - -
Each package_id/for_date combination has only a few (average 3) variants.
And state is 0 (inactive) or 1 (active). Around 4000 of the 13000 are active.
First I just want to know which packages have a price set (regardless of variation), so I add a composite key covering (1) for_date and (2) pid and I query:
select distinct package_id from packages_prices where for_date > date(now())
This query takes 1 seconds to return 3500 rows, which is too much. An Explain tells me that the composite key is used with key_len 3, and 2000000 rows are examined with 100% filtered with type range. Using where; Using index; Using temporary. The distinct takes it back to 3500 rows.
If I take out distinct, the Using temporary is not longer mentioned, but the query then returns 1000000 rows and still takes 1 seconds.
question 1 : why is this query so slow and how do I speed it up without having to add or change the columns in the table? I would expect that, given the composite key, this query should be able to cost less than 0,01s.
Now I want to know which active packages that have a price set.
So I add a key on state and I add a new composite key just like above, but in reverse order. And I write my query like this:
select distinct packages.id from packages
inner join packages_prices on id = package_id and for_date > date(now())
where state = 1
The query now takes 2 seconds. An Explain tells me for the packages table the key on state is used with key_len 4, examines 4000 rows and filters 100% type type ref. Using index; Using temporary. And for the packages_prices table the new composite key is used with key_len 4, examines 1000 rows and filters 33.33% with type ref. Using where; Using index; Distinct. The distinct takes it back to 3000 rows.
If I take out distinct, the Using temporary and Distinct are no longer mentioned, but the query return 850000 rows and takes 3 seconds.
question 2 : Why is the query that much slower now? Why is range no longer being used according to the Explain? And why has filtering with the new composite key dropped to 33.33%? I expected the composite key to filter 100% procent again.
This all seems very basic and trivial, but it has been costing me hours and hours and I still don't understand what's really going on under the hood.
Your observations are consistent with the way MySQL works. For your first query, using the index (for_date, package_id), MySQL will start at the specified date (using the index to find that position), but then has to go to the end of the index, because every next entry can reveal a yet unknown package_id. A specific package_id could e.g. have just been used on the latest for_date. That search will add up to your 2000000 examined rows. The relevant data is retrieved from the index, but it will still take time.
What to do about that?
With some creative rewriting, you can transform your query to the following code:
select package_id from packages_prices
group by package_id
having max(for_date) > date(now());
It will give you the same result as your first query: if there is at least one for_date > date(now()) (which will make it part of your resultset), that will be true for max(for_date) too. But this will only have to check one row per package_id (the one having max(for_date)), all other rows with for_date > date(now()) can be skipped.
MySQL will do that by using index for group-by-optimization (that text should be displayed in your explain). It will require the index (package_id, for_date) (that you already have) and only has to examine 13000 rows: Since the list is ordered, MySQL can jump directly to the last entry for each package_id, which will have the value for max(for_date); and then continue with the next package_id.
Actually, MySQL can use this method to optimize a distinct to (and will probably do that if you remove the condition on for_date), but is not always able to find a way; a really clever optimizer could have rewritten your query the same way I did, but we are not there yet.
And depending on your data distribution, that method could have been a bad idea: if you have e.g. 7000000 package_id, but only 20 of them in the future, checking each package_id for the maximum for_date will be much slower than just checking 20 rows that you can easily find by the index on for_date. So knowledge about your data will play an important role in choosing a better (and maybe optimal) strategy.
You can rewrite your second query in the same way. Unfortunately, such optimizations are not always easy to find and often specific to a specific query and situation. If you have a different distribution (as mentioned above) or if you e.g. slightly change your query and add an end-date, that method would not work anymore and you have to come up with another idea.

MYSQL indexing with partitioning

We are currently evaluating mysql for one of our use case related to analytics.
The table schema is some what like this
CREATE TABLE IF NOT EXISTS `analytics`(
`date` DATE,
`dimension1` BIGINT UNSIGNED,
`dimension2` BIGINT UNSIGNED,
`metrics1` BIGINT UNSIGNED,
`metrics2` BIGINT UNSIGNED,
INDEX `baseindex` (`dimension1`,`dt`)
);
Since most query would be around dimension 1 and date we felt that a combined index would be our best case to optimize query lookup
With this table schema in mind an explain query returns the following results
explain
select dimension2,dimension1
from analytics
where dimension1=1123 and dt between '2016-01-01' and '2016-01-30';
The following query returns the
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| 1 | SIMPLE | analytics | ref | baseindex | baseindex | 13 | const,const | 1 | Using index condition |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
This look good so far as we are getting indication that the indexes are being fired up.
However we though if we can optimize this a bit further, since most of our lookups will be for the current month or month based lookup we felt a date partitioning will further improve the performance.
The table was later modified to add partitions by month
ALTER TABLE analytics
PARTITION BY RANGE( TO_DAYS(`dt`))(
PARTITION JAN2016 VALUES LESS THAN (TO_DAYS('2016-02-01')),
PARTITION FEB2016 VALUES LESS THAN (TO_DAYS('2016-03-01')),
PARTITION MAR2016 VALUES LESS THAN (TO_DAYS('2016-04-01')),
PARTITION APR2016 VALUES LESS THAN (TO_DAYS('2016-05-01')),
PARTITION MAY2016 VALUES LESS THAN (TO_DAYS('2016-06-01')),
PARTITION JUN2016 VALUES LESS THAN (TO_DAYS('2016-07-01')),
PARTITION JUL2016 VALUES LESS THAN (TO_DAYS('2016-08-01')),
PARTITION AUG2016 VALUES LESS THAN (TO_DAYS('2016-09-01')),
PARTITION SEPT2016 VALUES LESS THAN (TO_DAYS('2016-10-01')),
PARTITION OCT2016 VALUES LESS THAN (TO_DAYS('2016-11-01')),
PARTITION NOV2016 VALUES LESS THAN (TO_DAYS('2016-12-01')),
PARTITION DEC2016 VALUES LESS THAN (TO_DAYS('2017-01-01'))
);
With the partition in place, the same query now returns the following results
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
| 1 | SIMPLE | analytics | range | baseindex | baseindex | 13 | NULL | 1 | Using where |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
Now the "Extra" column show that its switching to where instead of using index condition.
We have not noticed any performance boost or degradation, so curios to know how does adding a partition changes the value inside the extra column
This is too long for a comment.
MySQL partitions both the data and the indexes. So, the result of your query is that the query is accessing a smaller index which refers to fewer data pages.
Why don't you see a performance boost? Well, looking up rows in a smaller index is negligible savings (although there might be some savings for the first query from a cold start because the index has to be loaded into memory).
I am guessing that the data you are looking for is relatively small -- say, the records come from a handful of data pages. Well, fetching a handful of data pages from a partition is pretty much the same thing as fetching a handful of data pages from the full table.
Does this mean that partitioning is useless? Not at all. For one thing, the partitioned data and index is much smaller than the overall table. So, you have a savings in memory on the server side -- and this can be a big win on a busy server.
In general, though, partitions really shine when you have queries that don't fully use indexes. The smaller data sizes in each partition often make such queries more efficient.
Use NOT NULL (wherever appropriate).
Don't use BIGINT (8 bytes) unless you really need huge numbers. Dimension ids can usually fit in SMALLINT UNSIGNED (0..64K, 2 bytes) or MEDIUMINT UNSIGNED. (0..16M, 3 bytes).
Yes, INDEX(dim1, dt) is optimal for that one SELECT.
No, PARTITIONing will not help for that SELECT.
PARTITION BY RANGE(TO_DAYS(..)) is excellent if you intend to delete old data. But there is rarely any other benefit.
Use InnoDB.
Explicitly specify the PRIMARY KEY. It will be important in the discussion below.
When working with huge databases, it is a good idea to "count the disk hits". So, let's analyze your query.
INDEX(dim1, dt) with WHERE dim1 = a AND dt BETWEEN x and y will
If partitioned, prune down to the partition(s) representing x..y.
Drill down in the Index's BTree to [a,x]. With partitioning the BTree might be 1 level shallower, but that savings is lost to the pruning of step 1.
Scan forward until [a,y]. If only one partition is involved, this scan hits exactly the same number of blocks whether partitioned or not. If multiple partitions are needed, then there is some extra overhead.
For each row, use the PRIMARY KEY to reach over into the data to get dim2. Again, virtually the same amount of effort. Without the Engine and PRIMARY KEY, I cannot finish discussion this #4.
If (dim1, dim2, dt) is unique, make it the PK. In this case, INDEX(dim1, dt) is actually dim1, dt, dim2 since the PK is included in every secondary index. That says that #4 really involves a 'covering' index. That is, the no extra work to reach for dim2 (zero disk hits).
If, on the other hand, you did SELECT metric..., then #4 does have the effort mentioned.

Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?

I'm troubleshooting a query performance problem. Here's an expected query plan from explain:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | table1 | range | tdcol | tdcol | 8 | NULL | 5437848 | Using where |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.
However, if I query for just one more minute of data, we get this query plan:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| 1 | SIMPLE | table1 | ALL | tdcol | NULL | NULL | NULL | 381601300 | Using where |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)
The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.
Also, the 'USE KEY tdcol' syntax does not change the query plan.
Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.
5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).
This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.
In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).
MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?
I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.
What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.
It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.
"so I have a hard time believing that the table scan is better."
True. YOU have a hard time believing it. But the optimizer seems not to.
I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.
That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?
If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".