using range with composite key - mysql

A MySQL table contains the following two table tables (simplified):
(~13000) (~7000000 rows)
--------------- --------------------
| packages | | packages_prices |
--------------- --------------------
| id (int) |<- ->| package_id (int) |
| state (int) | | variant_id (int) |
- - - - - - - | for_date (date) |
| price (float) |
- - - - - - - - -
Each package_id/for_date combination has only a few (average 3) variants.
And state is 0 (inactive) or 1 (active). Around 4000 of the 13000 are active.
First I just want to know which packages have a price set (regardless of variation), so I add a composite key covering (1) for_date and (2) pid and I query:
select distinct package_id from packages_prices where for_date > date(now())
This query takes 1 seconds to return 3500 rows, which is too much. An Explain tells me that the composite key is used with key_len 3, and 2000000 rows are examined with 100% filtered with type range. Using where; Using index; Using temporary. The distinct takes it back to 3500 rows.
If I take out distinct, the Using temporary is not longer mentioned, but the query then returns 1000000 rows and still takes 1 seconds.
question 1 : why is this query so slow and how do I speed it up without having to add or change the columns in the table? I would expect that, given the composite key, this query should be able to cost less than 0,01s.
Now I want to know which active packages that have a price set.
So I add a key on state and I add a new composite key just like above, but in reverse order. And I write my query like this:
select distinct packages.id from packages
inner join packages_prices on id = package_id and for_date > date(now())
where state = 1
The query now takes 2 seconds. An Explain tells me for the packages table the key on state is used with key_len 4, examines 4000 rows and filters 100% type type ref. Using index; Using temporary. And for the packages_prices table the new composite key is used with key_len 4, examines 1000 rows and filters 33.33% with type ref. Using where; Using index; Distinct. The distinct takes it back to 3000 rows.
If I take out distinct, the Using temporary and Distinct are no longer mentioned, but the query return 850000 rows and takes 3 seconds.
question 2 : Why is the query that much slower now? Why is range no longer being used according to the Explain? And why has filtering with the new composite key dropped to 33.33%? I expected the composite key to filter 100% procent again.
This all seems very basic and trivial, but it has been costing me hours and hours and I still don't understand what's really going on under the hood.

Your observations are consistent with the way MySQL works. For your first query, using the index (for_date, package_id), MySQL will start at the specified date (using the index to find that position), but then has to go to the end of the index, because every next entry can reveal a yet unknown package_id. A specific package_id could e.g. have just been used on the latest for_date. That search will add up to your 2000000 examined rows. The relevant data is retrieved from the index, but it will still take time.
What to do about that?
With some creative rewriting, you can transform your query to the following code:
select package_id from packages_prices
group by package_id
having max(for_date) > date(now());
It will give you the same result as your first query: if there is at least one for_date > date(now()) (which will make it part of your resultset), that will be true for max(for_date) too. But this will only have to check one row per package_id (the one having max(for_date)), all other rows with for_date > date(now()) can be skipped.
MySQL will do that by using index for group-by-optimization (that text should be displayed in your explain). It will require the index (package_id, for_date) (that you already have) and only has to examine 13000 rows: Since the list is ordered, MySQL can jump directly to the last entry for each package_id, which will have the value for max(for_date); and then continue with the next package_id.
Actually, MySQL can use this method to optimize a distinct to (and will probably do that if you remove the condition on for_date), but is not always able to find a way; a really clever optimizer could have rewritten your query the same way I did, but we are not there yet.
And depending on your data distribution, that method could have been a bad idea: if you have e.g. 7000000 package_id, but only 20 of them in the future, checking each package_id for the maximum for_date will be much slower than just checking 20 rows that you can easily find by the index on for_date. So knowledge about your data will play an important role in choosing a better (and maybe optimal) strategy.
You can rewrite your second query in the same way. Unfortunately, such optimizations are not always easy to find and often specific to a specific query and situation. If you have a different distribution (as mentioned above) or if you e.g. slightly change your query and add an end-date, that method would not work anymore and you have to come up with another idea.

Related

No further optimization for this query?

I have some tables I want to join, but it cannot take dozens of seconds.
I want to go from this query that takes ~1s
SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 1.173s
to this joined query that is taking ~50s
SELECT COUNT(*) FROM business b
INNER JOIN business_group bg ON b.id=bg.business_id
WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 51.346s
Why does it take that long if the only thing it does differently is to join on the primary key of the business table (business.id)?
Besides this primary key index, I also have this one (group_id, business_id) on business_group (with (business_id, group_id) it took even longer).
Following is the execution plan:
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| 1 | SIMPLE | bg | <null> | ref | FKo2q0jurx07ein31bgmfvuk8gf,idx_bg_group_id_business_id | idx_bg_group_id_business_id | 9 | const | 2654528 | 100.0 | Using index |
| 1 | SIMPLE | b | <null> | eq_ref | PRIMARY | PRIMARY | 4 | database.bg.group_id | 1 | 100.0 | Using where; Using index |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
Is it possible to optimize the second query so it takes less time?
business table is ~45M rows while business_group is ~60M rows.
I'm writing this as someone who does a lot of indexing setups on SQL Server rather than MySQL. It is too long as a comment, and is based on what I believe are fundamentals, so hopefully it will help.
Why?
Firstly - why does it take so long for the second query to run? The answer is that it needs to do a lot more work in the second one
To demonstrate, imagine the only non-clustered index you have is on business_group for group_id.
You run the first query SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040.
All the engine needs to do is to seek to the appropriate spot in the index (where group_id = 1040), then read/count rows from the index (which is just a series of ints) until it changes - because the non-clustered index is sorted by that column.
Note if you had a second field in the non-clustered index (e.g., group_id, business_id), it would be almost as fast because it's still sorted on group_id first. However, it will be a slightly larger read as each row is 2x the size of the single-column version (but would still be very small).
Imagine you then run a slightly different query, counting business_id instead of * e.g., SELECT COUNT(business_id) FROM business_group bg WHERE bg.group_id=1040.
Assuming business_id is not the PK (and is not in the non-clustered index), then for every row it finds in the index, it then needs to go back and read the business_id from the table check it's not null (either in some sort of loop/reference, or read the whole table - I'm not 100% on how MySQL deals with these). However, it is a lot more work than above.
If business_id was in the index (as above, for group_id, business_id), then it could read that data straight from the index and not need to refer back to the original table - which is good.
Now add the join (your second query) SELECT COUNT(*) FROM business b INNER JOIN business_group bg ON b.id=bg.business_id WHERE bg.group_id=1040. The engine needs to
Get each business_id as above
Potentially sort the business IDs to help with the join
Join it to the business table (to ensure it has a valid row in the business table)
... and to do so, it may need to read all the row's data in the business table
Suggestions #1 - Avoid going to the business table
If you set up foreign keys to ensure that business_id in business_group is valid - then do you need to run the version with the join? Just run the first version.
Suggestion #2 - Indexes
If this was SQL Server and you needed that second query to run as fast as possible, I would set up two non-clustered indexes
NONCLUSTERED INDEX ... ON business_group (group_id, business_id)
NONCLUSTERED INDEX ... ON business (id)
The first means the engine can seek directly to the specific group_id, and then get a sorted list of business_id.
The second provides a sorted list of id (business_id) from the business table. As it has the same sort as the the results from the first index, it means the join is a lot less work.
However, the second one is controversial - many people would say 'no' to this as it overlaps your PK (or, more specifically, clustered index). It would also be sorted the same way. However, at least in SQL Server, this would include all the other data about the businesses e.g., the business name, address, etc. So to read the list of IDs from business, you'd also need to read the rest of the data - taking a lot more time.
However, if you put a non-clustered index just on ID, it will be a very narrow index (just the IDs) and therefore the amount of data to be read would be much less - and therefore often a lot faster.
Note though, that this is not as fast as if you could avoid doing the join altogether (e.g., Suggestion #1 above).

Efficient way to get last record from the database

I have a database with following table structure :
id | entry | log_type | user_id | created_on |
------------------------------------------------|
1 |a | error | 1 | 1433752884000|
2 |b | warn | 2 | 1433752884001|
3 |c | error | 2 | 1433752884002|
4 |d | warn | 4 | 1433752884003|
I want to obtain the last record from the table based on created_on field, currently i am using the following query to obtain the result list and obtain the last record on it using java:
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc;
I am using JPA and i execute the query using .getResultList() on the Query interface .Once i get the result list i do a get(0) to obtain the desired last record .
I have a large table with too much data , above query takes too long to execute and stalls the application. I cannot add additional index for now on existing data . Apart from adding the index on the data is there an alternate approach to avoid stalling of this query .
I was thinking of executing the following query,
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc limit 1;
Currently i cannot execute my second query on the database as it might cause my application to stall. Will execution of the second query be faster than the first query ?
I don't have a sufficiently large dataset available to reproduce the stalling problems on my local system and hence . I tried executing the queries on my local database and due to the lack of the available large dataset , unable to determine if the second query would be faster with the addition of "limit" on the returned query.
If the above second query isn't supposed to provide a better result , what would be the approach that i should to get an optimized query .
In case the second query should be good enough to avoid stalling , is it due to the reason that the DB fetches only one record instead instead of the entire set of records ? does the database handle looking/fetching for a single record differently as compared to looking/fetching too many records (as in first query) to improve query timings.
The performance depends...
ORDER BY x LIMIT 1
is a common pattern. It may or may not be very efficient -- It depends on the query and the indexes.
In your case:
where l.user_id=2 and l.log_type = 'error' order by l.created_on desc
this would be optimal:
INDEX(user_id, log_type, created_on)
With that index, it will essentially do one probe to find the row you need. Without that index, it will scan much or all of the table, sort it descending (ORDER BY .. DESC) and deliver the first row (LIMIT 1)
Before you do your query.getResultList(), you need to query.setMaxResults(1). This is the equivalent of LIMIT 1.
But be aware that if your Entity has a Collection of related sub-objects JOINed to it in the query, the Entity Manager may still have to do an unbounded select to get all the data it needs to build the first Entity. See this question and answer for more information about that.
In your case, as you only need one Entity, I would recommend lazy-loading any attached Entities after you have done the initial query.

Is a MySQL primary key already in some sort of default order

I just stumbled upon a few lines of code in a system I just started working with that I don't really get. The system has a large table that saves lots of entities with unique IDs and removes them once they're not longer needed but it never reuses them. So the table looks like this
------------------------
| id |info1|info2|info3|
------------------------
| 1 | foo1| foo2| foo3|
------------------------
| 17 | bar1| bar2| bar3|
------------------------
| 26 | bam1| bam2| bam3|
------------------------
| 328| baz1| baz2| baz3|
------------------------
etc.
In one place in the codebase there is a while loop whose purpose it is to loop through all entities in the DB and do things to them and right now this is solved like this
int lastId = fetchMaxId()
int id = 0
while (id = fetchNextId()){
doStuffWith(id)
}
where fetchMaxId is straight forward
int fetchMaxId(){
return sqlQuery("SELECT MAX(id) FROM Table")
}
but fetchNextId confuses me. It is implemented as
int fetchNextId(currentId, maxId){
return sqlQuery("
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
")
}
This system has been in production for several years so it obviously works but when I tried searching for a solution to why this works I only found people saying the same thing that I already thought i knew. The order in which a MySQL DB returns the result is not easily determined and should not be relied upon so if you wan't a particular order use a ORDER BY clause. But are there some times you can safely ignore the ORDER BY? This code has worked for 12 years and continued to work through several DB updates. Are we just lucky or am I missing something here? Before I saw this code I would have said that if you called
fetchNextId(1, 328)
you could end up with either 17 or 26 as the answer.
Some clues to why this works may be that the id column is the primary key of the Table in question and it's set to auto increment but I can't find any documentation that would explain why
fetchNextId(1, 328)
should always returns 17 when called on the table-snippet given above.
The short answer is yes, the primary key has an order, all indexes have an order, and a primary key is simply a unique index.
As you have rightly said, you should not rely on data being returned in the order the data is stored in, the optimiser is free to return it in any order it likes, and this will be dependent on the query plan. I will however attempt to explain why your query has worked for 12 years.
Your clustered index is just your table data, and your clustering key defines the order that it is stored in. The data is stored on the leaf, and the clustering key helps the root (and intermediate notes) act as pointers to quickly get to the right leaf to retrieve the data. A nonclustered index is a very similar structure, but the lowest level simply contains a pointer to the correct position on the leaf of the clustered index.
In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. In other DBMS you can define both a primary key and a clustered index, when you do this your primary key becomes a unique nonclustered index with a pointer back to the clustered index.
In it's simplest terms you can imagine a table with an ID column that is the primary key, and another column (A), your B-Tree structure for your clustered index would be something like:
Root Node
+---+
| 1 |
+---+
Intermediate Nodes
+---+ +---+ +---+
| 1 | | 4 | | 7 |
+---+ +---+ +---+
Leaf
+-----------+ +-----------+ +-----------+
ID -> | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 |
A -> | A | B | C | | D | E | F | | G | H | I |
+-----------+ +-----------+ +-----------+
In reality the leaf pages will be much bigger, but this is just a demo. Each page also has a pointer to the next page and the previous page for ease of traversing the tree. So when you do a query like:
SELECT ID, A
FROM T
WHERE ID > 5
LIMIT 1;
you are scanning a unique index so it is very likely this will be a sequential scan. Very likely is not guaranteed though.
MySQL will scan the Root node, if there is a potential match it will move on to the intermediate nodes, if the clause had been something like WHERE ID < 0 then MySQL would know that there were no results without going any further than the root node.
Once it moves on to the intermediate node it can identify that it needs to start on the second page (between 4 and 7) to start searching for an ID > 5. So it will sequentially scan the leaf starting on the second leaf page, having already identified the LIMIT 1 it will stop once it finds a match (in this case 6) and return this data from the leaf. In such a simple example this behaviour appears to be reliable and logical. I have tried to force exceptions by choosing an ID value I know is at the end of a leaf page to see if the leaf will be scanned in the reverse order, but as yet have been unable to produce this behaviour, this does not however mean it won't happen, or that future releases of MySQL won't do this in the scenarios I have tested.
In short, just add an order by, or use MIN(ID) and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan.
The answer to your question is yes. If you look at MySQL documentation you will see that whenever a table has a primary key it has an associated index.
When looking at the documentation for indexes you will see that they will mention primary keys as a type of index.
So in case of your particular scenario:
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
The query will stop executing as soon as it has found a value because of the LIMIT 1.
Without the LIMIT 1 it would have returned 17, 24 and 328.
However will all that said I don't think you will run into any order problems when the primary key is auto incrementing but whenever there is a scenario were the primary key is a unique employee no. instead of an auto incrementing field I would not trust the order of the result because the documentation also notes that MySQL reads sequentially, so the possibility is there that a primary key could fall out of the WHERE clause conditions and be skipped.

Improve performance of count and sum when already indexed

First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.
You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.

Performance difference between DISTINCT and GROUP BY

My understanding is that in (My)SQL a SELECT DISTINCT should do the same thing as a GROUP BY on all columns, except that GROUP BY does implicit sorting, so these two queries should be the same:
SELECT boardID,threadID FROM posts GROUP BY boardID,threadID ORDER BY NULL LIMIT 100;
SELECT DISTINCT boardID,threadID FROM posts LIMIT 100;
They're both giving me the same results, and they're giving identical output from EXPLAIN:
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
| 1 | SIMPLE | posts | ALL | NULL | NULL | NULL | NULL | 1263320 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+-----------------+
1 row in set
But on my table the query with DISTINCT consistently returns instantly and the one with GROUP BY takes about 4 seconds. I've disabled the query cache to test this.
There's 25 columns so I've also tried creating a separate table containing only the boardID and threadID columns, but the same problem and performance difference persists.
I have to use GROUP BY instead of DISTINCT so I can include additional columns without them being included in the evaluation of DISTINCT. So now I don't how to proceed. Why is there a difference?
First of all, your queries are not quite the same - GROUP BY has ORDER BY, but DISTINCT does not.
Note, that in either case, index is NOT used, and that cannot be good for performance.
I would suggest creating compound index for (boardid, threadid) - this should let both queries to make use of index and both should start working much faster
EDIT: Explanation why SELECT DISTINCT ... LIMIT 100 is faster than GROUP BY ... LIMIT 100 when you do not have indexes.
To execute first statement (SELECT DISTINCT) server only needs to fetch 100, maybe slightly more rows and can stop as soon as it has 100 different rows - no more work to do.
This is because original SQL statement did not specify any order, so server can deliver any 100 rows as it pleases, as long as they are distinct. But, if you were to impose any index-less ORDER BY on this before LIMIT 100, this query will immediately become slow.
To execute second statement (SELECT ... GROUP BY ... LIMIT 100), MySQL always does implicit ORDER BY by the same columns as were used in GROUP BY. In other words, it cannot quickly stop after fetching first few 100+ rows until all records are fetched, groupped and sorted. After that, it applies ORDER BY NULL you added (which does not do much I guess, but dropping it may speed things up), and finally, it gets first 100 rows and throws away remaining result. And of course, this is damn slow.
When you have compound index, all these steps can be done very quickly in either case.