I have a very slow query in MySql server.
I add the query :
SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT, TERM, START_DT, END_DT, CTYPE, MW AS MW_AWARD,
Mark, SCID
FROM
( SELECT a.CRR_DT, a.TOU, a.SRCE, a.SINK, a.NAME, a.SEASON, a.SRCESUMCONG,
a.SINKSUMCONG, a.SRCEAVGCONG, a.SINKAVGCONG, a.SUMSINKMSRCE,
a.AVGSINKMSRCE, a.HOURCOUNT, b.TERM, b.CTYPE, b.START_DT,
b.END_DT, b.MW, b.SCID, b.Mark
FROM
( SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01'
) a
INNER JOIN
( SELECT MARKET, TERM, TOU, SRCE, SINK, NAME, SCID, CTYPE, START_DT,
END_DT, SUM(MW) AS MW, SUBSTR(MARKET, 1, 3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL' , 'LDES')
GROUP BY MARKET , TOU , SRCE , SINK , NAME , SCID , CTYPE ,
START_DT , END_DT
) b ON a.NAME = b.NAME
AND a.TOU = b.TOU
) c
WHERE c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7 )
ORDER BY NAME , CRR_DT , TOU ASC
Here the result its Explain plan generated using MysQl Workbrench
I guess that the red block red are dangerous. Can please someone help me to understand this plan? Few hints on what I should check once I have this execution plan.
edit add TABLES layout
CREATE TABLE `CRR_CONGCALC` (
`CRR_DT` varchar(7) NOT NULL,
`TOU` varchar(50) NOT NULL,
`SRCE` varchar(50) NOT NULL,
`SINK` varchar(50) NOT NULL,
`SRCESUMCONG` decimal(12,6) DEFAULT NULL,
`SINKSUMCONG` decimal(12,6) DEFAULT NULL,
`SRCEAVGCONG` decimal(12,6) DEFAULT NULL,
`SINKAVGCONG` decimal(12,6) DEFAULT NULL,
`SUMSINKMSRCE` decimal(12,6) DEFAULT NULL,
`AVGSINKMSRCE` decimal(12,6) DEFAULT NULL,
`HOURCOUNT` int(11) NOT NULL DEFAULT '0',
`SEASON` char(1) NOT NULL DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`CRR_DT`,`SRCE`,`SINK`,`TOU`,`HOURCOUNT`),
KEY `srce_index` (`SRCE`),
KEY `srcesink` (`SRCE`,`SINK`)
)
CREATE TABLE `CRR_INVENTORY` (
`MARKET` varchar(50) NOT NULL,
`TERM` varchar(50) NOT NULL,
`TOU` varchar(50) NOT NULL,
`INVENTORY_DT` date NOT NULL,
`START_DT` datetime NOT NULL,
`END_DT` datetime NOT NULL,
`CRR_ID` varchar(50) NOT NULL,
`NSR_INDEX` tinyint(1) NOT NULL,
`SEGMENT` tinyint(1) NOT NULL,
`CTYPE` varchar(50) NOT NULL,
`CATEGORY` varchar(50) NOT NULL,
`COPTION` varchar(50) NOT NULL,
`SRCE` varchar(50) DEFAULT NULL,
`SINK` varchar(50) DEFAULT NULL,
`MW` decimal(8,4) NOT NULL,
`SCID` varchar(50) NOT NULL,
`SEASON` char(1) DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`MARKET`,`INVENTORY_DT`,`CRR_ID`),
KEY `srcesink` (`SRCE`,`SINK`)
)
Brings back memories. With a database, a "Full Table Scan" means that there is nothing that the database can use to speed up the query, it reads the entire table. The rows are stored in a non-sorted order, so there is no better way to "search" for the employee id you are looking for.
This is bad. Why?
If you have a table with a bunch of columns:
first_name, last_name, employee_id, ..., column50 and do a search where employee_id = 1234, if you don't have an index on the employee_id column, you're doing a sequential scan. Even worse if you're doing a join table2 on table1.employee_id = table2.eid, because it has to match the employee_id to every record in the join table.
If you create an index, you greatly reduce the scan time to find the matches (or throw away the non-matches) because instead of doing a sequential scan you can search a sorted field. Much faster.
When you create an index on the employee_id field, you are creating a way to search for employee numbers that is much, much, much faster. When you create an index, you are saying "I am going to join based on this field or have a where clause based on this field". This speeds up your query at the cost of a little bit of disk space.
There are all kinds of tricks with indexes, you can create them so they are unique, not unique, composite (contain multiple columns) and all kinds of stuff. Post your query and we can tell you what you might look at indexing to speed this up.
A good rule of thumb is that you should create an index on your tables on fields that you use in a where clause, join criteria or order by. Picking the field depends on a few things that are beyond the scope of this discussion, but that should be a start.
The pattern FROM ( SELECT... ) JOIN ( SELECT... ) ON ... does not optimize well. See if you can go directly from one of the tables, not hide it in a subquery.
CRR_CONGCALC needs INDEX(CRR_DT). (Please provide SHOW CREATE TABLE.)
CRR_INVENTORY needs INDEX(COPTION, START_DT).
Please make those changes, then come back for more advice, if needed.
According to your explain diagram, there are full table scans happening at each sub-query on CRR_CONGCALC and CRR_INVENTORY. Then when you join the sub-queries together, another full table scan, and finally, when the result set is ordered, one more full table scan.
A Few tips to improve performance
Use fields that are indexed as part of your join statement, where clause, group by clause & order by clause. If this query is used often, consider adding indexes to all relevant columns.
Avoid nested sub-queries with aggregate operations in joins as much as possible. The result-sets returned by the sub-queries are not indexed, joining on it will end up scanning the whole table, rather than just the index. The joins in this query could also result in weird and hard to detect fanning out issues, but this isn't a performance issue that you're seeking a solution for
Filter the result set as early as possible (i.e. in all the sub-queries at the inner most layer to minimize the # of rows the database server has to subsequently deal with.
Unless the final order by is necessary, avoid it.
Use temporary (or materialized) tables to de-nest subqueries. On these tables, you can add indexes, so further joining will be efficient. This assumes that you have the permissions to create & drop tables on the server
That said,
Here's how I would refactor your query.
In generating the inner query b, the group by clause does not contain all fields which are not aggregate columns. This is non standard sql, which leads to malformed data. Mysql allows it, and for the love of god I don't know why. It is best to avoid this trap.
The final wrapping query is unnecessary, as the where clause and group by clause can be applied to the unwrapped query.
This where clause seems fishy to me:
c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7)
START_DT & END_DT are datetime or timestamp columns being implicitly cast as char. It would be better to extract the year-month using the function DATE_FORMAT as:
DATE_FORMAT(<FIELD>, '%Y-%m-01')
Even if the where clause you used worked, it would omit records for which END_DT and CRR_DT fall in the same month. I'm not sure if that is the desired behaviour, but here's a query to illustrate what your boolean expression would evaluate:
SELECT CAST('2015-07-05' AS DATETIME) between '2015-07' and '2015-07';
-- This query returns 0 == False.
Using CREATE TABLE AS SELECT Syntax, first de-nest the sub queries. Note: as I don't know the data, I'm not sure which indexes need to be unique. You can delete the tables once the result is consumed.
Table 1:
CREATE TABLE sub_a (KEY(CRR_DT), KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT CRR_DT,
TOU,
SRCE,
SINK,
NAME,
SEASON,
SRCESUMCONG,
SINKSUMCONG,
SRCEAVGCONG,
SINKAVGCONG,
SUMSINKMSRCE,
AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01-01';
Table 2:
CREATE TABLE sub_b (KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT MARKET,
TERM,
TOU,
SRCE,
SINK,
NAME,
SCID,
CTYPE,
START_DT,
END_DT,
SUM(MW) AS MW_AWARD,
SUBSTR(MARKET,1,3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL','LDES')
GROUP BY MARKET, TERM, TOU,
SRCE, SINK, NAME, SCID,
CTYPE, START_DT, END_DT, MARK
-- note the two added columns in the groupby clause.
After this, the final query would be simply:
SELECT a.CRR_DT,
a.TOU,
a.SRCE,
a.SINK,
a.NAME,
a.SEASON,
a.SRCESUMCONG,
a.SINKSUMCONG,
a.SRCEAVGCONG,
a.SINKAVGCONG,
a.SUMSINKMSRCE,
a.AVGSINKMSRCE,
a.HOURCOUNT,
b.TERM,
b.CTYPE,
b.START_DT,
b.END_DT,
b.MW_AWARD,
b.SCID,
b.Mark
FROM sub_a a
JOIN sub_b b ON a.NAME = b.NAME AND a.TOU = b.TOU
WHERE a.CRR_DT BETWEEN DATE_FORMAT(b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(b.END_DT,'%Y-%m-01')
ORDER BY NAME,
CRR_DT,
TOU;
The above where clause follows the same logic used in your query, except, it's not trying to cast to string. However, this WHERE clause may be more appropriate,
WHERE sub_a.CRR_DT BETWEEN DATE_FORMAT(sub_b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(DATE_ADD(sub_b.END_DT, INTERVAL 1 MONTH),'%Y-%m-01')
Finally both sub_a & sub_b seem to have the fields SRCE & SINK. Would the result change if you added them to the join. That could further optimize the query (at this point, its fair to say queries) processing time.
By doing the above, we hopefully avoid two full table scans, but I don't have your data set, so I'm only making an educated guess here.
If its possible to express this logic without using intermediary tables, and directly via joins to the actual underlying tables CRR_CONGCALC and CRR_INVENTORY, that would be even faster
Full table scans operations are not always bad, or necessarily evil. Sometimes, a full scan is the most efficient way to satisfy a query. For example, the query SELECT * FROM mytable requires MySQL to return every row in the table and every column in each row. And in this case, using an index would just make more work. It's faster just to do a full scan.
On the other hand, if you're retrieving a couple of rows out of a million, an access plan using a suitable index is very likely to be much faster than a full table scan. Effective use of a index can eliminate vast swaths of rows that would otherwise need to be checked; the index basically tells MySQL that the rows we're looking for cannot be in 99% of the blocks in the table, so those blocks don't need to be checked.
MySQL processes views (including inline views) differently than other databases. MySQL uses the term derived table for an inline view. In your query a, b and c are all derived tables. MySQL runs the query to return the rows, and then materializes the view into a table. Once that is completed, the outer query can run against the derived table. But as of MySQL 5.5 (and I think 5.6), inline views are always materialized as derived tables. And that's a performance killer for large sets. (Some performance improvements are coming in newer versions of MySQL, some automatic indexing.)
Also, predicates in the outer query do not get pushed down into the view query. That is, if we run a query like this:
SELECT t.foo
FROM mytable t
WHERE t.foo = 'bar'
MySQL can make use of an index with a leading column of foo to efficiently locate the rows, even if mytable contains millions of rows. But if we write the query like this:
SELECT t.foo
FROM (SELECT * FROM mytable) t
WHERE t.foo = 'bar'
We're essentially forcing MySQL to make a copy of mytable, running the inline view query, to populate a derived table, containing all rows from mytable. And once that operation is complete, the outer query can run. But now, there's no index on the foo column in the derived table. So we're forcing MySQL to do a full scan of the derived table, to look at every row.
If we need an inline view, then relocating the predicate to the inline view query will result in a much smaller derived table.
SELECT t.foo
FROM (SELECT * FROM mytable WHERE foo = 'bar') t
With that, MySQL can make use of the index on foo to quickly locate the rows, and only those rows are materialized into the derived table. The full scan of the derived table isn't as painful now, because the outer query needs to return every row. In this example, it would also be much better to replace that * (representing every column) with just the columns we need to return.
The resultset you specify could be returned without the unnecessary inline views. A query something like this:
SELECT c.crr_dt
, c.tou
, c.srce
, c.sink
, c.name
, c.season
, c.srcesumcong
, c.sinksumcong
, c.srceavgcong
, c.sinkavgcong
, c.sumsinkmsrce
, c.avgsinkmsrce
, c.hourcount
, b.term
, b.start_dt
, b.end_dt
, b.ctype
, b.mw AS mw_award
, b.scid
, b.mark
FROM CRR_CONGCALC c
JOIN ( SELECT i.market
, i.term
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
, SUM(i.mw) AS mw
, SUBSTR(i.market, 1, 3) AS mark
FROM CRR_INVENTORY i
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
GROUP
BY i.market
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
) b
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
ORDER
BY c.name
, c.crr_dt
, c.tou
NOTES: If start_dt and end_dt are defined as DATE, DATETIME or TIMESTAMP columns, then I'd prefer to write the predicate like this:
AND c.crr_dt BETWEEN DATE_FORMAT(b.start_dt,'%Y-%m') AND DATE_FORMAT(b.end_dt,'%Y-%m')
(I don't think there's any performance to be gained there; that just makes it more clear what we're doing.)
In terms of improving performance of that query...
If we're returning a small subset of rows from CRR_INVENTORY, based on the predicates:
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
Then MySQL would likely be able to make effective use of an index with leading columns of (coption,scid,start_dt). That's assuming that this represents a relatively small subset of rows from the table. If those predicates are not very selective, if we're really getting 50% or 90% of the rows in the table, the index is likely going to much less effective.
We might be able to get MySQL to make use of an index to satisfy the GROUP BY clause, without requiring a sort operation. To get that, we'd need an index with leading columns that match the columns listed in the GROUP BY clause.
The derived table isn't going to have an index on it, so for best peformance of the join operation, we want an index on the other table ) is materialized, then we are going to want a suitable index on the other table CRR_CONGCALC. We want the leading columns of that index to be used for the lookup of the matching rows, the predicates:
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
So, we want an index with leading columns of (name, tou, crr_dt) to be able to efficiently locate the matching rows.
this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.
As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;