Optimize query for table with hundreds of millions of rows - mysql

this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.

As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;

Related

MySQL 8 - Slow select when order by combined with limit

I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts

MySQL Optimization of a request

I have a join request on 2 tables. I do this request 2 times for pagination. I did one SQL_CALC_FOUND_ROWS request before.
Table A contains 145000 rows
Table B contains 91000 rows
These are MyIsam tables.
Table A is like this :
id MEDIUMINT(6) UNSIGNED PK
id_b MEDIUMINT(6) UNSIGNED INDEX
rights ENUM ('0', '1', '2', '3', '4', '5', '6', '7', '8') INDEX
avail ENUM ('0', '1', '2', '3') INDEX
del_date DATE INDEX
... and 40 others fields
And these indexes :
rights, avail, del_date
id_b, rights, avail, del_date
Table B is like this :
id MEDIUMINT(6) UNSIGNED PK
title VARCHAR(150) INDEX
.. and 47 others fields
So here are the requests :
SELECT COUNT(*)
FROM A
INNER JOIN B ON B.id = A.id_b
WHERE
A.rights NOT IN ('7')
AND (A.del_date IS NULL OR A.del_date > '2016-02-17')
AND A.avail NOT IN ('0')
ORDER BY B.title;
SELECT A.*, B.*
FROM A
INNER JOIN B ON B.id = A.id_b
WHERE
A.rights NOT IN ('7')
AND (A.del_date IS NULL OR A.del_date > '2016-02-17')
AND A.avail NOT IN ('0')
ORDER BY B.title
LIMIT 0, 20;
The first request actually takes 2 seconds and the 2nd 0.3 seconds. Before trying to optimize the max that I can, it sometimes takes 10 seconds or more for the SQL_CALC_FOUND_ROWS request.
Optimization I have done :
suppress the SQL_CALC_FOUND_ROWS and do it with 2 requests.
the fields rights and avail were TINYINT(4) UNSIGNED, now they are ENUM
I added these 2 parts indexes rights, avail, del_date & id_b, rights, avail, del_date to optimize the ORDER BY clause (very good gain)
But now, I notice that (A.del_date IS NULL OR A.del_date > '2016-02-17') is the cause of the slowness.
If I delete all this clause (A.del_date IS NULL OR A.del_date > '2016-02-17') or just A.del_date IS NULL, the request is very quick.
Any help will be very very appreciated !
The first SELECT, which has only a COUNT(*), is strange for several reasons...
The ORDER BY is useless, and possibly slows down the query.
What is the mapping between A and B? If it is 1:1, then there is no need to include anything about B. If there are 0 or 1 B rows for each A, then you need the JOIN. If it is 1:many, then the COUNT will be more than the number of rows from A; is that what you want? To fix that, change the JOIN into an EXISTS.
If you need to keep B in the query, add INDEX(rights, del_date, avail, id_b) (in any order).
If you do not need to keep B, then have add INDEX(rights, del_date, avail) (in any order).
In either case, that will be a "covering index" (EXPLAIN will say "Using index") and it will run faster.
OR is often a performance killer -- it usually eliminates the possibility of using an index. Please provide EXPLAIN SELECT to see if that is the case here.
Please do not present information in sentences; use the actual output or schema specifications. Things can get lost.
You should consider moving to InnoDB.
If necessary, you should get rid of the "out of nnn items", thereby eliminating the COUNT(*) or SQL_CALC_FOUND_ROWS.
For the main query, the problem is that the filtering is on one table, but the ORDER BY is on another table. This makes optimization either difficult or impossible.
Instead of having individual column indexes on your "A" table, I would recommend trying a composite index of multiple fields that match your criteria. This way, the engine does not require loading all the actual page data for ever record to confirm the other parts, but can get only those via the content found in the index. Then, those that qualify it reads the actual pages of data for the rest of the record.
Looking at your WHERE clause, and since you are using NOT IN for your rights and availability flags, I would have an index on ( del_date, avail, rights )
Your "B" table, I would also have an index on (id, title) for similar reasons.

Does performance of this query can be improved

I have a very slow query in MySql server.
I add the query :
SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT, TERM, START_DT, END_DT, CTYPE, MW AS MW_AWARD,
Mark, SCID
FROM
( SELECT a.CRR_DT, a.TOU, a.SRCE, a.SINK, a.NAME, a.SEASON, a.SRCESUMCONG,
a.SINKSUMCONG, a.SRCEAVGCONG, a.SINKAVGCONG, a.SUMSINKMSRCE,
a.AVGSINKMSRCE, a.HOURCOUNT, b.TERM, b.CTYPE, b.START_DT,
b.END_DT, b.MW, b.SCID, b.Mark
FROM
( SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01'
) a
INNER JOIN
( SELECT MARKET, TERM, TOU, SRCE, SINK, NAME, SCID, CTYPE, START_DT,
END_DT, SUM(MW) AS MW, SUBSTR(MARKET, 1, 3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL' , 'LDES')
GROUP BY MARKET , TOU , SRCE , SINK , NAME , SCID , CTYPE ,
START_DT , END_DT
) b ON a.NAME = b.NAME
AND a.TOU = b.TOU
) c
WHERE c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7 )
ORDER BY NAME , CRR_DT , TOU ASC
Here the result its Explain plan generated using MysQl Workbrench
I guess that the red block red are dangerous. Can please someone help me to understand this plan? Few hints on what I should check once I have this execution plan.
edit add TABLES layout
CREATE TABLE `CRR_CONGCALC` (
`CRR_DT` varchar(7) NOT NULL,
`TOU` varchar(50) NOT NULL,
`SRCE` varchar(50) NOT NULL,
`SINK` varchar(50) NOT NULL,
`SRCESUMCONG` decimal(12,6) DEFAULT NULL,
`SINKSUMCONG` decimal(12,6) DEFAULT NULL,
`SRCEAVGCONG` decimal(12,6) DEFAULT NULL,
`SINKAVGCONG` decimal(12,6) DEFAULT NULL,
`SUMSINKMSRCE` decimal(12,6) DEFAULT NULL,
`AVGSINKMSRCE` decimal(12,6) DEFAULT NULL,
`HOURCOUNT` int(11) NOT NULL DEFAULT '0',
`SEASON` char(1) NOT NULL DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`CRR_DT`,`SRCE`,`SINK`,`TOU`,`HOURCOUNT`),
KEY `srce_index` (`SRCE`),
KEY `srcesink` (`SRCE`,`SINK`)
)
CREATE TABLE `CRR_INVENTORY` (
`MARKET` varchar(50) NOT NULL,
`TERM` varchar(50) NOT NULL,
`TOU` varchar(50) NOT NULL,
`INVENTORY_DT` date NOT NULL,
`START_DT` datetime NOT NULL,
`END_DT` datetime NOT NULL,
`CRR_ID` varchar(50) NOT NULL,
`NSR_INDEX` tinyint(1) NOT NULL,
`SEGMENT` tinyint(1) NOT NULL,
`CTYPE` varchar(50) NOT NULL,
`CATEGORY` varchar(50) NOT NULL,
`COPTION` varchar(50) NOT NULL,
`SRCE` varchar(50) DEFAULT NULL,
`SINK` varchar(50) DEFAULT NULL,
`MW` decimal(8,4) NOT NULL,
`SCID` varchar(50) NOT NULL,
`SEASON` char(1) DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`MARKET`,`INVENTORY_DT`,`CRR_ID`),
KEY `srcesink` (`SRCE`,`SINK`)
)
Brings back memories. With a database, a "Full Table Scan" means that there is nothing that the database can use to speed up the query, it reads the entire table. The rows are stored in a non-sorted order, so there is no better way to "search" for the employee id you are looking for.
This is bad. Why?
If you have a table with a bunch of columns:
first_name, last_name, employee_id, ..., column50 and do a search where employee_id = 1234, if you don't have an index on the employee_id column, you're doing a sequential scan. Even worse if you're doing a join table2 on table1.employee_id = table2.eid, because it has to match the employee_id to every record in the join table.
If you create an index, you greatly reduce the scan time to find the matches (or throw away the non-matches) because instead of doing a sequential scan you can search a sorted field. Much faster.
When you create an index on the employee_id field, you are creating a way to search for employee numbers that is much, much, much faster. When you create an index, you are saying "I am going to join based on this field or have a where clause based on this field". This speeds up your query at the cost of a little bit of disk space.
There are all kinds of tricks with indexes, you can create them so they are unique, not unique, composite (contain multiple columns) and all kinds of stuff. Post your query and we can tell you what you might look at indexing to speed this up.
A good rule of thumb is that you should create an index on your tables on fields that you use in a where clause, join criteria or order by. Picking the field depends on a few things that are beyond the scope of this discussion, but that should be a start.
The pattern FROM ( SELECT... ) JOIN ( SELECT... ) ON ... does not optimize well. See if you can go directly from one of the tables, not hide it in a subquery.
CRR_CONGCALC needs INDEX(CRR_DT). (Please provide SHOW CREATE TABLE.)
CRR_INVENTORY needs INDEX(COPTION, START_DT).
Please make those changes, then come back for more advice, if needed.
According to your explain diagram, there are full table scans happening at each sub-query on CRR_CONGCALC and CRR_INVENTORY. Then when you join the sub-queries together, another full table scan, and finally, when the result set is ordered, one more full table scan.
A Few tips to improve performance
Use fields that are indexed as part of your join statement, where clause, group by clause & order by clause. If this query is used often, consider adding indexes to all relevant columns.
Avoid nested sub-queries with aggregate operations in joins as much as possible. The result-sets returned by the sub-queries are not indexed, joining on it will end up scanning the whole table, rather than just the index. The joins in this query could also result in weird and hard to detect fanning out issues, but this isn't a performance issue that you're seeking a solution for
Filter the result set as early as possible (i.e. in all the sub-queries at the inner most layer to minimize the # of rows the database server has to subsequently deal with.
Unless the final order by is necessary, avoid it.
Use temporary (or materialized) tables to de-nest subqueries. On these tables, you can add indexes, so further joining will be efficient. This assumes that you have the permissions to create & drop tables on the server
That said,
Here's how I would refactor your query.
In generating the inner query b, the group by clause does not contain all fields which are not aggregate columns. This is non standard sql, which leads to malformed data. Mysql allows it, and for the love of god I don't know why. It is best to avoid this trap.
The final wrapping query is unnecessary, as the where clause and group by clause can be applied to the unwrapped query.
This where clause seems fishy to me:
c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7)
START_DT & END_DT are datetime or timestamp columns being implicitly cast as char. It would be better to extract the year-month using the function DATE_FORMAT as:
DATE_FORMAT(<FIELD>, '%Y-%m-01')
Even if the where clause you used worked, it would omit records for which END_DT and CRR_DT fall in the same month. I'm not sure if that is the desired behaviour, but here's a query to illustrate what your boolean expression would evaluate:
SELECT CAST('2015-07-05' AS DATETIME) between '2015-07' and '2015-07';
-- This query returns 0 == False.
Using CREATE TABLE AS SELECT Syntax, first de-nest the sub queries. Note: as I don't know the data, I'm not sure which indexes need to be unique. You can delete the tables once the result is consumed.
Table 1:
CREATE TABLE sub_a (KEY(CRR_DT), KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT CRR_DT,
TOU,
SRCE,
SINK,
NAME,
SEASON,
SRCESUMCONG,
SINKSUMCONG,
SRCEAVGCONG,
SINKAVGCONG,
SUMSINKMSRCE,
AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01-01';
Table 2:
CREATE TABLE sub_b (KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT MARKET,
TERM,
TOU,
SRCE,
SINK,
NAME,
SCID,
CTYPE,
START_DT,
END_DT,
SUM(MW) AS MW_AWARD,
SUBSTR(MARKET,1,3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL','LDES')
GROUP BY MARKET, TERM, TOU,
SRCE, SINK, NAME, SCID,
CTYPE, START_DT, END_DT, MARK
-- note the two added columns in the groupby clause.
After this, the final query would be simply:
SELECT a.CRR_DT,
a.TOU,
a.SRCE,
a.SINK,
a.NAME,
a.SEASON,
a.SRCESUMCONG,
a.SINKSUMCONG,
a.SRCEAVGCONG,
a.SINKAVGCONG,
a.SUMSINKMSRCE,
a.AVGSINKMSRCE,
a.HOURCOUNT,
b.TERM,
b.CTYPE,
b.START_DT,
b.END_DT,
b.MW_AWARD,
b.SCID,
b.Mark
FROM sub_a a
JOIN sub_b b ON a.NAME = b.NAME AND a.TOU = b.TOU
WHERE a.CRR_DT BETWEEN DATE_FORMAT(b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(b.END_DT,'%Y-%m-01')
ORDER BY NAME,
CRR_DT,
TOU;
The above where clause follows the same logic used in your query, except, it's not trying to cast to string. However, this WHERE clause may be more appropriate,
WHERE sub_a.CRR_DT BETWEEN DATE_FORMAT(sub_b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(DATE_ADD(sub_b.END_DT, INTERVAL 1 MONTH),'%Y-%m-01')
Finally both sub_a & sub_b seem to have the fields SRCE & SINK. Would the result change if you added them to the join. That could further optimize the query (at this point, its fair to say queries) processing time.
By doing the above, we hopefully avoid two full table scans, but I don't have your data set, so I'm only making an educated guess here.
If its possible to express this logic without using intermediary tables, and directly via joins to the actual underlying tables CRR_CONGCALC and CRR_INVENTORY, that would be even faster
Full table scans operations are not always bad, or necessarily evil. Sometimes, a full scan is the most efficient way to satisfy a query. For example, the query SELECT * FROM mytable requires MySQL to return every row in the table and every column in each row. And in this case, using an index would just make more work. It's faster just to do a full scan.
On the other hand, if you're retrieving a couple of rows out of a million, an access plan using a suitable index is very likely to be much faster than a full table scan. Effective use of a index can eliminate vast swaths of rows that would otherwise need to be checked; the index basically tells MySQL that the rows we're looking for cannot be in 99% of the blocks in the table, so those blocks don't need to be checked.
MySQL processes views (including inline views) differently than other databases. MySQL uses the term derived table for an inline view. In your query a, b and c are all derived tables. MySQL runs the query to return the rows, and then materializes the view into a table. Once that is completed, the outer query can run against the derived table. But as of MySQL 5.5 (and I think 5.6), inline views are always materialized as derived tables. And that's a performance killer for large sets. (Some performance improvements are coming in newer versions of MySQL, some automatic indexing.)
Also, predicates in the outer query do not get pushed down into the view query. That is, if we run a query like this:
SELECT t.foo
FROM mytable t
WHERE t.foo = 'bar'
MySQL can make use of an index with a leading column of foo to efficiently locate the rows, even if mytable contains millions of rows. But if we write the query like this:
SELECT t.foo
FROM (SELECT * FROM mytable) t
WHERE t.foo = 'bar'
We're essentially forcing MySQL to make a copy of mytable, running the inline view query, to populate a derived table, containing all rows from mytable. And once that operation is complete, the outer query can run. But now, there's no index on the foo column in the derived table. So we're forcing MySQL to do a full scan of the derived table, to look at every row.
If we need an inline view, then relocating the predicate to the inline view query will result in a much smaller derived table.
SELECT t.foo
FROM (SELECT * FROM mytable WHERE foo = 'bar') t
With that, MySQL can make use of the index on foo to quickly locate the rows, and only those rows are materialized into the derived table. The full scan of the derived table isn't as painful now, because the outer query needs to return every row. In this example, it would also be much better to replace that * (representing every column) with just the columns we need to return.
The resultset you specify could be returned without the unnecessary inline views. A query something like this:
SELECT c.crr_dt
, c.tou
, c.srce
, c.sink
, c.name
, c.season
, c.srcesumcong
, c.sinksumcong
, c.srceavgcong
, c.sinkavgcong
, c.sumsinkmsrce
, c.avgsinkmsrce
, c.hourcount
, b.term
, b.start_dt
, b.end_dt
, b.ctype
, b.mw AS mw_award
, b.scid
, b.mark
FROM CRR_CONGCALC c
JOIN ( SELECT i.market
, i.term
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
, SUM(i.mw) AS mw
, SUBSTR(i.market, 1, 3) AS mark
FROM CRR_INVENTORY i
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
GROUP
BY i.market
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
) b
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
ORDER
BY c.name
, c.crr_dt
, c.tou
NOTES: If start_dt and end_dt are defined as DATE, DATETIME or TIMESTAMP columns, then I'd prefer to write the predicate like this:
AND c.crr_dt BETWEEN DATE_FORMAT(b.start_dt,'%Y-%m') AND DATE_FORMAT(b.end_dt,'%Y-%m')
(I don't think there's any performance to be gained there; that just makes it more clear what we're doing.)
In terms of improving performance of that query...
If we're returning a small subset of rows from CRR_INVENTORY, based on the predicates:
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
Then MySQL would likely be able to make effective use of an index with leading columns of (coption,scid,start_dt). That's assuming that this represents a relatively small subset of rows from the table. If those predicates are not very selective, if we're really getting 50% or 90% of the rows in the table, the index is likely going to much less effective.
We might be able to get MySQL to make use of an index to satisfy the GROUP BY clause, without requiring a sort operation. To get that, we'd need an index with leading columns that match the columns listed in the GROUP BY clause.
The derived table isn't going to have an index on it, so for best peformance of the join operation, we want an index on the other table ) is materialized, then we are going to want a suitable index on the other table CRR_CONGCALC. We want the leading columns of that index to be used for the lookup of the matching rows, the predicates:
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
So, we want an index with leading columns of (name, tou, crr_dt) to be able to efficiently locate the matching rows.

MySQL, return all measurements and results within X last hours

This question is very much related to my previous question: MySQL, return all results within X last hours altough with additional significant constraint:
Now i have 2 tables, one for measurements and one for classified results for part of the measurements.
measurements are constantly arrive so as result, that are constantly added after classification of new measurements.
results will not necessarily be stored in the same order of measurement's arrive and store order!
I am interested only to present the last results. By last i mean to take the max time (the time is a part of the measurement structure) of last available result call it Y and a range of X seconds , and present the measurements together with the available results in the range beteen Y and Y-X.
The following are the structure of 2 tables:
event table:
CREATE TABLE `event_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`Feature` char(256) NOT NULL,
`UnixTimeStamp` int(10) unsigned NOT NULL,
`Value` double NOT NULL,
KEY `ix_filter` (`Feature`),
KEY `ix_time` (`UnixTimeStamp`),
KEY `id_index` (`id`)
) ENGINE=MyISAM
classified results table:
CREATE TABLE `event_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`level` enum('NORMAL','SUSPICIOUS') DEFAULT NULL,
`score` double DEFAULT NULL,
`eventId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `eventId_index` (`eventId`)
) ENGINE=MyISAM
I can't query for the last measurements timestamp first since i want to present measurements for which there are currently results, and since measurements arrive constantly, results may still not be available.
Therefore i thought of joining the two tables using
event_results.eventId=event_data.id and than selecting the max time of event_data.UnixTimeStamp as maxTime , after i have the maxTime, i need to do the same opearation again (joining 2 tables) and adding in a where clause a condition
WHERE event_data.UnixTimeStamp >= maxTime + INTERVAL -X SECOND
It seems to be not efficient to execute 2 joins only to achieve what i am asking, Do you have more ef
From my understanding, you are using an aggregate function, MAX. This will produce a record set of size one as a result, which is the highest time from which you will perform. Therefore, it needs to be broken out into a sub query (As you say, nested select). You HAVE to do 2 queries at some point. (Your answer to the last question has 2 queries in it, by having subqueries/nested selects).
The main time sub queries cause problems is when you perform the subquery in the select part of the query, as it performs the subquery for each time there is a row, which will make the query run exponentially slower as the resultset grows. Lets take the answer to your last question and write it in a horrible, inefficient way:
SELECT timeStart,
(SELECT max(timeStart) FROM events) AS maxTime
FROM events
WHERE timeStart > (maxTime + INTERVAL -1 SECOND)
This will perform a select query for each time there is an eventTime record, for the max eventtime. It should produce the same result, but this is slow. This is where the fear of subqueries comes from.
It also performs the aggregate function MAX on each row, which will return the same answer each time. So, you perform that sub query ONCE rather than on each row.
However, in the case of the answer of your last question, the MAX sub query part is ran once, and used to filter on the where, of which that select is ran once. So, in total, 2 queries are ran.
2 super fast queries are faster ran one after the other than 1 super slow query that is super slow.
I'm not entirely sure what resultset you want returned, so I am going to make some assumptions. Please feel free to correct any assumptions I've made.
It sounds (to me) like you want ALL rows from event_data that are within an hour (or however many seconds) of the absolute "latest" timestamp, and along with those rows, you also want to return any related rows from event_results, if any matching rows are available.
If that's the case, then using an inline view to retrieve the maximum value of timestamp is the way to go. (That operation will be very efficient, since the query will be returning a single row, and it can be efficiently retrieved from an existing index.)
Since you want all rows from a specified period of time (from the "latest time" back to "latest time minus X seconds"), we can go ahead and calculate the starting timestamp of the period in that same query. Here we assume you want to "go back" one hour (=60*60 seconds):
SELECT MAX(UnixTimeStamp) - 3600 FROM event_data
NOTE: the expression in the SELECT list above is based on UnixTimeStamp column defined as integer type, rather than as a DATETIME or TIMESTAMP datatype. If the column were defined as DATETIME or TIMESTAMP datatype, we would likely express that with something like this:
SELECT MAX(mydatetime) + INTERVAL -3600 SECONDS
(We could specify the interval units in minutes, hours, etc.)
We can use the result from that query in another query. To do that in the same query text, we simply wrap that query in parentheses, and reference it as a rowsource, as if that query were an actual table. This allows us to get all the rows from event_data that are within in the specified time period, like this:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
In this particular case, there's no need for an upper bound predicate on UnixTimeStamp column in the outer query. This is because we already know there are no values of UnixTimeStamp that are greater than the MAX(UnixTimeStamp), which is the upper bound of the period we are interested in.
(We could add an expression to the SELECT list of the inline view, to return MAX(l.UnixTimeStamp) AS to_unixtimestamp, and then include a predicate like AND d.UnixTimeStamp <= m.to_unixtimestamp in the outer query, but that would be unnecessarily redundant.)
You also specified a requirement to return information from the event_results table.
I believe you said that you wanted any related rows that are "available". This suggests (to me) that if no matching row is "available" from event_results, you still want to return the row from the event_data table.
We can use a LEFT JOIN operation to get that to happen:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
, r.id
, r.level
, r.score
, r.eventId
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
LEFT
JOIN event_results r
ON r.eventId = d.id
Since there is no unique constraint on the eventID column in the event_results table, there is a possibility that more than one "matching" row from event_results will be found. Whenever that happens, the row from event_data table will be repeated, once for each matching row from event_results.
If there is no matching row from event_results, then the row from event_data will still be returned, but with the columns from the event_results table set to NULL.
For performance, remove any columns from the SELECT list that you don't need returned, and be judicious in your choice of expressions in an ORDER BY clause. (The addition of a covering index may improve performance.)
For the statement as written above, MySQL is likely to use the ix_time index on the event_data table, and the eventId_index index on the event_results table.

1k entries query with multiple JOIN's takes up to 10 seconds

Here's a simplified version of the structure (left out some regular varchar cols):
CREATE TABLE `car` (
`reg_plate` varchar(16) NOT NULL default '',
`type` text NOT NULL,
`client` int(11) default NULL,
PRIMARY KEY (`reg_plate`)
)
And here's the query I'm trying to run:
SELECT * FROM (
SELECT
car.*,
tire.id as tire,
client.name as client_name
FROM
car
LEFT JOIN client ON car.client = client.id
LEFT JOIN tire ON tire.reg_plate = reg_plate
GROUP BY car.reg_plate
) t1
The nested query is necessary due to the framework sometimes adding WHERE / SORT clauses (which assume there are columns named client_name or tire).
Both the car and the tire tables have approx. 1,5K entries. client has no more than 500, and for some reason it still takes up to 10 seconds to complete (worse, the framework runs it twice, first to check how much rows there are, then to actually limit to the requested page)
I'm getting a feeling that this query is very inefficient, I just don't know how to optimize it.
Thanks in advance.
First, read up on MySQL's EXPLAIN syntax.
You probably need indexes on every column in the join clauses, and on every column that your framework uses in WHERE and SORT clauses. Sometimes multi-column indexes are better than single-column indexes.
Your framework probably doesn't require nested queries. Unnesting and creating a view or passing parameters to a stored procedure might give you better performance.
For better suggestions on SO, always include DDL and sample data (as INSERT statements) in your questions. You should probably include EXPLAIN output on performance questions, too.