I have a very slow query in MySql server.
I add the query :
SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT, TERM, START_DT, END_DT, CTYPE, MW AS MW_AWARD,
Mark, SCID
FROM
( SELECT a.CRR_DT, a.TOU, a.SRCE, a.SINK, a.NAME, a.SEASON, a.SRCESUMCONG,
a.SINKSUMCONG, a.SRCEAVGCONG, a.SINKAVGCONG, a.SUMSINKMSRCE,
a.AVGSINKMSRCE, a.HOURCOUNT, b.TERM, b.CTYPE, b.START_DT,
b.END_DT, b.MW, b.SCID, b.Mark
FROM
( SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01'
) a
INNER JOIN
( SELECT MARKET, TERM, TOU, SRCE, SINK, NAME, SCID, CTYPE, START_DT,
END_DT, SUM(MW) AS MW, SUBSTR(MARKET, 1, 3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL' , 'LDES')
GROUP BY MARKET , TOU , SRCE , SINK , NAME , SCID , CTYPE ,
START_DT , END_DT
) b ON a.NAME = b.NAME
AND a.TOU = b.TOU
) c
WHERE c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7 )
ORDER BY NAME , CRR_DT , TOU ASC
Here the result its Explain plan generated using MysQl Workbrench
I guess that the red block red are dangerous. Can please someone help me to understand this plan? Few hints on what I should check once I have this execution plan.
edit add TABLES layout
CREATE TABLE `CRR_CONGCALC` (
`CRR_DT` varchar(7) NOT NULL,
`TOU` varchar(50) NOT NULL,
`SRCE` varchar(50) NOT NULL,
`SINK` varchar(50) NOT NULL,
`SRCESUMCONG` decimal(12,6) DEFAULT NULL,
`SINKSUMCONG` decimal(12,6) DEFAULT NULL,
`SRCEAVGCONG` decimal(12,6) DEFAULT NULL,
`SINKAVGCONG` decimal(12,6) DEFAULT NULL,
`SUMSINKMSRCE` decimal(12,6) DEFAULT NULL,
`AVGSINKMSRCE` decimal(12,6) DEFAULT NULL,
`HOURCOUNT` int(11) NOT NULL DEFAULT '0',
`SEASON` char(1) NOT NULL DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`CRR_DT`,`SRCE`,`SINK`,`TOU`,`HOURCOUNT`),
KEY `srce_index` (`SRCE`),
KEY `srcesink` (`SRCE`,`SINK`)
)
CREATE TABLE `CRR_INVENTORY` (
`MARKET` varchar(50) NOT NULL,
`TERM` varchar(50) NOT NULL,
`TOU` varchar(50) NOT NULL,
`INVENTORY_DT` date NOT NULL,
`START_DT` datetime NOT NULL,
`END_DT` datetime NOT NULL,
`CRR_ID` varchar(50) NOT NULL,
`NSR_INDEX` tinyint(1) NOT NULL,
`SEGMENT` tinyint(1) NOT NULL,
`CTYPE` varchar(50) NOT NULL,
`CATEGORY` varchar(50) NOT NULL,
`COPTION` varchar(50) NOT NULL,
`SRCE` varchar(50) DEFAULT NULL,
`SINK` varchar(50) DEFAULT NULL,
`MW` decimal(8,4) NOT NULL,
`SCID` varchar(50) NOT NULL,
`SEASON` char(1) DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`MARKET`,`INVENTORY_DT`,`CRR_ID`),
KEY `srcesink` (`SRCE`,`SINK`)
)
Brings back memories. With a database, a "Full Table Scan" means that there is nothing that the database can use to speed up the query, it reads the entire table. The rows are stored in a non-sorted order, so there is no better way to "search" for the employee id you are looking for.
This is bad. Why?
If you have a table with a bunch of columns:
first_name, last_name, employee_id, ..., column50 and do a search where employee_id = 1234, if you don't have an index on the employee_id column, you're doing a sequential scan. Even worse if you're doing a join table2 on table1.employee_id = table2.eid, because it has to match the employee_id to every record in the join table.
If you create an index, you greatly reduce the scan time to find the matches (or throw away the non-matches) because instead of doing a sequential scan you can search a sorted field. Much faster.
When you create an index on the employee_id field, you are creating a way to search for employee numbers that is much, much, much faster. When you create an index, you are saying "I am going to join based on this field or have a where clause based on this field". This speeds up your query at the cost of a little bit of disk space.
There are all kinds of tricks with indexes, you can create them so they are unique, not unique, composite (contain multiple columns) and all kinds of stuff. Post your query and we can tell you what you might look at indexing to speed this up.
A good rule of thumb is that you should create an index on your tables on fields that you use in a where clause, join criteria or order by. Picking the field depends on a few things that are beyond the scope of this discussion, but that should be a start.
The pattern FROM ( SELECT... ) JOIN ( SELECT... ) ON ... does not optimize well. See if you can go directly from one of the tables, not hide it in a subquery.
CRR_CONGCALC needs INDEX(CRR_DT). (Please provide SHOW CREATE TABLE.)
CRR_INVENTORY needs INDEX(COPTION, START_DT).
Please make those changes, then come back for more advice, if needed.
According to your explain diagram, there are full table scans happening at each sub-query on CRR_CONGCALC and CRR_INVENTORY. Then when you join the sub-queries together, another full table scan, and finally, when the result set is ordered, one more full table scan.
A Few tips to improve performance
Use fields that are indexed as part of your join statement, where clause, group by clause & order by clause. If this query is used often, consider adding indexes to all relevant columns.
Avoid nested sub-queries with aggregate operations in joins as much as possible. The result-sets returned by the sub-queries are not indexed, joining on it will end up scanning the whole table, rather than just the index. The joins in this query could also result in weird and hard to detect fanning out issues, but this isn't a performance issue that you're seeking a solution for
Filter the result set as early as possible (i.e. in all the sub-queries at the inner most layer to minimize the # of rows the database server has to subsequently deal with.
Unless the final order by is necessary, avoid it.
Use temporary (or materialized) tables to de-nest subqueries. On these tables, you can add indexes, so further joining will be efficient. This assumes that you have the permissions to create & drop tables on the server
That said,
Here's how I would refactor your query.
In generating the inner query b, the group by clause does not contain all fields which are not aggregate columns. This is non standard sql, which leads to malformed data. Mysql allows it, and for the love of god I don't know why. It is best to avoid this trap.
The final wrapping query is unnecessary, as the where clause and group by clause can be applied to the unwrapped query.
This where clause seems fishy to me:
c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7)
START_DT & END_DT are datetime or timestamp columns being implicitly cast as char. It would be better to extract the year-month using the function DATE_FORMAT as:
DATE_FORMAT(<FIELD>, '%Y-%m-01')
Even if the where clause you used worked, it would omit records for which END_DT and CRR_DT fall in the same month. I'm not sure if that is the desired behaviour, but here's a query to illustrate what your boolean expression would evaluate:
SELECT CAST('2015-07-05' AS DATETIME) between '2015-07' and '2015-07';
-- This query returns 0 == False.
Using CREATE TABLE AS SELECT Syntax, first de-nest the sub queries. Note: as I don't know the data, I'm not sure which indexes need to be unique. You can delete the tables once the result is consumed.
Table 1:
CREATE TABLE sub_a (KEY(CRR_DT), KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT CRR_DT,
TOU,
SRCE,
SINK,
NAME,
SEASON,
SRCESUMCONG,
SINKSUMCONG,
SRCEAVGCONG,
SINKAVGCONG,
SUMSINKMSRCE,
AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01-01';
Table 2:
CREATE TABLE sub_b (KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT MARKET,
TERM,
TOU,
SRCE,
SINK,
NAME,
SCID,
CTYPE,
START_DT,
END_DT,
SUM(MW) AS MW_AWARD,
SUBSTR(MARKET,1,3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL','LDES')
GROUP BY MARKET, TERM, TOU,
SRCE, SINK, NAME, SCID,
CTYPE, START_DT, END_DT, MARK
-- note the two added columns in the groupby clause.
After this, the final query would be simply:
SELECT a.CRR_DT,
a.TOU,
a.SRCE,
a.SINK,
a.NAME,
a.SEASON,
a.SRCESUMCONG,
a.SINKSUMCONG,
a.SRCEAVGCONG,
a.SINKAVGCONG,
a.SUMSINKMSRCE,
a.AVGSINKMSRCE,
a.HOURCOUNT,
b.TERM,
b.CTYPE,
b.START_DT,
b.END_DT,
b.MW_AWARD,
b.SCID,
b.Mark
FROM sub_a a
JOIN sub_b b ON a.NAME = b.NAME AND a.TOU = b.TOU
WHERE a.CRR_DT BETWEEN DATE_FORMAT(b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(b.END_DT,'%Y-%m-01')
ORDER BY NAME,
CRR_DT,
TOU;
The above where clause follows the same logic used in your query, except, it's not trying to cast to string. However, this WHERE clause may be more appropriate,
WHERE sub_a.CRR_DT BETWEEN DATE_FORMAT(sub_b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(DATE_ADD(sub_b.END_DT, INTERVAL 1 MONTH),'%Y-%m-01')
Finally both sub_a & sub_b seem to have the fields SRCE & SINK. Would the result change if you added them to the join. That could further optimize the query (at this point, its fair to say queries) processing time.
By doing the above, we hopefully avoid two full table scans, but I don't have your data set, so I'm only making an educated guess here.
If its possible to express this logic without using intermediary tables, and directly via joins to the actual underlying tables CRR_CONGCALC and CRR_INVENTORY, that would be even faster
Full table scans operations are not always bad, or necessarily evil. Sometimes, a full scan is the most efficient way to satisfy a query. For example, the query SELECT * FROM mytable requires MySQL to return every row in the table and every column in each row. And in this case, using an index would just make more work. It's faster just to do a full scan.
On the other hand, if you're retrieving a couple of rows out of a million, an access plan using a suitable index is very likely to be much faster than a full table scan. Effective use of a index can eliminate vast swaths of rows that would otherwise need to be checked; the index basically tells MySQL that the rows we're looking for cannot be in 99% of the blocks in the table, so those blocks don't need to be checked.
MySQL processes views (including inline views) differently than other databases. MySQL uses the term derived table for an inline view. In your query a, b and c are all derived tables. MySQL runs the query to return the rows, and then materializes the view into a table. Once that is completed, the outer query can run against the derived table. But as of MySQL 5.5 (and I think 5.6), inline views are always materialized as derived tables. And that's a performance killer for large sets. (Some performance improvements are coming in newer versions of MySQL, some automatic indexing.)
Also, predicates in the outer query do not get pushed down into the view query. That is, if we run a query like this:
SELECT t.foo
FROM mytable t
WHERE t.foo = 'bar'
MySQL can make use of an index with a leading column of foo to efficiently locate the rows, even if mytable contains millions of rows. But if we write the query like this:
SELECT t.foo
FROM (SELECT * FROM mytable) t
WHERE t.foo = 'bar'
We're essentially forcing MySQL to make a copy of mytable, running the inline view query, to populate a derived table, containing all rows from mytable. And once that operation is complete, the outer query can run. But now, there's no index on the foo column in the derived table. So we're forcing MySQL to do a full scan of the derived table, to look at every row.
If we need an inline view, then relocating the predicate to the inline view query will result in a much smaller derived table.
SELECT t.foo
FROM (SELECT * FROM mytable WHERE foo = 'bar') t
With that, MySQL can make use of the index on foo to quickly locate the rows, and only those rows are materialized into the derived table. The full scan of the derived table isn't as painful now, because the outer query needs to return every row. In this example, it would also be much better to replace that * (representing every column) with just the columns we need to return.
The resultset you specify could be returned without the unnecessary inline views. A query something like this:
SELECT c.crr_dt
, c.tou
, c.srce
, c.sink
, c.name
, c.season
, c.srcesumcong
, c.sinksumcong
, c.srceavgcong
, c.sinkavgcong
, c.sumsinkmsrce
, c.avgsinkmsrce
, c.hourcount
, b.term
, b.start_dt
, b.end_dt
, b.ctype
, b.mw AS mw_award
, b.scid
, b.mark
FROM CRR_CONGCALC c
JOIN ( SELECT i.market
, i.term
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
, SUM(i.mw) AS mw
, SUBSTR(i.market, 1, 3) AS mark
FROM CRR_INVENTORY i
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
GROUP
BY i.market
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
) b
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
ORDER
BY c.name
, c.crr_dt
, c.tou
NOTES: If start_dt and end_dt are defined as DATE, DATETIME or TIMESTAMP columns, then I'd prefer to write the predicate like this:
AND c.crr_dt BETWEEN DATE_FORMAT(b.start_dt,'%Y-%m') AND DATE_FORMAT(b.end_dt,'%Y-%m')
(I don't think there's any performance to be gained there; that just makes it more clear what we're doing.)
In terms of improving performance of that query...
If we're returning a small subset of rows from CRR_INVENTORY, based on the predicates:
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
Then MySQL would likely be able to make effective use of an index with leading columns of (coption,scid,start_dt). That's assuming that this represents a relatively small subset of rows from the table. If those predicates are not very selective, if we're really getting 50% or 90% of the rows in the table, the index is likely going to much less effective.
We might be able to get MySQL to make use of an index to satisfy the GROUP BY clause, without requiring a sort operation. To get that, we'd need an index with leading columns that match the columns listed in the GROUP BY clause.
The derived table isn't going to have an index on it, so for best peformance of the join operation, we want an index on the other table ) is materialized, then we are going to want a suitable index on the other table CRR_CONGCALC. We want the leading columns of that index to be used for the lookup of the matching rows, the predicates:
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
So, we want an index with leading columns of (name, tou, crr_dt) to be able to efficiently locate the matching rows.
Related
So I have the following code
CREATE TABLE dispenser(
id_disp int not null auto_increment,
location_disp varchar(20) not null,
full_capacity int not null,
primary key (id_disp)
);
CREATE TABLE records(
time_stamp DATETIME DEFAULT CURRENT_TIMESTAMP not null,
id_dispenser int not null,
nr_pumps int not null,
primary key (id_dispenser,time_stamp)
);
CREATE VIEW left_capacity AS
SELECT
max(time_stamp) AS 'calendar',
id_dispenser AS 'dispenser',
full_capacity AS 'capacity',
(full_capacity-(nr_pumps*3)) AS 'available'
FROM records r, dispenser d
WHERE r.id_dispenser=d.id_disp
GROUP by id_dispenser
ORDER by id_dispenser desc;
and I have a problem on my view table where it says: (full_capacity-(nr_pumps*3)) because if nr_pumps*3 is bigger than full_capacity, I will have a negative available value. Is there a way I can make the value 0 if the calculation result is negative?
You can use greatest(). It returns the greatest of its arguments.
CREATE VIEW left_capacity
AS
SELECT max(time_stamp) AS calendar,
id_dispenser AS dispenser,
full_capacity AS capacity,
greatest(full_capacity - nr_pumps * 3, 0) AS available
FROM records r
INNER JOIN dispenser d
ON d.id_disp = r.id_dispenser
GROUP by id_dispenser;
Some additional notes:
Do not enclose identifiers, such as column aliases in single quotes. Singles quotes in SQL are meant for string or date literals. MySQL might accept aliases in single quotes but other DBMS do not. So don't even start getting used to it.
It is recommended to use the explicit JOIN syntax over the old comma separated list in the FROM clause. The former is easier to write without errors, to read and to understand.
Your use of GROUP BY makes the query malformed. Not every column in the list of selected columns is either an argument to an aggregation function, in the GROUP BY clause or fully functional dependent of a column in the GROUP BY clause. Sadly older versions of MySQL or badly configured ones won't enforce that rule. But the results can be funny, so prepare to be in for a surprise one day.
I don't know if MySQL might be an exception here. But in general an ORDER BY in a view is pointless (unless LIMIT is involved too). Unless the outermost query explicitly defines an order with an ORDER BY, SQL engines generally are allowed to return results in any order and will do if deemed necessary. So the ORDER BY in a view might get optimized away.
I've got a report which I need to show the month and year profit from my transaction the query which I made and works is very slow and can not figure out how I can manage to change the query the way that consumes less time to load.
SELECT MONTH(MT4_TRADES.CLOSE_TIME) as MONTH
, YEAR(MT4_TRADES.CLOSE_TIME) as YEAR
, SUM(MT4_TRADES.SWAPS) as SWAPS
, SUM(MT4_TRADES.VOLUME)/100 as VOLUME
, SUM(MT4_TRADES.PROFIT) AS PROFIT
FROM MT4_TRADES
JOIN MT4_USERS
ON MT4_TRADES.LOGIN = MT4_USERS.LOGIN
WHERE MT4_TRADES.CMD < 2
AND MT4_TRADES.CLOSE_TIME <> "1970-01-01 00:00:00"
AND MT4_USERS.AGENT_ACCOUNT <> "1"
GROUP
BY YEAR(MT4_TRADES.CLOSE_TIME)
, MONTH(MT4_TRADES.CLOSE_TIME)
ORDER
BY YEAR
This is the full query, any suggestion would be highly appreciated.
This is the result of explain:
Echoing the comment from #Barmar, look at the EXPLAIN output to see query execution plan. Verify that suitable indexes are being used.
Likely the big rock in terms of performance is the "Using filesort" operation.
To get around that, we would need a suitable index available. and that would require some changes to the table. (The typical question on "improve query performance" topic on SO comes with a restrictions that we "can't add indexes or make any changes to the table".)
I'd be looking at a functional index (feature added in MySQL 8.0, for MySQL 5.7, I'd be looking at adding generated columns and including generated columns in a secondary index, featured added in MySQL 5.7)
CREATE INDEX `MT4_TRADES_ix2` ON MT4_TRADES ((YEAR(close_time)),(MONTH(close_time)))
I'd be tempted to go with a covering index, and also change the grouping to a single expression e.g. DATE_FORMAT(close_time,'%Y-%m')
CREATE INDEX `MT4_TRADES_ix3` ON MT4_TRADES ((DATE_FORMAT(close_time,'%Y-%m'))
,swaps,volume,profit,login,cmd,closetime)
from the query, it looks like login is going to be UNIQUE in MT4_USERS table, likely that's the PRIMARY KEY or a UNIQUE KEY, so an index is going to be available, but we're just guessing...
With suitable indexes available, we could so something like this:
SELECT DATE_FORMAT(close_time,'%Y-%m') AS close_year_mo
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.swaps ,NULL)) AS swaps
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.volume ,NULL))/100 AS volume
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.profit ,NULL)) AS profit
FROM MT4_TRADES t
JOIN MT4_USERS u
ON u.login = t.login
AND u.agent_account <> '1'
GROUP BY close_year_mo
ORDER BY close_year_mo
and we'd expect MySQL to do a loose index scan, with the EXPLAIN output top show "using index for group-by" and not show "Using filesort"
EDIT
For versions of MySQL before 5.7, we could create new columns, e.g.year_close and month_close, populate the columns with the results of expressions YEAR(close_time) and MONTH(close_time) (we could create BEFORE INSERT and BEFORE UPDATE triggers to handle that automatically for us)
Then we could create index with those columns as the leading columns
CREATE INDEX ... ON MT4_TRADES ( year_close, month_close, ... )
And then reference the new columns in the query
SELECT t.year_close AS `YEAR`
, t.month_close AS `MONTH`
FROM MT4_TRADES t
JOIN ...
WHERE ...
GROUP
BY t.year_close
, t.month_close
Ideally include in the index all of referenced columns from MT4_TRADES, to make a covering index for the query.
I have a query which runs slowly, I've come up with a much faster alternative, but I'd like some help understanding why the original query is so slow.
A simplified version of my problem use two tables. A simplified version of the first table, called profiles, is
`profiles` (
`id` int(11),
`title` char(255),
`body` text,
`pin` int(11),
PRIMARY KEY (`id`),
UNIQUE KEY `pin` (`pin`)
)
The simplified version of my second table, calls, is
`calls` (
`id` int(11),
`pin` int(11),
`duration` int(11),
PRIMARY KEY (`id`),
KEY `ivr_id` (`pin`)
)
My query is supposed to get the full profiles, with the addition of the number of calls received by a profile. The query I was using, was
SELECT profiles.*, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls
ON profiles.pin = calls.pin
GROUP BY profiles.pin
With ~100 profiles and ~250,000 calls, this query takes about 10 seconds which is slow.
If I modify the query to just select the title from profiles, not all columns, the query is much faster. If I modify the query to remove the group by, its also much faster. If I just select everything from the profiles table, then its also a fast query.
My actual profile table has many more text and char fields. The query speed is worse the more text fields that are selected. Why are the text fields causing the query to be so slow, when they are not involved in the JOIN or the GROUP?
I came up with a slightly different query which is much faster, less than half a second. This query is:
SELECT profiles.*, temp.readings
FROM profiles
LEFT JOIN (
SELECT pin ,COUNT(*) AS readings
FROM calls
GROUP BY pin
) AS temp
ON temp.pin=profiles.pin
Whilst I think I've solved by speed problem, I'd like to understand what is causing the issue in the first query.
======== Update ========
I've just profiled both the queries and the entire speed difference is in the 'sending data' section. The slow query is about 10 seconds and the faster query is about 0.1 seconds
======== Update 2 ========
After discussing with #scaisEdge, I think I can rephrase my question. Consider a table T1 that has ~40 columns of which 8 are of type TEXT and ~100 rows and table T2 which has 5 columns of type INT and VARCHAR with ~250,000 rows. Why is it that:
SELECT T1.* FROM T1 is fast
SELECT T1.* FROM T1 JOIN T2 GROUP BY T1.joinfield is slow
SELECT T1.selectfield FROM T1 JOIN T2 GROUP BY T1.joinfield is fast if selectfield is an INT or VARCHAR
This should happpend because
The first query join 100 profile with 250,000 calls and then reduce the returning rows grouping by the result . And the select profile.* implies the full accessing for each matching row to the profile table data
Then second query join 100 profile with the number of rows returned by subquery for TEMP (problably much less than 250,000) reducing the number of the access to the table profile data
Instead of profile.* try accessing only to pin column
SELECT profiles.pin, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls ON profiles.pin = calls.pin
GROUP BY profiles.pin
As a suggestion you should take note that use of group by for the first query is allowed only for mysql version earlier than version 5.7 .. because the use of group by column without mention column in select clause and not affected by aggregation function and not mention in GROUP BY is not allowed by defualt and produce error ..
The sql query is fairly standard inner join type.
For example comparing n tables to see which customerId's exist in all n tables would be a basic WHERE ... AND type query.
The problem is the size of the tables are > 10 million records. The database is denormalized. Normalization is not an option.
The query either takes to long to complete or never completes.
I'm not sure if it's relevant but we are using spring xd job modules for other types of queries.
I'm not sure how to partition this sort of job so that it can be run in parallel so that it takes less time and so if a step/subsection fails it can continue from where it left off.
Other posts with similar problem suggest using alternative methods besides the database engine like implementing a LOOP JOIN in code or using MapReduce or Hadoop, having never used either I'm unsure if they are worth looking into for this use case.
What is the standard approach to this sort of operation, I'd expect it to be fairly common. I might be using the wrong search terms to research approaches because I haven't come across any stock standard solutions or clear directions.
The rather cryptic original requirement was:
Compare party_id column in the three very large tables to identify the customer available in three table
i.e if it is AND operation between three.
SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID AND SAMPLE3.PARTY_ID
If the operation is OR then pick all the customers available in the three tables.
SAMPLE1.PARTY_ID OR SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID
AND / OR are used between tables then performed the comparison as required. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID
I set up some 4 test tables each with with this definition
CREATE TABLE `TABLE1` (
`CREATED` datetime DEFAULT NULL,
`PARTY_ID` varchar(45) NOT NULL,
`GROUP_ID` varchar(45) NOT NULL,
`SEQUENCE_ID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`SEQUENCE_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=978536 DEFAULT CHARSET=latin1;
Then added 1,000,000 records to each just random numbers in a range that should result in joins.
I used the following test query
SELECT `TABLE1`.`PARTY_ID` AS `pi1`, `TABLE2`.`PARTY_ID` AS `pi2`, `TABLE3`.`PARTY_ID` AS `pi3`, `TABLE4`.`PARTY_ID` AS `pi4` FROM `devt1`.`TABLE2` AS `TABLE2`, `devt1`.`TABLE1` AS `TABLE1`, `devt1`.`TABLE3` AS `TABLE3`, `devt1`.`TABLE4` AS `TABLE4` WHERE `TABLE2`.`PARTY_ID` = `TABLE1`.`PARTY_ID` AND `TABLE3`.`PARTY_ID` = `TABLE2`.`PARTY_ID` AND `TABLE4`.`PARTY_ID` = `TABLE3`.`PARTY_ID`
It's supposed to complete in under 10 min and for table sizes 10x larger.
My test query still hasn't completed and it has been running for 15 min
The following may perform better than the existing join-based query:
select party_id from
(select distinct party_id from SAMPLE1 union all
select distinct party_id from SAMPLE2 union all
select distinct party_id from SAMPLE3) as ilv
group by party_id
having count(*) = 3
Amend the count(*) condition to match the number of tables being queried.
If you want to return party_id values that are present in any table rather than all, then omit the final having clause.
this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.
As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;