Related
I have a pagination query which does range index scan on a large table:
create table t_dummy (
id int not null auto_increment,
field1 varchar(255) not null,
updated_ts timestamp null default null,
primary key (id),
key idx_name (updated_ts)
The query looks like this:
select * from t_dummy a
where a.field1 = 'VALUE'
and (a.updated_ts > 'some time' or (a.updated_ts = 'some time' and a.id > x)
order by a.updated_ts, a.id
limit 100
The explain plan show large cost with rows value being very high, however, it is using all the right indexes and the execution seems fast. Can someone please tell whether this means the query is inefficient?
EXPLAIN can be misleading. It can report a high value for rows, despite the fact that MySQL optimizes LIMIT queries to stop once enough rows have been found that satisfy your requested LIMIT (100 in your case).
The problem is, at the time the query does the EXPLAIN, it doesn't necessarily know how many rows it will have to examine to find at least 100 rows that satisfy the conditions in your WHERE clause.
So you can usually ignore the rows field of the EXPLAIN when you have a LIMIT query. It probably won't really have to examine so many rows.
If the execution is fast enough, don't worry about it. If it is not, consider a (field1,updated_ts) index and/or changing your query to
and a.updated_ts >= 'some time' and (a.updated_ts > 'some time' or a.id > x)
As Bill says, Explain cannot be trusted to take LIMIT into account.
The following will confirm that the query is touching only 100 rows:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The Handler_read% values will probably add up to about 100. There will probably be no Handler_write% values -- they would indicate the creation of a temp table.
A tip: If you use LIMIT 101, you get the 100 rows to show, plus an indication of whether there are more rows. This, with very low cost, avoids having a [Next] button that sometimes brings up a blank page.
My tips on the topic: http://mysql.rjweb.org/doc.php/pagination
I have a very slow query in MySql server.
I add the query :
SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT, TERM, START_DT, END_DT, CTYPE, MW AS MW_AWARD,
Mark, SCID
FROM
( SELECT a.CRR_DT, a.TOU, a.SRCE, a.SINK, a.NAME, a.SEASON, a.SRCESUMCONG,
a.SINKSUMCONG, a.SRCEAVGCONG, a.SINKAVGCONG, a.SUMSINKMSRCE,
a.AVGSINKMSRCE, a.HOURCOUNT, b.TERM, b.CTYPE, b.START_DT,
b.END_DT, b.MW, b.SCID, b.Mark
FROM
( SELECT CRR_DT, TOU, SRCE, SINK, NAME, SEASON, SRCESUMCONG, SINKSUMCONG,
SRCEAVGCONG, SINKAVGCONG, SUMSINKMSRCE, AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01'
) a
INNER JOIN
( SELECT MARKET, TERM, TOU, SRCE, SINK, NAME, SCID, CTYPE, START_DT,
END_DT, SUM(MW) AS MW, SUBSTR(MARKET, 1, 3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL' , 'LDES')
GROUP BY MARKET , TOU , SRCE , SINK , NAME , SCID , CTYPE ,
START_DT , END_DT
) b ON a.NAME = b.NAME
AND a.TOU = b.TOU
) c
WHERE c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7 )
ORDER BY NAME , CRR_DT , TOU ASC
Here the result its Explain plan generated using MysQl Workbrench
I guess that the red block red are dangerous. Can please someone help me to understand this plan? Few hints on what I should check once I have this execution plan.
edit add TABLES layout
CREATE TABLE `CRR_CONGCALC` (
`CRR_DT` varchar(7) NOT NULL,
`TOU` varchar(50) NOT NULL,
`SRCE` varchar(50) NOT NULL,
`SINK` varchar(50) NOT NULL,
`SRCESUMCONG` decimal(12,6) DEFAULT NULL,
`SINKSUMCONG` decimal(12,6) DEFAULT NULL,
`SRCEAVGCONG` decimal(12,6) DEFAULT NULL,
`SINKAVGCONG` decimal(12,6) DEFAULT NULL,
`SUMSINKMSRCE` decimal(12,6) DEFAULT NULL,
`AVGSINKMSRCE` decimal(12,6) DEFAULT NULL,
`HOURCOUNT` int(11) NOT NULL DEFAULT '0',
`SEASON` char(1) NOT NULL DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`CRR_DT`,`SRCE`,`SINK`,`TOU`,`HOURCOUNT`),
KEY `srce_index` (`SRCE`),
KEY `srcesink` (`SRCE`,`SINK`)
)
CREATE TABLE `CRR_INVENTORY` (
`MARKET` varchar(50) NOT NULL,
`TERM` varchar(50) NOT NULL,
`TOU` varchar(50) NOT NULL,
`INVENTORY_DT` date NOT NULL,
`START_DT` datetime NOT NULL,
`END_DT` datetime NOT NULL,
`CRR_ID` varchar(50) NOT NULL,
`NSR_INDEX` tinyint(1) NOT NULL,
`SEGMENT` tinyint(1) NOT NULL,
`CTYPE` varchar(50) NOT NULL,
`CATEGORY` varchar(50) NOT NULL,
`COPTION` varchar(50) NOT NULL,
`SRCE` varchar(50) DEFAULT NULL,
`SINK` varchar(50) DEFAULT NULL,
`MW` decimal(8,4) NOT NULL,
`SCID` varchar(50) NOT NULL,
`SEASON` char(1) DEFAULT '0',
`NAME` varchar(110) NOT NULL,
PRIMARY KEY (`MARKET`,`INVENTORY_DT`,`CRR_ID`),
KEY `srcesink` (`SRCE`,`SINK`)
)
Brings back memories. With a database, a "Full Table Scan" means that there is nothing that the database can use to speed up the query, it reads the entire table. The rows are stored in a non-sorted order, so there is no better way to "search" for the employee id you are looking for.
This is bad. Why?
If you have a table with a bunch of columns:
first_name, last_name, employee_id, ..., column50 and do a search where employee_id = 1234, if you don't have an index on the employee_id column, you're doing a sequential scan. Even worse if you're doing a join table2 on table1.employee_id = table2.eid, because it has to match the employee_id to every record in the join table.
If you create an index, you greatly reduce the scan time to find the matches (or throw away the non-matches) because instead of doing a sequential scan you can search a sorted field. Much faster.
When you create an index on the employee_id field, you are creating a way to search for employee numbers that is much, much, much faster. When you create an index, you are saying "I am going to join based on this field or have a where clause based on this field". This speeds up your query at the cost of a little bit of disk space.
There are all kinds of tricks with indexes, you can create them so they are unique, not unique, composite (contain multiple columns) and all kinds of stuff. Post your query and we can tell you what you might look at indexing to speed this up.
A good rule of thumb is that you should create an index on your tables on fields that you use in a where clause, join criteria or order by. Picking the field depends on a few things that are beyond the scope of this discussion, but that should be a start.
The pattern FROM ( SELECT... ) JOIN ( SELECT... ) ON ... does not optimize well. See if you can go directly from one of the tables, not hide it in a subquery.
CRR_CONGCALC needs INDEX(CRR_DT). (Please provide SHOW CREATE TABLE.)
CRR_INVENTORY needs INDEX(COPTION, START_DT).
Please make those changes, then come back for more advice, if needed.
According to your explain diagram, there are full table scans happening at each sub-query on CRR_CONGCALC and CRR_INVENTORY. Then when you join the sub-queries together, another full table scan, and finally, when the result set is ordered, one more full table scan.
A Few tips to improve performance
Use fields that are indexed as part of your join statement, where clause, group by clause & order by clause. If this query is used often, consider adding indexes to all relevant columns.
Avoid nested sub-queries with aggregate operations in joins as much as possible. The result-sets returned by the sub-queries are not indexed, joining on it will end up scanning the whole table, rather than just the index. The joins in this query could also result in weird and hard to detect fanning out issues, but this isn't a performance issue that you're seeking a solution for
Filter the result set as early as possible (i.e. in all the sub-queries at the inner most layer to minimize the # of rows the database server has to subsequently deal with.
Unless the final order by is necessary, avoid it.
Use temporary (or materialized) tables to de-nest subqueries. On these tables, you can add indexes, so further joining will be efficient. This assumes that you have the permissions to create & drop tables on the server
That said,
Here's how I would refactor your query.
In generating the inner query b, the group by clause does not contain all fields which are not aggregate columns. This is non standard sql, which leads to malformed data. Mysql allows it, and for the love of god I don't know why. It is best to avoid this trap.
The final wrapping query is unnecessary, as the where clause and group by clause can be applied to the unwrapped query.
This where clause seems fishy to me:
c.CRR_DT BETWEEN SUBSTR(c.START_DT, 1, 7) AND SUBSTR(c.END_DT, 1, 7)
START_DT & END_DT are datetime or timestamp columns being implicitly cast as char. It would be better to extract the year-month using the function DATE_FORMAT as:
DATE_FORMAT(<FIELD>, '%Y-%m-01')
Even if the where clause you used worked, it would omit records for which END_DT and CRR_DT fall in the same month. I'm not sure if that is the desired behaviour, but here's a query to illustrate what your boolean expression would evaluate:
SELECT CAST('2015-07-05' AS DATETIME) between '2015-07' and '2015-07';
-- This query returns 0 == False.
Using CREATE TABLE AS SELECT Syntax, first de-nest the sub queries. Note: as I don't know the data, I'm not sure which indexes need to be unique. You can delete the tables once the result is consumed.
Table 1:
CREATE TABLE sub_a (KEY(CRR_DT), KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT CRR_DT,
TOU,
SRCE,
SINK,
NAME,
SEASON,
SRCESUMCONG,
SINKSUMCONG,
SRCEAVGCONG,
SINKAVGCONG,
SUMSINKMSRCE,
AVGSINKMSRCE,
HOURCOUNT
FROM CRR_CONGCALC
WHERE CRR_DT >= '2015-01-01';
Table 2:
CREATE TABLE sub_b (KEY(NAME), KEY(TOU), KEY(NAME, TOU)) AS
SELECT MARKET,
TERM,
TOU,
SRCE,
SINK,
NAME,
SCID,
CTYPE,
START_DT,
END_DT,
SUM(MW) AS MW_AWARD,
SUBSTR(MARKET,1,3) AS MARK
FROM CRR_INVENTORY
WHERE COPTION = 'OBLIGATION'
AND START_DT >= '2015-01-01'
AND SCID IN ('EAGL','LDES')
GROUP BY MARKET, TERM, TOU,
SRCE, SINK, NAME, SCID,
CTYPE, START_DT, END_DT, MARK
-- note the two added columns in the groupby clause.
After this, the final query would be simply:
SELECT a.CRR_DT,
a.TOU,
a.SRCE,
a.SINK,
a.NAME,
a.SEASON,
a.SRCESUMCONG,
a.SINKSUMCONG,
a.SRCEAVGCONG,
a.SINKAVGCONG,
a.SUMSINKMSRCE,
a.AVGSINKMSRCE,
a.HOURCOUNT,
b.TERM,
b.CTYPE,
b.START_DT,
b.END_DT,
b.MW_AWARD,
b.SCID,
b.Mark
FROM sub_a a
JOIN sub_b b ON a.NAME = b.NAME AND a.TOU = b.TOU
WHERE a.CRR_DT BETWEEN DATE_FORMAT(b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(b.END_DT,'%Y-%m-01')
ORDER BY NAME,
CRR_DT,
TOU;
The above where clause follows the same logic used in your query, except, it's not trying to cast to string. However, this WHERE clause may be more appropriate,
WHERE sub_a.CRR_DT BETWEEN DATE_FORMAT(sub_b.START_DT,'%Y-%m-01')
AND DATE_FORMAT(DATE_ADD(sub_b.END_DT, INTERVAL 1 MONTH),'%Y-%m-01')
Finally both sub_a & sub_b seem to have the fields SRCE & SINK. Would the result change if you added them to the join. That could further optimize the query (at this point, its fair to say queries) processing time.
By doing the above, we hopefully avoid two full table scans, but I don't have your data set, so I'm only making an educated guess here.
If its possible to express this logic without using intermediary tables, and directly via joins to the actual underlying tables CRR_CONGCALC and CRR_INVENTORY, that would be even faster
Full table scans operations are not always bad, or necessarily evil. Sometimes, a full scan is the most efficient way to satisfy a query. For example, the query SELECT * FROM mytable requires MySQL to return every row in the table and every column in each row. And in this case, using an index would just make more work. It's faster just to do a full scan.
On the other hand, if you're retrieving a couple of rows out of a million, an access plan using a suitable index is very likely to be much faster than a full table scan. Effective use of a index can eliminate vast swaths of rows that would otherwise need to be checked; the index basically tells MySQL that the rows we're looking for cannot be in 99% of the blocks in the table, so those blocks don't need to be checked.
MySQL processes views (including inline views) differently than other databases. MySQL uses the term derived table for an inline view. In your query a, b and c are all derived tables. MySQL runs the query to return the rows, and then materializes the view into a table. Once that is completed, the outer query can run against the derived table. But as of MySQL 5.5 (and I think 5.6), inline views are always materialized as derived tables. And that's a performance killer for large sets. (Some performance improvements are coming in newer versions of MySQL, some automatic indexing.)
Also, predicates in the outer query do not get pushed down into the view query. That is, if we run a query like this:
SELECT t.foo
FROM mytable t
WHERE t.foo = 'bar'
MySQL can make use of an index with a leading column of foo to efficiently locate the rows, even if mytable contains millions of rows. But if we write the query like this:
SELECT t.foo
FROM (SELECT * FROM mytable) t
WHERE t.foo = 'bar'
We're essentially forcing MySQL to make a copy of mytable, running the inline view query, to populate a derived table, containing all rows from mytable. And once that operation is complete, the outer query can run. But now, there's no index on the foo column in the derived table. So we're forcing MySQL to do a full scan of the derived table, to look at every row.
If we need an inline view, then relocating the predicate to the inline view query will result in a much smaller derived table.
SELECT t.foo
FROM (SELECT * FROM mytable WHERE foo = 'bar') t
With that, MySQL can make use of the index on foo to quickly locate the rows, and only those rows are materialized into the derived table. The full scan of the derived table isn't as painful now, because the outer query needs to return every row. In this example, it would also be much better to replace that * (representing every column) with just the columns we need to return.
The resultset you specify could be returned without the unnecessary inline views. A query something like this:
SELECT c.crr_dt
, c.tou
, c.srce
, c.sink
, c.name
, c.season
, c.srcesumcong
, c.sinksumcong
, c.srceavgcong
, c.sinkavgcong
, c.sumsinkmsrce
, c.avgsinkmsrce
, c.hourcount
, b.term
, b.start_dt
, b.end_dt
, b.ctype
, b.mw AS mw_award
, b.scid
, b.mark
FROM CRR_CONGCALC c
JOIN ( SELECT i.market
, i.term
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
, SUM(i.mw) AS mw
, SUBSTR(i.market, 1, 3) AS mark
FROM CRR_INVENTORY i
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
GROUP
BY i.market
, i.tou
, i.srce
, i.sink
, i.name
, i.scid
, i.ctype
, i.start_dt
, i.end_dt
) b
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
ORDER
BY c.name
, c.crr_dt
, c.tou
NOTES: If start_dt and end_dt are defined as DATE, DATETIME or TIMESTAMP columns, then I'd prefer to write the predicate like this:
AND c.crr_dt BETWEEN DATE_FORMAT(b.start_dt,'%Y-%m') AND DATE_FORMAT(b.end_dt,'%Y-%m')
(I don't think there's any performance to be gained there; that just makes it more clear what we're doing.)
In terms of improving performance of that query...
If we're returning a small subset of rows from CRR_INVENTORY, based on the predicates:
WHERE i.coption = 'OBLIGATION'
AND i.start_dt >= '2015-01-01'
AND i.scid IN ('EAGL','LDES')
Then MySQL would likely be able to make effective use of an index with leading columns of (coption,scid,start_dt). That's assuming that this represents a relatively small subset of rows from the table. If those predicates are not very selective, if we're really getting 50% or 90% of the rows in the table, the index is likely going to much less effective.
We might be able to get MySQL to make use of an index to satisfy the GROUP BY clause, without requiring a sort operation. To get that, we'd need an index with leading columns that match the columns listed in the GROUP BY clause.
The derived table isn't going to have an index on it, so for best peformance of the join operation, we want an index on the other table ) is materialized, then we are going to want a suitable index on the other table CRR_CONGCALC. We want the leading columns of that index to be used for the lookup of the matching rows, the predicates:
ON c.name = b.name
AND c.tou = b.tou
AND c.crr_dt >= '2015-01'
AND c.crr_dt BETWEEN SUBSTR(b.start_dt,1,7)
AND SUBSTR(b.end_dt,1,7)
So, we want an index with leading columns of (name, tou, crr_dt) to be able to efficiently locate the matching rows.
this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.
As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;
I'm going through an application and trying to optimize some queries and I'm really struggling with a few of them. Here's an example:
SELECT `Item` . * , `Source` . * , `Keyword` . * , `Author` . *
FROM `items` AS `Item`
JOIN `sources` AS `Source` ON ( `Item`.`source_id` = `Source`.`id` )
JOIN `authors` AS `Author` ON ( `Item`.`author_id` = `Author`.`id` )
JOIN `items_keywords` AS `ItemsKeyword` ON ( `Item`.`id` = `ItemsKeyword`.`item_id` )
JOIN `keywords` AS `Keyword` ON ( `Keyword`.`id` = `ItemsKeyword`.`keyword_id` )
JOIN `keywords_profiles` AS `KeywordsProfile` ON ( `Keyword`.`id` = `KeywordsProfile`.`keyword_id` )
JOIN `profiles` AS `Profile` ON ( `Profile`.`id` = `KeywordsProfile`.`profile_id` )
WHERE `KeywordsProfile`.`profile_id` IN ( 17 )
GROUP BY `Item`.`id`
ORDER BY `Item`.`timestamp` DESC , `Item`.`id` DESC
LIMIT 0 , 20;
This one is taking 10-30 seconds...in the tables referenced, there are about 500k author rows, and about 750k items and items_keywords rows. Everything else is less than 500 rows.
Here's the explain output:
http://img.skitch.com/20090220-fb52wd7jf58x41ikfxaws96xjn.jpg
EXPLAIN is relatively new to me, but I went through this line by line and it all seems fine. Not sure what else I can do, as I've got indexes on everything...what am I missing?
The server this sits on is just a 256 slice over at slicehost, but there's nothing else running on it and the CPU is at 0% before its run. And yet still it cranks away on this query. Any ideas?
EDIT: Some further info; one of the things that makes this really frustrating is that if I repeatedly run this query, it takes less than .1 seconds. I'm assuming this is due to the query cache, but if I run RESET QUERY CACHE before it, it still runs extremely quickly. It's only after I wait a little while or run some other queries that the 10-30 second times return. All the tables are MyISAM...does this indicate that MySQL is loading stuff into memory and that's why it runs so much faster for awhile?
EDIT 2: Thanks so much to everyone for your help...an update...I cut everything down to this:
SELECT i.id
FROM items AS i
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
Consistently took 5-6 seconds, despite there only being 750k records in the DB. Once I dropped the 2nd column on the ORDER BY clause, it was pretty much instant. There's obviously several things going on here, but when I cut the query down to this:
SELECT i.id
FROM items AS i
JOIN items_keywords AS ik ON ( i.id = ik.item_id )
JOIN keywords AS k ON ( k.id = ik.keyword_id )
JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id )
WHERE kp.profile_id IN (139)
ORDER BY i.timestamp DESC
LIMIT 20;
It's still taking 10+ seconds...what else can I do?
Minor curiosity: on the explain, the rows column for items_keywords is always 1544, regardless of what profile_id I'm using in the query. shouldn't it change depending on the number of items associated with that profile?
EDIT 3: Ok, this is getting ridiculous :). If I drop the ORDER BY clause entirely, things are very speedy and the temp table / filesort disappears from explain. There's currently an index on the item.timestamp column, but is it not being used for some reason? I thought I remembered something about mysql only using one index per table or something? should I create a multi-column index over all the columns on the items table that this query references (source_id, author_id, timestamp, etc)?
Try this and see how it does:
SELECT i.*, s.*, k.*, a.*
FROM items AS i
JOIN sources AS s ON (i.source_id = s.id)
JOIN authors AS a ON (i.author_id = a.id)
JOIN items_keywords AS ik ON (i.id = ik.item_id)
JOIN keywords AS k ON (k.id = ik.keyword_id)
WHERE k.id IN (SELECT kp.keyword_id
FROM keywords_profiles AS kp
WHERE kp.profile_id IN (17))
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
I factored out a couple of the joins into a non-correlated subquery, so you wouldn't have to do a GROUP BY to map the result to distinct rows.
Actually, you may still get multiple rows per i.id in my example, depending on how many keywords map to a given item and also to profile_id 17.
The filesort reported in your EXPLAIN report is probably due to the combination of GROUP BY and ORDER BY using different fields.
I agree with #ʞɔıu's answer that the speedup is probably because of key caching.
It looks okay, every row in the explain is using an index of some sort. One possible worry is the filesort bit. Try running the query without the order by clause and see if that improves it.
Then, what I would do is gradually take out each join until you (hopefully) get that massive speed increase, then concentrate on why that's happening.
The reason I mention the filesort is because I can't see a mention of timestamp anywhere in the explain output (even though it's your primary sort criteria) - it might be requiring a full non-indexed sort.
UPDATE#1:
Based on edit#2, the query:
SELECT i.id
FROM items AS i
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
takes 5-6 seconds. That's abhorrent. Try creating a composite index on both TIMESTAMP and ID and see if that improves it:
create index timestamp_id on items(timestamp,id);
select id from items order by timestamp desc,id desc limit 0,20;
select id from items order by timestamp,id limit 0,20;
select id from items order by timestamp desc,id desc;
select id from items order by timestamp,id;
On one of the tests, I've left off the descending bit (DB2 for one sometimes doesn't use indexes if they're in the opposite order). The other variation is to take off the limit in case that's affecting it.
For your query to run fast, you need:
Create an index: CREATE INDEX ix_timestamp_id ON items (timestamp, id)
Ensure that id's on sources, authors and keywords are primary keys.
Force MySQL to use this index for items, and perform NESTED LOOP joins for other items:
EXPLAIN EXTENDED
SELECT Item.*, Source . * , Keyword . * , Author . *
FROM items AS Item FORCE INDEX FOR ORDER BY (ix_timestamp_id)
JOIN items_keywords AS ItemsKeyword FORCE INDEX (ix_item_keyword) ON ( Item.id = ItemsKeyword.item_id AND ItemsKeyword.keyword_id IN
(
SELECT keyword_id
FROM keywords_profiles AS KeywordsProfile FORCE INDEX (ix_keyword_profile)
WHERE KeywordsProfile.profile_id = 17
)
)
JOIN sources AS Source FORCE INDEX (primary) ON ( Item.source_id = Source.id )
JOIN authors AS Author FORCE INDEX (primary) ON ( Item.author_id = Author.id )
JOIN keywords AS Keyword FORCE INDEX (primary) ON ( Keyword.id = ItemsKeyword.keyword_id )
ORDER BY Item.timestamp DESC, Item.id DESC
LIMIT 0, 20
As you can see, we get rid of GROUP BY, push the subquery into the JOIN condition and force PRIMARY KEYs to be used for joins.
That's how we ensure that NESTED LOOPS with items as a leading tables will be used for all joins.
As a result:
1, 'PRIMARY', 'Item', 'index', '', 'ix_timestamp_id', '12', '', 20, 2622845.00, ''
1, 'PRIMARY', 'Author', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.author_id', 1, 100.00, ''
1, 'PRIMARY', 'Source', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.source_id', 1, 100.00, ''
1, 'PRIMARY', 'ItemsKeyword', 'ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.id', 1, 100.00, 'Using where; Using index'
1, 'PRIMARY', 'Keyword', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.ItemsKeyword.keyword_id', 1, 100.00, ''
2, 'DEPENDENT SUBQUERY', 'KeywordsProfile', 'unique_subquery', 'PRIMARY', 'PRIMARY', '8', 'func,const', 1, 100.00, 'Using index; Using where'
, and when we run this, we get
20 rows fetched in 0,0038s (0,0019s)
There are 500k in items, 600k in items_keywords, 512 values in keywords and 512 values in keywords_profiles (all with profile 17).
I would suggest you run a profiler on the query, then you can see how long each subquery took and where the time is being consumed. If you have phpmyadmin, it's a simple chekbox you need to check to get this functionality, but my guess is you can get it manually from the mysql terminal app as well. I haven't seen this explain thing before, if it is in fact the profiling i am used to in phpmyadmin i apologize for the nonsense.
What is the GROUP BY clause achieving? There are no aggregate functions in the SELECT so the GROUP BY should be unnecessary
Some things to try:
Try not selecting all columns from all tables, and select only what you need. That may preclude the use of covering indexes (looking for using index in the extra column) and in general will soak up a lot of needless IO.
That filesort looks a little troubling. Try removing the order by and replacing it with order by null -- group by implicitly sorts in mysql so you have to order by null to remove that implicit sort.
Try adding an index on item (timestamp, id) or (id, timestamp). Might do something about that filesort (you never know).
Why are you grouping by item id? and not selecting any aggregate columns? if you group by a column and then select (much less order by) some other non-aggregate columns then the values of those columns will be selected more or less arbitrary. Unless, is item id is always unique for this query, in which case the group by will not accomplish anything.
Lastly, in my experience, mysql sometimes will just inexplicably freak out if you give it too many joins to try to optimize. Try and figure out if there's some way you don't have to do so many joins all once like that, i.e. split it up into multiple queries if you can.
one of the things that makes this really frustrating is that if I repeatedly run this query, it takes less than .1 seconds. I'm assuming this is due to the query cache — add SQL_NO_CACHE after the SELECT keyword to disable the use of the query cache per this query
All the tables are MyISAM...does this indicate that MySQL is loading stuff into memory and that's why it runs so much faster for awhile — MyISAM uses a key buffer and only caches index data in memory, and relies on the OS to hopefully cache non-index data. Unlike Innodb, which caches everything in the buffer pool.
Is it possible you're having issues because of filesystem I/O ? The EXPLAIN shows that there have to be 1544 rows fetched from the ItemsKeyword table. If you have to go to disk for each of those you'll add about 10-15 second total to the run time (assuming a high-ish seek time because you're on a VM). Normally the tables are cached in RAM or the data is stored close enough on the disk that reads can be combined. However, you're running on a VM with 256MB of ram, so you may no memory spare it can cache into and if your table file is fragmented enough you might be able to get the query performance degraded this much.
You could probably get some idea of what's happening with I/O during the query by running something like pidstat -d 1 or iostat 1 in another shell on the server.
EDIT:
From looking at the query adding an index on (ItemsKeyword.item_id, ItemsKeyword.keyword_id) should fix it if my theory is right about it being a problem with the seeks for the ItemsKeyword table.
MySQL loads a lot into different caches, including indexes and queries. In addition, your operating system will keep a file system cache that could speed up your query when executed repeatedly.
One thing to consider is how MySQL creates temporary tables during this type of query. As you can see in your explain, a temporary table is being created, probably for sorting of the results. Generally, MySQL will create these temporary tables in memory, except for 2 conditions. First, if they exceed the maximum size set in MySQL settings (max temp table size or heap size - check mysqlperformanceblogs.com for more info on these settings). The second and more important one is this:
Temporary tables will always be created on disk when text or blob tables are selected in the query.
This can create a major performance hit, and even lead to an i/o bottleneck if your server is getting any amount of action.
Check to see if any of your columns are of this data type. If they are, you can try to rewrite the query so that a temporary table is not created (group by always causes them, I think), or try not selecting these out. Another strategy would be to break this up into several smaller queries that might execute in a fraction of the time.
Good luck!
I may be completely wrong but what happens when you change
WHERE kp.profile_id IN (139)
to
WHERE kp.profile_id = 139
Try this:
SELECT i.id
FROM ((items AS i
INNER JOIN items_keywords AS ik ON ( i.id = ik.item_id ))
INNER JOIN keywords AS k ON ( k.id = ik.keyword_id ))
INNER JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id AND kp.profile_id = 139)
ORDER BY i.timestamp DESC
LIMIT 20;
Looking at the pastie.org link in the comments to the question:
you're joining items.source_id int(4) to sources.id int(16)
also items.id int(16) to itemskeywords.item_id int(11)
I can't see any good reason for the two fields to have different widths in these cases
I realise that these are just display widths and that the actual range of numbers which the column can store is determined solely by the INT part but the MySQL 6.0 reference manual says:
Note that if you store larger values
than the display width in an integer
column, you may experience problems
when MySQL generates temporary tables
for some complicated joins, because in
these cases MySQL assumes that the
data fits into the original column
width.
From the rough figures you quoted, it doesn't look as though you are exceeding the display width on any of the ID columns. You may as well tidy up these inconsistencies though just to eliminate another possible bug.
You might be as well to remove the display widths altogether if you don't have a need for them
edit:
I would hazard a guess that the original author of the database perhaps thought that int(4) meant "an integer with up to 4 digits" whereas it actually means "an integer between -2147483648 and 2147482647 displayed with at least 4 characters left-padded with spaces if need be"
Definitions like authors.refreshed int(20) or items.timestamp int(30) don't really make sense as there can only be 10 digits plus the sign in an int. Even a bigint can't exceed 20 characters. Perhaps the original author thought that int(4) was analogous to varchar(4)?
Try a backup copy of your tables. After that rename the original tables to something else, rename the new tables to the original and try again with your new-but-old-named tables...
Or you can try to repair the tables, but this doesn't always help.
Edit: Man, this was an old question...
The problem appears that it has to full joins across every single table before it even tries to do a where clause. This can cause 500k rows per table across you're looking in the millions+ rows that it's populating in memory. I would try changing the JOINS to LEFT JOINS USING ().
First of all, this question regards MySQL 3.23.58, so be advised.
I have 2 tables with the following definition:
Table A: id INT (primary), customer_id INT, offlineid INT
Table B: id INT (primary), name VARCHAR(255)
Now, table A contains in the range of 65k+ records, while table B contains ~40 records. In addition to the 2 primary key indexes, there is also an index on the offlineid field in table A. There are more fields in each table, but they are not relevant (as I see it, ask if necessary) for this query.
I was first presented with the following query (query time: ~22 seconds):
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
Now, each id in medie is associated with a different name, meaning you could group by id as well as name. A bit of testing back and forth settled me on this (query time: ~6 seconds):
SELECT a.name, COUNT(*) AS orders, COUNT(DISTINCT(b.kundeid)) AS leads
FROM medie a
INNER JOIN katalogbestilling_katalog b ON a.id = b.offline
GROUP BY b.offline;
Is there any way to crank it down to "instant" time (max 1 second at worst)? I added the index on offlineid, but besides that and the re-arrangement of the query, I am at a loss for what to do. The EXPLAIN query shows me the query is using fileshort (the original query also used temp tables). All suggestions are welcome!
I'm going to guess that your main problem is that you are using such an old version of MySQL. Maybe MySQL 3 doesn't like the COUNT(DISTINCT()).
Alternately, it might just be system performance. How much memory do you have?
Still, MySQL 3 is really old. I would at least put together a test system to see if a newer version ran that query faster.
Unfortunately, mysql 3 doesn't support sub-queries. I suspect that the old version in general is what's causing the slow performance.
You could try making sure there are covering indexes defined on each table. A covering index is just an index where each column requested in the select or used in a join is included in the index. This way, the engine only has to read the index entry and doesn't have to also do the corresponding row lookup to get any requested columns not included in the index. I've used this technique with great success in Oracle and MS SqlServer.
Looking at your query, you could try:
one index for medie.id, medie.name
one index for katalogbestilling_katalog.offlineid, katalogbestilling_katalog.kundeid
The columns should be defined in these orders for the index. That makes a difference whether the index can be used or not.
More info here:
Covering Index Info
You may get a small increase in performance if you remove the inner join and replace it with a nested select statement also remove the count(*) and replace it with the PK.
SELECT a.name, COUNT(*) AS orders, COUNT(DISTINCT(b.kundeid)) AS leads
FROM medie aINNER JOIN katalogbestilling_katalog b ON a.id = b.offline
GROUP BY b.offline;
would be
SELECT a.name,
COUNT(a.id) AS orders,
(SELECT COUNT(kundeid) FROM katalogbestilling_katalog b WHERE b.offline = a.id) AS Leads
FROM medie a;
Well if the query is run often enough to warrant the overhead, create an index on table A containing the fields used in the query. Then all the results can be read from an index and it wont have to scan the table.
That said, all my experience is based on MSSQL, so might not work.
Your second query is fine and 65k+40k rows is not very large :)
Put an new index on katalogbestilling_katalog.offline column and it will run faster for you.
How is kundeid defined? It would be helpful to see the full schema for both tables (as generated by MySQL, ie. with indexes) as well as the output of EXPLAIN with the queries above.
The easiest way to debug this and find out what is your bottleneck would be to start removing fields, one by one, from the query and measure how long does it take to run (remember to run RESET QUERY CACHE before running each query). At some point you'll see a significant drop in the execution time and then you've identified your bottleneck. For example:
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
may become
SELECT b.name, COUNT(DISTINCT(a.kundeid)) AS leads
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
to eliminate the possibility of "orders" being the bottleneck, or
SELECT b.name, COUNT(*) AS orders
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
to eliminate "leads" from the equasion. This will lead you in the right direction.
update: I'm not suggesting removing any of the data from the final query. Just remove them to reduce the number of variables while looking for the bottleneck. Given your comment, I understand
SELECT b.name
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
is still performing badly? This clearly means it's either the join that is not optimized or the group by (which you can test by removing the group by - either the JOIN will be still slow, in which case that's the problem you need to fix, or it won't - in which case it's obviously the GROUP BY). Can you post the output of
EXPLAIN SELECT b.name
FROM katalogbestilling_katalog a, medie b
WHERE a.offlineid = b.id
GROUP BY b.name
as well as the table schemas (to make it easier to debug)?
update #2
there's also a possibility that all of your indeces are created correctly but you have you mysql installation misconfigured when it comes to max memory usage or something along those lines which forces it to use disk sortation.
Try adding an index to (offlineid, kundeid)
I added 180,000 BS rows to katalog and 30,000 BS rows to medie (with katalog offlineid's corresponding to medie id's and with a few overlapping kundeid's to make sure the disinct counts work). Mind you this is on mysql 5, so if you don't have similar results, mysql 3 may be your culprit, but from what I recall mysql 3 should be able to handle this just fine.
My tables:
CREATE TABLE `katalogbestilling_katalog` (
`id` int(11) NOT NULL auto_increment,
`offlineid` int(11) NOT NULL,
`kundeid` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `offline_id` (`offlineid`,`kundeid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=60001 ;
CREATE TABLE `medie` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=30001 ;
My query:
SELECT b.name, COUNT(*) AS orders, COUNT(DISTINCT(a.kundeid)) AS leads
FROM medie b
INNER JOIN katalogbestilling_katalog a ON b.id = a.offlineid
GROUP BY a.offlineid
LIMIT 0 , 30
"Showing rows 0 - 29 (30,000 total, Query took 0.0018 sec)"
And the explain:
id: 1
select_type: SIMPLE
table: a
type: index
possible_keys: NULL
key: offline_id
key_len: 8
ref: NULL
rows: 180000
Extra: Using index
id: 1
select_type: SIMPLE
table: b
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: test.a.offlineid
rows: 1
Extra:
Try optimizing the server itself. See this post by Peter Zaitsev for the most important variables. Some are InnoDB specific, while others are for MyISAM. You didnt mention which engine you were using which might be relevant in this case (count(*) is much faster in MyISAM than in InnoDB for example).
Here is another post from same blog, and an article from MySQL Forge