mysql query optimization: select with counted subquery extremely slow - mysql

I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?

I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().

You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.

My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.

Related

How can I optimize a query which depends on both COUNT and GROUP BY?

I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');

MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

I've got a fairly simple query that seeks to display the number of email addresses that are subscribed along with the number unsubscribed, grouped by client.
The query:
SELECT
client_id,
COUNT(CASE WHEN subscribed = 1 THEN subscribed END) AS subs,
COUNT(CASE WHEN subscribed = 0 THEN subscribed END) AS unsubs
FROM
contacts_emailAddresses
LEFT JOIN contacts ON contacts.id = contacts_emailAddresses.contact_id
GROUP BY
client_id
Schema of relevant tables follows. contacts_emailAddresses is a junction table between contacts (which has the client_id) and emailAddresses (which is not actually used in this query).
CREATE TABLE `contacts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`firstname` varchar(255) NOT NULL DEFAULT '',
`middlename` varchar(255) NOT NULL DEFAULT '',
`lastname` varchar(255) NOT NULL DEFAULT '',
`gender` varchar(5) DEFAULT NULL,
`client_id` mediumint(10) unsigned DEFAULT NULL,
`datasource` varchar(10) DEFAULT NULL,
`external_id` int(10) unsigned DEFAULT NULL,
`created` timestamp NULL DEFAULT NULL,
`trash` tinyint(1) NOT NULL DEFAULT '0',
`updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `client_id` (`client_id`),
KEY `external_id combo` (`client_id`,`datasource`,`external_id`),
KEY `trash` (`trash`),
KEY `lastname` (`lastname`),
KEY `firstname` (`firstname`),
CONSTRAINT `contacts_ibfk_1` FOREIGN KEY (`client_id`) REFERENCES `clients` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14742974 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
CREATE TABLE `contacts_emailAddresses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`contact_id` int(10) unsigned NOT NULL,
`emailAddress_id` int(11) unsigned DEFAULT NULL,
`primary` tinyint(1) unsigned NOT NULL DEFAULT '0',
`subscribed` tinyint(1) unsigned NOT NULL DEFAULT '1',
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `contact_id` (`contact_id`),
KEY `subscribed` (`subscribed`),
KEY `combo` (`contact_id`,`emailAddress_id`) USING BTREE,
KEY `emailAddress_id` (`emailAddress_id`) USING BTREE,
CONSTRAINT `contacts_emailAddresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`id`),
CONSTRAINT `contacts_emailAddresses_ibfk_2` FOREIGN KEY (`emailAddress_id`) REFERENCES `emailAddresses` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24700918 DEFAULT CHARSET=utf8
Here's the EXPLAIN:
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | contacts_emailAddresses | ALL | NULL | NULL | NULL | NULL | 10176639 | Using temporary; Using filesort |
| 1 | SIMPLE | contacts | eq_ref | PRIMARY | PRIMARY | 4 | icarus.contacts_emailAddresses.contact_id | 1 | |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
2 rows in set (0.08 sec)
The problem here clearly is the GROUP BY clause, as I can remove the JOIN (and the items that depend on it) and the performance still is terrible (40+ seconds). There are 10m records in contacts_emailAddresses, 12m-some records in contacts, and 10–15 client records for the grouping.
From the doc:
Temporary tables can be created under conditions such as these:
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
DISTINCT combined with ORDER BY may require a temporary table.
If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage.
I'm obviously not combining the GROUP BY with an ORDER BY, and I have tried multiple things to ensure that the GROUP BY is on a column that should be properly placed in the join queue (including rewriting the query to put contacts in the FROM and instead join to contacts_emailAddresses), all to no avail.
Any suggestions for performance tuning would be much appreciated!
I think the only real shot you have of getting away from a "Using temporary; Using filesort" operation (given the current schema, the current query, and the specified resultset) would be to use correlated subqueries in the SELECT list.
SELECT c.client_id
, (SELECT IFNULL(SUM(es.subscribed=1),0)
FROM contacts_emailAddresses es
JOIN contacts cs
ON cs.id = es.contact_id
WHERE cs.client_id = c.client_id
) AS subs
, (SELECT IFNULL(SUM(eu.subscribed=0),0)
FROM contacts_emailAddresses eu
JOIN contacts cu
ON cu.id = eu.contact_id
WHERE cu.client_id = c.client_id
) AS unsubs
FROM contacts c
GROUP BY c.client_id
This may run quicker than the original query, or it may not. Those correlated subqueries are going to get run for each returned by the outer query. If that outer query is returning a boatload of rows, that's a whole boatload of subquery executions.
Here's the output from an EXPLAIN:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo contact_id 4 cu.id Using where
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo contact_id 4 cs.id Using where
For optimum performance of this query, we'd really like to see "Using index" in the Extra column of the explain for the eu and es tables. But to get that, we'd need a suitable index, one with a leading column of contact_id and including the subscribed column. For example:
CREATE INDEX cemail_IX2 ON contacts_emailAddresses (contact_id, subscribed);
With the new index available, EXPLAIN output shows MySQL will use the new index:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo,cemail_IX2 cemail_IX2 4 cu.id Using where; Using index
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo,cemail_IX2 cemail_IX2 4 cs.id Using where; Using index
NOTES
This is the kind of problem where introducing a little redundancy can improve performance. (Just like we do in a traditional data warehouse.)
For optimum performance, what we'd really like is to have the client_id column available on the contacts_emailAddresses table, without a need to JOINI to the contacts table.
In the current schema, the foreign key relationship to contacts table gets us the client_id (rather, the JOIN operation in the original query is what gets it for us.) If we could avoid that JOIN operation entirely, we could satisfy the query entirely from a single index, using the index to do the aggregation, and avoiding the overhead of the "Using temporary; Using filesort" and JOIN operations...
With the client_id column available, we'd create a covering index like...
... ON contacts_emailAddresses (client_id, subscribed)
Then, we'd have a blazingly fast query...
SELECT e.client_id
, SUM(e.subscribed=1) AS subs
, SUM(e.subscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.client_id
That would get us a "Using index" in the query plan, and the query plan for this resultset just doesn't get any better than that.
But, that would require a change to your scheam, it doesn't really answer your question.
Without the client_id column, then the best we're likely to do is a query like the one Gordon posted in his answer (though you still need to add the GROUP BY c.client_id to get the specified result.) The index Gordon recommended will be of benefit...
... ON contacts_emailAddresses(contact_id, subscribed)
With that index defined, the standalone index on contact_id is redundant. The new index will be a suitable replacement to support the existing foreign key constraint. (The index on just contact_id could be dropped.)
Another approach would be to do the aggregation on the "big" table first, before doing the JOIN, since it's the driving table for the outer join. Actually, since that foreign key column is defined as NOT NULL, and there's a foreign key, it's not really an "outer" join at all.
SELECT c.client_id
, SUM(s.subs) AS subs
, SUM(s.unsubs) AS unsubs
FROM ( SELECT e.contact_id
, SUM(e.subscribed=1) AS subs
, SUM(e.eubscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.contact_id
) s
JOIN contacts c
ON c.id = s.contact_id
GROUP BY c.client_id
Again, we need an index with contact_id as the leading column and including the subscribed column, for best performance. (The plan for s should show "Using index".) Unfortunately, that's still going to materialize a fairly sizable resultset (derived table s) as a temporary MyISAM table, and the MyISAM table isn't going to be indexed.

mysql slow count in join query

so i have two tables that i need to be able to get counts for. One of them holds the content and the other on the relationship between it and the categories table. Here are the DDl :
CREATE TABLE content_en (
id int(11) NOT NULL AUTO_INCREMENT,
title varchar(100) DEFAULT NULL,
uid int(11) DEFAULT NULL,
date_added int(11) DEFAULT NULL,
date_modified int(11) DEFAULT NULL,
active tinyint(1) DEFAULT NULL,
comment_count int(6) DEFAULT NULL,
orderby tinyint(4) DEFAULT NULL,
settings text,
permalink varchar(255) DEFAULT NULL,
code varchar(3) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY id (id),
UNIQUE KEY id_2 (id) USING BTREE,
UNIQUE KEY combo (id,active) USING HASH,
KEY code (code) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8;
and for the other table
CREATE TABLE content_page_categories (
catid int(11) unsigned NOT NULL,
itemid int(10) unsigned NOT NULL,
main tinyint(1) DEFAULT NULL,
KEY itemid (itemid),
KEY catid (catid),
KEY combo (catid,itemid) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The query i'm running is :
SELECT count(*)
FROM content_page_categories USE INDEX (combo)
INNER JOIN content_en USE INDEX (combo) ON (id = itemid)
WHERE catid = 1 AND active = 1 ;
Both tables have 125k rows and i can't get the count query to run fast enough. Best timing i get is 0.175 which is horrible for this ammount of rows. Selecting 100 rows is as fast as 0.01. I have tried like 3 or 4 variations of this query but in the end the timings are just about the same. Also, if i don't do USE INDEX timing goes 3x slower.
Also tried the following :
SELECT COUNT( *) FROM content_page_categories
INNER JOIN content_en ON id=itemid
AND catid = 1 AND active = 1 WHERE 1
and :
SELECT SQL_CALC_FOUND_ROWS catid,content_en.* FROM content_page_categories
INNER JOIN content_en ON (id=itemid)
WHERE catid =1 AND active = 1 LIMIT 1;
SELECT FOUND_ROWS();
Index definitions :
content_en 0 PRIMARY 1 id A 125288 BTREE
content_en 0 id 1 id A 125288 BTREE
content_en 0 id_2 1 id A 125288 BTREE
content_en 0 combo 1 id A BTREE
content_en 0 combo 2 active A YES BTREE
content_en 1 code 1 code A 42 YES BTREE
content_page_categories 1 itemid 1 itemid A 96842 BTREE
content_page_categories 1 catid 1 catid A 10 BTREE
content_page_categories 1 combo 1 catid A 10 BTREE
content_page_categories 1 combo 2 itemid A 96842 BTREE
Any ideas?
[EDIT]
i have uploaded sample data for these tables here
result of explain :
mysql> explain SELECT count(*) FROM content_page_categories USE INDEX (combo) I<br>
NNER JOIN content_en USE INDEX (combo) ON (id = itemid) WHERE catid = 1 AND act<br>
ive = 1 ;
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
| 1 | SIMPLE | content_en | index | combo | combo | 6 | NULL | 125288 | Using where; Using index |
| 1 | SIMPLE | content_page_categories | ref | combo | combo | 8 | const,mcms.content_en.id | 1 | Using where; Using index |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
2 rows in set (0.00 sec)
I downloaded your data and tried a few experiments. I'm running MySQL 5.6.12 on a CentOS virtual machine on a Macbook Pro. The times I observed can be used for comparison, but your system may have different performance.
Base case
First I tried without the USE INDEX clauses, because I avoid optimizer overrides where possible. In most cases, a simple query like this should use the correct index if it's available. Hard-coding the index choice in a query makes it harder to use a better index later.
I also use correlation names (table aliases) to make the query more clear.
mysql> EXPLAIN SELECT COUNT(*) FROM content_en AS e
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: eq_ref
possible_keys: PRIMARY,combo2,combo
key: PRIMARY
key_len: 4
ref: test.c.itemid
rows: 1
Extra: Using where
This executed in 0.36 seconds.
Covering index
I'd like to get "Using index" on the second table as well, so I need an index on (active, id) in that order. I had to USE INDEX in this case to persuade the optimizer not to use the primary key.
mysql> ALTER TABLE content_en ADD KEY combo2 (active, id);
mysql> explain SELECT COUNT(*) FROM content_en AS e USE INDEX (combo2)
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: ref
possible_keys: combo2
key: combo2
key_len: 6
ref: const,test.c.itemid
rows: 1
Extra: Using where; Using index
The rows reported by EXPLAIN is an important indicator of how much work it's going to take to execute the query. Notice the rows in the above EXPLAIN is only 71k, much smaller than the 125k rows you got when you scanned the content_en table first.
This executed in 0.44 seconds. This is unexpected, because usually a query using a covering index is an improvement.
Convert tables to InnoDB
I tried the same covering index solution as above, but with InnoDB as the storage engine.
mysql> ALTER TABLE content_en ENGINE=InnoDB;
mysql> ALTER TABLE content_page_categories ENGINE=InnoDB;
This had the same EXPLAIN report. It took 1 or 2 iterations to warm the buffer pool, but then the performance of the query tripled.
This executed in 0.16 seconds.
I also tried removing the USE INDEX, and the time increased slightly, to 0.17 seconds.
#Matthew's solution with STRAIGHT_JOIN
mysql> SELECT straight_join count(*)
FROM content_en
INNER JOIN content_page_categories use index (combo)
ON (id = itemid)
WHERE catid = 1 AND active = 1;
This executed in 0.20 - 0.22 seconds.
#bobwienholt's solution, denormalization
I tried the solution proposed by #bobwienholt, using denormalization to copy the active attribute to the content_page_categories table.
mysql> ALTER TABLE content_page_categories ADD COLUMN active TINYINT(1);
mysql> UPDATE content_en JOIN content_page_categories ON id = itemid
SET content_page_categories.active = content_en.active;
mysql> ALTER TABLE content_page_categories ADD KEY combo3 (catid,active);
mysql> SELECT COUNT(*) FROM content_page_categories WHERE catid = 1 and active = 1;
This executed in 0.037 - 0.044 seconds. So this is better, if you can maintain the redundant active column in sync with the value in the content_en table.
#Quassnoi's solution, summary table
I tried the solution proposed by #Quassnoi, to maintain a table with precomputed counts per catid and active. The table should have very few rows, and looking up the counts you need are primary key lookups and require no JOINs.
mysql> CREATE TABLE page_active_category (
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (active, catid)
) ENGINE=InnoDB;
mysql> INSERT INTO page_active_category
SELECT e.active, c.catid, COUNT(*)
FROM content_en AS e
JOIN content_page_categories AS c ON c.itemid = e.id
GROUP BY e.active, c.catid
mysql> SELECT cnt FROM page_active_category WHERE active = 1 AND catid = 1
This executed in 0.0007 - 0.0017 seconds. So this is the best solution by an order of magnitude, if you can maintain the table with aggregate counts.
You can see from this that different types of denormalization (including a summary table) is an extremely powerful tool for the sake of performance, though it has drawbacks because maintaining the redundant data can be inconvenient and makes your application more complex.
There are too many records to count.
If you want a faster solution, you'll have to store aggregate data.
MySQL does not support materialized views (or indexed views in SQL Server's terms) so you would need to create and maintain them yourself.
Create a table:
CREATE TABLE
page_active_category
(
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY
(active, catid)
) ENGINE=InnoDB;
then populate it:
INSERT
INTO page_active_category
SELECT active, catid, COUNT(*)
FROM content_en
JOIN content_page_categories
ON itemid = id
GROUP BY
active, catid
Now, each time you insert, delete or update a record in either content_en or content_page_categories, you should update the appropriate record in page_active_category.
This is doable with two simple triggers on both content_en and content_page_categories.
This way, your original query may be rewritten as mere:
SELECT cnt
FROM page_active_category
WHERE active = 1
AND catid = 1
which is a single primary key lookup and hence instant.
The problem is the "active" column in content_en. Obviously, if you just needed to know how many content records were related to a particular category (active or not) all you would have to do is:
SELECT count(1)
FROM content_page_categories
WHERE catid = 1;
Having to join back to every content_en record just to read the "active" flag is really what is slowing this query down.
I recommend adding "active" to content_page_categories and making it a copy of the related value in content_en... you can keep this column up to date using triggers or in your code. Then you can change the combo index to be:
KEY combo (catid,active,itemid)
and rewrite your query to:
SELECT count(1)
FROM content_page_categories USE INDEX (combo)
WHERE catid = 1 AND active = 1;
Also, you may have much better luck using InnoDB tables instead of MyISAM. Just be sure to tune your InnoDB settings: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
For me with your data as setup, I was getting the join query taking ~ 50x times longer than just selecting from the content_page_categories.
I was able to achieve performance about 10x slower than just selecting from the categories table by doing the following with your data:
I used straight_join
SELECT straight_join count(*)
FROM content_en
INNER JOIN content_page_categories use index (combo)
ON (id = itemid)
WHERE catid = 1 AND active = 1 ;
and the following table structure (slightly modified):
CREATE TABLE `content_en` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(100) DEFAULT NULL,
`uid` int(11) DEFAULT NULL,
`date_added` int(11) DEFAULT NULL,
`date_modified` int(11) DEFAULT NULL,
`active` tinyint(1) DEFAULT NULL,
`comment_count` int(6) DEFAULT NULL,
`orderby` tinyint(4) DEFAULT NULL,
`settings` text,
`permalink` varchar(255) DEFAULT NULL,
`code` varchar(3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id` (`id`),
KEY `test_con_1` (`active`) USING HASH,
KEY `combo` (`id`,`active`) USING HASH
ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8
And:
CREATE TABLE `content_page_categories` (
`catid` int(11) unsigned NOT NULL,
`itemid` int(10) unsigned NOT NULL,
`main` tinyint(1) DEFAULT NULL,
KEY `itemid` (`itemid`),
KEY `catid` (`catid`),
KEY `test_cat_1` (`catid`) USING HASH,
KEY `test_cat_2` (`itemid`) USING HASH,
KEY `combo` (`itemid`,`catid`) USING HASH
ENGINE=MyISAM DEFAULT CHARSET=utf8
To achieve better than this I think you will need a view, a flattened structure, or another type of look up field (as in the trigger to populate a row in the other table as discussed by another poster).
EDIT:
I should also point to this decent post on why/when to be careful with Straight_Join:
When to use STRAIGHT_JOIN with MySQL
If you use it, use it responsibly!
to speed up counting on mysql joins use subqueries.
For example getting cities with placeCount
city table
id
title
......
place table
id
city_id
title
.....
SELECT city.title,subq.count as placeCount
FROM city
left join (
select city_id,count(*) as count from place
group by city_id
) subq
on city.id=subq.city_id

Simple SELECT mysql query very slow (using intersect)

A query that used to work just fine on a production server has started becoming extremely slow (in a matter of hours).
This is it:
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This takes up to 20-30 seconds to execute (the table has ~200.000 rows)
This is the output of EXPLAIN:
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| 1 | SIMPLE | news_articles | index_merge | news_category_id,published | news_category_id,published | 5,5 | NULL | 8409 | Using intersect(news_category_id,published); Using where; Using filesort |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
Playing around with it, I found that hinting a specific index (date_edited) makes it much faster:
SELECT * FROM news_articles USE INDEX (date_edited) WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This one takes milliseconds to execute.
EXPLAIN output for this one is:
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| 1 | SIMPLE | news_articles | index | NULL | date_edited | 8 | NULL | 1 | Using where |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
Columns news_category_id, published and date_edited are all indexed.
The storage engine is InnoDB.
This is the table structure:
CREATE TABLE `news_articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` text NOT NULL,
`subtitle` text NOT NULL,
`summary` text NOT NULL,
`keywords` varchar(500) DEFAULT NULL,
`body` mediumtext NOT NULL,
`source` varchar(255) DEFAULT NULL,
`source_visible` int(11) DEFAULT NULL,
`author_information` enum('none','name','signature') NOT NULL DEFAULT 'name',
`date_added` datetime NOT NULL,
`date_edited` datetime NOT NULL,
`views` int(11) DEFAULT '0',
`news_category_id` int(11) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
`c_forwarded` int(11) DEFAULT '0',
`published` int(11) DEFAULT '0',
`deleted` int(11) DEFAULT '0',
`permalink` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `news_category_id` (`news_category_id`),
KEY `published` (`published`),
KEY `deleted` (`deleted`),
KEY `date_edited` (`date_edited`),
CONSTRAINT `news_articles_ibfk_3` FOREIGN KEY (`news_category_id`) REFERENCES `news_categories` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `news_articles_ibfk_4` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=192588 DEFAULT CHARSET=utf8
I could possibly change all queries my web application does to hint using that index. but this is considerable work.
Is there some way to tune MySQL so that the first query is made more efficient without actually rewriting all queries?
just a few tips..
1 - It seems to me the fields published and news_category_id are INTEGER. If so, please remove the single quotes from your query. It can make a huge difference when comes to performance;
2 - Also, I'd say that your field published has no many different values (it is probably 1 - yes and 0 - no, or something like that). If I'm right, this is not a good field to index at all. The parse in this case still has to go through all the records to find what it is looking for; In this case move the news_category_id to be the first field in your WHERE clause.
3 - "Don't forget about the most left index". This affirmation is valid for your SELECT, JOINS, WHERE, ORDER BY. Even the position of the columns on the table are imporant, keep the indexed ones on the top. Indexes are your friend as long as you know how to play with them.
Hope it can help you in somehow..
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
Original:
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
ORDER BY date_edited DESC LIMIT 1;
Since you have LIMIT 1, you're only selecting the latest row. ORDER BY date_edited tells MySQL to sort then take 1 row off the top. This is really slow, and why USE INDEX would help.
Try to match MAX(date_edited) in the WHERE clause instead. That should get the query planner to use its index automatically.
Choose MAX(date_entered):
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
AND date_edited = (select max(date_edited) from news_articles);
Please change your query to :
SELECT * FROM news_articles WHERE published = 1 AND news_category_id = 4 ORDER BY date_edited DESC LIMIT 1;
Please note that i have removed quotes from '1' and '4' data provided in query
The difference in the datatype passed and the column structure does not allow mysql to be able to use the index on these 2 columns.

MySQL partial indexes on varchar fields and group by optimization

I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.
with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.
Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.
You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.