mysql slow count in join query - mysql

so i have two tables that i need to be able to get counts for. One of them holds the content and the other on the relationship between it and the categories table. Here are the DDl :
CREATE TABLE content_en (
id int(11) NOT NULL AUTO_INCREMENT,
title varchar(100) DEFAULT NULL,
uid int(11) DEFAULT NULL,
date_added int(11) DEFAULT NULL,
date_modified int(11) DEFAULT NULL,
active tinyint(1) DEFAULT NULL,
comment_count int(6) DEFAULT NULL,
orderby tinyint(4) DEFAULT NULL,
settings text,
permalink varchar(255) DEFAULT NULL,
code varchar(3) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY id (id),
UNIQUE KEY id_2 (id) USING BTREE,
UNIQUE KEY combo (id,active) USING HASH,
KEY code (code) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8;
and for the other table
CREATE TABLE content_page_categories (
catid int(11) unsigned NOT NULL,
itemid int(10) unsigned NOT NULL,
main tinyint(1) DEFAULT NULL,
KEY itemid (itemid),
KEY catid (catid),
KEY combo (catid,itemid) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The query i'm running is :
SELECT count(*)
FROM content_page_categories USE INDEX (combo)
INNER JOIN content_en USE INDEX (combo) ON (id = itemid)
WHERE catid = 1 AND active = 1 ;
Both tables have 125k rows and i can't get the count query to run fast enough. Best timing i get is 0.175 which is horrible for this ammount of rows. Selecting 100 rows is as fast as 0.01. I have tried like 3 or 4 variations of this query but in the end the timings are just about the same. Also, if i don't do USE INDEX timing goes 3x slower.
Also tried the following :
SELECT COUNT( *) FROM content_page_categories
INNER JOIN content_en ON id=itemid
AND catid = 1 AND active = 1 WHERE 1
and :
SELECT SQL_CALC_FOUND_ROWS catid,content_en.* FROM content_page_categories
INNER JOIN content_en ON (id=itemid)
WHERE catid =1 AND active = 1 LIMIT 1;
SELECT FOUND_ROWS();
Index definitions :
content_en 0 PRIMARY 1 id A 125288 BTREE
content_en 0 id 1 id A 125288 BTREE
content_en 0 id_2 1 id A 125288 BTREE
content_en 0 combo 1 id A BTREE
content_en 0 combo 2 active A YES BTREE
content_en 1 code 1 code A 42 YES BTREE
content_page_categories 1 itemid 1 itemid A 96842 BTREE
content_page_categories 1 catid 1 catid A 10 BTREE
content_page_categories 1 combo 1 catid A 10 BTREE
content_page_categories 1 combo 2 itemid A 96842 BTREE
Any ideas?
[EDIT]
i have uploaded sample data for these tables here
result of explain :
mysql> explain SELECT count(*) FROM content_page_categories USE INDEX (combo) I<br>
NNER JOIN content_en USE INDEX (combo) ON (id = itemid) WHERE catid = 1 AND act<br>
ive = 1 ;
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
| 1 | SIMPLE | content_en | index | combo | combo | 6 | NULL | 125288 | Using where; Using index |
| 1 | SIMPLE | content_page_categories | ref | combo | combo | 8 | const,mcms.content_en.id | 1 | Using where; Using index |
+----+-------------+-------------------------+-------+---------------+-------+---------+--------------------------+--------+--------------------------+
2 rows in set (0.00 sec)

I downloaded your data and tried a few experiments. I'm running MySQL 5.6.12 on a CentOS virtual machine on a Macbook Pro. The times I observed can be used for comparison, but your system may have different performance.
Base case
First I tried without the USE INDEX clauses, because I avoid optimizer overrides where possible. In most cases, a simple query like this should use the correct index if it's available. Hard-coding the index choice in a query makes it harder to use a better index later.
I also use correlation names (table aliases) to make the query more clear.
mysql> EXPLAIN SELECT COUNT(*) FROM content_en AS e
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: eq_ref
possible_keys: PRIMARY,combo2,combo
key: PRIMARY
key_len: 4
ref: test.c.itemid
rows: 1
Extra: Using where
This executed in 0.36 seconds.
Covering index
I'd like to get "Using index" on the second table as well, so I need an index on (active, id) in that order. I had to USE INDEX in this case to persuade the optimizer not to use the primary key.
mysql> ALTER TABLE content_en ADD KEY combo2 (active, id);
mysql> explain SELECT COUNT(*) FROM content_en AS e USE INDEX (combo2)
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: ref
possible_keys: combo2
key: combo2
key_len: 6
ref: const,test.c.itemid
rows: 1
Extra: Using where; Using index
The rows reported by EXPLAIN is an important indicator of how much work it's going to take to execute the query. Notice the rows in the above EXPLAIN is only 71k, much smaller than the 125k rows you got when you scanned the content_en table first.
This executed in 0.44 seconds. This is unexpected, because usually a query using a covering index is an improvement.
Convert tables to InnoDB
I tried the same covering index solution as above, but with InnoDB as the storage engine.
mysql> ALTER TABLE content_en ENGINE=InnoDB;
mysql> ALTER TABLE content_page_categories ENGINE=InnoDB;
This had the same EXPLAIN report. It took 1 or 2 iterations to warm the buffer pool, but then the performance of the query tripled.
This executed in 0.16 seconds.
I also tried removing the USE INDEX, and the time increased slightly, to 0.17 seconds.
#Matthew's solution with STRAIGHT_JOIN
mysql> SELECT straight_join count(*)
FROM content_en
INNER JOIN content_page_categories use index (combo)
ON (id = itemid)
WHERE catid = 1 AND active = 1;
This executed in 0.20 - 0.22 seconds.
#bobwienholt's solution, denormalization
I tried the solution proposed by #bobwienholt, using denormalization to copy the active attribute to the content_page_categories table.
mysql> ALTER TABLE content_page_categories ADD COLUMN active TINYINT(1);
mysql> UPDATE content_en JOIN content_page_categories ON id = itemid
SET content_page_categories.active = content_en.active;
mysql> ALTER TABLE content_page_categories ADD KEY combo3 (catid,active);
mysql> SELECT COUNT(*) FROM content_page_categories WHERE catid = 1 and active = 1;
This executed in 0.037 - 0.044 seconds. So this is better, if you can maintain the redundant active column in sync with the value in the content_en table.
#Quassnoi's solution, summary table
I tried the solution proposed by #Quassnoi, to maintain a table with precomputed counts per catid and active. The table should have very few rows, and looking up the counts you need are primary key lookups and require no JOINs.
mysql> CREATE TABLE page_active_category (
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (active, catid)
) ENGINE=InnoDB;
mysql> INSERT INTO page_active_category
SELECT e.active, c.catid, COUNT(*)
FROM content_en AS e
JOIN content_page_categories AS c ON c.itemid = e.id
GROUP BY e.active, c.catid
mysql> SELECT cnt FROM page_active_category WHERE active = 1 AND catid = 1
This executed in 0.0007 - 0.0017 seconds. So this is the best solution by an order of magnitude, if you can maintain the table with aggregate counts.
You can see from this that different types of denormalization (including a summary table) is an extremely powerful tool for the sake of performance, though it has drawbacks because maintaining the redundant data can be inconvenient and makes your application more complex.

There are too many records to count.
If you want a faster solution, you'll have to store aggregate data.
MySQL does not support materialized views (or indexed views in SQL Server's terms) so you would need to create and maintain them yourself.
Create a table:
CREATE TABLE
page_active_category
(
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY
(active, catid)
) ENGINE=InnoDB;
then populate it:
INSERT
INTO page_active_category
SELECT active, catid, COUNT(*)
FROM content_en
JOIN content_page_categories
ON itemid = id
GROUP BY
active, catid
Now, each time you insert, delete or update a record in either content_en or content_page_categories, you should update the appropriate record in page_active_category.
This is doable with two simple triggers on both content_en and content_page_categories.
This way, your original query may be rewritten as mere:
SELECT cnt
FROM page_active_category
WHERE active = 1
AND catid = 1
which is a single primary key lookup and hence instant.

The problem is the "active" column in content_en. Obviously, if you just needed to know how many content records were related to a particular category (active or not) all you would have to do is:
SELECT count(1)
FROM content_page_categories
WHERE catid = 1;
Having to join back to every content_en record just to read the "active" flag is really what is slowing this query down.
I recommend adding "active" to content_page_categories and making it a copy of the related value in content_en... you can keep this column up to date using triggers or in your code. Then you can change the combo index to be:
KEY combo (catid,active,itemid)
and rewrite your query to:
SELECT count(1)
FROM content_page_categories USE INDEX (combo)
WHERE catid = 1 AND active = 1;
Also, you may have much better luck using InnoDB tables instead of MyISAM. Just be sure to tune your InnoDB settings: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/

For me with your data as setup, I was getting the join query taking ~ 50x times longer than just selecting from the content_page_categories.
I was able to achieve performance about 10x slower than just selecting from the categories table by doing the following with your data:
I used straight_join
SELECT straight_join count(*)
FROM content_en
INNER JOIN content_page_categories use index (combo)
ON (id = itemid)
WHERE catid = 1 AND active = 1 ;
and the following table structure (slightly modified):
CREATE TABLE `content_en` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(100) DEFAULT NULL,
`uid` int(11) DEFAULT NULL,
`date_added` int(11) DEFAULT NULL,
`date_modified` int(11) DEFAULT NULL,
`active` tinyint(1) DEFAULT NULL,
`comment_count` int(6) DEFAULT NULL,
`orderby` tinyint(4) DEFAULT NULL,
`settings` text,
`permalink` varchar(255) DEFAULT NULL,
`code` varchar(3) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id` (`id`),
KEY `test_con_1` (`active`) USING HASH,
KEY `combo` (`id`,`active`) USING HASH
ENGINE=MyISAM AUTO_INCREMENT=127126 DEFAULT CHARSET=utf8
And:
CREATE TABLE `content_page_categories` (
`catid` int(11) unsigned NOT NULL,
`itemid` int(10) unsigned NOT NULL,
`main` tinyint(1) DEFAULT NULL,
KEY `itemid` (`itemid`),
KEY `catid` (`catid`),
KEY `test_cat_1` (`catid`) USING HASH,
KEY `test_cat_2` (`itemid`) USING HASH,
KEY `combo` (`itemid`,`catid`) USING HASH
ENGINE=MyISAM DEFAULT CHARSET=utf8
To achieve better than this I think you will need a view, a flattened structure, or another type of look up field (as in the trigger to populate a row in the other table as discussed by another poster).
EDIT:
I should also point to this decent post on why/when to be careful with Straight_Join:
When to use STRAIGHT_JOIN with MySQL
If you use it, use it responsibly!

to speed up counting on mysql joins use subqueries.
For example getting cities with placeCount
city table
id
title
......
place table
id
city_id
title
.....
SELECT city.title,subq.count as placeCount
FROM city
left join (
select city_id,count(*) as count from place
group by city_id
) subq
on city.id=subq.city_id

Related

Selecting distinct values from a join of two large tables

I have an animals table with about 3 million records. The table has, among a few other columns, an id, name, and owner_id column. I have an animal_breeds table with about 2.5 million records. The table only has an animal_id and breed column.
I'm trying to find the distinct breed values that are associated with a specific owner_id, but the query is taking 20 seconds or so. Here's the query:
SELECT DISTINCT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`owner_id` = ? ;
The tables have all appropriate indices. I can't denormalize the table by adding a breed column to the animals table because it is possible for animals to be assigned multiple breeds. I also have this problem with a few other large tables that have one-to-many relationships.
Is there a more performant way to achieve what I'm looking for? It seems like a pretty simple problem but I can't seem to figure out the best way to achieve this other than pre-calculating and caching the results.
Here is the explain output from my query. Notice the Using temporary
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 "SIMPLE" "a" NULL "ref" "PRIMARY,animals_animal_id_index" "animals_animal_id_index" "153" "const" 1126303 100.00 "Using index; Using temporary"
1 "SIMPLE" "ab" NULL "ref" "animal_breeds_animal_id_breed_unique,animal_breeds_animal_id_index,animal_breeds_breed_index" "animal_breeds_animal_id_breed_unique" "5" "pedigreeonline.a.id" 1 100.00 "Using index"
And as requested, here are the create table statements (I left off a few unrelated columns and indices from the animals table). I believe the animal_breeds_animal_id_index index on animal_breeds table is redundant because of the unique key on the table, but we can ignore that for now as long as it's not causing the problem :)
CREATE TABLE `animals` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`owner_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_animal_id_index` (`owner_id`,`id`),
KEY `animals_name_index` (`name`),
) ENGINE=InnoDB AUTO_INCREMENT=2470843 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int(10) unsigned DEFAULT NULL,
`breed` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_animal_id_index` (`animal_id`),
KEY `animal_breeds_breed_index` (`breed`),
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Any help would be appreciated. Thanks!
With knowledge about your data you can try something like this:
SELECT
b.*
FROM
(
SELECT
DISTINCT `breed`
FROM
`animal_breeds`
) AS b
WHERE
EXISTS (
SELECT
*
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
b.breed = ab.breed
AND a.owner_id = ?
)
;
The idea is to get short list of distinct breeds without any filtering (for small list it would be quite fast) and then filter further the list with correlated subquery. As the list is short it would be only few subqueries executed and they will only check for existence that is much faster that any grouping (distinct == grouping).
This will only work if your distinct list is quite short.
With random generated data based on your answers the above query gave me the following execution plan:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 2 100.00
3 SUBQUERY a ref PRIMARY,animals_animal_id_index animals_animal_id_index 153 const 1011 100.00 Using index
3 SUBQUERY ab ref animal_breeds_animal_id_breed_unique,`animal_breeds_animal_id_index`,animal_breeds_animal_id_index `animal_breeds_animal_id_index` 5 test.a.id 2 100.00 Using index
2 DERIVED animal_breeds range animal_breeds_animal_id_breed_unique,`animal_breeds_breed_index`,animal_breeds_breed_index `animal_breeds_breed_index` 1022 2 100.00 Using index for group-by
Alternatively, you can try to create WHERE clause like this:
...
WHERE
b.breed IN (
SELECT
ab.breed
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
a.owner_id = ?
)
For this query:
SELECT DISTINCT ab.`breed`
FROM `animal_breeds` ab INNER JOIN
`animals` a
ON a.`id` = ab.`animal_id`
WHERE a.`owner_id` = ? ;
You want indexes on animals(owner_id, id) and animal_breeds(animal_id, breed). The order of the columns in the composite index is important.
With the right index, I imagine that this will be very fast.
EDIT:
According to the explain, there are 1,126,303 matches for the values you are using. The time is due to removing duplicates. Given the sizes of the tables, it is surprising that there would be so many matching one value.

mysql query optimization: select with counted subquery extremely slow

I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?
I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().
You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.
My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.

How can I optimize a query which depends on both COUNT and GROUP BY?

I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');

MYSQL Left join extremely slow on indexed columns

Below are the 4 tables' table structure:
Calendar:
CREATE TABLE `calender` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`HospitalID` int(11) NOT NULL,
`ColorCode` int(11) DEFAULT NULL,
`RecurrID` int(11) NOT NULL,
`IsActive` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`ID`),
UNIQUE KEY `ID_UNIQUE` (`ID`),
KEY `idxHospital` (`ID`,`StaffID`,`HospitalID`,`ColorCode`,`RecurrID`,`IsActive`)
) ENGINE=InnoDB AUTO_INCREMENT=4638 DEFAULT CHARSET=latin1;
CalendarAttendee:
CREATE TABLE `calenderattendee` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`CalenderID` int(11) NOT NULL,
`StaffID` int(11) NOT NULL,
`IsActive` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`ID`),
KEY `idxCalStaffID` (`StaffID`,`CalenderID`)
) ENGINE=InnoDB AUTO_INCREMENT=20436 DEFAULT CHARSET=latin1;
CallPlanStaff:
CREATE TABLE `callplanstaff` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Staffname` varchar(45) NOT NULL,
`IsActive` tinyint(4) NOT NULL DEFAULT '1',
PRIMARY KEY (`ID`),
UNIQUE KEY `ID_UNIQUE` (`ID`),
KEY `idx_IsActive` (`Staffname`,`IsActive`),
KEY `idx_staffName` (`Staffname`,`ID`) USING BTREE KEY_BLOCK_SIZE=100
) ENGINE=InnoDB AUTO_INCREMENT=13 DEFAULT CHARSET=latin1;
Users:
CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL DEFAULT '',
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_users_on_email` (`email`),
UNIQUE KEY `index_users_on_name` (`name`),
KEY `idx_email` (`email`) USING BTREE KEY_BLOCK_SIZE=100
) ENGINE=InnoDB AUTO_INCREMENT=33 DEFAULT CHARSET=utf8;
What I'm trying to do is to fetch the calender.ID and Users.name using below query:
SELECT a.ID, h.name
FROM `stjude`.`calender` a
left join calenderattendee e on a.ID = e.calenderID
left join callplanstaff f on e.StaffID = f.ID
left join users h on f.Staffname = h.email
The relation between those tables are:
It took about 4 seconds to fetch 13000 records which I bet it could be faster.
When I look at the tabular explain of the query, here's the result:
Why MYSQL isn't using index on callplanstaff table and users table?
Also, in my case, should I use multi index instead of multi column index?
And is there any indexes I'm missing so my query is slow?
=======================================================================
Updated:
As zedfoxus and spencer7593 recommended to change the idxCalStaffID's ordering and idx_staffname's ordering, below is the execution plan:
It took 0.063 seconds to fetch, much fewer time required, how does the ordering of the indexing affects the fetch time..?
You're misinterpreting the EXPLAIN report.
type: index is not such a good thing. It means it's doing an "index-scan" which examines every element of an index. It's almost as bad as a table-scan. Notice the column rows: 4562 and rows: 13451. This is the estimated number of index elements it will examine for each of those tables.
Having two tables doing a index-scan is even worse. The total number of rows examined for this join is 4562 x 13451 = 61,363,462.
Using join buffer is not a good thing. It's a thing the optimizer does as a consolation when it can't use an index for the join.
type: eqref is a good thing. It means it's using a PRIMARY KEY index or UNIQUE KEY index, to look up exactly one row. Notice the column rows: 1. So at least for each of the rows from the previous join, it only does one index lookup.
You should create an index on calenderattendee for columns (CalenderId, StaffId) in that order (#spencer7593 posted this suggestion while I was writing my post).
By using LEFT [OUTER] JOIN in this query, you're preventing MySQL from optimizing the order of table joins. And since your query fetches h.name, I infer that you really just want results where the calendar event has an attendee and the attendee has a corresponding user record. It makes no sense that you're not using an INNER JOIN.
Here's the EXPLAIN with the new index and the joins changed to INNER JOIN (though my row counts are meaningless because I didn't create test data):
+----+-------------+-------+------------+--------+--------------------------------+----------------------+---------+----------------+------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+--------------------------------+----------------------+---------+----------------+------+----------+-----------------------+
| 1 | SIMPLE | a | NULL | index | PRIMARY,ID_UNIQUE,idxHospital | ID_UNIQUE | 4 | NULL | 1 | 100.00 | Using index |
| 1 | SIMPLE | e | NULL | ref | idxCalStaffID,CalenderID | CalenderID | 4 | test.a.ID | 1 | 100.00 | Using index |
| 1 | SIMPLE | f | NULL | eq_ref | PRIMARY,ID_UNIQUE | PRIMARY | 4 | test.e.StaffID | 1 | 100.00 | NULL |
| 1 | SIMPLE | h | NULL | eq_ref | index_users_on_email,idx_email | index_users_on_email | 767 | func | 1 | 100.00 | Using index condition |
+----+-------------+-------+------------+--------+--------------------------------+----------------------+---------+----------------+------+----------+-----------------------+
The type: index for the calenderattendee table has been changed to type: ref which means an index lookup against a non-unique index. And the note about Using join buffer is gone.
That should run better.
how does the ordering of the indexing affects the fetch time..?
Think of a telephone book, which is ordered by last name first, then by first name. This helps you look up people by last name very quickly. But it does not help you look up people by first name.
The position of columns in an index matters!
You might like my presentation How to Design Indexes, Really.
Slides: http://www.slideshare.net/billkarwin/how-to-design-indexes-really
Video of me presenting this talk: https://www.youtube.com/watch?v=ELR7-RdU9XU
Q: Is there any indexes I'm missing so my query is slow?
A: Yes. A suitable index on calendarattendee is missing.
We probably want an index on calenderattendee with a calendarid as the leading column, for example:
... ON calenderattendee (calendaid, staffid)
This seems like a situation where inner join might be a better option than a left join.
SELECT a.ID, h.name
FROM `stjude`.`calender` a
INNER JOIN calenderattendee e on a.ID = e.calenderID
INNER JOIN callplanstaff f on e.StaffID = f.ID
INNER JOIN users h on f.Staffname = h.email
Then let's get onto the indexes. The Calendar table has
PRIMARY KEY (`ID`),
UNIQUE KEY `ID_UNIQUE` (`ID`),
The second one, ID_UNIQUE is redundant. A Primary key is a unique index. Having too many indexes slows down insert/update/delete operations.
Then the users table has
UNIQUE KEY `index_users_on_email` (`email`),
UNIQUE KEY `index_users_on_name` (`name`),
KEY `idx_email` (`email`) USING BTREE KEY_BLOCK_SIZE=100
The idx_email column is redundant here. Other than that there isn't much to do by way of tweaking the indexes. Your explain shows that an index is being used on each and table.
Why MYSQL isn't using index on callplanstaff table and users table?
Your explain shows that it does. The it's using the primary key and the index_users_on_email indexes on these tables.
Also, in my case, should I use multi index instead of multi column
index?
As a rule of thumb, mysql uses only one index per table. So a multi column index is the way to go rather than having multiple indexes.
And is there any indexes I'm missing so my query is slow?
As I mentioned in the comments you are fetching (and probably displaying) 13,000 records. That's where your bottleneck maybe.

MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

I've got a fairly simple query that seeks to display the number of email addresses that are subscribed along with the number unsubscribed, grouped by client.
The query:
SELECT
client_id,
COUNT(CASE WHEN subscribed = 1 THEN subscribed END) AS subs,
COUNT(CASE WHEN subscribed = 0 THEN subscribed END) AS unsubs
FROM
contacts_emailAddresses
LEFT JOIN contacts ON contacts.id = contacts_emailAddresses.contact_id
GROUP BY
client_id
Schema of relevant tables follows. contacts_emailAddresses is a junction table between contacts (which has the client_id) and emailAddresses (which is not actually used in this query).
CREATE TABLE `contacts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`firstname` varchar(255) NOT NULL DEFAULT '',
`middlename` varchar(255) NOT NULL DEFAULT '',
`lastname` varchar(255) NOT NULL DEFAULT '',
`gender` varchar(5) DEFAULT NULL,
`client_id` mediumint(10) unsigned DEFAULT NULL,
`datasource` varchar(10) DEFAULT NULL,
`external_id` int(10) unsigned DEFAULT NULL,
`created` timestamp NULL DEFAULT NULL,
`trash` tinyint(1) NOT NULL DEFAULT '0',
`updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `client_id` (`client_id`),
KEY `external_id combo` (`client_id`,`datasource`,`external_id`),
KEY `trash` (`trash`),
KEY `lastname` (`lastname`),
KEY `firstname` (`firstname`),
CONSTRAINT `contacts_ibfk_1` FOREIGN KEY (`client_id`) REFERENCES `clients` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14742974 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
CREATE TABLE `contacts_emailAddresses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`contact_id` int(10) unsigned NOT NULL,
`emailAddress_id` int(11) unsigned DEFAULT NULL,
`primary` tinyint(1) unsigned NOT NULL DEFAULT '0',
`subscribed` tinyint(1) unsigned NOT NULL DEFAULT '1',
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `contact_id` (`contact_id`),
KEY `subscribed` (`subscribed`),
KEY `combo` (`contact_id`,`emailAddress_id`) USING BTREE,
KEY `emailAddress_id` (`emailAddress_id`) USING BTREE,
CONSTRAINT `contacts_emailAddresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`id`),
CONSTRAINT `contacts_emailAddresses_ibfk_2` FOREIGN KEY (`emailAddress_id`) REFERENCES `emailAddresses` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24700918 DEFAULT CHARSET=utf8
Here's the EXPLAIN:
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | contacts_emailAddresses | ALL | NULL | NULL | NULL | NULL | 10176639 | Using temporary; Using filesort |
| 1 | SIMPLE | contacts | eq_ref | PRIMARY | PRIMARY | 4 | icarus.contacts_emailAddresses.contact_id | 1 | |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
2 rows in set (0.08 sec)
The problem here clearly is the GROUP BY clause, as I can remove the JOIN (and the items that depend on it) and the performance still is terrible (40+ seconds). There are 10m records in contacts_emailAddresses, 12m-some records in contacts, and 10–15 client records for the grouping.
From the doc:
Temporary tables can be created under conditions such as these:
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
DISTINCT combined with ORDER BY may require a temporary table.
If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage.
I'm obviously not combining the GROUP BY with an ORDER BY, and I have tried multiple things to ensure that the GROUP BY is on a column that should be properly placed in the join queue (including rewriting the query to put contacts in the FROM and instead join to contacts_emailAddresses), all to no avail.
Any suggestions for performance tuning would be much appreciated!
I think the only real shot you have of getting away from a "Using temporary; Using filesort" operation (given the current schema, the current query, and the specified resultset) would be to use correlated subqueries in the SELECT list.
SELECT c.client_id
, (SELECT IFNULL(SUM(es.subscribed=1),0)
FROM contacts_emailAddresses es
JOIN contacts cs
ON cs.id = es.contact_id
WHERE cs.client_id = c.client_id
) AS subs
, (SELECT IFNULL(SUM(eu.subscribed=0),0)
FROM contacts_emailAddresses eu
JOIN contacts cu
ON cu.id = eu.contact_id
WHERE cu.client_id = c.client_id
) AS unsubs
FROM contacts c
GROUP BY c.client_id
This may run quicker than the original query, or it may not. Those correlated subqueries are going to get run for each returned by the outer query. If that outer query is returning a boatload of rows, that's a whole boatload of subquery executions.
Here's the output from an EXPLAIN:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo contact_id 4 cu.id Using where
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo contact_id 4 cs.id Using where
For optimum performance of this query, we'd really like to see "Using index" in the Extra column of the explain for the eu and es tables. But to get that, we'd need a suitable index, one with a leading column of contact_id and including the subscribed column. For example:
CREATE INDEX cemail_IX2 ON contacts_emailAddresses (contact_id, subscribed);
With the new index available, EXPLAIN output shows MySQL will use the new index:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo,cemail_IX2 cemail_IX2 4 cu.id Using where; Using index
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo,cemail_IX2 cemail_IX2 4 cs.id Using where; Using index
NOTES
This is the kind of problem where introducing a little redundancy can improve performance. (Just like we do in a traditional data warehouse.)
For optimum performance, what we'd really like is to have the client_id column available on the contacts_emailAddresses table, without a need to JOINI to the contacts table.
In the current schema, the foreign key relationship to contacts table gets us the client_id (rather, the JOIN operation in the original query is what gets it for us.) If we could avoid that JOIN operation entirely, we could satisfy the query entirely from a single index, using the index to do the aggregation, and avoiding the overhead of the "Using temporary; Using filesort" and JOIN operations...
With the client_id column available, we'd create a covering index like...
... ON contacts_emailAddresses (client_id, subscribed)
Then, we'd have a blazingly fast query...
SELECT e.client_id
, SUM(e.subscribed=1) AS subs
, SUM(e.subscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.client_id
That would get us a "Using index" in the query plan, and the query plan for this resultset just doesn't get any better than that.
But, that would require a change to your scheam, it doesn't really answer your question.
Without the client_id column, then the best we're likely to do is a query like the one Gordon posted in his answer (though you still need to add the GROUP BY c.client_id to get the specified result.) The index Gordon recommended will be of benefit...
... ON contacts_emailAddresses(contact_id, subscribed)
With that index defined, the standalone index on contact_id is redundant. The new index will be a suitable replacement to support the existing foreign key constraint. (The index on just contact_id could be dropped.)
Another approach would be to do the aggregation on the "big" table first, before doing the JOIN, since it's the driving table for the outer join. Actually, since that foreign key column is defined as NOT NULL, and there's a foreign key, it's not really an "outer" join at all.
SELECT c.client_id
, SUM(s.subs) AS subs
, SUM(s.unsubs) AS unsubs
FROM ( SELECT e.contact_id
, SUM(e.subscribed=1) AS subs
, SUM(e.eubscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.contact_id
) s
JOIN contacts c
ON c.id = s.contact_id
GROUP BY c.client_id
Again, we need an index with contact_id as the leading column and including the subscribed column, for best performance. (The plan for s should show "Using index".) Unfortunately, that's still going to materialize a fairly sizable resultset (derived table s) as a temporary MyISAM table, and the MyISAM table isn't going to be indexed.