MySQL count(*) , Group BY and INNER JOIN - mysql

I have a really bad time with a query on MySQL 5.1.
I simplified the 2 tables I make a JOIN on :
CREATE TABLE `jobs` (
`id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`title` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM ;
AND
CREATE TABLE `jobsCategories` (
`jobID` int(11) NOT NULL,
`industryID` int(11) NOT NULL,
KEY `jobID` (`jobID`),
KEY `industryID` (`industryID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
The query is straight forward :
SELECT count(*) as nb,industryID
FROM jobs J
INNER JOIN jobsCategories C ON C.jobID=J.id
GROUP BY industryID
ORDER BY nb DESC;
I got around 150000 records into the jobs table, and 350000 records into the jobsCategories table, and I have 30 industries;
The query takes approximatively 50 seconds to execute !!!
Do you have any idea why it takes so long? How could I optimize the structure of this database? Profilling the query show me that 99% of the execution time is spend on copying on tmp tables.
EXPLAIN <query> gives me :
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: J
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 178950
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: C
type: ref
possible_keys: jobID
key: jobID
key_len: 8
ref: J.id
rows: 1
Extra: Using where
2 rows in set (0.00 sec)
About the memory :
free -m :
total used free shared buffers cached
Mem: 2011 1516 494 0 8 1075
-/+ buffers/cache: 433 1578
Swap: 5898 126 5772
With the FORCE INDEX suggested below
select count(*) as nb, industryID
from
jobs J
inner join jobsCategories C force index (industryID) on (C.jobID = J.id )
group by industryID
order by nb DESC;
SHOW PROFILE;
gives me :
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000095 |
| Opening tables | 0.000014 |
| System lock | 0.000008 |
| Table lock | 0.000007 |
| init | 0.000032 |
| optimizing | 0.000011 |
| statistics | 0.000032 |
| preparing | 0.000016 |
| Creating tmp table | 0.000031 |
| executing | 0.000003 |
| Copying to tmp table | 3.301305 |
| Sorting result | 0.000028 |
| Sending data | 0.000024 |
| end | 0.000003 |
| removing tmp table | 0.000009 |
| end | 0.000004 |
| query end | 0.000003 |
| freeing items | 0.000029 |
| logging slow query | 0.000003 |
| cleaning up | 0.000003 |
+----------------------+----------+
I guess my RAM (2Gb) is not large enough. How can I be certain this is the case?

Firstly I think that you don't need to join table jobs in order to get the same result (unless you have some garbage data in table jobsCategories):
select count(*) as nb, industryID
from jobsCategories
group by industryID
order by nb DESC;
Otherwise you may try to force index on industryID:
select count(*) as nb, industryID
from
jobs J
inner join jobsCategories C force index (industryID) on (C.jobID = J.id )
group by industryID
order by nb DESC;

change your tables to InnoDB =) InnoDB is good managing big tables and the COUNT(*) to make it faster
http://www.mysqlperformanceblog.com/2009/01/12/should-you-move-from-myisam-to-innodb/
Good Luck
EDIT:
after testing, it seems that MyISAM is faster than InnoDB when using COUNT(*) when there is no WHERE clause:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
anyway, i've tested your exact query simulating the tables that you have (150k Jobs and 300k JobsCategories) using MyISAM tables and it took 1.5 seconds so maybe your problem is elsewhere.. it's all i can tell you =P

Hope I'm not misinterpreting the reading, but from what I see, you don't need ANY join. Since your grouping is how many jobs fall under each respective industry, its all in your job categories table, why join to the actual job table for the title of the job since that is not even being returned
select IndustryID,
count(*) JobsPerIndustry
from JobCategories
group by IndustryID
EDIT PER COMMENT / FEEDBACK...
That definitely makes a difference... adding a criteria associated with a job... Ensure your Jobs table has an index on the element you are expecting to allow limiting based on... Then follow similar query like you originally had. Ensure your Jobs table has an index on CountryID.
SELECT
count(*) as nb,
industryID
FROM jobs J
JOIN jobsCategories C
ON J.ID = C.jobID
WHERE
J.countryID=1234
GROUP BY
industryID
ORDER BY
nb DESC;

Related

Fetch latest row for a specific identifier

I have a table that looks like this
ID | identifier | data | created_at
------------------------------------
1 | 500 | test1 | 2011-08-30 15:27:29
2 | 501 | test1 | 2011-08-30 15:27:29
3 | 500 | test2 | 2011-08-30 15:28:29
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
What I need is the most up to date entry for each identifier, that could be decided by either the ID or the date in created_at. I assumed ID is the better choice due to the indexing.
I would expect this result set:
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
The result should be ordered by either date or ID in ascending order.
It's important that this is a table that contains ~ 8 millions of rows.
I tried quite some approaches now with self joining and sub queries. Unfortunately all of those came out with either wrong results or half a decade of run time.
To provide an example:
SELECT lo1.*
FROM table lo1
INNER JOIN
(
SELECT MAX(id) MaxID, identifier, id
FROM table
GROUP BY identifier
) lo2
ON lo1.identifier= lo2.identifier
AND lo1.id = lo2.MaxID
ORDER BY lo1.id DESC
LIMIT 10
The above query takes very long and does sometimes not return the latest result for an identifier, not quite sure why though.
Does anyone have an approach that is able to fetch the required result sets and preferably does not take a decade?
As asked, here is the create code:
CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`identifier` int(11) NOT NULL,
`data` varchar(200) COLLATE latin1_bin NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `identifier` (`identifier`),
KEY `created_at` (`created_at`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
The correct query that gives the correct results but it won't scale on larger tables.
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
) `table_group`
ON
`table`.ID = `table_group`.MaxID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/4
But when you check "View Execution Plan" you can see "Using where; Using temporary; Using filesort" in the extra column meaning MySQL needs to use a quicksort algorithm "Using temporary;" means the quicksort algorithm first will be run on a memory temporary table.
If the memory temporary table becomes to large it will be converted to a MyISAM on disk temporary table.
Meaning the quicksort will need disk based random i/o to sort which is slow on disks.
So this method will not scale on the table with ~8 millions of rows.
This query below also gives the same results but it should be more optimized
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN (
SELECT
`table`.ID
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
)
AS `table_group`
ON
`table`.ID = `table_group`.MaxID
)
AS `table_group_max`
ON
`table`.ID = `table_group_max`.ID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/21
When you check "View Execution Plan" there is no more "Using temporary; Using filesort" meaning the query should be more optimal then the previous query and should in theory execute faster.
Because the combination "Using temporary; Using filesort" can really be a performance killer like explained.

Need help understanding how mysql indexes work

I have a table that looks like this:
CREATE TABLE `metric` (
`metricid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`host` varchar(50) NOT NULL,
`userid` int(10) unsigned DEFAULT NULL,
`lastmetricvalue` double DEFAULT NULL,
`receivedat` int(10) unsigned DEFAULT NULL,
`name` varchar(255) NOT NULL,
`sampleid` tinyint(3) unsigned NOT NULL,
`type` tinyint(3) unsigned NOT NULL DEFAULT '0',
`lastrawvalue` double NOT NULL,
`priority` tinyint(3) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`metricid`),
UNIQUE KEY `unique-metric` (`userid`,`host`,`name`,`sampleid`)
) ENGINE=InnoDB AUTO_INCREMENT=1000000221496 DEFAULT CHARSET=utf8
It has 177,892 rows at the moment, and when I run the following query:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE m.userid = 8
AND (host, name, sampleid) IN (('localhost','0.4350799184758216cpu-3/cpu-nice',0),
('localhost','0.4350799184758216cpu-3/cpu-system',0),
('localhost','0.4350799184758216cpu-3/cpu-idle',0),
('localhost','0.4350799184758216cpu-3/cpu-wait',0),
('localhost','0.4350799184758216cpu-3/cpu-interrupt',0),
('localhost','0.4350799184758216cpu-3/cpu-softirq',0),
('localhost','0.4350799184758216cpu-3/cpu-steal',0),
('localhost','0.4350799184758216cpu-4/cpu-user',0),
('localhost','0.4350799184758216cpu-4/cpu-nice',0),
('localhost','0.4350799184758216cpu-4/cpu-system',0),
('localhost','0.4350799184758216cpu-4/cpu-idle',0),
('localhost','0.4350799184758216cpu-4/cpu-wait',0),
('localhost','0.4350799184758216cpu-4/cpu-interrupt',0),
('localhost','0.4350799184758216cpu-4/cpu-softirq',0),
('localhost','0.4350799184758216cpu-4/cpu-steal',0),
('localhost','_util/billing-bytes',0),('localhost','_util/billing-metrics',0));
it takes 0.87 seconds to return results, explain is:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: ref
possible_keys: unique-metric
key: unique-metric
key_len: 5
ref: const
rows: 85560
Extra: Using where
1 row in set (0.00 sec)
profile looks like this:
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000160 |
| checking permissions | 0.000010 |
| Opening tables | 0.000021 |
| exit open_tables() | 0.000008 |
| System lock | 0.000008 |
| mysql_lock_tables(): unlocking | 0.000005 |
| exit mysqld_lock_tables() | 0.000007 |
| init | 0.000068 |
| optimizing | 0.000018 |
| statistics | 0.000091 |
| preparing | 0.000042 |
| executing | 0.000005 |
| Sending data | 0.870180 |
| innobase_commit_low():trx_comm | 0.000012 |
| Sending data | 0.000111 |
| end | 0.000009 |
| query end | 0.000009 |
| ha_commit_one_phase(-1) | 0.000015 |
| innobase_commit_low():trx_comm | 0.000004 |
| ha_commit_one_phase(-1) | 0.000005 |
| query end | 0.000005 |
| closing tables | 0.000012 |
| freeing items | 0.000562 |
| logging slow query | 0.000005 |
| cleaning up | 0.000005 |
| sleeping | 0.000006 |
+--------------------------------+----------+
Which seems way too high for me. I've tried to replace the userid = 8 and (host, name, sampleid) IN part of the first query to (userid, host, name, sampleid) IN and this query runs about 0.5s - almost 2 times quicker, for reference, here's the query:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE (userid, host, name, sampleid) IN ((8,'localhost','0.4350799184758216cpu-3/cpu-nice',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-system',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-idle',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-wait',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-interrupt',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-softirq',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-steal',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-user',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-nice',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-system',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-idle',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-wait',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-interrupt',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-softirq',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-steal',0),
(8,'localhost','_util/billing-bytes',0),
(8,'localhost','_util/billing-metrics',0));
its explain looks like this:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 171121
Extra: Using where
1 row in set (0.00 sec)
Next I've updated the table to contain a single joined column:
alter table `metric` add `forindex` varchar(120) not null default '';
update metric set forindex = concat(userid,`host`,`name`,sampleid);
alter table metric add index `forindex` (`forindex`);
Updated the query to have only 1 string searched:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE (forindex) IN (('8localhost0.4350799184758216cpu-3/cpu-nice0'),
('8localhost0.4350799184758216cpu-3/cpu-system0'),
('8localhost0.4350799184758216cpu-3/cpu-idle0'),
('8localhost0.4350799184758216cpu-3/cpu-wait0'),
('8localhost0.4350799184758216cpu-3/cpu-interrupt0'),
('8localhost0.4350799184758216cpu-3/cpu-softirq0'),
('8localhost0.4350799184758216cpu-3/cpu-steal0'),
('8localhost0.4350799184758216cpu-4/cpu-user0'),
('8localhost0.4350799184758216cpu-4/cpu-nice0'),
('8localhost0.4350799184758216cpu-4/cpu-system0'),
('8localhost0.4350799184758216cpu-4/cpu-idle0'),
('8localhost0.4350799184758216cpu-4/cpu-wait0'),
('8localhost0.4350799184758216cpu-4/cpu-interrupt0'),
('8localhost0.4350799184758216cpu-4/cpu-softirq0'),
('8localhost0.4350799184758216cpu-4/cpu-steal0'),
('8localhost_util/billing-bytes0'),
('8localhost_util/billing-metrics0'));
And now I get the same results in 0.00 sec! Explain is:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: range
possible_keys: forindex
key: forindex
key_len: 362
ref: NULL
rows: 17
Extra: Using where
1 row in set (0.00 sec)
So to summarize, here are the results:
m.userid = X AND (host, name, sampleid) IN - index used, 85560 rows scanned, runs in 0.9s
(userid, host, name, sampleid) IN - index not used, 171121 rows scanned, runs in 0.5s
additional column with compound index replaced with an index over a concatenated utility column - index used, 17 rows scanned, runs in 0s
Why does second query run faster than the first? And why is the third query so much faster than the rest? Should I keep such a column for the sole purpose of faster searching?
Mysql version is:
mysqld Ver 5.5.34-55 for Linux on x86_64 (Percona XtraDB Cluster (GPL), wsrep_25.9.r3928)
Indexes help your search terms in the WHERE clause by narrowing down the search as much as possible. You can see this happening...
The rows field of EXPLAIN gives an estimate of how many rows the query will have to examine to find the rows that match your query. By comparing the rows reported in each EXPLAIN, you can see how much better your better-optimized query is:
rows: 85560 -- first query
rows: 171121 -- second query examines 2x more rows, but it was probably
-- faster because the data was buffered after the first query
rows: 17 -- third query examines 5,000x fewer rows than first query
You would also notice in the SHOW PROFILE details if you ran that for the third query that "Sending data" is a lot faster for the quicker query. This process state indicates how long it took to copy rows from the storage engine up to the SQL layer of MySQL. Even when doing memory-to-memory copying, this takes a while for so many thousands of rows. This is why indexes are so beneficial.
For more useful explanation, see my presentation How to Design Indexes, Really.

Adding limit clause to MySQL query slows it down dramatically

I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14 |
...
| 1002395157 | 2012-05-14 |
| 1002395187 | 2012-05-14 |
| 1002395475 | 2012-05-14 |
+------------+-------------+
105776 rows in set (2.19 sec)
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id limit 1;
+------------+-------------+
| player_id | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14 |
+------------+-------------+
1 row in set (42 min 23.26 sec)
mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id | int(10) unsigned | NO | PRI | NULL | |
| insert_date | date | NO | PRI | NULL | |
| xp | int(10) unsigned | YES | | NULL | |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)
There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.
CREATE temporary TABLE date_curr
(
player_id INT UNSIGNED NOT NULL,
insert_date DATE,
PRIMARY KEY player_id (player_id, insert_date)
) ENGINE=MEMORY;
INSERT into date_curr
SELECT player_id,
MAX(insert_date) AS insert_date
FROM player_record
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;
CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev
SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER join date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id limit 0,20000;
date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.
These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.
Here is the explain output:
mysql> explain SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER JOIN date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id
LIMIT 0,20000;
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | pr | range | PRIMARY,insert_date | insert_date | 3 | NULL | 396828 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 216825 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)
This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.
Here is the 'show profile all' output, with most time being spent on copying to tmp table.
| Status | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function | Source_file | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 | 0.383941 | 110240 | 18983 | 16160 | 448 | 0 | 0 | 0 | 43 | 0 | exec | sql_select.cc | 1976 |
A bit of a long answer but I hope you can learn something from this.
So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:
possible_keys
PRIMARY,insert_date
However the MySQL query optimizer decided to use the following index:
key
insert_date
This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.
When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.
So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.
Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.
Simple fact about a primary key index(aka clustered):
A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.
Simple fact about secondary index(aka non-clustered):
A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.
This is a subtle but big difference.
Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.
When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.
In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.
Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.
Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.
This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.
By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.
Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.
Hopefully this gives you a better understanding of indexes and why they are a double edged sword.
Had the same problem. When I added FORCE INDEX (id) it went back to the few milliseconds of a query it was without the limit, while producing the same results.

mysql using temporary table with subqueries, but not group by and order by

I have the following mysql query which is taking about 3 minutes to run. It does have 2 sub queries, but the tables have very few rows. When doing an explain, it looks like the "using temporary" might be the culprit. Apparently, it looks like the database is creating a temporary table for all three queries as noted in the "using temporary" designation below.
What confused me is that the MySQL documentation says, that using temporary is generally caused by group by and order by, neither of which I'm using. Do the subqueries cause an implicit group by or order by? Are the sub-queries causing a temporary table to be necessary regardless of group by or order by? Any recommendations of how to restructure this query so MySQL can handle it more efficiently? Any other tuning ideas in the MySQL settings?
mysql> explain
SELECT DISTINCT COMPANY_ID, COMPANY_NAME
FROM COMPANY
WHERE ID IN (SELECT DISTINCT ID FROM CAMPAIGN WHERE CAMPAIGN_ID IN (SELECT
DISTINCT CAMPAIGN_ID FROM AD
WHERE ID=10 AND (AD_STATUS='R' OR AD_STATUS='T'))
AND (STATUS_CODE='L' OR STATUS_CODE='A' OR STATUS_CODE='C'));
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| 1 | PRIMARY | COMPANY | ALL | NULL | NULL | NULL | NULL | 1207 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | CAMPAIGN | ALL | NULL | NULL | NULL | NULL | 880 | Using where; Using temporary |
| 3 | DEPENDENT SUBQUERY | AD | ALL | NULL | NULL | NULL | NULL | 264 | Using where; Using temporary |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
thanks!
Phil
I don't know the structure of your schema, but I would try the following:
CREATE INDEX i_company_id ON company(id); -- should it be a Primary Key?..
CREATE INDEX i_campaign_id ON campaign(id); -- same, PK here?
CREATE INDEX i_ad_id ON ad(id); -- the same question applies
ANALYZE TABLE company, campaign, ad;
And your query can be simplified like this:
SELECT DISTINCT c.company_id, c.company_name
FROM company c
JOIN campaign cg ON c.id = cg.id
JOIN ad ON cg.campaign_id = ad.campaign_id
WHERE ad.id = 10
AND ad.ad_status IN ('R', 'T')
AND ad.status_code IN ('L', 'A', 'C');
DISTINCT clauses in the subqueries are slowing down things significantly for you, the final one is sufficient.

MySQL performance with GROUP BY and JOIN

After spending a lot of time with variants to this question I'm wondering if someone can help me optimize this query or indexes.
I have three temp tables ref1, ref2, ref3 all defined as below, with ref1 and ref2 each having about 6000 rows and ref3 only 3 rows:
CREATE TEMPORARY TABLE ref1 (
id INT NOT NULL AUTO_INCREMENT,
val INT,
PRIMARY KEY (id)
)
ENGINE = MEMORY;
The slow query is against a table like so, with about 1M rows:
CREATE TABLE t1 (
d DATETIME NOT NULL,
id1 INT NOT NULL,
id2 INT NOT NULL,
id3 INT NOT NULL,
x INT NULL,
PRIMARY KEY (id1, d, id2, id3)
)
ENGINE = INNODB;
The query in question:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
The temp tables are used to filter the result set down to just the items a user is looking for.
EXPLAIN
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| 1 | SIMPLE | ref1 | ALL | PRIMARY | NULL | NULL | NULL | 6000 | Using temporary; Using filesort |
| 1 | SIMPLE | t1 | ref | PRIMARY | PRIMARY | 4 | med31new.ref1.id | 38 | Using where |
| 1 | SIMPLE | ref3 | ALL | PRIMARY | NULL | NULL | NULL | 3 | Using where; Using join buffer |
| 1 | SIMPLE | ref2 | eq_ref | PRIMARY | PRIMARY | 4 | med31new.t1.id2 | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
(on a different system with ~5M rows EXPLAIN show t1 first in the list, with "Using where; Using index; Using temporary; Using filesort")
Is there something obvious I'm missing that would prevent the temporary table from being used?
First filesort does not mean a file is writtent on disk to perform the sort, it's the name of the quicksort algorithm in mySQL, check what-does-using-filesort-mean-in-mysql.
So the problematic keyword in your explain is Using temporary, not Using filesort. For that you can play with tmp_table_size & max_heap_table_size(put the same values on both) to allow more in-memory work and avoid temporary table creation, check this link on the subject with remarks about documentation mistakes.
Then you could try different index policy, and see the results, but do not try to avoid filesort.
Last thing, not related, you make a SUM(x) but x can takes NULL values, SUM(COALESCE(x) , 0) is maybe better if you do not want any NULL value on the Group to make your sum being NULL.
Add an index on JUST the DATE. Since that is the criteria of the first table, and the others are just joins, it will be optimized against the DATE first... the joins are secondary.
Isn't this:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
exactly equivalent to:
select id1, SUM(x)
FROM t1
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
group by id1;
What are the extra tables being used for? I think the temp table mentioned in another answer is referring to MySQL creating a temp table during query execution. If you're hoping to create a sub-query (or table) that will minimize number of operations required in a join, that might speed up the query, but I don't see joined data being selected.