MySQL avoid filesort in inner query - mysql

I am trying to avoid filesort but not getting luck in removing it from inner query. If I move condition to outer query then it shows me nothing.
Create table articles (
article_id Int UNSIGNED NOT NULL AUTO_INCREMENT,
editor_id Int UNSIGNED NOT NULL,
published_date Datetime,
Primary Key (book_id)) ENGINE = InnoDB;
Create Index published_date_INX ON articles (published_date);
Create Index editor_id_INX ON articles (editor_id);
EXPLAIN SELECT article_id, article_date FROM articles AA INNER JOIN
(
SELECT article_id
FROM articles
WHERE editor_id=1
ORDER BY published_date DESC
LIMIT 100, 5
) ART USING (article_id);
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
| 1 | | PRIMARY | <derived2> | ALL | NULL | NULL | NULL NULL | 5 | |
| 1 | | PRIMARY | AA | eq_ref | PRIMARY | PRIMARY | 4 ART.article_id | 1 | |
| 2 | | DERIVED | articles | ALL | editor_id | editor_id | 5 | 114311 | Using filesort |
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
3 rows in set (30.31 sec)
Any suggestion how to remove filesort from this query?

Maybe you can try adding an index on editor_id, published_date.
create index edpub_INX on articles (editor_id, published_date);
On your inner query:
SELECT article_id
FROM articles
WHERE editor_id=1
ORDER BY published_date DESC
LIMIT 100, 5
MySQL's query planner is probably thinking that filtering by editor_id (using the index) and then ordering by published_date is better than using published_date_INX and then filtering by editor_id. And query planner is probably right.
So, if you want to "help" on that specific query, create an index on editor_id, published_date and see if it helps your query run faster.

Related

MySQL select query gets quite slow with BOTH where and descending order

I have this select query, ItemType is varchar type and ItemComments is int type:
select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1
You can see this query has 3 conditions:
where 'ItemType' equals a specific value;
order by 'ItemComments'
with descending order
The interesting thing is, when I select rows with all three conditions, it's getting very slow. But if I drop any one of the three (except condition 2), the query runs quite fast. See:
select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 16.318 sec. */
select * from ItemInfo where ItemType="item_type" order by ItemComments limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.140 sec. */
select * from ItemInfo order by ItemComments desc limit 1;
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.015 sec. */
Plus,
I'm using MySQL 5.7 with InnoDB engine.
I have created indexes on both ItemType and ItemComments and table ItemInfo contains 2 million rows.
I have searched many possible explanation like MySQL support for descending index, composite index and so on. But these still can't explain why query #1 runs slowly while query #2 and #3 runs well.
It would be very appreciated if anyone could help me out.
Updates:create table and explain info
Create code:
CREATE TABLE `ItemInfo` (
`ItemID` VARCHAR(255) NOT NULL,
`ItemType` VARCHAR(255) NOT NULL,
`ItemPics` VARCHAR(255) NULL DEFAULT '0',
`ItemName` VARCHAR(255) NULL DEFAULT '0',
`ItemComments` INT(50) NULL DEFAULT '0',
`ItemScore` DECIMAL(10,1) NULL DEFAULT '0.0',
`ItemPrice` DECIMAL(20,2) NULL DEFAULT '0.00',
`ItemDate` DATETIME NULL DEFAULT '1971-01-01 00:00:00',
PRIMARY KEY (`ItemID`, `ItemType`),
INDEX `ItemDate` (`ItemDate`),
INDEX `ItemComments` (`ItemComments`),
INDEX `ItemType` (`ItemType`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
Explain result:
mysql> explain select * from ItemInfo where ItemType="item_type" order by ItemComments desc limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | i | NULL | index | ItemType | ItemComments | 5 | NULL | 83 | 1.20 | Using where |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
mysql> explain select * from ItemInfo where ItemType="item_type" order by ItemComments limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | i | NULL | index | ItemType | ItemComments | 5 | NULL | 83 | 1.20 | Using where |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
mysql> explain select * from ItemInfo order by ItemComments desc limit 1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
| 1 | SIMPLE | i | NULL | index | NULL | ItemComments | 5 | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------+
Query from O. Jones:
mysql> explain
-> SELECT a.*
-> FROM ItemInfo a
-> JOIN (
-> SELECT MAX(ItemComments) ItemComments, ItemType
-> FROM ItemInfo
-> GROUP BY ItemType
-> ) maxcomm ON a.ItemType = maxcomm.ItemType
-> AND a.ItemComments = maxcomm.ItemComments
-> WHERE a.ItemType = 'item_type';
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
| 1 | PRIMARY | a | NULL | ref | ItemComments,ItemType | ItemType | 767 | const | 27378 | 100.00 | Using where |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 772 | mydb.a.ItemComments,const | 10 | 100.00 | Using where; Using index |
| 2 | DERIVED | ItemInfo | NULL | index | PRIMARY,ItemDate,ItemComments,ItemType | ItemType | 767 | NULL | 2289466 | 100.00 | NULL |
+----+-------------+------------+------------+-------+----------------------------------------+-------------+---------+---------------------------+---------+----------+--------------------------+
I'm not sure if I execute this query right but I couldn't get the records within quite a long time.
Query from Vijay. But I add ItemType join condition cause with only max_comnt return items from other ItemType:
SELECT ifo.* FROM ItemInfo ifo
JOIN (SELECT ItemType, MAX(ItemComments) AS max_comnt FROM ItemInfo WHERE ItemType="item_type") inn_ifo
ON ifo.ItemComments = inn_ifo.max_comnt and ifo.ItemType = inn_ifo.ItemType
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 7.441 sec. */
explain result:
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | ifo | NULL | index_merge | ItemComments,ItemType | ItemComments,ItemType | 5,767 | NULL | 88 | 100.00 | Using intersect(ItemComments,ItemType); Using where |
| 2 | DERIVED | ItemInfo | NULL | ref | ItemType | ItemType | 767 | const | 27378 | 100.00 | NULL |
+----+-------------+------------+------------+-------------+-----------------------+-----------------------+---------+-------+-------+----------+-----------------------------------------------------+
And I want to explain why I use order with limit at the first place: I was planning to fetch record from the table randomly with a specific probability. The random index generated from python and send to MySQL as a variable. But then I found it cost so much time so I decided to just use the first record I got.
After inspiring by O. Jones and Vijay, I tried using max function, but it doesn't perform well:
select max(ItemComments) from ItemInfo where ItemType='item_type'
/* Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 6.225 sec. */
explain result:
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
| 1 | SIMPLE | ItemInfo | NULL | ref | ItemType | ItemType | 767 | const | 27378 | 100.00 | NULL |
+----+-------------+------------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
Thanks for all contribute to this question. Hope you could bring more solutions based on information above.
Please provide CURRENT SHOW CREATE TABLE ItemInfo.
For most of those queries, you need the composite index
INDEX(ItemType, ItemComments)
For the last one, you need
INDEX(ItemComments)
For that especially slow query, please provide EXPLAIN SELECT ....
Discussion - Why does INDEX(ItemType, ItemComments) help with where ItemType="item_type" order by ItemComments desc limit 1?
An index is structured in a BTree (see Wikipedia), thereby making searching for an individual item very fast, and making scanning in a particular order very fast.
where ItemType="item_type" says to filter on ItemType, but there are a lot of such in the index. In this index, they are ordered by ItemComments (for a given ItemType). The direction desc suggests to start with the highest value of ItemContents; that is the 'end' of the index items. Finally limit 1 says to stop after one item is found. (Somewhat like finding the last "S" in your Rolodex.)
So the query is to 'drill down' the BTree to the end of the entries for ItemType in the composite INDEX(ItemType, ItemContents) and grab one entry -- a very efficient task.
Actually SELECT * implies that there is one more step, namely to get all the columns for that one row. That info is not in the index, but over in the BTree for ItemInfo -- which contains all the columns for all the rows, ordered by the PRIMARY KEY.
The "secondary index" (INDEX(ItemType, ItemComments)) implicitly contains a copy of the relevant PRIMARY KEY columns, so we now have the values of ItemID and ItemType. With those, we can drill down this other BTree to find the desired row and fetch all (*) the columns.
Your first query ordering ascending can take advantage of your index on ItemComment.
SELECT * ... ORDER BY ... LIMIT 1 is a notorious performance antipattern. Why? The server must sort a whole mess of rows, just to discard all but the first.
You might try this (for your descending order variant). It's a little more verbose but much more efficient.
SELECT a.*
FROM ItemInfo a
JOIN (
SELECT MAX(ItemComments) ItemComments, ItemType
FROM ItemInfo
GROUP BY ItemType
) maxcomm ON a.ItemType = maxcomm.ItemType
AND a.ItemComments = maxcomm.ItemComments
WHERE a.ItemType = 'item type'
Why does this work? It uses GROUP BY / MAX() to find the maximum value rather that ORDER BY ... DESC LIMIT 1 . The subquery does your search.
To make this work as efficiently as possible you need a compound (multicolumn) index on (ItemType, ItemComments). Create that with
ALTER TABLE ItemInfo CREATE INDEX ItemTypeCommentIndex (ItemType, ItemComments);
When you create the new index, drop your index on ItemType, because the new index is redundant with that one.
MySQL's query planner is smart enough to see the outer WHERE clause before it runs the inner GROUP BY query, so it doesn't have to aggregate the whole table.
With that compound index MySQL can use a loose index scan to satisfy the subquery. Those are almost miraculously fast. You should read up on the topic.
Your query will select all the rows with based on the where condition. After that it will sort the rows according to order by statement , then it will select the first row. A better query would be something like
SELECT ifo.* FROM ItemInfo ifo
JOIN (SELECT MAX(ItemComments) AS max_comnt FROM ItemInfo WHERE ItemType="item_type") inn_ifo
ON ifo.ItemComments = inn_ifo.max_comnt
As this query only finds maximum value from the column. Finding MAX() is only O(n) but the fastest algorithm for sorting is of O(nlogn) . So if you will avoid the order by statemet the query will perform faster.
Hope this helped.

Is there any way to optimize this mysql query?

i have two tables called hg_questions and hg_tags related with hg_question_tag which has the following structure:
+---------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------------------+------+-----+---------+-------+
| qid | bigint(20) unsigned | YES | | NULL | |
| tagid | bigint(20) unsigned | YES | MUL | NULL | |
| tagname | varchar(50) | YES | | NULL | |
+---------+---------------------+------+-----+---------+-------+
i have only one index on this table for tagid column, the following query runs so slow, since i have exactly 59440 rows for tag number 464 (so many questions in this tag)
SELECT hg_questions.qid,
hg_questions.question,
hg_questions.points,
hg_questions.reward,
hg_questions.answerscount,
hg_questions.created_at,
hg_questions.sections,
hg_questions.answered,
hg_questions.user_id
FROM hg_questions
INNER JOIN hg_question_tag ON hg_question_tag.qid = hg_questions.qid
WHERE hg_question_tag.tagid = 464
ORDER BY points DESC LIMIT 15
OFFSET 0;
when running explain on this query i get:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+---------------+---------+---------+----------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | hg_question_tag | ref | tagid | tagid | 9 | const | 59440 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | hg_questions | eq_ref | PRIMARY | PRIMARY | 8 | ejaaba_bilal.hg_question_tag.qid | 1 | |
+----+-------------+-----------------+--------+---------------+---------+---------+----------------------------------+-------+----------------------------------------------+
any ideas how i can optimize this query? or there is a way to get it work faster.
hg_questions has an index on points column
removed order by points makes it work faster like 80%
Here is your query, simplified to some extent:
SELECT q.*
FROM hg_questions q INNER JOIN
hg_question_tag qt
ON qt.qid = q.qid
WHERE qt.tagid = 464
ORDER BY points DESC
LIMIT 15 OFFSET 0;
You are doing both a join and an order by. To optimize this query, try putting an index on hg_question_tag(tagid, qid). This will work if not many questions have the tag.
If many questions have the tag, it might be better to go through the questions and choose the appropriate ones. Try rewriting the query as:
select q.*
from hg_questions q
where exists (select 1 from hg_question_tag qt where qt.qid = q.qid and qt.tagid = 464)
order by points desc
limit 15 offset 0;
Keep the above index and put another index on hg_questions(points, qid).

Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns

I am attempting to speed up a query that takes around 60 seconds to complete on a table of ~20 million rows.
For this example, the table has three columns (id, dateAdded, name).
id is the primary key.
The indexes I have added to the table are:
(dateAdded)
(name)
(id, name)
(id, name, dateAdded)
The query I am trying to run is:
SELECT MAX(id) as id, name
FROM exampletable
WHERE dateAdded <= '2014-01-20 12:00:00'
GROUP BY name
ORDER BY NULL;
The date is variable from query to query.
The objective of this is to get the most recent entry for each name at or before the date added.
When I use explain on the query it tells me that it is using the (id, name, dateAdded) index.
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | exampletable | index | date_added_index | id_element_name_date_added_index | 162 | NULL | 22016957 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
Edit:
Added two new indexes from comments:
(dateAdded, name, id)
(name, id)
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| 1 | SIMPLE | exampletable | index | date_added_index,date_added_name_id_index | id__name_date_added_index | 162 | NULL | 22040469 | Using where; Using index; Using temporary |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
Edit:
Added create table script.
CREATE TABLE `exampletable` (
`id` int(10) NOT NULL auto_increment,
`dateAdded` timestamp NULL default CURRENT_TIMESTAMP,
`name` varchar(50) character set utf8 default '',
PRIMARY KEY (`id`),
KEY `date_added_index` (`dateAdded`),
KEY `name_index` USING BTREE (`name`),
KEY `id_name_index` USING BTREE (`id`,`name`),
KEY `id_name_date_added_index` USING BTREE (`id`,`dateAdded`,`name`),
KEY `date_added_name_id_index` USING BTREE (`dateAdded`,`name`,`id`),
KEY `name_id_index` USING BTREE (`name`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=22046064 DEFAULT CHARSET=latin1
Edit:
Here is the Explain from the answer provided by HeavyE.
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| id | select_type | table | type | possible_k | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1732 | Using temporary; Using filesort |
| 1 | PRIMARY | example1 | ref | date_added_index,name_index,date_added_name_id_index,name_id_index,name_date_added_index | date_added_name_id_index | 158 | maxDateByElement.dateAdded,maxDateByElement.name | 1 | Using where; Using index |
| 2 | DERIVED | exampletable | range | date_added_index,date_added_name_id_index | name_date_added_index | 158 | NULL | 1743 | Using where; Using index for group-by |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
There is a great Stack Overflow post on optimization of Selecting rows with the max value in a column: https://stackoverflow.com/a/7745635/633063
This seems a little messy but works very well:
SELECT example1.name, MAX(example1.id)
FROM exampletable example1
INNER JOIN (
select name, max(dateAdded) dateAdded
from exampletable
where dateAdded <= '2014-01-20 12:00:00'
group by name
) maxDateByElement on example1.name = maxDateByElement.name AND example1.dateAdded = maxDateByElement.dateAdded
GROUP BY name;
why are you using index on many keys?? if your where clause contains only one column, then use that index only, put index on dateAdded and on name separately and then use in sql statement like this:
SELECT MAX(id) as id, name
FROM exampletable
USE INDEX (dateAdded_index) USE INDEX FOR GROUP BY (name_index)
WHERE dateAdded <= '2014-01-20 12:00:00'
GROUP BY name
ORDER BY NULL;
here is the link if you want to know more. Please let me know, whether it is giving some positive results or not.
IF the where command makes no difference, then its either the max(id) or the name.
I would test the indexes by eliminating Max(id) completely, and see if the group by name is fast. Then I would add Min(id) to see if its any faster than Max(id). (I have seen this make a difference).
Also, you should test the order by NULL. Try Order by name desc, or Order by name asc. Clark Vera

Optimizing a large keyword table?

I have a huge table like
CREATE TABLE IF NOT EXISTS `object_search` (
`keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).
Now searching through this with a query like
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2
is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.
It's explain looks like:
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| 1 | SIMPLE | search | ref | PRIMARY | PRIMARY | 42 | const | 345180 | 100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 182544 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | search.object_id | 1 | 100.00 | |
| 2 | DERIVED | search | ref | PRIMARY | PRIMARY | 42 | | 345180 | 100.00 | Using where; Using index |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
My top goal would be to be able to scan through this within 1 or 2 seconds.
So, are there further techniques to improve search speed for keywords?
Update 2013-08-06:
Applying most of Neville K's suggestion I now have the following setup:
CREATE TABLE `object_search_keyword` (
`keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
CREATE TABLE `object_search` (
`keyword_id` int(10) unsigned NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The new query's explain looks like this:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0 | | 1 | 100.00 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | object_search | ref | PRIMARY | PRIMARY | 4 | object_keyword.keyword_id | 2190225 | 100.00 | Using index |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
GROUP BY search.object_id HAVING hits = 2
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC
The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.
Any way to further optimize this?
Update 2013-08-07
The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.
So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.
Update 2013-08-07 later that day
Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:
SELECT wowrapper.*, locale.title
FROM (
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
GROUP BY search.media_id HAVING hits = 1
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS search
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
WHERE 1
ORDER BY object.object_id DESC
) AS wowrapper
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id)
LIMIT 0,48
Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.
Now this is already a massive improvement, but is there any chance to push this further?
Update 2013-08-08 early that day
Undid that last nested variation of the query, since it actually slowed down other variations of it...
I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).
Update 2013-08-08
Alright, a plain fulltext index doesnt really help.
Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!
So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?
Full text search will be much faster than using the standard SQL string comparison features.
Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:
Keywords
--------
keyword_id
keyword
keyword_object
-------------
keyword_id
object_id
objects
-------
object_id
......
If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.
The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.
For this query the only thing I can come up with is to rewrite it as:
SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'
and I'm not sure it will be faster this way.
Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).
If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.

MySQL query to find items without matching record in JOIN is very slow

I've read a lot of questions about query optimization but none have helped me with this.
As setup, I have 3 tables that represent an "entry" that can have zero or more "categories".
> show create table entries;
CREATE TABLE `entries` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
...
`name` varchar(255),
`updated_at` timestamp NOT NULL,
...
PRIMARY KEY (`id`),
KEY `name` (`name`)
) ENGINE=InnoDB
> show create table entry_categories;
CREATE TABLE `entry_categories` (
`ent_name` varchar(255),
`cat_id` int(11),
PRIMARY KEY (`ent_name`,`cat_id`),
KEY `names` (`ent_name`)
) ENGINE=InnoDB
(The actual "category" table doesn't come into the question.)
Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward:
SELECT COUNT(id)
FROM entries e
LEFT JOIN entry_categories c
ON e.name=c.ent_name
WHERE c.ent_name IS NUL
GROUP BY e.name;
On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried
SELECT COUNT(id)
FROM entries e
WHERE NOT EXISTS(
SELECT ent_name
FROM entry_categories c
WHERE c.ent_name = e.name
)
GROUP BY e.name;
with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with
SELECT COUNT(*)
FROM entries e
JOIN (
SELECT ent_name as name
FROM entry_categories
WHERE cat_id = 123
)c
USING (name)
GROUP BY name;
runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table?
I'll try to transcribe the EXPLAIN results for each query:
> EXPLAIN {no category query};
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
> EXPLAIN {single category query}
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort |
| 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | |
| 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try:
select name, sum(e) count_entries from
(select name, 1 e, 0 c from entries
union all
select ent_name name, 0 e, 1 c from entry_categories) s
group by name
having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join.
The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping)
For some reason the query uses filesort, despite that you don't have an order by clause.
Can you show the explain results next to each query, and the single category query, for further diagnosis?