Why does DISTINCT make this query take 10x longer than without? - mysql

I've got this mysql query:
SELECT DISTINCT post.postId,hash,previewUrl,lastRetrieved
FROM post INNER JOIN (tag as t1,taggedBy as tb1,tag as t2,taggedBy as tb2,tag as t3,taggedBy as tb3)
ON post.id=tb1.postId AND tb1.tagId=t1.id AND post.id=tb2.postId AND tb2.tagId=t2.id AND post.id=tb3.postId AND tb3.tagId=t3.id
WHERE ((t1.name="a" AND t2.name="b") OR t3.name="c")
ORDER BY post.postId DESC LIMIT 0,100;
it takes around 15 seconds to run that query, whereas the same query without DISTINCT takes less than a second.
EXPLAIN output for the query with DISTINCT:
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
| 1 | SIMPLE | post | index | PRIMARY | postId | 4 | NULL | 1 | Using temporary |
| 1 | SIMPLE | tb1 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb1.tagId | 1 | Distinct |
| 1 | SIMPLE | tb2 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb2.tagId | 1 | Distinct |
| 1 | SIMPLE | tb3 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb3.tagId | 1 | Using where; Distinct |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
7 rows in set (0.01 sec)
EXPLAIN output for the query without DISTINCT:
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
| 1 | SIMPLE | post | index | PRIMARY | postId | 4 | NULL | 1 | NULL |
| 1 | SIMPLE | tb1 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb1.tagId | 1 | NULL |
| 1 | SIMPLE | tb2 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb2.tagId | 1 | NULL |
| 1 | SIMPLE | tb3 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb3.tagId | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
CREATE TABLE `post` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`postId` int(11) NOT NULL,
`hash` varchar(32) COLLATE utf8_bin NOT NULL,
`previewUrl` varchar(512) COLLATE utf8_bin NOT NULL,
`lastRetrieved` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `postId` (`postId`),
UNIQUE KEY `hash` (`hash`),
KEY `postId_2` (`postId`),
KEY `postId_3` (`postId`)
) ENGINE=InnoDB AUTO_INCREMENT=692561 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `tag` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`),
KEY `name_2` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=157876 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `taggedBy` (
`postId` int(11) NOT NULL,
`tagId` int(11) NOT NULL,
PRIMARY KEY (`postId`,`tagId`),
KEY `tagId` (`tagId`),
CONSTRAINT `taggedBy_ibfk_1` FOREIGN KEY (`postId`) REFERENCES `post` (`id`) ON DELETE CASCADE,
CONSTRAINT `taggedBy_ibfk_2` FOREIGN KEY (`tagId`) REFERENCES `tag` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
what causes this query to be so slow? how can I speed it up?
I hope I've given enough information so you guys can give me some meaningful answers. if I've left something out I'll be happy to add it.

Several things are being discussed, even in #SlimGhost's reasonable (but deleted) answer.
DISTINCT vs GROUP BY
Although GROUP BY can sometimes be used to replace DISTINCT, don't do it; they are meant for different things.
They both require some form of extra effort. (I'll get to the 10x later.) Both have to discover common values -- either in the entire row (for DISTINCT) or for the grouped items. This can be done in one of at least two ways. (Probably most engines have these options built in.) Note that the DISTINCT or GROUP BY must logically come after WHERE, but before ORDER BY and LIMIT.
Keep some kind of internal associative array as the output is being generated. This is practical if the the optimizer can see that there won't be "too many" possible different values.
Sort the output; then dedup or group in a pass over the output. This works regardless of size.
ORDER BY + LIMIT
Notice that the query is doing DISTINCT over 4 columns: post.postId, hash, previewUrl, lastRetrieved. It is not obvious whether these are all in post or scattered across the 7 tables. (Please clarify by qualifying every column.)
Let's assume the JOINs need to be done to find the 4 columns.
Let's say there is no DISTINCT. Now, the operations are
Walk through post in ORDER BY post.postID order.
For each such row, do the JOINs and check the WHERE.
After 100 rows have passed the WHERE, stop.
But with DISTINCT, the optimizer can't make such a simplifying assumption in order to stop short. Instead:
Walk through post in ORDER BY post.postID order. (Starting with t1/t2/t3 is out of the question because of OR.) Actually, it is unclear whether the optimizer would bother going in this order.
For each such row, do the JOINs and check the WHERE.
Do something about DISTINCT.
After 100 rows have passed the WHERE, stop. Note: This may involve lots more rows from post (perhaps 10x?)
Keep in mind that the optimizer knows nothing about whether postId is 1:1 with hash, etc. So, it can't make simplifying assumptions. Suppose there were 200 rows in the JOIN with the smallest postId, and the hash happened to be in descending order. Smells like a need for a "sort".
EXPLAIN FORMAT=JSON SELECT ... might give you some of these details.
Ouch. You have both a id and UNIQUE(postid)? Get rid of id and turn the postId into the PRIMARY KEY. This, alone, may speed things up.
What is the hash a hash of?
Please use the JOIN ... ON ... syntax.
You have 3 indexes on postId; get rid of the extra two.
Why use DISTINCT?
Now that I see that all the SELECTed columns come from the one table, and that they will obviously be easily made distinct, why even consider using DISTINCT.
(updates)
JOIN ON
FROM post INNER JOIN (tag as t1,taggedBy as tb1,...
ON post.id=tb1.postId AND tb1.tagId=t1.id AND ...
-->
FROM post
JOIN tag AS t1 ON post.id = tb1.postId
JOIN taggedBy AS tb1 ON tb2.tagId = t2.id
... (each ON is next to the JOIN it applies to)
A speedup technique
SELECT p2.postId, p2.hash, p2.previewUrl, p2.lastRetrieved
FROM (
SELECT DISTINCT postId -- Only the PRIMARY KEY
FROM post
JOIN ... etc
WHERE ... ...
ORDER BY postId
LIMIT 100
) x
JOIN post AS p2 ON x.postId = p2.id -- self join for getting rest of fields
ORDER BY x.postId -- assuming you need the ordering
This puts the DISTINCT in the inner query, where you are fetching only the one column (postId). (I am not sure whether this technique will help much in your case.)

Related

Mysql "Using temporary" but no official reason matched in this query

If you look at the official documentation for MySql temporary tables:
http://dev.mysql.com/doc/refman/5.1/en/internal-temporary-tables.html
The reasons given are:
The server creates temporary tables under conditions such as these:
Evaluation of UNION statements.
Evaluation of some views, such those that use the TEMPTABLE algorithm,
UNION, or aggregation.
Evaluation of statements that contain an ORDER BY clause and a
different GROUP BY clause, or for which the ORDER BY or GROUP BY
contains columns from tables other than the first table in the join queue.
Evaluation of DISTINCT combined with ORDER BY may require a temporary table.
For queries that use the SQL_SMALL_RESULT option, MySQL uses an
in-memory temporary table, unless the query also contains elements
(described later) that require on-disk storage.
Evaluation of multiple-table UPDATE statements.
Evaluation of GROUP_CONCAT() or COUNT(DISTINCT) expressions.
None of these conditions are met in this query:
select ttl.id AS id,
ttl.name AS name,
ttl.updated_at AS last_update_on,
ttl.user_id AS list_creator,
ttl.retailer_nomination_list AS nomination_list,
ttl.created_at AS created_on,
tv.name AS venue_name,
from haha_title_lists ttl
left join haha_title_list_to_users tltu on ((ttl.id = tltu.title_list_id))
left join users u on ((tltu.user_id = u.id))
left join users u2 on ((tltu.user_id = u2.id))
left join haha_title_list_to_venues tlv on ((ttl.id = tlv.title_list))
left join haha_venue_properties tvp on ((tlv.venue_id = tvp.id))
left join haha_venues tv on ((tvp.venue_id = tv.id))
join haha_title_list_to_books tlb on ((ttl.id = tlb.title_list_id))
join wawa_title ot on ((tlb.title_id = ot.title_id))
join wawa_title_to_author ota on ((ot.title_id = ota.title_id))
join wawa_author oa on ((ota.author_id = oa.author_id))
group by ttl.id;
For this table:
CREATE TABLE haha_title_lists (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
name varchar(255) DEFAULT NULL,
isbn varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
created_at datetime NOT NULL,
updated_at datetime NOT NULL,
user_id int(11) DEFAULT NULL,
list_note text,
retailer_nomination_list int(11) DEFAULT NULL,
PRIMARY KEY ( id )
) ENGINE=InnoDB AUTO_INCREMENT=460 DEFAULT CHARSET=utf8
I would expect the PRIMARY KEY to be used, since this table only matches on id. What would cause the use of a temporary table?
If I run EXPLAIN on this query I get:
+----+-------------+-------+--------+------------------------------------------------------------------------+---------------------------------------+---------+---------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------------------------------------------+---------------------------------------+---------+---------------------------------------+------+---------------------------------+
| 1 | SIMPLE | ttl | ALL | PRIMARY | NULL | NULL | NULL | 307 | Using temporary; Using filesort |
| 1 | SIMPLE | tltu | ref | idx_title_list_to_user | idx_title_list_to_user | 4 | wawa_ripple_development.ttl.id | 1 | Using index |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.tltu.user_id | 1 | Using index |
| 1 | SIMPLE | u2 | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.tltu.user_id | 1 | Using index |
| 1 | SIMPLE | tlb | ref | idx_title_list_to_books_title_id,idx_title_list_to_books_title_list_id | idx_title_list_to_books_title_list_id | 4 | wawa_ripple_development.ttl.id | 49 | Using where |
| 1 | SIMPLE | ot | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.tlb.title_id | 1 | Using index |
| 1 | SIMPLE | ota | ref | PRIMARY,title_id | title_id | 4 | wawa_ripple_development.ot.title_id | 1 | Using where; Using index |
| 1 | SIMPLE | oa | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.ota.author_id | 1 | Using index |
| 1 | SIMPLE | tlv | ALL | NULL | NULL | NULL | NULL | 175 | |
| 1 | SIMPLE | tvp | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.tlv.venue_id | 1 | |
| 1 | SIMPLE | tv | eq_ref | PRIMARY | PRIMARY | 4 | wawa_ripple_development.tvp.venue_id | 1 | |
+----+-------------+-------+--------+------------------------------------------------------------------------+---------------------------------------+---------+---------------------------------------+------+---------------------------------+
Why do I get "Using temporary; Using filesort"?
First, some comments...
The "using temp, using filesort" is often on the first line of the EXPLAIN, but the actual position of them could be anywhere. Furthermore there could be multiple tmps and/or sorts, even for a 1-table query. For example: ... GROUP BY aaa ORDER BY bbb may use one tmp for grouping and another for sorting.
In newer versions, you can do EXPLAIN FORMAT=JSON SELECT... to get a blow-by-blow account -- it will be clear there how many tmps and sorts there are.
"Filesort" is a misnomer. In many cases, the data may actually be collected in memory and sorted there. That is, "no file is harmed in the filming of the query". There are many reasons for deciding (either up-front, or later) to use a disk-based sort; I won't give those details in this answer. One way to check is SHOW STATUS LIKE 'Created_tmp%tables';. Another is via the slowlog.
Only recently have some UNIONs been improved to avoid tmp tables -- in obvious cases where they aren't needed. Alas, unions are still single-threaded.
Back to your question... Yes, your GROUP BY applies to the first table. But, for whatever reason, the optimizer chose to gather the data, then sort. The other option would have been to use the PRIMARY KEY(id) for ordering and grouping. Hmmm... I wonder what would happen if you added ORDER BY ttl.id? I'm guessing that the Optimizer is focusing on how to do the GROUP BY -- either by filesort or by collecting a hash in ram, and it decided that all the JOINs were too much to think through.

Mysql optimizer chooses wrong table order in query

We have simple database with 4 tables: files, file_versions, users, organizations.
I do select all files which owned by some organization with some condition on trashing date by this query:
select * FROM organizations o
LEFT JOIN users u ON o.id=u.organization_id
LEFT JOIN files f ON u.user_identity=f.owner_identity
LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity
AND f.local_path=fv.local_path
WHERE o.id=2001237 AND o.trashed_file_age_limit>=1
AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
Explain select shows me that optimizer choose wrong table order, which is different from query order(organizations-> users->files->file_versions):
mysql> explain select * FROM organizations o LEFT JOIN users u ON o.id=u.organization_id LEFT JOIN files f ON u.user_identity=f.owner_identity LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | f | ALL | PRIMARY | NULL | NULL | NULL | 109615125 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY,identity,organization_id | identity | 36 | filemirror.f.owner_identity | 1 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
4 rows in set (0.01 sec)
Of couse this query is slow because of full scan by files table and I have to use STRAIGHT_JOIN(which is not equivalent to LEFT JOIN) to fix table order and make query faster.
mysql> explain select * FROM organizations o STRAIGHT_JOIN users u ON o.id=u.organization_id STRAIGHT_JOIN files f ON u.user_identity=f.owner_identity STRAIGHT_JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | u | ref | PRIMARY,identity,organization_id | PRIMARY | 4 | const | 36 | |
| 1 | SIMPLE | f | ref | PRIMARY | PRIMARY | 36 | filemirror.u.user_identity | 6089324 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
4 rows in set (0.00 sec)
My question is why mysql can change table order in not symmetric join operation?
Tables structure:
CREATE TABLE `file_versions` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`,`version_number`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `files` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
..
`trashing_date` int(11) default NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`),
KEY `trashing_date` (`trashing_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `organizations` (
`id` int(11) NOT NULL,
...
`trashed_file_age_limit` int(11) default NULL,
...
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `users` (
`organization_id` int(11) NOT NULL,
`id` int(11) NOT NULL,
`user_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
...
PRIMARY KEY (`organization_id`,`id`),
UNIQUE KEY `identity` (`user_identity`),
KEY `organization_id` (`organization_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
Mysql version 5.5
Look at the rows estimates, mysql thinks that it will need to read 109M rows of files table in first plan and 6M for each of 36 users = 216M rows for second plan. So it seems reasonable to read all 109M rows only once and in priamry key order instead reading them in separate blocks.. Those estimates does not seem very reasonable to me, so I would try running analyze table on files, but they are estimates so maybe you wont get better numbers.
Using LEFT join and then adding condition on the table to WHERE turns it into INNER join as Strawberry says in their comment - you have to have value for the where condition to ever be true, so mysql feels free to reorder those a bit, maybe even it seems better for optimizer to do "really-inner" joins first, so that may be second reason for that plan.
You can try using STRAIGHT_JOIN in different way - if you put it just once right after SELECT, then your join order is used by optimizer if possible (it usually is barring some weird right joins and other corner cases) without changing join type on specific tables (it is then used as sort of FLAG, in the way SQL_NO_CACHE is used to signalize something, instead of as special join type)
Then to make it even better, you may try adding index to files on (owner_identity, trashing_date) which should help in localizing specific files for each user and not globally as with current key on (trashing_date) only.

Optimizing a large keyword table?

I have a huge table like
CREATE TABLE IF NOT EXISTS `object_search` (
`keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).
Now searching through this with a query like
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2
is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.
It's explain looks like:
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| 1 | SIMPLE | search | ref | PRIMARY | PRIMARY | 42 | const | 345180 | 100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 182544 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | search.object_id | 1 | 100.00 | |
| 2 | DERIVED | search | ref | PRIMARY | PRIMARY | 42 | | 345180 | 100.00 | Using where; Using index |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
My top goal would be to be able to scan through this within 1 or 2 seconds.
So, are there further techniques to improve search speed for keywords?
Update 2013-08-06:
Applying most of Neville K's suggestion I now have the following setup:
CREATE TABLE `object_search_keyword` (
`keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
CREATE TABLE `object_search` (
`keyword_id` int(10) unsigned NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The new query's explain looks like this:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0 | | 1 | 100.00 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | object_search | ref | PRIMARY | PRIMARY | 4 | object_keyword.keyword_id | 2190225 | 100.00 | Using index |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
GROUP BY search.object_id HAVING hits = 2
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC
The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.
Any way to further optimize this?
Update 2013-08-07
The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.
So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.
Update 2013-08-07 later that day
Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:
SELECT wowrapper.*, locale.title
FROM (
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
GROUP BY search.media_id HAVING hits = 1
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS search
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
WHERE 1
ORDER BY object.object_id DESC
) AS wowrapper
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id)
LIMIT 0,48
Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.
Now this is already a massive improvement, but is there any chance to push this further?
Update 2013-08-08 early that day
Undid that last nested variation of the query, since it actually slowed down other variations of it...
I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).
Update 2013-08-08
Alright, a plain fulltext index doesnt really help.
Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!
So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?
Full text search will be much faster than using the standard SQL string comparison features.
Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:
Keywords
--------
keyword_id
keyword
keyword_object
-------------
keyword_id
object_id
objects
-------
object_id
......
If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.
The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.
For this query the only thing I can come up with is to rewrite it as:
SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'
and I'm not sure it will be faster this way.
Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).
If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.

MySQL query to find items without matching record in JOIN is very slow

I've read a lot of questions about query optimization but none have helped me with this.
As setup, I have 3 tables that represent an "entry" that can have zero or more "categories".
> show create table entries;
CREATE TABLE `entries` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
...
`name` varchar(255),
`updated_at` timestamp NOT NULL,
...
PRIMARY KEY (`id`),
KEY `name` (`name`)
) ENGINE=InnoDB
> show create table entry_categories;
CREATE TABLE `entry_categories` (
`ent_name` varchar(255),
`cat_id` int(11),
PRIMARY KEY (`ent_name`,`cat_id`),
KEY `names` (`ent_name`)
) ENGINE=InnoDB
(The actual "category" table doesn't come into the question.)
Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward:
SELECT COUNT(id)
FROM entries e
LEFT JOIN entry_categories c
ON e.name=c.ent_name
WHERE c.ent_name IS NUL
GROUP BY e.name;
On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried
SELECT COUNT(id)
FROM entries e
WHERE NOT EXISTS(
SELECT ent_name
FROM entry_categories c
WHERE c.ent_name = e.name
)
GROUP BY e.name;
with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with
SELECT COUNT(*)
FROM entries e
JOIN (
SELECT ent_name as name
FROM entry_categories
WHERE cat_id = 123
)c
USING (name)
GROUP BY name;
runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table?
I'll try to transcribe the EXPLAIN results for each query:
> EXPLAIN {no category query};
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
> EXPLAIN {single category query}
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort |
| 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | |
| 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try:
select name, sum(e) count_entries from
(select name, 1 e, 0 c from entries
union all
select ent_name name, 0 e, 1 c from entry_categories) s
group by name
having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join.
The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping)
For some reason the query uses filesort, despite that you don't have an order by clause.
Can you show the explain results next to each query, and the single category query, for further diagnosis?

Inner join will not use index

Why would this query (and a number of similar variants) not use the index for ASIN on the 'tags' table? It insists on a full-table scan even when A contains just a few rows. As 'tags' table on production contains nearly a million entries, it's killing the query rather badly.
SELECT C.tag, count(C.tag) AS total
FROM
(
SELECT B.*
FROM
(
SELECT ASIN FROM requests WHERE user_id=9
) A
INNER JOIN tags B USING(ASIN)
) C
GROUP BY C.tag ORDER BY total DESC
EXPLAIN shows no index being used (run on test DB so rows in 'tags' is low, but still a full table scan):
| 1 | PRIMARY | <derived2> | system | NULL | NULL | NULL | NULL | 0 | const row not found |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 28 | |
| 2 | DERIVED | B | ALL | NULL | NULL | NULL | NULL | 2593 | Using where; Using join buffer |
| 3 | DERIVED | borrowing_requests | ref | idx_user_id | idx_user_id | 5 | | 27 | Using where
Indexes:
| book_tags | 1 | asin | 1 | ASIN | A | 432 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_tag | 1 | tag | A | 1296 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_updated_on | 1 | updated_on | A | 518 | NULL | NULL | | BTREE
The query was rewritten from an INNER JOIN which was having the same problem:
SELECT tag, count(tag) AS total
FROM tags
INNER JOIN requests ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC
EXPLAIN:
| 1 | SIMPLE | tags | ALL | NULL | NULL | NULL | NULL | 2593 | Using temporary; Using filesort |
| 1 | SIMPLE | requests | ref | idx_ASIN,idx_user_id | idx_ASIN | 33 | func | 3 | Using where
I get the idea this is a real basic point I'm missing, but about 4 hours work on it has got me nowhere. Any advice is welcome.
EDIT:
I can see that the first query using sub-queries won't use indexes thanks to some replies, but this was being used as it ran twice as quick as the bottom query with just the INNER JOIN.
As an example, there are 70k rows in requests (all with an indexed ASIN), and 700k rows in tags, with 95k different ASINs in tags, each with less than 10 different tag records.
If a user has 10 requests, I only want the tags from those 10 ASINs to be listed and counted. In my mind, this should use tags.idx_ASIN and should lookup 100 rows (10 ASINs, each with max of 10 tags) at most from the tags table.
I'm missing something...I just can't see what.
EDIT:
requests CREATE TABLE:
CREATE TABLE IF NOT EXISTS `requests` (
`bid` int(40) NOT NULL AUTO_INCREMENT,
`user_id` int(20) DEFAULT NULL,
`ASIN` varchar(10) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` enum('active','inactive','pending','deleted','completed') COLLATE utf8_unicode_ci NOT NULL,
`added_on` datetime NOT NULL,
`status_changed_on` datetime NOT NULL,
`last_emailed` datetime DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`bid`),
KEY `idx_ASIN` (`ASIN`),
KEY `idx_status` (`status`),
KEY `idx_added_on` (`added_on`),
KEY `idx_user_id` (`user_id`),
KEY `idx_status_changed_on` (`status_changed_on`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=149380 ;
tags CREATE TABLE
CREATE TABLE IF NOT EXISTS `tags` (
`ASIN` varchar(10) NOT NULL,
`tag` varchar(50) NOT NULL,
`updated_on` datetime NOT NULL,
KEY `idx_tag` (`tag`),
KEY `idx_updated_on` (`updated_on`),
KEY `idx_asin` (`ASIN`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
There is no primary key on tags. I don't usually have tables without primary keys, but didn't see the need on this one. Could this be an issue?
AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.
MySQL doesn't index subqueries. If you want indexes to improve performance of your queries, rewrite them to not use subqueries.
Try reversing the order of the tables in your original query:
SELECT tag, count(tag) AS total
FROM requests
INNER JOIN tags ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC
AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.