Inner join will not use index

Inner join will not use index - mysql

Why would this query (and a number of similar variants) not use the index for ASIN on the 'tags' table? It insists on a full-table scan even when A contains just a few rows. As 'tags' table on production contains nearly a million entries, it's killing the query rather badly.
SELECT C.tag, count(C.tag) AS total
FROM
(
SELECT B.*
FROM
(
SELECT ASIN FROM requests WHERE user_id=9
) A
INNER JOIN tags B USING(ASIN)
) C
GROUP BY C.tag ORDER BY total DESC
EXPLAIN shows no index being used (run on test DB so rows in 'tags' is low, but still a full table scan):
| 1 | PRIMARY | <derived2> | system | NULL | NULL | NULL | NULL | 0 | const row not found |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 28 | |
| 2 | DERIVED | B | ALL | NULL | NULL | NULL | NULL | 2593 | Using where; Using join buffer |
| 3 | DERIVED | borrowing_requests | ref | idx_user_id | idx_user_id | 5 | | 27 | Using where
Indexes:
| book_tags | 1 | asin | 1 | ASIN | A | 432 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_tag | 1 | tag | A | 1296 | NULL | NULL | | BTREE | |
| book_tags | 1 | idx_updated_on | 1 | updated_on | A | 518 | NULL | NULL | | BTREE
The query was rewritten from an INNER JOIN which was having the same problem:
SELECT tag, count(tag) AS total
FROM tags
INNER JOIN requests ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC
EXPLAIN:
| 1 | SIMPLE | tags | ALL | NULL | NULL | NULL | NULL | 2593 | Using temporary; Using filesort |
| 1 | SIMPLE | requests | ref | idx_ASIN,idx_user_id | idx_ASIN | 33 | func | 3 | Using where
I get the idea this is a real basic point I'm missing, but about 4 hours work on it has got me nowhere. Any advice is welcome.
EDIT:
I can see that the first query using sub-queries won't use indexes thanks to some replies, but this was being used as it ran twice as quick as the bottom query with just the INNER JOIN.
As an example, there are 70k rows in requests (all with an indexed ASIN), and 700k rows in tags, with 95k different ASINs in tags, each with less than 10 different tag records.
If a user has 10 requests, I only want the tags from those 10 ASINs to be listed and counted. In my mind, this should use tags.idx_ASIN and should lookup 100 rows (10 ASINs, each with max of 10 tags) at most from the tags table.
I'm missing something...I just can't see what.
EDIT:
requests CREATE TABLE:
CREATE TABLE IF NOT EXISTS `requests` (
`bid` int(40) NOT NULL AUTO_INCREMENT,
`user_id` int(20) DEFAULT NULL,
`ASIN` varchar(10) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` enum('active','inactive','pending','deleted','completed') COLLATE utf8_unicode_ci NOT NULL,
`added_on` datetime NOT NULL,
`status_changed_on` datetime NOT NULL,
`last_emailed` datetime DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`bid`),
KEY `idx_ASIN` (`ASIN`),
KEY `idx_status` (`status`),
KEY `idx_added_on` (`added_on`),
KEY `idx_user_id` (`user_id`),
KEY `idx_status_changed_on` (`status_changed_on`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=149380 ;
tags CREATE TABLE
CREATE TABLE IF NOT EXISTS `tags` (
`ASIN` varchar(10) NOT NULL,
`tag` varchar(50) NOT NULL,
`updated_on` datetime NOT NULL,
KEY `idx_tag` (`tag`),
KEY `idx_updated_on` (`updated_on`),
KEY `idx_asin` (`ASIN`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
There is no primary key on tags. I don't usually have tables without primary keys, but didn't see the need on this one. Could this be an issue?
AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.

MySQL doesn't index subqueries. If you want indexes to improve performance of your queries, rewrite them to not use subqueries.

Try reversing the order of the tables in your original query:
SELECT tag, count(tag) AS total
FROM requests
INNER JOIN tags ON requests.ASIN=tags.ASIN
WHERE user_id=9
GROUP BY tag
ORDER BY total DESC

AHA! Different charsets and collations. I shall correct that and try again!
Later:
That got it. Query went down from 10secs to 0.006secs. Thanks to everyone for getting me to look at this differently.

Related

How to use correct indexes with a double inner join query?

I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...

I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.

Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)

MySQL report query optimization and timezone issues

I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?

No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)

MySql Record Matching Criteria With Latest Date

I have a mySql table where all status changes are recorded. I want to be able to query the status of all items on a specific date, or the last date for all items. The table I have now is:
CREATE TABLE `tra_rel_sta` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tra_id` int(11) DEFAULT NULL,
`sta_id` int(11) DEFAULT NULL,
`changed_on` datetime DEFAULT NULL,
`changed_by` int(11) DEFAULT NULL,
`comments` text,
PRIMARY KEY (`id`),
KEY `tra_id` (`tra_id`),
KEY `rel` (`tra_id`,`sta_id`,`changed_on`),
KEY `sta_id` (`sta_id`),
KEY `changed_on` (`changed_on`),
KEY `tra_changed` (`tra_id`,`changed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=51734 DEFAULT CHARSET=utf8;
(I know I'm probably overdoing the indexes, but I haven't exactly figured out how to optimize indexes yet).
The query I'm using now, which works is:
SELECT rel.changed_on, rel.changed_by, rel.tra_id, sta.id AS sta_id, sta.status, sta.description, sta.onHold, sta.awaitingApproval, sta.approved, sta.complete, sta.locked
FROM (
SELECT tra_id, MAX(changed_on) AS lst
FROM tra_rel_sta
GROUP BY tra_id
) AS rec
LEFT JOIN tra_rel_sta AS rel ON rel.changed_on = rec.lst AND rel.tra_id = rec.tra_id
LEFT JOIN tra_status AS sta ON sta.id = rel.sta_id
If I want to use a specific date, I insert a WHERE statement in the sub-query.
This works, but it takes about 0.65 seconds to run in PHP with about 51,733 records in the table. This query is used as a sub query in several others when I need to know the last status of an object, and as a result, is slowing down many application.
I've tried to use a sub query in the WHERE statement as described in MySQL: how to select record with latest date before a certain date but it takes almost twice as long. I've tried using a JOIN statement as described in MySQL select of record with latest date but I'm getting about the same or just slightly slower results.
How can I optimize this query or fix my indexes to make this more effective?
Thanks!!
As requested, EXPLAIN of query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---|-------------|-------------|--------|-----------------------------------|---------|---------|-------------------|-------|-------------
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 49931 | NULL
1 | PRIMARY | rel | ref | tra_id,rel,changed_on,tra_changed | tra_id | 5 | rec.tra_id | 1 | Using where
1 | PRIMARY | sta | eq_ref | PRIMARY | PRIMARY | 4 | csinfo.rel.sta_id | 1 | NULL
2 | DERIVED | tra_rel_sta | index | tra_id,rel,tra_changed | tra_id | 5 | NULL | 49931 | NULL

Why does DISTINCT make this query take 10x longer than without?

I've got this mysql query:
SELECT DISTINCT post.postId,hash,previewUrl,lastRetrieved
FROM post INNER JOIN (tag as t1,taggedBy as tb1,tag as t2,taggedBy as tb2,tag as t3,taggedBy as tb3)
ON post.id=tb1.postId AND tb1.tagId=t1.id AND post.id=tb2.postId AND tb2.tagId=t2.id AND post.id=tb3.postId AND tb3.tagId=t3.id
WHERE ((t1.name="a" AND t2.name="b") OR t3.name="c")
ORDER BY post.postId DESC LIMIT 0,100;
it takes around 15 seconds to run that query, whereas the same query without DISTINCT takes less than a second.
EXPLAIN output for the query with DISTINCT:
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
| 1 | SIMPLE | post | index | PRIMARY | postId | 4 | NULL | 1 | Using temporary |
| 1 | SIMPLE | tb1 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb1.tagId | 1 | Distinct |
| 1 | SIMPLE | tb2 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb2.tagId | 1 | Distinct |
| 1 | SIMPLE | tb3 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index; Distinct |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb3.tagId | 1 | Using where; Distinct |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-----------------------+
7 rows in set (0.01 sec)
EXPLAIN output for the query without DISTINCT:
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
| 1 | SIMPLE | post | index | PRIMARY | postId | 4 | NULL | 1 | NULL |
| 1 | SIMPLE | tb1 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb1.tagId | 1 | NULL |
| 1 | SIMPLE | tb2 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb2.tagId | 1 | NULL |
| 1 | SIMPLE | tb3 | ref | PRIMARY,tagId | PRIMARY | 4 | e621datamirror.post.id | 13 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY,name,name_2 | PRIMARY | 4 | e621datamirror.tb3.tagId | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+------+-------------+
CREATE TABLE `post` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`postId` int(11) NOT NULL,
`hash` varchar(32) COLLATE utf8_bin NOT NULL,
`previewUrl` varchar(512) COLLATE utf8_bin NOT NULL,
`lastRetrieved` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `postId` (`postId`),
UNIQUE KEY `hash` (`hash`),
KEY `postId_2` (`postId`),
KEY `postId_3` (`postId`)
) ENGINE=InnoDB AUTO_INCREMENT=692561 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `tag` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`),
KEY `name_2` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=157876 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `taggedBy` (
`postId` int(11) NOT NULL,
`tagId` int(11) NOT NULL,
PRIMARY KEY (`postId`,`tagId`),
KEY `tagId` (`tagId`),
CONSTRAINT `taggedBy_ibfk_1` FOREIGN KEY (`postId`) REFERENCES `post` (`id`) ON DELETE CASCADE,
CONSTRAINT `taggedBy_ibfk_2` FOREIGN KEY (`tagId`) REFERENCES `tag` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
what causes this query to be so slow? how can I speed it up?
I hope I've given enough information so you guys can give me some meaningful answers. if I've left something out I'll be happy to add it.

Several things are being discussed, even in #SlimGhost's reasonable (but deleted) answer.
DISTINCT vs GROUP BY
Although GROUP BY can sometimes be used to replace DISTINCT, don't do it; they are meant for different things.
They both require some form of extra effort. (I'll get to the 10x later.) Both have to discover common values -- either in the entire row (for DISTINCT) or for the grouped items. This can be done in one of at least two ways. (Probably most engines have these options built in.) Note that the DISTINCT or GROUP BY must logically come after WHERE, but before ORDER BY and LIMIT.
Keep some kind of internal associative array as the output is being generated. This is practical if the the optimizer can see that there won't be "too many" possible different values.
Sort the output; then dedup or group in a pass over the output. This works regardless of size.
ORDER BY + LIMIT
Notice that the query is doing DISTINCT over 4 columns: post.postId, hash, previewUrl, lastRetrieved. It is not obvious whether these are all in post or scattered across the 7 tables. (Please clarify by qualifying every column.)
Let's assume the JOINs need to be done to find the 4 columns.
Let's say there is no DISTINCT. Now, the operations are
Walk through post in ORDER BY post.postID order.
For each such row, do the JOINs and check the WHERE.
After 100 rows have passed the WHERE, stop.
But with DISTINCT, the optimizer can't make such a simplifying assumption in order to stop short. Instead:
Walk through post in ORDER BY post.postID order. (Starting with t1/t2/t3 is out of the question because of OR.) Actually, it is unclear whether the optimizer would bother going in this order.
For each such row, do the JOINs and check the WHERE.
Do something about DISTINCT.
After 100 rows have passed the WHERE, stop. Note: This may involve lots more rows from post (perhaps 10x?)
Keep in mind that the optimizer knows nothing about whether postId is 1:1 with hash, etc. So, it can't make simplifying assumptions. Suppose there were 200 rows in the JOIN with the smallest postId, and the hash happened to be in descending order. Smells like a need for a "sort".
EXPLAIN FORMAT=JSON SELECT ... might give you some of these details.
Ouch. You have both a id and UNIQUE(postid)? Get rid of id and turn the postId into the PRIMARY KEY. This, alone, may speed things up.
What is the hash a hash of?
Please use the JOIN ... ON ... syntax.
You have 3 indexes on postId; get rid of the extra two.
Why use DISTINCT?
Now that I see that all the SELECTed columns come from the one table, and that they will obviously be easily made distinct, why even consider using DISTINCT.
(updates)
JOIN ON
FROM post INNER JOIN (tag as t1,taggedBy as tb1,...
ON post.id=tb1.postId AND tb1.tagId=t1.id AND ...
-->
FROM post
JOIN tag AS t1 ON post.id = tb1.postId
JOIN taggedBy AS tb1 ON tb2.tagId = t2.id
... (each ON is next to the JOIN it applies to)
A speedup technique
SELECT p2.postId, p2.hash, p2.previewUrl, p2.lastRetrieved
FROM (
SELECT DISTINCT postId -- Only the PRIMARY KEY
FROM post
JOIN ... etc
WHERE ... ...
ORDER BY postId
LIMIT 100
) x
JOIN post AS p2 ON x.postId = p2.id -- self join for getting rest of fields
ORDER BY x.postId -- assuming you need the ordering
This puts the DISTINCT in the inner query, where you are fetching only the one column (postId). (I am not sure whether this technique will help much in your case.)

Mysql optimizer chooses wrong table order in query

We have simple database with 4 tables: files, file_versions, users, organizations.
I do select all files which owned by some organization with some condition on trashing date by this query:
select * FROM organizations o
LEFT JOIN users u ON o.id=u.organization_id
LEFT JOIN files f ON u.user_identity=f.owner_identity
LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity
AND f.local_path=fv.local_path
WHERE o.id=2001237 AND o.trashed_file_age_limit>=1
AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
Explain select shows me that optimizer choose wrong table order, which is different from query order(organizations-> users->files->file_versions):
mysql> explain select * FROM organizations o LEFT JOIN users u ON o.id=u.organization_id LEFT JOIN files f ON u.user_identity=f.owner_identity LEFT JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | f | ALL | PRIMARY | NULL | NULL | NULL | 109615125 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY,identity,organization_id | identity | 36 | filemirror.f.owner_identity | 1 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+--------+----------------------------------+----------+---------+----------------------------------------------------+-----------+-------------+
4 rows in set (0.01 sec)
Of couse this query is slow because of full scan by files table and I have to use STRAIGHT_JOIN(which is not equivalent to LEFT JOIN) to fix table order and make query faster.
mysql> explain select * FROM organizations o STRAIGHT_JOIN users u ON o.id=u.organization_id STRAIGHT_JOIN files f ON u.user_identity=f.owner_identity STRAIGHT_JOIN file_versions fv ON f.owner_identity=fv.owner_identity AND f.local_path=fv.local_path WHERE o.id=2001237 AND o.trashed_file_age_limit>=1 AND f.trashing_date<(1433943058 - o.trashed_file_age_limit*24*60*60);
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
| 1 | SIMPLE | o | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | u | ref | PRIMARY,identity,organization_id | PRIMARY | 4 | const | 36 | |
| 1 | SIMPLE | f | ref | PRIMARY | PRIMARY | 36 | filemirror.u.user_identity | 6089324 | Using where |
| 1 | SIMPLE | fv | ref | PRIMARY | PRIMARY | 3035 | filemirror.u.user_identity,filemirror.f.local_path | 1 | |
+----+-------------+-------+-------+----------------------------------+---------+---------+----------------------------------------------------+---------+-------------+
4 rows in set (0.00 sec)
My question is why mysql can change table order in not symmetric join operation?
Tables structure:
CREATE TABLE `file_versions` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`,`version_number`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `files` (
`owner_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
`local_path` varchar(999) character set utf8 NOT NULL,
`version_number` int(11) unsigned NOT NULL,
..
`trashing_date` int(11) default NULL,
...
PRIMARY KEY (`owner_identity`,`local_path`),
KEY `trashing_date` (`trashing_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `organizations` (
`id` int(11) NOT NULL,
...
`trashed_file_age_limit` int(11) default NULL,
...
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
CREATE TABLE `users` (
`organization_id` int(11) NOT NULL,
`id` int(11) NOT NULL,
`user_identity` char(36) character set latin1 collate latin1_bin NOT NULL,
...
PRIMARY KEY (`organization_id`,`id`),
UNIQUE KEY `identity` (`user_identity`),
KEY `organization_id` (`organization_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
Mysql version 5.5

Look at the rows estimates, mysql thinks that it will need to read 109M rows of files table in first plan and 6M for each of 36 users = 216M rows for second plan. So it seems reasonable to read all 109M rows only once and in priamry key order instead reading them in separate blocks.. Those estimates does not seem very reasonable to me, so I would try running analyze table on files, but they are estimates so maybe you wont get better numbers.
Using LEFT join and then adding condition on the table to WHERE turns it into INNER join as Strawberry says in their comment - you have to have value for the where condition to ever be true, so mysql feels free to reorder those a bit, maybe even it seems better for optimizer to do "really-inner" joins first, so that may be second reason for that plan.
You can try using STRAIGHT_JOIN in different way - if you put it just once right after SELECT, then your join order is used by optimizer if possible (it usually is barring some weird right joins and other corner cases) without changing join type on specific tables (it is then used as sort of FLAG, in the way SQL_NO_CACHE is used to signalize something, instead of as special join type)
Then to make it even better, you may try adding index to files on (owner_identity, trashing_date) which should help in localizing specific files for each user and not globally as with current key on (trashing_date) only.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008