select slow because of unused inner join - mysql

I have two tables:
CREATE TABLE `A` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
CREATE TABLE `B` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`a_id` int(11) NOT NULL,
`c_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `IX_a_id` (`a_id`),
KEY `IX_c_id` (`c_id`),
CONSTRAINT `a_id_ibfk_1` FOREIGN KEY (`a_id`) REFERENCES `A` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
They have a couple million rows each.
explain select count(*) FROM B inner join A on B.a_id = A.id WHERE B.c_id = 7;
+----+-------------+-------+--------+-----------------------+------------+---------+--------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------+------------+---------+--------------------+--------+-------------+
| 1 | SIMPLE | B | ref | IX_a_id,IX_c_id | IX_c_id | 4 | const | 116624 | Using where |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 4 | test1.B.a_id | 1 | Using index |
+----+-------------+-------+--------+-----------------------+------------+---------+--------------------+--------+-------------+
Now, I can't understand why mysql is unable to ignore the un-needed inner join to A which kills performance. i.e., the following query is equivalent to the above:
select count(*) from B where B.c_id = 7
which should be easy to infer since B.a_id can't be null and B.a_id has a constraint to the unique key A.id
Is there a way to make mysql understand this ?

SQL statements that include a join will scan both tables and load them into temporary storage ahead of doing the join. This means it loads both the tables into a temporary table, sorts and removes the unnecessary duplicates and returns the results. You can speed this up by using sub-queries as the tables to join.
For example if you did the following, you would are unlikely to get your performance penalty:
select count(*) FROM
(select * from B where B.c_id = 7 and B.a_id is not Null) as sub_b
inner join
(select * from A where A.id is not Null) as sub_a
on (sub_b.a_id = sub_a.id);
This is because SQL is behaving as you would expect and pre-filtering the results to join thus loading less into it's temporary table. You can speed this up further using indexes on the id on A, and the c_id and a_id on B
If you have a large number of columns, reducing the number of columns returned in the queries will also speed it up:
select count(sub_b.a_id) FROM
(select a_id from B where B.c_id = 7 and B.a_id is not Null) as sub_b
inner join
(select id from A where A.id is not Null) as sub_a
on (sub_b.a_id = sub_a.id);
----edit----
a good point was made below, the second select is unnecessary. The first sub-select ensures that the where clause we care about is applied first.
Try this:
select count(sub_b.a_id) FROM
(select a_id from B where B.c_id = 7) as sub_b
inner join A on (sub_b.a_id = A.id);

Now, I can't understand why mysql is unable to ignore the un-needed inner join to A which kills performance.
Addressing that implied question...
File a bug report to suggest such an optimization. http://bugs.mysql.com .

Related

Improving slow MariaDB query performance

I have what seems to be a fairly straightforward query, but it is super slow and I would like to improve its performance if I can.
SELECT `contacts`.`unit_id`, `contacts`.`owner_id`, `units`.`description`,
`units`.`address`, `owners`.`name`, `owners`.`email`, COUNT(*) AS contact_count
FROM `contacts`
LEFT JOIN `units` ON `contacts`.`unit_id` = `units`.`id`
LEFT JOIN `owners` ON `contacts`.`owner_id` = `owners`.`id`
WHERE `owners.group_id` = 6
AND `contacts`.`checkin` BETWEEN '2021-10-01 00:00:00' AND '2021-10-31 23:59:59'
GROUP BY `units`.`id`
ORDER BY `contact_count` DESC
LIMIT 20;
I'm just trying to get the units with the most contacts in a given date range, and belonging to a certain group of owners.
+------+-------------+----------+--------+--------------------------------------------------+---------------------------+---------+-------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+--------+--------------------------------------------------+---------------------------+---------+-------------------------+------+---------------------------------+
| 1 | SIMPLE | owners | ref | PRIMARY,owners_group_id_foreign | owners_group_id_foreign | 4 | const | 1133 | Using temporary; Using filesort |
| 1 | SIMPLE | contacts | ref | contacts_checkin_index,contacts_owner_id_foreign | contacts_owner_id_foreign | 4 | appdb.owners.id | 1145 | Using where |
| 1 | SIMPLE | units | eq_ref | PRIMARY | PRIMARY | 4 | appdb.contacts.unit_id | 1 | |
+------+-------------+----------+--------+--------------------------------------------------+---------------------------+---------+-------------------------+------+---------------------------------+
As near as I can tell, everything that should be indexed is:
CREATE TABLE `contacts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner_id` int(10) unsigned NOT NULL,
`unit_id` int(10) unsigned NOT NULL,
`terminal_id` int(10) unsigned NOT NULL,
`checkin` datetime NOT NULL
PRIMARY KEY (`id`),
KEY `contacts_checkin_index` (`checkin`),
KEY `contacts_unit_id_foreign` (`unit_id`),
KEY `contacts_terminal_id_foreign` (`terminal_id`),
KEY `contacts_owner_id_foreign` (`owner_id`),
CONSTRAINT `contacts_unit_id_foreign` FOREIGN KEY (`unit_id`) REFERENCES `units` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `contacts_terminal_id_foreign` FOREIGN KEY (`terminal_id`) REFERENCES `terminals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `contacts_owner_id_foreign` FOREIGN KEY (`owner_id`) REFERENCES `owners` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=25528530 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The contacts table currently has about 10 million rows, and this query takes about 4 minutes to run. Is this anything that can be improved significantly or am I just bumping up against limitations of my hardware at this point?
SELECT sub.unit_id, sub.owner_id, u.`description`, u.`address`,
sub.name, sub.email,
sub.contact_count
FROM
( SELECT c.`unit_id`, c.`owner_id`,
o.`name`, o.`email`,
COUNT(*) AS contact_count
FROM `contacts` AS c
JOIN `owners` AS o ON c.`owner_id` = o.`id`
WHERE o.`group_id` = 6
AND c.`checkin` >= '2021-10-01'
AND c.`checkin` < '2021-10-01' + INTERVAL 1 MONTH
GROUP BY c.`unit_id`
ORDER BY `contact_count` DESC
LIMIT 20
) AS sub
LEFT JOIN `units` AS u ON sub.`unit_id` = u.`id`
ORDER BY `contact_count` DESC, sub.unit_id DESC;
Notes:
I turned it inside out in order to hit units only 20 times.
JOIN owners cannot be LEFT JOIN, so I changed that.
I changed the GROUP BY to avoid using units prematurely.
Possibly the GROUP BY is now redundant.
I changed the date range to make it easier to be generic.
I augmented the ORDER BY to make it deterministic in case of dup counts.
Notice, below, how "composite" indexes can be helpful.
Indexes that may help:
contacts: INDEX(checkin, unit_id, owner_id)
contacts: INDEX(owner_id, checkin, unit_id)
owners: INDEX(group_id, id, name, email)
When adding those, remove any INDEXes that start with the same columns. Example: contacts: INDEX(checkin)
I haven't tested it, so I can't be sure exactly, but I think the main reason for the slow speed is the number of rows participating in group by.
So, you can try the following method to reduce the number of rows.
(Since I can't test it, I'm not sure if this query will run correctly. I'm just trying to show you a way.)
SELECT B.*, `owners`.`name`, `owners`.`email`
FROM (
SELECT `units`.`id`, MAX(`contacts`.`owner_id`) AS owner_id, `units`.`description`, `units`.`address`, COUNT(*) AS contact_count
FROM (
SELECT *
FROM `contacts`
WHERE `contacts`.`checkin` BETWEEN '2021-10-01 00:00:00' AND '2021-10-31 23:59:59') as A
LEFT JOIN `units` ON A.`unit_id` = `units`.`id`
GROUP BY `units`.`id`) AS B
LEFT JOIN `owners` ON B.`owner_id` = `owners`.`id` AND `owners.group_id` = 6
ORDER BY `contact_count` DESC
LIMIT 20
Previously, I have a similar experience that I have to check the number of ad views and page visitors by date and time range,
and in this way, the time was reduced.

Slow join with order query

I have a problem with the speed of query. Question is similar to this one, but can't find solution. Explain says that MySQL is using: Using where; Using index; Using temporary; Using filesort
Slow query:
select
distinct(`books`.`id`)
from `books`
join `books_genres` on `books_genres`.`book_id` = `books`.`id`
where
`books`.`is_status` = 'active' and `books`.`master_book` = 'true'
and `books_genres`.`genre_id` in(380,381,384,385,1359)
order by
`books`.`livelib_read_num` DESC, `books`.`id` DESC
limit 0,25
#25 rows (0.319 s)
But if I remove order statement from query it is really fast:
select sql_no_cache
distinct(`books`.`id`)
from `books`
join `books_genres` on `books_genres`.`book_id` = `books`.`id`
where
`books`.`is_status` = 'active' and `books`.`master_book` = 'true'
and `books_genres`.`genre_id` in(380,381,384,385,1359)
limit 0,25
#25 rows (0.005 s)
Explain:
+------+-------------+--------------+--------+---------------------------------------------------------------------------------------------------------------------+------------------+---------+--------------------------------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------+--------+---------------------------------------------------------------------------------------------------------------------+------------------+---------+--------------------------------+--------+-----------------------------------------------------------+
| 1 | SIMPLE | books_genres | range | book_id,categorie_id,book_id2,genre_id_book_id | genre_id_book_id | 10 | NULL | 194890 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | books | eq_ref | PRIMARY,is_status,master_book,is_status_master_book,is_status_master_book_indexed,is_status_donor_no_ru_master_book | PRIMARY | 4 | knigogid3.books_genres.book_id | 1 | Using where |
+------+-------------+--------------+--------+---------------------------------------------------------------------------------------------------------------------+------------------+---------+--------------------------------+--------+-----------------------------------------------------------+
2 rows in set (0.00 sec)
My tables:
CREATE TABLE `books_genres` (
`book_id` int(11) DEFAULT NULL,
`genre_id` int(11) DEFAULT NULL,
`sort` tinyint(4) DEFAULT NULL,
UNIQUE KEY `book_id` (`book_id`,`genre_id`),
KEY `categorie_id` (`genre_id`),
KEY `sort` (`sort`),
KEY `book_id2` (`book_id`),
KEY `genre_id_book_id` (`genre_id`,`book_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `books` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`is_status` enum('active','parser','incorrect','extremist','delete','fulldeteled') NOT NULL DEFAULT 'active',
`livelib_book_id` int(11) DEFAULT NULL,
`master_book` enum('true','false') DEFAULT 'true'
PRIMARY KEY (`id`),
KEY `is_status` (`is_status`),
KEY `master_book` (`master_book`),
KEY `livelib_book_id` (`livelib_book_id`),
KEY `livelib_read_num` (`livelib_read_num`),
KEY `is_status_master_book` (`is_status`,`master_book`),
KEY `livelib_book_id_master_book` (`livelib_book_id`,`master_book`),
KEY `is_status_master_book_indexed` (`is_status`,`master_book`,`indexed`),
KEY `is_status_donor_no_ru_master_book` (`is_status`,`donor`,`no_ru`,`master_book`),
KEY `livelib_url_master_book_is_status` (`livelib_url`,`master_book`,`is_status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Problems with books_genres.
It has no PRIMARY KEY.
All columns are nullable. Will you ever insert a row with any NULLs?
Recommend (after saying NOT NULL on all columns):
PRIMARY KEY(`book_id`,`genre_id`)
INDEX(genre_id, book_id, sort)
and remove all the rest.
I don't see livelib_read_num in the table???
In the other table, remove any indexes that are the exact prefix of some other index.
These might help with speed. (Again, filter out prefix indexes that are redundant.) (These are "covering" indexes, which helps a little.)
books: INDEX(is_status, master_book, livelib_read_num, id)
books: INDEX(livelib_read_num, id, is_status, master_book)
The second index may cause the Optimizer to give preference to ORDER BY. (That is a risky optimization, since it might have to scan the entire index without finding 25 relevant rows.)
SELECT sql_no_cache
`books`.`id`
FROM
`books`
use index(books_idx_is_stat_master_livelib_id)
WHERE
(
1 = 1
AND `books`.`is_status` = 'active'
AND `books`.`master_book` = 'true'
)
AND (
EXISTS (
SELECT
1
FROM
`books_genres`
WHERE
(
`books_genres`.`book_id` = `books`.`id`
)
AND (
`books_genres`.`genre_id` IN (
380, 381, 384, 385, 1359
)
)
)
)
ORDER BY
`books`.`livelib_read_num` DESC,
`books`.`id` DESC LIMIT 0,
25;
25 rows in set (0.07 sec)

How to optimize join when search in categories

I have a table with items:
CREATE TABLE `ost_content` (
`uid` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`type` enum('media','serial','season','series') NOT NULL,
`alias` varchar(200) NOT NULL,
`views` mediumint(7) NOT NULL DEFAULT '0',
`ratings_count` enum('0','1','2','4','5') NOT NULL DEFAULT '0',
`ratings_sum` mediumint(5) NOT NULL DEFAULT '0',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`conversion_status` enum('converting','error','success','announcement') NOT NULL DEFAULT 'converting',
PRIMARY KEY (`uid`),
UNIQUE KEY `idx_uid_type` (`uid`,`type`),
KEY `idx_type` (`type`),
KEY `idx_upload_date DESC` (`upload_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And table, that connect items with categories:
CREATE TABLE `ost_categories2media` (
`categories2media_id` mediumint(6) unsigned NOT NULL AUTO_INCREMENT,
`categories2media_category_id` smallint(5) unsigned NOT NULL,
`categories2media_uid` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`categories2media_id`),
KEY `categories2media_media_id` (`categories2media_uid`),
KEY `categories2media_category_id` (`categories2media_category_id`)
) ENGINE=InnoDB AUTO_INCREMENT=501114 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Than, I executing query:
SELECT
c1.uid,
c1.alias,
c1.type,
c1.views,
c1.upload_date,
c1.ratings_sum,
c1.ratings_count,
c1.conversion_status
FROM
ost_content c1
LEFT JOIN ost_categories2media c2m ON c2m.categories2media_uid = c1.uid
WHERE
c2m.categories2media_category_id = '53'
AND c1.conversion_status IN ('success', 'announcement')
AND c1.type IN ('serial', 'media')
ORDER BY
c1.upload_date DESC
LIMIT 16, 16
It executing slow, categories2media_category_id check many rows:
+----+-------------+-------+--------+--------------------------------------------------------+------------------------------+---------+---------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+------------------------------+---------+---------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | c2m | ref | categories2media_media_id,categories2media_category_id | categories2media_category_id | 2 | const | 32076 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c1 | eq_ref | PRIMARY,idx_uid_type,idx_type | PRIMARY | 3 | uakino.c2m.categories2media_uid | 1 | Using where |
+----+-------------+-------+--------+--------------------------------------------------------+------------------------------+---------+---------------------------------+-------+----------------------------------------------+
How I can optimize or rewrite this query?
Mysql indexes are like cooks, too many of them aren't very useful because mysql uses only one index per table. Let's look at ost_categories2media,
That's three separate indexes on three columns. You are better off with two indexes like this.
PRIMARY KEY (`categories2media_id`),
KEY `categories2media_media_id` (`categories2media_uid`,`categories2media_category_id`)
Now mysql no longer has to decide between an index on categories2media_uid or categories2media_category_id it has an index that covers both!
Looking at your ost_content table we see
PRIMARY KEY (`uid`),
UNIQUE KEY `idx_uid_type` (`uid`,`type`),
KEY `idx_type` (`type`),
KEY `idx_upload_date DESC` (`upload_date`)
Some of these indexes are a bit redundant. Any query that filters on the uid field can use the PK while any query that filters on type can use idx_type that means idx_uid_type is there just to enforce the uniqueness. But we can make it more usefull like this:
PRIMARY KEY (`uid`),
UNIQUE KEY `idx_uid_type` (`type`,`uid`),
KEY `idx_upload_date DESC` (`upload_date`)
We've got rid of one index! that ought to make your indexes a lot faster. You still have an index on upload_date that isn't used in this particulary query. So how about a composite index for that?
PRIMARY KEY (`uid`),
UNIQUE KEY `idx_uid_type` (`type`,`uid`),
KEY `idx_upload_date DESC` (`uid`,`upload_date`)
First, the LEFT JOIN is not necessary. So, you can write the query as:
SELECT c.*
FROM ost_content c JOIN
ost_categories2media c2m
ON c2m.categories2media_uid = c.uid
WHERE c2m.categories2media_category_id = '53' AND
c.conversion_status IN ('success', 'announcement') AND
c.type IN ('serial', 'media')
ORDER BY c.upload_date DESC
LIMIT 16, 16;
Unfortunately, your conditions on the content table are not simple = conditions. If they were, and index on ost_content(conversion_status, type, uid) would be recommended. This might still be the better option.
Another option is to go the other way: An index on ost_categories2media(categories2media_category_id, categories2media_uid).
You might find that the first composite index and this query work best:
SELECT c.*
FROM ((SELECT c.*
FROM ost_content c JOIN
ost_categories2media c2m
ON c2m.categories2media_uid = c.uid
WHERE c2m.categories2media_category_id = '53' AND
c.conversion_status = 'success' AND
c.type IN ('serial', 'media')
) UNION ALL
(SELECT c.*
FROM ost_content c JOIN
ost_categories2media c2m
ON c2m.categories2media_uid = c.uid
WHERE c2m.categories2media_category_id = '53' AND
c.conversion_status = 'announcement' AND
c.type IN ('serial', 'media')
)
) c
ORDER BY c.upload_date DESC
LIMIT 16, 16;
This looks more complicated, but each subquery can take advantage of the index, so it might have improved performance.

Distinct vs Group By

I have two tables like this.
The 'order' table has 21886 rows.
CREATE TABLE `order` (
`id` bigint(20) unsigned NOT NULL,
`reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `order_detail_products` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` bigint(20) unsigned NOT NULL,
`order_detail_id` int(11) NOT NULL,
`prod_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
My question is here.
MariaDB [test]> explain
-> SELECT DISTINCT A.id
-> FROM order A
-> JOIN order_detail_products B ON A.id = B.order_id
-> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| 1 | SIMPLE | A | index | PRIMARY | idx_reg_date | 8 | NULL | 22151 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | B | ref | idx_order_id | idx_order_id | 8 | bom_20140804.A.id | 2 | Using index; Distinct |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
MariaDB [test]> explain
-> SELECT A.id
-> FROM order A
-> JOIN order_detail_products B ON A.id = B.order_id
-> GROUP BY A.id
-> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| 1 | SIMPLE | A | index | PRIMARY | idx_reg_date | 8 | NULL | 65 | Using index; Using temporary |
| 1 | SIMPLE | B | ref | idx_order_id | idx_order_id | 8 | bom_20140804.A.id | 2 | Using index |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
Listed above, two queries returns same result but distinct is too slow(explain too many rows).
What's the difference?
It is usually advised to use DISTINCT instead of GROUP BY, since that is what you actually want, and let the optimizer choose the "best" execution plan. However - no optimizer is perfect. Using DISTINCT the optimizer can have more options for an execution plan. But that also means that it has more options to choose a bad plan.
You write that the DISTINCT query is "slow", but you don't tell any numbers. In my test (with 10 times as many rows on MariaDB 10.0.19 and 10.3.13) the DISTINCT query is like (only) 25% slower (562ms/453ms). The EXPLAIN result is no help at all. It's even "lying". With LIMIT 100, 30 it would need to read at least 130 rows (that's what my EXPLAIN actually schows for GROUP BY), but it shows you 65.
I can't explain the 25% difference in execution time, but it seems that the engine is doing a full table/index scan in any case, and sorts the result before it can skip 100 and select 30 rows.
The best plan would probably be:
Read rows from idx_reg_date index (table A) one by one in descending order
Look if there is a match in the idx_order_id index (table B)
Skip 100 matching rows
Send 30 matching rows
Exit
If there are like 10% of rows in A which have no match in B, this plan would read something like 143 rows from A.
Best I could do to somehow force this plan is:
SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100
This query returns the same result in 156 ms (3 times faster than GROUP BY). But that is still too slow. And it's probaly still reading all rows in table A.
We can proof that a better plan can exist with a "little" subquery trick:
SELECT A.id
FROM (
SELECT id, reg_date
FROM `order`
ORDER BY reg_date DESC
LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100
This query executes in "no time" (~ 0 ms) and returns the same result on my test data. And though it's not 100% reliable, it shows that the optimizer is not doing a good job.
So what are my conclusions:
The optimizer does not always do the best job and sometimes needs help
Even when we know "the best plan", we can not always enforce it
DISTINCT is not always faster than GROUP BY
When no index can be used for all clauses - things are getting quite tricky
Test schema and dummy data:
drop table if exists `order`;
CREATE TABLE `order` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into `order`(reg_date)
select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
from information_schema.COLUMNS a
, information_schema.COLUMNS b
limit 218860;
drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`order_id` bigint(20) unsigned NOT NULL,
`order_detail_id` int(11) NOT NULL,
`prod_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into order_detail_products(id, order_id, order_detail_id, prod_id)
select null as id
, floor(rand(2)*218860)+1 as order_id
, 0 as order_detail_id
, 0 as prod_id
from information_schema.COLUMNS a
, information_schema.COLUMNS b
limit 437320;
Queries:
SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms
SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms
SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms
SELECT A.id
FROM (
SELECT id, reg_date
FROM `order`
ORDER BY reg_date DESC
LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms
I believe your select distinct is slow because you broke the index by matching on another table. In most cases, select distinct will be faster. But in this case, since you are matching on parameters of another table, the index is broken and is much slower.

Optimizing the mysql query - Avoid creation of temporary table?

This is the query that I am using on tables : products, reviews, replies, review_images.
Query :
SELECT products.id, reviews.*,
GROUP_CONCAT(DISTINCT CONCAT_WS('~',replies.reply, replies.time)) AS Replies,
GROUP_CONCAT(DISTINCT CONCAT_WS('~',review_images.image_title, review_images.image_location)) AS ReviewImages
FROM products
LEFT JOIN reviews on products.id = reviews.product_id
LEFT JOIN replies on reviews.id = replies.review_id
LEFT JOIN review_images on reviews.id = review_images.review_id
WHERE products.id = 1
GROUP BY products.id, reviews.id;
Schema :
Products :
id | name | product_details....
Reviews :
id | product_id | username | review | time | ...
Replies :
id | review_id | username | reply | time | ...
Review Images :
id | review_id | image_title | image_location | ...
Indexes:
Products :
PRIMARY KEY - id
Reviews :
PRIMARY KEY - id
FOREIGN KEY - product_id (id IN products table)
FOREIGN KEY - username (username IN users table)
Replies :
PRIMARY KEY - id
FOREIGN KEY - review_id (id IN reviews table)
FOREIGN KEY - username (username IN users table)
Review Images :
PRIMARY KEY - id
FOREIGN KEY - review_id (id IN reviews table)
Explain Query :
id | select_type | table | type | possible_keys | rows | extra
1 | SIMPLE | products | index | null | 1 | Using index; Using temporary; Using filesort
1 | SIMPLE | reviews | ALL | product_id | 4 | Using where; Using join buffer (Block Nested Loop)
1 | SIMPLE | replies | ref | review_id | 1 | Null
1 | SIMPLE | review_images | ALL | review_id | 5 | Using where; Using join buffer (Block Nested Loop)
I don't know what is wrong here, that it needs to use filesort and create a temporary table?
Here are few Profiling results :
Opening Tables 140 µs
Init 139 µs
System Lock 34 µs
Optimizing 21 µs
Statistics 106 µs
Preparing 146 µs
Creating Tmp Table 13.6 ms
Sorting Result 27 µs
Executing 11 µs
Sending Data 11.6 ms
Creating Sort Index 1.4 ms
End 89 µs
Removing Tmp Table 8.9 ms
End 34 µs
Query End 25 µs
Closing Tables 66 µs
Freeing Items 41 µs
Removing Tmp Table 1.4 ms
Freeing Items 46 µs
Removing Tmp Table 1.2 ms
Freeing Items 203 µs
Cleaning Up 55 µs
As from the Explain and Profiling results, it is clear that temporary table is created to produce the results. How can I optimize this query to get similar results and better performance and avoid the creation of temporary table?
Help would be appreciated. Thanks in advance.
EDIT
Create Tables
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
`description` varchar(100) NOT NULL,
`items` int(11) NOT NULL,
`price` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
CREATE TABLE `reviews` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`username` varchar(30) NOT NULL,
`product_id` int(11) NOT NULL,
`review` text NOT NULL,
`time` datetime NOT NULL,
`ratings` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `product_id` (`product_id`),
KEY `username` (`username`)
) ENGINE=InnoDB
CREATE TABLE `replies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`review_id` int(11) NOT NULL,
`username` varchar(30) NOT NULL,
`reply` text NOT NULL,
`time` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `review_id` (`review_id`)
) ENGINE=InnoDB
CREATE TABLE `review_images` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`review_id` int(11) NOT NULL,
`image_title` text NOT NULL,
`image_location` text NOT NULL,
PRIMARY KEY (`id`),
KEY `review_id` (`review_id`)
) ENGINE=InnoDB
EDIT:
I simplified the query above and now it does not create temporary tables. The only reason as mentioned by #Bill Karwin was that I was using GROUP BY on second table in the joins.
Simplified query :
SELECT reviews. * ,
GROUP_CONCAT( DISTINCT CONCAT_WS( '~', replies.reply, replies.time ) ) AS Replies,
GROUP_CONCAT( DISTINCT CONCAT_WS( '~', review_images.image_title, review_images.image_location ) ) AS ReviewImages
FROM reviews
LEFT JOIN replies ON reviews.id = replies.review_id
LEFT JOIN review_images ON reviews.id = review_images.review_id
WHERE reviews.product_id = 1
GROUP BY reviews.id
Now the PROBLEM that I'm facing is :
Because I'm using GROUP_CONCAT, there is a limit to the data it can hold which is in the variable GROUP_CONCAT_MAX_LEN, so as I'm concatenating the replies given by the users, it could go very very long and can possibly exceed the memory defined. I know I can change the value of GROUP_CONCAT_MAX_LEN for current session, but still there is a limitation to it that at some point in time, the query may fail or unable to fetch complete results.
How can I modify my query so as not to use GROUP_CONCAT and still get results expected.
POSSIBLE SOLUTION :
Simply using LEFT JOINS, which creates duplicate rows for every new result in the last column and which makes it hard to traverse in php? Any suggestions?
I see this question is not getting enough response from SO members. But I've been looking for the solution and searching about concepts since last to last week. Still no luck. Hope some of you PROs can help me out. Thanks in advance.
You can't avoid creating a temporary table when your GROUP BY clause references columns from two different tables.
The only way to avoid the temporary table in this query is to store a denormalized version of the data in one table, and index the two columns you're grouping by.
Another way you can simplify and get results in a format that's easier to work with in PHP is to do multiple queries, without GROUP BY.
First get the reviews. Example is in PHP & PDO, but the principle applies to any language.
$review_stmt = $pdo->query("
SELECT reviews.*,
FROM reviews
WHERE reviews.product_id = 1");
Arrange them in an associative array keyed by the review_id.
$reviews = array();
while ($row => $review_stmt->fetch(PDO::FETCH_ASSOC)) {
$reviews[$row['d']] = $row;
}
Then get the replies and append them to an array using the key 'replies'. Use INNER JOIN instead of LEFT JOIN, because it's okay if there are no replies.
$reply_stmt = $pdo->query("
SELECT replies.*
FROM reviews
INNER JOIN replies ON reviews.id = replies.review_id
WHERE reviews.product_id = 1");
while ($row = $reply_stmt->fetch(PDO::FETCH_ASSOC)) {
$reviews[$row['review_id']]['replies'][] = $row;
}
And do the same for review_images.
$reply_stmt = $pdo->query("
SELECT review_images.*
FROM reviews
INNER JOIN review_images ON reviews.id = review_images.review_id
WHERE reviews.product_id = 1");
while ($row = $reply_stmt->fetch(PDO::FETCH_ASSOC)) {
$reviews[$row['review_id']]['review_images'][] = $row;
}
The end result is an array of reviews, which contains elements which are nested arrays for related replies and images respectively.
The efficiency of running simpler queries can make up for the extra work of running three queries. Plus you don't have to write code to explode() the group-concatted strings.