SQL Query is slow and not using indexes - mysql

SELECT `productTitle`, `orderCnt`, `promPCPriceStr`,
`productImgUrl`, `oriPriceStr`, `detailUrl`,
(SELECT count(id) FROM orders t4
WHERE t4.productId = t1.productId
AND DATE( t4.`date`) > DATE_SUB(CURDATE(), INTERVAL 2 DAY)
) as ordertoday
FROM `products` t1
WHERE `orderCnt` > 0
AND `orderCnt` < 2000
AND `promPCPriceStr` > 0
AND `promPCPriceStr` < 2000
HAVING ordertoday > 5 AND ordertoday < 2000
order by ordertoday desc limit 150
This query take 18 second to finish when i run explain command on it shows this
it does not use the index keys !
The tables used
Products Table
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`productId` bigint(20) NOT NULL,
`detailUrl` text CHARACTER SET utf32 NOT NULL,
`belongToDSStore` int(11) NOT NULL,
`promPCPriceStr` float NOT NULL DEFAULT '-1',
`oriPriceStr` float NOT NULL DEFAULT '-1',
`orderCnt` int(11) NOT NULL,
`productTitle` text CHARACTER SET utf32 NOT NULL,
`productImgUrl` text CHARACTER SET utf32 NOT NULL,
`created_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`cat` bigint(20) NOT NULL DEFAULT '-1',
PRIMARY KEY (`id`),
UNIQUE KEY `productId` (`productId`),
KEY `orderCnt` (`orderCnt`),
KEY `cat` (`cat`),
KEY `promPCPriceStr` (`promPCPriceStr`)
) ENGINE=InnoDB AUTO_INCREMENT=37773 DEFAULT CHARSET=latin1
Orders Table
CREATE TABLE `orders` (
`oid` int(11) NOT NULL AUTO_INCREMENT,
`countryCode` varchar(10) NOT NULL,
`date` datetime NOT NULL,
`id` bigint(20) NOT NULL,
`productId` bigint(20) NOT NULL,
PRIMARY KEY (`oid`),
UNIQUE KEY `id` (`id`),
KEY `date` (`date`),
KEY `productId` (`productId`)
) ENGINE=InnoDB AUTO_INCREMENT=9790205 DEFAULT CHARSET=latin1

MySQL won't use an index even if one exists on a column you search, if the values you search for appear on a large subset of the rows.
I did a test with MySQL 5.6. I created table with ~1,000,000 rows, with a column x with random values evenly distributed between 1 and 1000. There's an index on column x.
Depending on my search terms, I see the index is used if I search for a range of values matching a small enough subset of rows, otherwise it decides using the index is too much trouble, and just does a table-scan:
mysql> explain select * from foo where x < 50;
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | foo | range | x | x | 4 | NULL | 102356 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
mysql> explain select * from foo where x < 100;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | foo | ALL | x | NULL | NULL | NULL | 1046904 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
I would infer that your query's search conditions match a pretty large portion of the rows, and MySQL decides the indexes on these columns are not worth using.
WHERE `orderCnt` > 0
AND `orderCnt` < 2000
AND `promPCPriceStr` > 0
AND `promPCPriceStr` < 2000
If you think MySQL is making the wrong choice, you can try to use an index hint to tell MySQL that a table-scan is prohibitively expensive. This will urge it to use the index (if the index is relevant to the search condition).
mysql> explain select * from foo force index (x) where x < 100;
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | foo | range | x | x | 4 | NULL | 216764 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
I would write the query this way, without any subquery:
SELECT t.productTitle, t.orderCnt, t.promPCPriceStr,
t.productImgUrl, t.oriPriceStr, t.detailUrl,
COUNT(o.id) AS orderToday
FROM products t
LEFT JOIN orders o ON t.productid = o.productid AND o.date > CURDATE() - INTERVAL 2 DAY
WHERE t.orderCnt > 0 AND t.orderCnt < 2000
AND t.promPCPriceStr > 0 AND t.promPCPriceStr < 2000
GROUP BY t.productid
HAVING ordertoday > 5 AND ordertoday < 2000
ORDER BY ordertoday DESC LIMIT 150
When I EXPLAIN the query, I get this report:
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
| 1 | SIMPLE | t | ALL | productId,orderCnt,promPCPriceStr | NULL | NULL | NULL | 9993 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | o | ref | date,productId | productId | 8 | test.t.productId | 1 | Using where |
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
It still does a table-scan for products but it joins the relevant matching rows in orders with an index lookup instead of a correlated subquery.
I filled my tables with random date, to make 98,846 product rows and 215,508 orders rows. When I run the query it takes about 0.18 seconds.
Although when I run your query with the correlated subquery, it takes 0.06 seconds. I don't know why your query is so slow. You could be running on an underpowered server.
I'm running my test on a Macbook Pro 2017 with an i7 CPU and 16GB of RAM.

In both tables, it is counterproductive to have both an AUTO_INCREMENT PRIMARY KEY and a BIGINT column that is UNIQUE. Get rid of the AI column and promote the other to PK. This may require changing some of your code, since the AI column is gone.
As for the subquery...
(SELECT count(id) FROM orders t4
WHERE t4.productId = t1.productId
AND DATE( t4.`date`) > DATE_SUB(CURDATE(), INTERVAL 2 DAY)
) as ordertoday
Change COUNT(id) to COUNT(*) unless you need to check id for being NOT NULL (which I doubt).
The date column is hidden in a function call, so no index will be useful. So, change the date test to
AND t4.`date` > CURDATE - INTERVAL 2 DAY
Then add this composite index. (It will help with Karwin's reformulation, too).
INDEX(productId, date)

Related

Mysql count performance on very big rows between two dates conditions

I have a table with more than 20 million rows in Innodb.
the columns are
id, viewable_id, visitor, viewed_at
where the viewable_id and viewed_at are indexes.
when I do the below query
SELECT COUNT(*)
FROM views_users
WHERE (viewable_id = 2)
and (viewed_at between '2021-04-19 01:38:37'
and '2021-06-30 01:38:37');
=> take (3 min 6.72 sec)
the explain is
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+-----------------------------------------------------------+-------------------------------+---------+-------+---------+----------+-------------+
| 1 | SIMPLE | views_users | NULL | ref | views_users_viewable_id_index,views_users_viewed_at_index | views_users_viewable_id_index | 8 | const | 9554594 | 50.00 | Using where
How can I increase the performance to reach less than 4 seconds?
CREATE TABLE views_users (
id int unsigned NOT NULL AUTO_INCREMENT,
viewable_type varchar(255) NOT NULL,
viewable_id bigint unsigned NOT NULL,
visitor text,
collection varchar(255) DEFAULT NULL,
viewed_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
KEY user_id (viewable_id)
) ENGINE=InnoDB AUTO_INCREMENT=20995848
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
I increase the performance to take less than 2 seconds by applying MySQL partiotons.
I used partition by range using the viewed_at column. change viewed_at type from timestamp to datatime and made it as primary key with id.
make a cronjob runs on first day of each month that reorganize last partition into another partitions and so on.
For this query:
SELECT COUNT(*)
FROM views_users
WHERE viewable_id = 2 and
viewed_at between '2021-04-19 01:38:37' and '2021-06-30 01:38:37';
You can create an index:
CREATE INDEX idx_views_users_viewable_id_viewed_at ON views_users(viewable_id, viewed_at);

SELECT statement optimization MySQL

I am looking for a way how to make my SELECT query even faster than it is now, because I have a feeling it should be possible to make it faster.
Here is the query
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
Here is EXPLAIN SELECT
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY | PRIMARY | 4 | dev.r.id_pair | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
Now the tables are
CREATE TABLE `tag_product` (
`id_pair` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_product` int(10) unsigned NOT NULL,
`id_user_tag` int(10) unsigned NOT NULL,
`status` tinyint(3) NOT NULL,
`percentile` decimal(8,4) unsigned NOT NULL,
`percentile_weighted` decimal(8,4) unsigned NOT NULL,
`elo` int(10) unsigned NOT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_pair`),
UNIQUE KEY `id_product_id_user_tag` (`id_product`,`id_user_tag`),
KEY `status` (`status`),
KEY `id_user_tag` (`id_user_tag`),
CONSTRAINT `tag_product_ibfk_5` FOREIGN KEY (`id_user_tag`) REFERENCES `user_tag` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `tag_rating` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_customer` int(10) unsigned NOT NULL,
`id_pair` int(10) unsigned NOT NULL,
`id_duel` int(10) unsigned NOT NULL,
`value` tinyint(4) NOT NULL,
`date_add` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_duel_id_pair` (`id_duel`,`id_pair`),
KEY `id_pair_id_customer` (`id_pair`,`id_customer`),
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
KEY `id_customer_value_date_add` (`id_customer`,`value`,`date_add`),
CONSTRAINT `tag_rating_ibfk_3` FOREIGN KEY (`id_pair`) REFERENCES `tag_product` (`id_pair`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `tag_rating_ibfk_6` FOREIGN KEY (`id_duel`) REFERENCES `tag_rating_duel` (`id_duel`) ON DELETE CASCADE ON UPDATE CASCADE,
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The table tag_product has about 250k rows and the tag_rating has about 1M rows.
My issue is that the SQL query takes about 0.8s on average on my machine. I would like to make it ideally under 0.5s while also assuming the tables can get like 10 times bigger. The amount of rows taken into play should be about the same because I have a date condition (I only want less than a month old rows).
Is this possible to make faster just by some trick (aka not restructuring my tables)? When I slightly modify (dont join the smaller table) the statement as
SELECT r.id_customer, COUNT(*)
FROM tag_rating AS r USE INDEX (value_date_add)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer;
here is EXPLAIN SELECT
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
it takes about 0.25s which is great. So the JOIN makes it 3x slower. Is that inevitable? I feel like since I am joining via primary key it shouldnt make a query 3x slower.
---UPDATE---
This is actually my query. The number of different id_customer values is about 1 thousand and is expected to rise, the number of rows with value=1 is exactly half. So far the query performance seems to be slowing down linearly based on the number of rows in rating table
Using adding id_pair at the end of the id_customer_value_date_add or value_id_customer_date_add index doesnt help.
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (id_customer_value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.id_customer IN (2593179,1461878,2318871,2654090,2840415,2852531,2987432,3473275,3960453,3961798,4129734,4191571,4202912,4204817,4211263,4248789,765650,1341317,1430380,2116196,3367674,3701901,3995273,4118307,4136114,4236589,783262,913493,1034296,2626574,3574634,3785772,2825128,4157953,3331279,4180367,4208685,4287879,1038898,1445750,1975108,3658055,4185296,4276189,428693,4248631,1892448,3773855,2901524,3830868,3934786) AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
This is EXPLAIN SELECT
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | r | range | id_customer_value_date_add | id_customer_value_date_add | 10 | | 558906 | Using where; Using index |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY,status | PRIMARY | 4 | dev.r.id_pair | 1 | Using where |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
Any tips are appreciated. Thank you
INDEX(value, date_add, id_customer, id_pair)
Would be "covering", giving an extra boost on performance for both queries. And also for Gordon's formulation.
At the same time, get rid of these:
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
because they might get in the way of the Optimizer picking the new index. Any other queries that were using those indexes will easily use the new index.
If you are not otherwise using tag_rating.id, get rid of it and promote the UNIQUE to PRIMARY KEY.
Try writing the query using a correlated subquery:
SELECT r.id_customer,
(SELECT ROUND(AVG(tp.percentile_weighted), 2)
FROM tag_product tp
WHERE tp.id_pair = r.id_pair
) AS percentile
FROM tag_rating AS r
WHERE r.value = 1 AND
r.date_add > '2020-08-08 11:56:00';
This eliminates the outer aggregation which should be faster.

Slow inner join order query

I have a problem with the speed of query. Question is similar to this one, but can't find solution. Explain says that MySQL is using: Using index condition; Using where; Using temporary; Using filesort on companies table.
Mysql slow query: INNER JOIN + ORDER BY causes filesort
Slow query:
SELECT * FROM companies
INNER JOIN post_indices
ON companies.post_index_id = post_indices.id
WHERE companies.deleted_at is NULL
ORDER BY post_indices.id
LIMIT 1;
# 1 row in set (5.62 sec)
But if I remove where statement from query it is really fast:
SELECT * FROM companies
INNER JOIN post_indices
ON companies.post_index_id = post_indices.id
ORDER BY post_indices.id
LIMIT 1;
# 1 row in set (0.00 sec)
I've tried using different indexes on companies table:
index_companies_on_deleted_at
index_companeis_on_post_index_id
index_companies_on_deleted_at_and_post_index_id
index_companies_on_post_index_id_and_deleted_at
index_companies_on_deleted_at index is automatically selected by MySQL. Stats for same query using above indexes:
5.6 sec
3.4 sec
8.5 sec
3.5 sec
Any ideas how to improve my query speed? Again said - without where deleted_at is null condition query is instant..
Companies table has 1.3 mil of rows.
PostIndices table has 3k rows.
UPDATE 1:
Order by post_indices.id is used for simplicity since it's indexed already. But it will be used on other columns of join table (post_indices). So sort on companies.post_index_id wont solve this issue
UPDATE 2: for Rick James
Your query takes only 0.04 sec to accomplish. And explain says that index_companies_on_deleted_at_and_post_index_id index is used. So yes, it works better, but this doesn't solve my problem (need to order on post_indices columns, will do this in future, so id post_indices.id used for simplicity of example. In future it will be for example post_indices.city).
My query with WHERE, but without ORDER BY is instant.
UPDATE 3:
EXPLAIN query. Also I noticed that order of indexes matters. index_companies_on_deleted_at index is used if it's higher (created earlier) then index_companies_on_deleted_at_and_post_index_id. Otherwise later index is used. I mean automatically selected by MySQL.
mysql> EXPLAIN SELECT * FROM companies INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | companies | NULL | ref | index_companies_on_post_index_id,index_companies_on_deleted_at,index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at | 6 | const | 638692 | 100.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM companies USE INDEX(index_companies_on_post_index_id) INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | companies | NULL | ALL | index_companies_on_post_index_id | NULL | NULL | NULL | 1277385 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM companies USE INDEX(index_companies_on_deleted_at_and_post_index_id) INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
| 1 | SIMPLE | companies | NULL | ref | index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at_and_post_index_id | 6 | const | 638692 | 100.00 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
UPDATE 4:
I've removed non related columns:
| companies | CREATE TABLE `companies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`address` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`post_index_id` int(11) DEFAULT NULL,
`vat` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`note` text COLLATE utf8_unicode_ci,
`state` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'new',
`deleted_at` datetime DEFAULT NULL,
`lead_list_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_companies_on_vat` (`vat`),
KEY `index_companies_on_post_index_id` (`post_index_id`),
KEY `index_companies_on_state` (`state`),
KEY `index_companies_on_deleted_at` (`deleted_at`),
KEY `index_companies_on_deleted_at_and_post_index_id` (`deleted_at`,`post_index_id`),
KEY `index_companies_on_lead_list_id` (`lead_list_id`),
CONSTRAINT `fk_rails_5fc7f5c6b9` FOREIGN KEY (`lead_list_id`) REFERENCES `lead_lists` (`id`),
CONSTRAINT `fk_rails_79719355c6` FOREIGN KEY (`post_index_id`) REFERENCES `post_indices` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2523518 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
| post_indices | CREATE TABLE `post_indices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`county` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`postal_code` int(11) DEFAULT NULL,
`group_part` int(11) DEFAULT NULL,
`group_number` int(11) DEFAULT NULL,
`group_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`city` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3101 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
UPDATE 5:
Another developer tested same query on his local machine with exactly same data set (dump/restore). And he got totally different explain:
mysql> explain SELECT * FROM companies INNER JOIN post_indices ON companies.post_index_id = post_indices.id WHERE companies.deleted_at is NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
| 1 | SIMPLE | post_indices | index | PRIMARY | PRIMARY | 4 | NULL | 1 | NULL |
| 1 | SIMPLE | companies | ref | index_companies_on_post_index_id,index_companies_on_deleted_at,index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at_and_post_index_id | 11 | const,enbro_purecrm_eu_development.post_indices.id | 283 | Using index condition |
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
2 rows in set (0,00 sec)
Same query on his PC is instant. Have no idea why it is happening.. I've also tried to use STRAIGHT_JOIN. When I force post_indices table to be read first by MySQL, it is blazing fast too. But still it is mistery for me, why same query on another machine is fast (mysql -v 5.6.27) and slow on my machine (mysql -v 5.7.10)
So it seems that problem is MySQL using wrong table as first table to read.
Does this work better?
SELECT * FROM companies AS c
INNER JOIN post_indices AS pi
ON c.post_index_id = pi.id
WHERE c.deleted_at is NULL
ORDER BY c.post_index_id -- Note
LIMIT 1;
INDEX(deleted_at, post_index_id) -- note
For that matter, how fast does it run with the WHERE, but without the ORDER BY?
Using the following optimizer hints, should force MySQL to use the plan that your colleague observed:
SELECT * FROM post_indices
STRAIGHT_JOIN companies FORCE INDEX(index_companies_on_deleted_at_and_post_index_id)
ON companies.post_index_id = post_indices.id
WHERE companies.deleted_at is NULL
ORDER BY post_indices.id
LIMIT 1;
If you will be sorting on other columns of post_indices, you will need an index on those columns to make this plan work well.
Note that what is the most optimal plan will depend on how frequent deleted_at is NULL. If deleted_at is frequently NULL, the above plan will be fast. If not, with the above plan one will have to run through many rows of post_indices before a match is found. Note also that for queries with OFFSET, the same plan may not be the most effective.
I think the issue here is that MySQL decides the join order without considering the effects of ORDER BY and LIMIT. In other words, it will choose the join order that it thinks is fastest to execute the full join.
Since there is a restriction on the companies table (deleted_at is NULL), I am not surprised that it will start with this table.

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table

How to make this complicated query faster [MySQL]?

I have the next query:
SELECT JL.j_id, COUNT(*) as total
FROM j_log JL
WHERE JL.log_time > '20120205164008'
AND JL.j_id IN (
SELECT j_id
FROM j
WHERE checked = '1'
AND expires >= '20120207164008'
) GROUP BY JL.j_id ORDER BY total DESC LIMIT 3
j table has big structure 100 fields and 248986 rows inside it.
next KEY's are present in it
PRIMARY KEY (`j_id`),
KEY `expires` (`expires`),
KEY `checked` (`checked`),
KEY `checked_2` (`checked`,`expires`)
j_log table has about 63000000 records and the next structure
CREATE TABLE `j_log` (
`j_id` int(11) NOT NULL DEFAULT '0',
`member_id` int(11) DEFAULT NULL,
`ip` int(10) unsigned NOT NULL DEFAULT '0',
`log_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
KEY `j_id` (`j_id`),
KEY `log_time` (`log_time`),
KEY `ip` (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
so the considered query wants to get top3 of most visited j_id instances
this is the plan
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
| 1 | PRIMARY | JL | index | log_time | j_id | 4 | NULL | 63914602 | 0.36 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | j | unique_subquery | PRIMARY,expires,checked,checked_2 | PRIMARY | 4 | func | 1 | 100.00 | Using where |
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
Some times it could take up for 15!!! minutes.
Is there any way how to make faster ?
SELECT JL.j_id, COUNT(*) as total
FROM j_log JL
INNER JOIN j
ON JL.j_id = j.j_id
AND j.checked = '1'
AND j.expires >= '20120207164008'
WHERE JL.log_time > '20120205164008'
GROUP BY JL.j_id
ORDER BY total
DESC LIMIT 3
Will this be faster?
Why do you use a subquery?
Why is checked a string? ('1' instead of just 1)
Why do you compare jl.log_time and j.expires diffrently (> vs >=)
How about this query
SELECT j.j_id, COUNT(jl.j_id) as total
FROM j
LEFT JOIN j_log jl ON (jl.j_id = j.j_id AND jl.checked = '1' AND jl.log_time > '20120205164008')
WHERE j.expires >= '20120207164008'
GROUP BY j.j_id
ORDER BY total DESC
LIMIT 3
Make sure j_id is the PRIMARY KEY for both tables and put an index on j.expires and jl.checked and jl.logtime.
Also make sure the field checked is optimized. I'm not sure what the possible values can be, but I assume it's a boolean field. So rather make the field_type BIT or use an ENUM
Edit
Also you should convert the fields j.expires and jl.log_time to better fields. I think it is just a varchar now, looking at the current value you use: 20120205164008. Convert this into a DATETIME field (but don't just convert the tables because you will lose the data).