I have 2 tables. From these two tables i am trying to insert records into a third table using a select query with join. However i found that select query with join not using indexes and taking a lots of time, hence insertion is very slow.
I tried to create multiple indexes as suggested in few posts but not avail.
MySQL with JOIN not using index
MySQL query with JOIN not using INDEX
Here are my tables structure:
CREATE TABLE master_table (
id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
field1 VARCHAR(50) DEFAULT NULL,
field2 VARCHAR(50) DEFAULT NULL,
field3 VARCHAR(50) DEFAULT NULL,
field4 VARCHAR(50) DEFAULT NULL,
PRIMARY KEY (id),
KEY mt_field1_index (field1)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
CREATE TABLE child_table (
c_id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
m_id BIGINT(20) UNSIGNED NOT NULL ,
group_id BIGINT(20) UNSIGNED NOT NULL ,
status ENUM('Status1','Status2','Status3') NOT NULL,
job_id VARCHAR(50) DEFAULT NULL,
PRIMARY KEY (c_id),
UNIQUE KEY ct_mid_gid (m_id,group_id),
KEY Index_ct_status (status),
KEY index_ct_jobid (job_id),
KEY index_ct_mid (m_id),
KEY index_ct_cid_sts_tsk (group_id,status,job_id)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
Query:
SELECT m.id
, NULLIF(TRIM(m.field1),'')
FROM master_table m
JOIN child_table c
ON m.id = c.m_id
WHERE c.group_id = 2
AND c.status = 'Status3'
AND c.job_id = 0
ORDER
BY m.id
LIMIT 0, 1000;
Explain:
+-------+-------------+-------+------------+----------+------------------------------------------------------------------------------+-----------------+---------+--------------+-------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+-------+-------------+-------+------------+----------+------------------------------------------------------------------------------+-----------------+---------+--------------+-------+----------+------------------------------------------------+
| 1 | SIMPLE | c | (NULL) | ref | ct_mid_gid,Index_ct_status,index_ct_jobid,index_ct_mid,index_ct_cid_sts_tsk | Index_ct_status | 1 | const | 65689 | 0.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m | (NULL) | eq_ref | PRIMARY | PRIMARY | 8 | r_n_d.c.m_id | 1 | 100.00 | (NULL) |
+-------+-------------+-------+------------+----------+------------------------------------------------------------------------------+-----------------+---------+--------------+-------+----------+------------------------------------------------+
WHERE c.group_id = 2
AND c.status = 'Status3'
AND c.job_id = 0
ORDER BY c.m_id -- Note the change
Needs
INDEX(group_id, status, job_id, -- in any order
m_id) -- last
What you have (separate indexes) is not the same.
In order to get to the LIMIT the index must get entirely past the WHERE and ORDER BY. This prevents computing all the rows (before the LIMIT) and sorting and only then do the LIMIT.
So, you get 4 speedups:
Index efficiently fetching the desired rows (from c)
No need for a sort pass (since ORDER BY delivers them in order)
The index is "covering" (hence, no bouncing back and forth between index BTree and data BTree for c)
Get to stop at 1000.
While you are at it, consider getting rid of the AUTO_INCREMENT. Toss c_id and change
PRIMARY KEY (c_id),
UNIQUE KEY ct_mid_gid (m_id, group_id)
-->
PRIMARY KEY(m_id, group_id)
Coincidentally, if you had done this, your KEY index_ct_cid_sts_tsk (group_id,status,job_id) would have stumbled into the perfect index. This is because the PK is implicitly tacked onto any secondary index, but you need m_id, not c_id. Anyway, I prefer to be explicit.
And when making changes, toss any redundant indexes. For example, KEY index_ct_mid (m_id) is useless since it is the beginning of another index.
Have you created index on the columns used in where clause?
Check query by adding index on columns.
but do remember if you add index then insert, update and delete operation on table might get slow.
Related
I am looking for a way how to make my SELECT query even faster than it is now, because I have a feeling it should be possible to make it faster.
Here is the query
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
Here is EXPLAIN SELECT
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY | PRIMARY | 4 | dev.r.id_pair | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
Now the tables are
CREATE TABLE `tag_product` (
`id_pair` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_product` int(10) unsigned NOT NULL,
`id_user_tag` int(10) unsigned NOT NULL,
`status` tinyint(3) NOT NULL,
`percentile` decimal(8,4) unsigned NOT NULL,
`percentile_weighted` decimal(8,4) unsigned NOT NULL,
`elo` int(10) unsigned NOT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_pair`),
UNIQUE KEY `id_product_id_user_tag` (`id_product`,`id_user_tag`),
KEY `status` (`status`),
KEY `id_user_tag` (`id_user_tag`),
CONSTRAINT `tag_product_ibfk_5` FOREIGN KEY (`id_user_tag`) REFERENCES `user_tag` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `tag_rating` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_customer` int(10) unsigned NOT NULL,
`id_pair` int(10) unsigned NOT NULL,
`id_duel` int(10) unsigned NOT NULL,
`value` tinyint(4) NOT NULL,
`date_add` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_duel_id_pair` (`id_duel`,`id_pair`),
KEY `id_pair_id_customer` (`id_pair`,`id_customer`),
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
KEY `id_customer_value_date_add` (`id_customer`,`value`,`date_add`),
CONSTRAINT `tag_rating_ibfk_3` FOREIGN KEY (`id_pair`) REFERENCES `tag_product` (`id_pair`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `tag_rating_ibfk_6` FOREIGN KEY (`id_duel`) REFERENCES `tag_rating_duel` (`id_duel`) ON DELETE CASCADE ON UPDATE CASCADE,
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The table tag_product has about 250k rows and the tag_rating has about 1M rows.
My issue is that the SQL query takes about 0.8s on average on my machine. I would like to make it ideally under 0.5s while also assuming the tables can get like 10 times bigger. The amount of rows taken into play should be about the same because I have a date condition (I only want less than a month old rows).
Is this possible to make faster just by some trick (aka not restructuring my tables)? When I slightly modify (dont join the smaller table) the statement as
SELECT r.id_customer, COUNT(*)
FROM tag_rating AS r USE INDEX (value_date_add)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer;
here is EXPLAIN SELECT
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
it takes about 0.25s which is great. So the JOIN makes it 3x slower. Is that inevitable? I feel like since I am joining via primary key it shouldnt make a query 3x slower.
---UPDATE---
This is actually my query. The number of different id_customer values is about 1 thousand and is expected to rise, the number of rows with value=1 is exactly half. So far the query performance seems to be slowing down linearly based on the number of rows in rating table
Using adding id_pair at the end of the id_customer_value_date_add or value_id_customer_date_add index doesnt help.
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (id_customer_value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.id_customer IN (2593179,1461878,2318871,2654090,2840415,2852531,2987432,3473275,3960453,3961798,4129734,4191571,4202912,4204817,4211263,4248789,765650,1341317,1430380,2116196,3367674,3701901,3995273,4118307,4136114,4236589,783262,913493,1034296,2626574,3574634,3785772,2825128,4157953,3331279,4180367,4208685,4287879,1038898,1445750,1975108,3658055,4185296,4276189,428693,4248631,1892448,3773855,2901524,3830868,3934786) AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
This is EXPLAIN SELECT
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | r | range | id_customer_value_date_add | id_customer_value_date_add | 10 | | 558906 | Using where; Using index |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY,status | PRIMARY | 4 | dev.r.id_pair | 1 | Using where |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
Any tips are appreciated. Thank you
INDEX(value, date_add, id_customer, id_pair)
Would be "covering", giving an extra boost on performance for both queries. And also for Gordon's formulation.
At the same time, get rid of these:
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
because they might get in the way of the Optimizer picking the new index. Any other queries that were using those indexes will easily use the new index.
If you are not otherwise using tag_rating.id, get rid of it and promote the UNIQUE to PRIMARY KEY.
Try writing the query using a correlated subquery:
SELECT r.id_customer,
(SELECT ROUND(AVG(tp.percentile_weighted), 2)
FROM tag_product tp
WHERE tp.id_pair = r.id_pair
) AS percentile
FROM tag_rating AS r
WHERE r.value = 1 AND
r.date_add > '2020-08-08 11:56:00';
This eliminates the outer aggregation which should be faster.
I have a query that takes about 90 seconds to run even though the tables should have the right indexes. I don't understand why.
I am using MySQL and the tables are InnoDB.
This is the query:
SELECT count(*)
FROM `following_lists` fl INNER JOIN users u
ON fl.user_uuid = u.user_uuid
WHERE fl.following_query_id = 1000010 AND u.status <= 2
I expect this query to start on the table following_lists, grab about 4K records as per the WHERE condition, join these records to the table users by its primary key, check the value of a field in the users table, and return the count of the resulting records. Why does it take so long? Could it be because the two fields I'm joining the tables by are CHAR(40) and not integers?
These are the tables involved and their indexes:
CREATE TABLE `users` (
`user_uuid` CHAR(40) NOT NULL,
`status` TINYINT UNSIGNED NOT NULL,
...
PRIMARY KEY (`user_uuid`),
...
)
CREATE TABLE `following_lists` (
`following_id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`following_query_id` INT UNSIGNED NOT NULL,
`user_uuid` CHAR(40) NOT NULL,
PRIMARY KEY (`following_id`),
KEY `query_id` (`following_query_id`),
KEY `user_uuid` (`user_uuid`)
)
And this is the output of the explain query:
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
| 1 | SIMPLE | fl | ref | query_id,user_uuid | query_id | 4 | const | 3718 | |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 160 | fl.user_uuid | 1 | Using index |
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
Further details:
The table following_lists has about 25k rows, but only 3718 have fl.following_query_id = 1000010.
The table users has about 160k rows, but only 3718 should be selected in the join. Only 40 records meet both conditions fl.following_query_id = 1000010 AND u.status <= 2.
The query is slow even if I remove the condition AND u.status <= 2.
"have the right indexes" -- dead give away.
If you are using MyISAM, don't. Instead, switch to InnoDB.
Do you need following_lists.id for anything? Is (following_query_id, user_uuid) Unique? If so, make them the PRIMARY KEY.
If you can't do the above, change
KEY `query_id` (`following_query_id`)
to
INDEX(following_query_id, user_uuid)
UUIDs are terrible inefficient, especially when unnecessarily declared utf8mb4, or CHAR with a larger than necessary size. Change to CHAR(36) CHARACTER SET ascii. (Notice the "160" in the `EXPLAIN shrink significantly.)
More on why UUIDs are bad for performance: http://mysql.rjweb.org/doc.php/uuid
How much RAM do you have? What is the setting for innodb_buffer_pool_size? (Sounds like it is too low.)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Problem with MySQL version 5.7.18. Earlier versions of MySQL behaves as supposed to.
Here are two tables. Table 1:
CREATE TABLE `test_events` (
`id` int(11) NOT NULL,
`event` int(11) DEFAULT '0',
`manager` int(11) DEFAULT '0',
`base_id` int(11) DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`client` int(11) DEFAULT '0',
`event_time` datetime DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_events`
ADD PRIMARY KEY (`id`),
ADD KEY `client` (`client`),
ADD KEY `event_time` (`event_time`),
ADD KEY `manager` (`manager`),
ADD KEY `base_id` (`base_id`),
ADD KEY `create_time` (`create_time`);
And the second table:
CREATE TABLE `test_event_types` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`base` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_event_types`
ADD PRIMARY KEY (`id`);
Let's try to select last event from base "314":
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
WHERE base = 314
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | test_events | NULL | ALL | NULL | NULL | NULL | NULL | 434928 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | test_event_types | NULL | ALL | PRIMARY | NULL | NULL | NULL | 44 | 2.27 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
MySQL is not using index and reads the whole table.
Without WHERE statement:
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| 1 | SIMPLE | test_events | NULL | index | NULL | create_time | 4 | NULL | 1 | 100.00 | NULL |
| 1 | SIMPLE | test_event_types | NULL | eq_ref | PRIMARY | PRIMARY | 4 | m16.test_events.event | 1 | 100.00 | Using index |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
Now it uses index.
MySQL 5.5.55 uses index in both cases. Why is it so and what to do with it?
I don't know the difference you are seeing in your previous and current installations but the servers behaviour makes sense.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) ORDER BY test_events.create_time DESC LIMIT 1;
In this query you do not have a where clause but you are fetching one row only. And that's after sorting by create_time which happens to have an index. And that index can be used for sorting. But let's see the second query.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) WHERE base = 314 ORDER BY test_events.create_time DESC LIMIT 1
You don't have an index on the base column. So no index can be used on that. To find the relevent records mysql has to do a table scan. Having identified the relevent rows, they need to be sorted. But in this case the query planner has decided that it's just not worth it to use the index on create_time
I see several problems with your setup, the first being not having and index on base as already mentioned. But why is base varchar? You appear to be storing integers in it.
ALTER TABLE test_events
ADD PRIMARY KEY (id),
ADD KEY client (client),
ADD KEY event_time (event_time),
ADD KEY manager (manager),
ADD KEY base_id (base_id),
ADD KEY create_time (create_time);
And making multiple indexes like this doesn't make much sense in mysql. That's because mysql can use only one index per table for queries. You would be far better off with one or two indexes. Possibly multi column indexes.
I think your ideal index would contain both create_time and event fields
base = 314 with base VARCHAR... is a performance problem. Either put quotes around 314 or make base some integer type.
You appear not to need LEFT. If not, then do a plain JOIN so that the optimizer has the freedom to start with an INDEX(base), which is then missing and needed.
As for the differences between 5.5 and 5.6 and 5.7, there have been a number of Optimization changes; you may have encountered a regression. But I don't want to chase that until you have improved the query and indexes.
I stumbled upon same scenario where MySQL was using table scan, instead of INDEX search.
This could be because of one of the reasons, mentioned in MySQL docs:
The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
mysql docs link
And when I checked EXPLAIN of MySQL query in production server with large number of rows, it used INDEX search as expected.
Its one of the MySQL optimizations, under the hood :)
I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table
I've been messing around all day trying to find why my query performance is terrible. It is extremely simple, yet can take over 15 minutes to execute (I abort the query at that stage). I am joining a table with over 2 million records.
This is the select:
SELECT
audit.MessageID, alerts.AlertCount
FROM
audit
LEFT JOIN (
SELECT MessageID, COUNT(ID) AS 'AlertCount'
FROM alerts
GROUP BY MessageID
) AS alerts ON alerts.MessageID = audit.MessageID
This is the EXPLAIN
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | PRIMARY | AL | index | NULL | IDX_audit_MessageID | 4 | NULL | 2330944 | 100.00 | Using index |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 124140 | 100.00 | |
| 2 | DERIVED | alerts | index | NULL | IDX_alerts_MessageID | 5 | NULL | 124675 | 100.00 | Using index |
This is the schema:
# Not joining, just showing types
CREATE TABLE messages (
ID int NOT NULL AUTO_INCREMENT,
MessageID varchar(255) NOT NULL,
PRIMARY KEY (ID),
INDEX IDX_messages_MessageID (MessageID)
);
# 2,324,931 records
CREATE TABLE audit (
ID int NOT NULL AUTO_INCREMENT,
MessageID int NOT NULL,
LogTimestamp timestamp NOT NULL,
PRIMARY KEY (ID),
INDEX IDX_audit_MessageID (MessageID),
CONSTRAINT FK_audit_MessageID FOREIGN KEY(MessageID) REFERENCES messages(ID)
);
# 124,140
CREATE TABLE alerts (
ID int NOT NULL AUTO_INCREMENT,
AlertLevel int NOT NULL,
Text nvarchar(4096) DEFAULT NULL,
MessageID int DEFAULT 0,
PRIMARY KEY (ID),
INDEX IDX_alert_MessageID (MessageID),
CONSTRAINT FK_alert_MessageID FOREIGN KEY(MessageID) REFERENCES messages(ID)
);
A few very important things to note - the MessageID is not 1:1 in either 'audit' or 'alerts'; The MessageID can exist in one table, but not the other, or may exist in both (which is the purpose of my join); In my test DB, none of the MessageID exist in both. In other words, my query will return 2.3 million records with 0 as the count.
Another thing to note is that the 'audit' and 'alert' tables used to use MessageID as varchar(255). I created the 'messages' table expecting that it would fix the join. It actually made it worse. Previously, it would take 78 seconds, now, it never returns.
What am I missing about MySQL?
Subqueries are very hard for the MySQL engine to optimize. Try:
SELECT
audit.MessageID, COUNT(alerts.ID) AS AlertCount
FROM
audit
LEFT JOIN alerts ON alerts.MessageID = audit.MessageID
GROUP BY audit.MessageID
You're joining to a subquery.
The subquery results are effectively a temporary table - note the <derived2> in the query execution plan. As you can see there, they're not indexed, since they're ephemeral.
You should execute the query as a single unit with a join, rather than joining to the results of a second query.
EDIT: Andrew has posted an answer with one example of how to do your work in a normal join query, instead of in two steps.