mysql version is 5.5.40-0+wheezy1-log
I have this query:
SELECT cycle_id, sum(fst_field) + sum(snd_field) AS tot_sum
FROM mytable WHERE parent_id IN (
SELECT id FROM mytable WHERE cycle_id = 2662
)
I have these indexes:
parent_id
parent_id, cycle_id, fst_field, snd_field
If I execute the command
EXPLAIN EXTENDED SELECT cycle_id, sum(fst_field) + sum(snd_field) AS tot_sum
FROM mytable WHERE parent_id IN (
SELECT id FROM mytable WHERE cycle_id = 2662
)
This is the result:
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | mytable | ALL | NULL | NULL | NULL | NULL | 185971 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | mytable | unique_subquery | PRIMARY,cycle_id_idx | PRIMARY | 4 | func | 1 | 100.00 | Using where |
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
It does not use any index. I tried to add other composed indexes (i tried several), without success.
I don't remember if 5.5 still had a very crude handling of IN ( SELECT ... ). If so, that would probably explain the problem
Consider upgrading to 5.6 or 5.7 or 8.0.
Convert the query to use a JOIN.
INDEX(cycle_id) is needed.
Related
I have two tables called ny_clean (3454602 entries) and pickup_0_ids_temp_table (2739268 entries) who have both an id CHAR(11) column which is a primary key and has a BTREE index on top of it ( MySQL 5.7) .
The "id" column in pickup_0_ids_temp_table is a subset of ny_clean and I want to get a result which is ny_clean without the id values from pickup_0_ids_temp_table.
Option 1:
EXPLAIN
SELECT *
FROM pickup_0_ids_temp_table as t
JOIN ny_clean as n
ON n.id != t.id;
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| 1 | SIMPLE | t | NULL | index | NULL | PRIMARY | 11 | NULL | 2734512 | 100.00 | Using index |
| 1 | SIMPLE | ny_clean | NULL | index | NULL | btree_pk_ny_clean | 11 | NULL | 3445904 | 90.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
Option 2:
EXPLAIN
SELECT *
FROM ny_clean as n
WHERE n.id NOT IN (
SELECT id
FROM pickup_0_ids_temp_table);
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| 1 | PRIMARY | n | NULL | ALL | NULL | NULL | NULL | NULL | 3445904 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | pickup_0_ids_temp_table | NULL | unique_subquery | PRIMARY,btree_pickup_0 | PRIMARY | 11 | func | 1 | 100.00 | Using index |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
I then use one of the options inside this larger query
EXPLAIN
INSERT INTO y
SELECT id, pickup_longitude, pickup_latitude
FROM x
JOIN
(OPTION 1 OR 2) as z
ON z.id = x.id;
When I used Option 1 inside the larger query it ran for two days and it was not finished. Option 2 on the other hand did the job in less than 30minutes
My Question: Why is that?
Following the MySQL documentation (https://dev.mysql.com/doc/refman/5.7/en/subquery-materialization.html) I would suspect that it is due to materialization of the subquery but how would I check this ?
And am I interpreting the EXPLAIN Output wrong? Because judging from it I would expect Option 1 to be faster since it uses an index on both tables
Or does it have to do ith the larger query?
Thanks in advance
Your option 1 doesn't do what you think will do.
If you have two tables
n.id t.id
1 1
2 2
3 3
ON n.id != t.id;
You get:
1,2
1,3
2,1
2,3
3,1
3,2
That is almost a cartesian product. So 3.4 mill x 2.7 mill ~ 9.18 mill rows
Then you try to do a JOIN and because that materialzed table doesnt have index will take very long time.
My query example is :
select * from table1
where table1_seq = (select table2_seq from table2 where ~)
then does sub-query operate every row of table1's Data ??
or
just operates once??
IF once, then which is better performance between upper query or use join query??
With your query it won't return an error only if the subquery return 1 and only 1 result.
You should do an explain on the query :
Explain select * from table1
where table1_seq = (select table2_seq from table2 where id2=X)
Will return something like that :
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
==========================================================================================================================================================
1 | PRIMARY | Table1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL
2 | SUBQUERY | Table2 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index
Inner Join in other hand will also do 2 query :
EXPLAIN select table1.*
from table1
INNER JOIN table2 ON table1.table1_seq = table2.table2_seq
WHERE table2.id2=X
Result :
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
==========================================================================================================================================================
1 | SIMPLE | Table1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL
2 | SIMPLE | Table2 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL
So it's less a problem of benchmark and more about of what you want to do. If you want to select only 1 value from another table you can really do the query you wrote.
I have the fowlloing query
select * from mapping_channel_fqdn_virtual_host where id in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
explaining the above query gives me the following result:
explain select * from mapping_channel_fqdn_virtual_host where id in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
+----+-------------+-----------------------------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | mapping_channel_fqdn_virtual_host | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 10 | 100.00 | Using where |
+----+-------------+-----------------------------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0,00 sec)
id is the primary key of the table. It looks like this is a range type of query and it uses the Primary Key as index for this one.
When I try to explain the following query I get different results:
explain select * from mapping_channel_fqdn_virtual_host where id in (select max(id) from mapping_channel_fqdn_virtual_host group by channel_id, fqdn_virtual_host_id);
+----+-------------+-----------------------------------+------------+-------+------------------+------------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------------------+------------+-------+------------------+------------------+---------+------+------+----------+-------------+
| 1 | PRIMARY | mapping_channel_fqdn_virtual_host | NULL | ALL | NULL | NULL | NULL | NULL | 4849 | 100.00 | Using where |
| 2 | SUBQUERY | mapping_channel_fqdn_virtual_host | NULL | index | idx_channel_fqdn | idx_channel_fqdn | 8 | NULL | 4849 | 100.00 | Using index |
+----+-------------+-----------------------------------+------------+-------+------------------+------------------+---------+------+------+----------+-------------+
2 rows in set, 1 warning (0,00 sec)
idx_channel_fqdn is a composite index key for the column pair used in the groupby clause. But when using subquery the Primary query stops using the index like it did before. Can you explain why this is happening?
Tried the JOIN query Anouar suggested:
explain select * from mapping_channel_fqdn_virtual_host as x join (select max(id) as ids from mapping_channel_fqdn_virtual_host group by channel_id, fqdn_virtual_host_id) as y on x.id=y.ids;
+----+-------------+-----------------------------------+------------+--------+------------------+------------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------------------+------------+--------+------------------+------------------+---------+-------+------+----------+-------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 4849 | 100.00 | Using where |
| 1 | PRIMARY | x | NULL | eq_ref | PRIMARY | PRIMARY | 4 | y.ids | 1 | 100.00 | NULL |
| 2 | DERIVED | mapping_channel_fqdn_virtual_host | NULL | index | idx_channel_fqdn | idx_channel_fqdn | 8 | NULL | 4849 | 100.00 | Using index |
+----+-------------+-----------------------------------+------------+--------+------------------+------------------+---------+-------+------+----------+-------------+
3 rows in set, 1 warning (0,00 sec)
Judging by the index and eq_ref types it looks like it is better to use the JOIN than the subquery? Can you explain a bit more the outcome of the join explain expressiong?
You can see the answers for this question
You will fine ideas.
I am quoting from some answers
Sometimes MySQL does not use an index, even if one is available. One
circumstance under which this occurs is when the optimizer estimates
that using the index would require MySQL to access a very large
percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
Briefly, Try forcing the index:
SELECT *
FROM mapping_channel_fqdn_virtual_host FORCE INDEX (name of the index you want to use)
WHERE (mapping_channel_fqdn_virtual_host.id IN (1,2,3,4,5,6,7,8,9,10));
Or use JOIN instead and see the explain
SELECT * FROM mapping_channel_fqdn_virtual_host mcf
JOIN (select max(id) as ids from mapping_channel_fqdn_virtual_host group by channel_id, fqdn_virtual_host_id)) AS mcfv
ON mcf.id = mcfv.ids;
I needed to get values from the "latest" (i.e. highest record id) record for each value of a field (server_name in this case).
I had already added a server_name_id index on server_name and id.
My first attempt took minutes to run.
SELECT server_name, state
FROM replication_client as a
WHERE id = (
SELECT MAX(id)
FROM replication_client
WHERE server_name = a.server_name)
ORDER BY server_name
My second attempt took 0.001s to run.
SELECT rep.server_name, state FROM (
SELECT server_name, MAX(id) AS max_id
FROM replication_client
GROUP BY server_name) AS newest,
replication_client AS rep
WHERE rep.id = newest.max_id
ORDER BY server_name
What is the principle behind this optimisation? (I'd like to be able to write optimised queries without trial and error.)
P.S. Explained below:
mysql> EXPLAIN
->
-> SELECT server_name, state
-> FROM replication_client as a
-> WHERE id = (SELECT MAX(id) FROM replication_client WHERE server_name = a.server_name)
-> ORDER BY server_name
-> ;
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
| 1 | PRIMARY | a | ALL | NULL | NULL | NULL | NULL | 630711 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | replication_client | ref | server_name_id | server_name_id | 18 | mrg.a.server_name | 45050 | Using index |
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
mysql> explain
-> SELECT rep.server_name, state FROM (
-> SELECT server_name, MAX(id) AS max_id
-> FROM replication_client
-> GROUP BY server_name) AS newest,
-> replication_client AS rep
-> WHERE rep.id = newest.max_id
-> ORDER BY server_name
-> ;
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using temporary; Using filesort |
| 1 | PRIMARY | rep | eq_ref | PRIMARY | PRIMARY | 4 | newest.max_id | 1 | |
| 2 | DERIVED | replication_client | range | NULL | server_name_id | 18 | NULL | 15 | Using index for group-by |
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
Well, the whole thing is quite self-explaining, when you look at two words in your first explain plan: DEPENDENT SUBQUERY
This means, that for every row, your where condition examines, the subquery is executed. Of course this can be slow as hell.
Also note, that there's an order of operations when executing a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
ORDER BY clause
SELECT clause
When you can filter in FROM clause, it's better than filtering in WHERE clause...
Following queries run quite fast and instantaneously on mysql server:
SELECT table_name.id
FROM table_name
WHERE table_name.id in (10000)
SELECT table_name.id
from table_name
where table_name.id = (SELECT table_name.id
FROM table_name
WHERE table_name.id in (10000)
);
But if I change the second query to as following, then it takes more than 20 seconds:
SELECT table_name.id
from table_name
where table_name.id in (SELECT table_name.id
FROM table_name
WHERE table_name.id in (10000)
);
On doing explain, I get the following output. It is clear that there are some issues regarding how MySQL indexes the data, and use in keyword.
For first query:
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | table_name | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
For second query:
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | table_name | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
| 2 | SUBQUERY | table_name | const | PRIMARY | PRIMARY | 4 | | 1 | Using index |
+----+-------------+---------------+-------+---------------+---------+---------+-------+------+-------------+
For third query:
+----+--------------------+------------+-------+---------------+---------+---------+-------+---------+--------------------------+
| id | select_type | table_name | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+---------------+---------+---------+-------+---------+--------------------------+
| 1 | PRIMARY | table_name | index | NULL | sentTo | 5 | NULL | 6250751 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | table_name | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
+----+--------------------+------------+-------+---------------+---------+---------+-------+---------+--------------------------+
I am using InnoDB and have tried changing the third query to forcibly use the index as indicated by the following category.
In first case you have only first record from subquery (It runs once, because equals is only for first value)
In second query you got Cartesian multiplication (each per each) because IN runs subquery for each row. Which is not good for performance
Try to use joins for these cases.