Experiment
First I create a table:
create database test;
use test;
create table user_purchase
(
order_id int primary key auto_increment,
user_id int,
amount int
);
create table users
(
user_id int primary key auto_increment,
name varchar(15),
age smallint(4)
);
alter user_purchase
add foreign key(user_id) references users(user_id);
Second insert some random data:
wget https://github.com/Percona-Lab/mysql_random_data_load/releases/download/v0.1.12/mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz && tar -xvf mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz && chmod 744 mysql_random_data_load
./mysql_random_data_load test user_purchase 4000 --host 127.0.0.1 --password 123 --user root
./mysql_random_data_load test users 10000 --host 127.0.0.1 --password 123 --user root
Third I log in and execute two queries:
select *
from users as u
where exists (select 1 from user_purchase as up
where up.user_id = u.user_id); ===> it takes about 0.05 sec
select *
from users
where user_id in (select user_id from user_purchase
group by user_id); ===> it takes about 0.02 sec
Question
When using "IN" operator it stably takes 0.02sec, however, when using "EXISTS", it stably takes 0.04sec or even longer, Why using IN is faster when it has to do much more row scanning?
Explain plans:
mysql> EXPLAIN Select * from users where user_id IN (select user_id from user_purchase group by user_id);
+----+--------------+---------------+------------+--------+---------------+------------+---------+--------------------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+---------------+------------+--------+---------------+------------+---------+--------------------+-------+----------+-------------+
| 1 | SIMPLE | users | NULL | ALL | PRIMARY | NULL | NULL | NULL | 11000 | 100.00 | Using where |
| 1 | SIMPLE | <subquery2> | NULL | eq_ref | <auto_key> | <auto_key> | 5 | test.users.user_id | 1 | 100.00 | NULL |
| 2 | MATERIALIZED | user_purchase | NULL | index | user_id | user_id | 5 | NULL | 5000 | 100.00 | Using index |
+----+--------------+---------------+------------+--------+---------------+------------+---------+--------------------+-------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)
mysql> EXPlain Select * FROM users as u where exists (select 1 from user_purchase as up where up.user_id = u.user_id);
+----+--------------------+-------+------------+------+---------------+---------+---------+----------------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+---------+---------+----------------+-------+----------+-------------+
| 1 | PRIMARY | u | NULL | ALL | NULL | NULL | NULL | NULL | 11000 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | up | NULL | ref | user_id | user_id | 5 | test.u.user_id | 1 | 100.00 | Using index |
+----+--------------------+-------+------------+------+---------------+---------+---------+----------------+-------+----------+-------------+
2 rows in set, 2 warnings (0.00 sec)
For the "IN" case it executes the subquery in first place and keeps it's result in memory. That's what "Memorization" stays for in the Explain
From the docs
Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory
While "EXISTS" part performs the sub-query for each set of unique values of parent query.
From the docs
For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context
Technically, there are different sub-queries in IN and in EXISTS clauses
In the "IN" you have
select user_id from user_purchase group by user_id
So, it's enough to perform it once and keep the result in memory
but in the EXISTS
select 1 from user_purchase as up where up.user_id = u.user_id
The "WHERE" clause says "execute this query for every set of different values of "user_id" from parent query".
That's why it takes longer for EXISTS to perform
Related
Consider the following data in the table of books:
bId serial
1 123
2 234
5 445
9 556
There's another table of missing_books with a latest_known_serial whose values come from the following query:
UPDATE missing_books mb
SET latest_known_serial = (
SELECT serial FROM books b
WHERE b.bId < mb.bId
ORDER BY b.bId DESC LIMIT 1)
The aforementioned query produces the following:
bId latest_known_serial
3 234
4 234
6 445
7 445
8 445
It all works, but I was wondering if there's any more performant way to do this as it actually hits big tables.
You can make performance increase by using indexes to make your query faster: I tried to simulate your query:
mysql> EXPLAIN UPDATE missing_books mb
-> SET latest_known_serial = (
-> SELECT serial FROM books b
-> WHERE b.bId < mb.bId
-> ORDER BY b.bId DESC LIMIT 1);
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------------------+
| 1 | UPDATE | mb | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 100.00 | NULL |
| 2 | DEPENDENT SUBQUERY | b | NULL | ALL | bId | NULL | NULL | NULL | 5 | 33.33 | Range checked for each record (index map: 0x1); Using filesort |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------------------+
2 rows in set, 2 warnings (0.00 sec)
As you can see in the above query, It uses a full table scan (type: ALL) to perform the operation: Optimizer didn't select to use the indexes (unique) defined on bId column.
Now Let's make it Primary Key instead of unique index, then run the optimizer to see the result set:
Drop Unique index first:
mysql> ALTER TABLE books DROP INDEX bId;
Query OK, 0 rows affected (0.00 sec)
Records: 0 Duplicates: 0 Warnings: 0
Then Define PK on bId Column
mysql> ALTER TABLE books
ADD PRIMARY KEY (bId);
Now test again:
mysql> EXPLAIN UPDATE missing_books mb SET latest_known_serial = ( SELECT serial FROM books b WHERE b.bId < mb.bId ORDER BY b.bId DESC LIMIT 1);
+----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------+
| 1 | UPDATE | mb | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 100.00 | NULL |
| 2 | DEPENDENT SUBQUERY | b | NULL | index | PRIMARY | PRIMARY | 4 | NULL | 1 | 33.33 | Using where; Backward index scan |
+----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------+
2 rows in set, 2 warnings (0.00 sec)
As you can see in the key column, optimizer used the PK index defined on books table! You can test the speed by making small adjustments.
I have a working, nice, indexed SQL query aggregating notes (sum of ints) for all my users and others stuffs. This is "query A".
I want to use this aggregated notes in others queries, say "query B".
If I create a View based on "query A", will the indexes of the original query will be used when needed if I join it in "query B" ?
Is that true for MySQL ? For others flavors of SQL ?
Thanks.
In MySQL, you cannot create an index on a view. MySQL uses indexes of the underlying tables when you query data against the views that use the merge algorithm. For the views that use the temptable algorithm, indexes are not utilized when you query data against the views.
https://www.percona.com/blog/2007/08/12/mysql-view-as-performance-troublemaker/
Here's a demo table. It has a userid attribute column and a note column.
mysql> create table t (id serial primary key, userid int not null, note int, key(userid,note));
If you do an aggregation to get the sum of note per userid, it does an index-scan on (userid, note).
mysql> explain select userid, sum(note) from t group by userid;
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
| 1 | SIMPLE | t | index | userid | userid | 9 | NULL | 1 | Using index |
+----+-------------+-------+-------+---------------+--------+---------+------+------+-------------+
1 row in set (0.00 sec)
If we create a view for the same query, then we can see that querying the view uses the same index on the underlying table. Views in MySQL are pretty much like macros — they just query the underlying table.
mysql> create view v as select userid, sum(note) from t group by userid;
Query OK, 0 rows affected (0.03 sec)
mysql> explain select * from v;
+----+-------------+------------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | NULL |
| 2 | DERIVED | t | index | userid | userid | 9 | NULL | 1 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+------+------+-------------+
2 rows in set (0.00 sec)
So far so good.
Now let's create a table to join with the view, and join to it.
mysql> create table u (userid int primary key, name text);
Query OK, 0 rows affected (0.09 sec)
mysql> explain select * from v join u using (userid);
+----+-------------+------------+-------+---------------+-------------+---------+---------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------------+---------+---------------+------+-------------+
| 1 | PRIMARY | u | ALL | PRIMARY | NULL | NULL | NULL | 1 | NULL |
| 1 | PRIMARY | <derived2> | ref | <auto_key0> | <auto_key0> | 4 | test.u.userid | 2 | NULL |
| 2 | DERIVED | t | index | userid | userid | 9 | NULL | 1 | Using index |
+----+-------------+------------+-------+---------------+-------------+---------+---------------+------+-------------+
3 rows in set (0.01 sec)
I tried to use hints like straight_join to force it to read v then join to u.
mysql> explain select * from v straight_join u on (v.userid=u.userid);
+----+-------------+------------+-------+---------------+--------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+------+------+----------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 7 | NULL |
| 1 | PRIMARY | u | ALL | PRIMARY | NULL | NULL | NULL | 1 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | t | index | userid | userid | 9 | NULL | 7 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+------+------+----------------------------------------------------+
"Using join buffer (Block Nested Loop)" is MySQL's terminology for "no index used for the join." It's just looping over the table the hard way -- by reading batches of rows from start to finish of the table.
I tried to use force index to tell MySQL that type=ALL is to be avoided.
mysql> explain select * from v straight_join u force index(PRIMARY) on (v.userid=u.userid);
+----+-------------+------------+--------+---------------+---------+---------+----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+----------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 7 | NULL |
| 1 | PRIMARY | u | eq_ref | PRIMARY | PRIMARY | 4 | v.userid | 1 | NULL |
| 2 | DERIVED | t | index | userid | userid | 9 | NULL | 7 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+----------+------+-------------+
Maybe this is using an index for the join? But it's weird that table u is before table t in the EXPLAIN. I'm frankly not sure how to understand what it's doing, given the order of rows in this EXPLAIN report. I would expect the joined table should come after the primary table of the query.
I only put a few rows of data into each table. One might get some different EXPLAIN results with a larger representative sample of test data. I'll leave that to you to try.
We have a small mobile app, that keep sending the teams locations working in the field. We have web-based admin panel to see the last location of each team in the field, there are 8-10 teams.
Now, the table that save the locations, becoming bigger (around 800K records) and it is taking about 10 seconds to get the info from the db.
We can not simply remove the old records, as we want to keep the history of the teams visits on different location.
In the view, We are using the following SQL query in our admin panel
SELECT w.ID, w.DaynTime, team_Desc, co_Nome, w.team_Lat, w.team_Long
FROM ( SELECT MAX(ID) AS maxID FROM VlocationTab GROUP BY UserID) AS aux
INNER JOIN VlocationTab AS w ON w.ID = aux.maxID;
Here is the create statement
CREATE TABLE `TableName` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`UserID` varchar(15) NOT NULL,
`Lat` double(8,6) NOT NULL,
`Long` double(8,6) NOT NULL,
`DayTime` datetime NOT NULL,
`User` varchar(15) DEFAULT NULL,
`Date` datetime DEFAULT NULL,
`AUser` varchar(15) DEFAULT NULL,
`ADate` datetime DEFAULT NULL,
PRIMARY KEY (`ID`),
KEY `DataTime` (`DayTime`),
KEY `Coordenates` (`Lat`,`Long`)
) ENGINE=MyISAM AUTO_INCREMENT=1040384 DEFAULT CHARSET=utf8;
Is there anyway to optimize this query to minimize the execution time please ?
I populate a test table with 1000000 lines (1000 users and 1000 lines per user)
Here is the initial plan :
mysql> explain SELECT w.ID, w.DayTime, User, Lat, `Long` FROM ( SELECT MAX(ID) AS maxID FROM TableName GROUP BY UserID) AS aux INNER JOIN TableName AS w ON w.ID = aux.maxID;
+----+-------------+------------+------------+--------+---------------+---------+---------+-----------+---------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------+---------+---------+-----------+---------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 1000000 | 100.00 | Using where |
| 1 | PRIMARY | w | NULL | eq_ref | PRIMARY | PRIMARY | 4 | aux.maxID | 1 | 100.00 | NULL |
| 2 | DERIVED | TableName | NULL | ALL | NULL | NULL | NULL | NULL | 1000000 | 100.00 | Using temporary; Using filesort |
+----+-------------+------------+------------+--------+---------------+---------+---------+-----------+---------+----------+---------------------------------+
3 rows in set, 1 warning (0.00 sec)
mysql> SELECT count(*) FROM ( SELECT MAX(ID) AS maxID FROM TableName GROUP BY UserID) AS aux INNER JOIN TableName AS w ON w.ID = aux.maxID;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+
1 row in set (1.07 sec)
Your subquery
SELECT MAX(ID) AS maxID FROM TableName GROUP BY UserID
cannot use any index, so you do a full scan to search max(id) per user, next join with the primary key.
I add an index with two columns, user and id. As the index is ordonned it allow to directly get the max(id) per user :
mysql> alter table TableName add index UserID_ID(UserID,ID);
Query OK, 1000000 rows affected (10.60 sec)
Records: 1000000 Duplicates: 0 Warnings: 0
new plan and time :
mysql> explain SELECT w.ID, w.DayTime, User, Lat, `Long` FROM ( SELECT MAX(ID) AS maxID FROM TableName GROUP BY UserID) AS aux INNER JOIN TableName AS w ON w.ID = aux.maxID;
+----+-------------+------------+------------+--------+---------------+-----------+---------+-----------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------+-----------+---------+-----------+------+----------+--------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 1001 | 100.00 | Using where |
| 1 | PRIMARY | w | NULL | eq_ref | PRIMARY | PRIMARY | 4 | aux.maxID | 1 | 100.00 | NULL |
| 2 | DERIVED | TableName | NULL | range | UserID_ID | UserID_ID | 47 | NULL | 1001 | 100.00 | Using index for group-by |
+----+-------------+------------+------------+--------+---------------+-----------+---------+-----------+------+----------+--------------------------+
3 rows in set, 1 warning (0.00 sec)
mysql> SELECT count(*) FROM ( SELECT MAX(ID) AS maxID FROM TableName GROUP BY UserID) AS aux INNER JOIN TableName AS w ON w.ID = aux.maxID;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+
1 row in set (0.04 sec)
PS : But the best way is to rewrite your request to filter on date first, for example lines younger than one day.
I needed to get values from the "latest" (i.e. highest record id) record for each value of a field (server_name in this case).
I had already added a server_name_id index on server_name and id.
My first attempt took minutes to run.
SELECT server_name, state
FROM replication_client as a
WHERE id = (
SELECT MAX(id)
FROM replication_client
WHERE server_name = a.server_name)
ORDER BY server_name
My second attempt took 0.001s to run.
SELECT rep.server_name, state FROM (
SELECT server_name, MAX(id) AS max_id
FROM replication_client
GROUP BY server_name) AS newest,
replication_client AS rep
WHERE rep.id = newest.max_id
ORDER BY server_name
What is the principle behind this optimisation? (I'd like to be able to write optimised queries without trial and error.)
P.S. Explained below:
mysql> EXPLAIN
->
-> SELECT server_name, state
-> FROM replication_client as a
-> WHERE id = (SELECT MAX(id) FROM replication_client WHERE server_name = a.server_name)
-> ORDER BY server_name
-> ;
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
| 1 | PRIMARY | a | ALL | NULL | NULL | NULL | NULL | 630711 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | replication_client | ref | server_name_id | server_name_id | 18 | mrg.a.server_name | 45050 | Using index |
+----+--------------------+--------------------+------+----------------+----------------+---------+-------------------+--------+-----------------------------+
mysql> explain
-> SELECT rep.server_name, state FROM (
-> SELECT server_name, MAX(id) AS max_id
-> FROM replication_client
-> GROUP BY server_name) AS newest,
-> replication_client AS rep
-> WHERE rep.id = newest.max_id
-> ORDER BY server_name
-> ;
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using temporary; Using filesort |
| 1 | PRIMARY | rep | eq_ref | PRIMARY | PRIMARY | 4 | newest.max_id | 1 | |
| 2 | DERIVED | replication_client | range | NULL | server_name_id | 18 | NULL | 15 | Using index for group-by |
+----+-------------+--------------------+--------+---------------+----------------+---------+---------------+------+---------------------------------+
Well, the whole thing is quite self-explaining, when you look at two words in your first explain plan: DEPENDENT SUBQUERY
This means, that for every row, your where condition examines, the subquery is executed. Of course this can be slow as hell.
Also note, that there's an order of operations when executing a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
ORDER BY clause
SELECT clause
When you can filter in FROM clause, it's better than filtering in WHERE clause...
My db tables are growing very fast (and will continue), at this time I have a problem with this query (well, others too):
select user_id from post where user_id not in (select id from user)
What I need is the new ids that are in post table and there are not in user table.
Here is the explain:
> mysql> explain select user_id from post where user_id not in (select
> id from user);
>
+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref |rows | Extra |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+
| 1 | PRIMARY | post | ALL | NULL |NULL | NULL | NULL | 16076920 | Using where |
| 2 | DEPENDENT SUBQUERY | user | unique_subquery | PRIMARY | PRIMARY | 8 | func | 1 | Using index |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+----------+-------------+
I have tried this other too:
SELECT p.user_id FROM post p LEFT JOIN user u ON p.user_id=u.id WHERE u.id IS NULL;
The explain:
mysql> EXPLAIN SELECT p.user_id FROM post p LEFT JOIN user u ON p.user_id=u.id WHERE u.id IS NULL;
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+
| 1 | SIMPLE | p | ALL | NULL | NULL | NULL | NULL | 14323335 | |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 8 | ghost.p.user_id | 1 | Using where; Using index; Not exists |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+----------+--------------------------------------+
Both queries have to see the entire table post, and it is very huge:
post table: 16077899 entries
user table: 9657158 entries
The query would take several minutes (more than 30min) to perform, any tip ?
Thanks!
I think you should do two things...
Make sure you have and index on post (user_id)
Add the DISTINCT keyword to your query, like this:
SELECT DISTINCT user_id
FROM post
WHERE user_id NOT IN (SELECT id FROM user)
Check the new EXPLAIN PLAN.
I am no expert in database, but your first query explain didn't use a Index on userid, do you have any index built on user_id field in post table? If not just create that. Also, you can try group by/distinct to filter your users, as first query will otherwise return multiple userid from post. This increase speed in all.