SQL order of execution for correlated subquery - mysql

I have the following Personnel table:
+---------+----------+-------------+
| name | dept_nbr | job_title |
+---------+----------+-------------+
| Michael | 14 | Programmer |
| Kumar | 14 | Programmer |
| Dave | 14 | Programmer |
| Jane | 14 | Manager |
| Carol | 37 | Programmer |
| Joe | 37 | Programmer |
| John | 59 | CEO |
+---------+----------+-------------+
Problem: Find all dept_nbr's (departments) that have fewer than 3 programmers.
Working query:
SELECT DISTINCT dept_nbr
FROM Personnel AS P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer') < 3;
Result:
37
59
Notes:
Department 14 is correctly not included as it has 3 programmers (3 is equal to but not fewer than 3). Department 59 has zero programmers, and is also correctly included in the results.
My question:
When the above query executes, how does a generic SQL engine proceed? From what I have read, SQL execution order is (roughly): From, Where, Group By, Having, and Select. So, is the following correct?
1 - The Outer Query passes each row of the Personnel table as P1 into the Inner query.
2.a - The Inner Query scans the entire Personnel table as P2, row by row, looking for rows that satisfy the condition "P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'".
2.b – Once the Inner Query is done with the entire table, it COUNTs the matching dept_nbr values and returns it to the Outer Query.
3 – In the Outer Query, if the count returned from the Inner Query satisfies the condition "WHERE (Inner Query Count Result) < 3", the corresponding dept_nbr for the P1 row is SELECTed.
4 – Following all rows processed by the Outer Query, the Outer Query does a DISTINCT on the results and displays the unique dept_nbr values.
Is my understanding above correct? Specifically, does the outer query do the DISTINCT at the very end (step #4)? It seems that in this way, the inner query does redundant scanning (for example, it processes dept_nbr = 14 four times, when it really has the answer in the first pass).
I tested the above query on sqlfiddle.com w/ MySQL 5.6.

When the above query executes, how does a generic SQL engine proceed?
From what I have read, SQL execution order is (roughly): From, Where,
Group By, Having, and Select.
This statement is -- generally -- not correct. SQL is parsed in the order that you describe. However, the execution is determined by the optimizer and might have little to do with the original query. Remember: SQL is a descriptive language, not a procedural language. It describes the result set, not the specific steps for calculating it.
That said, MySQL's execution plan is much closer to the query than most other databases (particularly more advanced databases with better optimizers). And, almost any database is going to proceed in the steps you describe for this query. The aggregation in the subquery limits the choices for optimization.
If you want to eliminate the redundancy, then do the select distinct before the filtering:
SELECT dept_nbr
FROM (SELECT DISTINCT dept_nbr FROM Personnel P1) P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'
) < 3;
You can also do this more simply with just an aggregation:
select dept_nbr
from personnel
group by dept_nbr
having sum(job_title = 'Programmer') < 3;

Add EXPLAIN (or EXPLAIN EXTENDED) before your query and it should give you the explain plan which will detail exactly the steps in order of your query. This is a very useful tool when trying to optimize queries.

Related

Duplicate date condition in MySQL Query with different date ranges [duplicate]

I have a set of conditions in my where clause like
WHERE
d.attribute3 = 'abcd*'
AND x.STATUS != 'P'
AND x.STATUS != 'J'
AND x.STATUS != 'X'
AND x.STATUS != 'S'
AND x.STATUS != 'D'
AND CURRENT_TIMESTAMP - 1 < x.CREATION_TIMESTAMP
Which of these conditions will be executed first? I am using oracle.
Will I get these details in my execution plan?
(I do not have the authority to do that in the db here, else I would have tried)
Are you sure you "don't have the authority" to see an execution plan? What about using AUTOTRACE?
SQL> set autotrace on
SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where emp.ename like 'K%'
4 and dept.loc like 'l%'
5 /
no rows selected
Execution Plan
----------------------------------------------------------
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")
As you can see, that gives quite a lot of detail about how the query will be executed. It tells me that:
the condition "emp.ename like 'K%'" will be applied first, on the full scan of EMP
then the matching DEPT records will be selected via the index on dept.deptno (via the NESTED LOOPS method)
finally the filter "dept.loc like 'l%' will be applied.
This order of application has nothing to do with the way the predicates are ordered in the WHERE clause, as we can show with this re-ordered query:
SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where dept.loc like 'l%'
4 and emp.ename like 'K%';
no rows selected
Execution Plan
----------------------------------------------------------
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")
The database will decide what order to execute the conditions in.
Normally (but not always) it will use an index first where possible.
As has been said, looking at the execution plan will give you some information. However, unless you use the plan stability feature, you can't rely on the execution plan always remaining the same.
In the case of the query you posted, it doesn't look like the order of evaluation will change the logic in any way, so I guess what you are thinking about is efficiency. It's fairly likely that the Oracle optimizer will choose a plan that is efficient.
There are tricks you can do to encourage a particular ordering if you want to compare the performance with base query. Say for instance that you wanted the timestamp condition to be executed first. You could do this:
WITH subset AS
( SELECT /*+ materialize */
FROM my_table
WHERE CURRENT_TIMESTAMP - 1 < x.CREATION_TIMESTAMP
)
SELECT *
FROM subset
WHERE
d.attribute3 = 'abcd*'
AND x.STATUS != 'P'
AND x.STATUS != 'J'
AND x.STATUS != 'X'
AND x.STATUS != 'S'
AND x.STATUS != 'D'
The "materialize" hint should cause the optimizer to execute the inline query first, then scan that result set for the other conditions.
I'm not advising you do this as a general habit. In most cases just writing the simple query will lead to the best execution plans.
To add to the other comments on execution plans, under the cpu-based costing model introduced in 9i and used by default in 10g+ Oracle will also make an assessment of which predicate evaluation order will result in lower computational cost even if that does not affect the table access order and method. If executing one predicate before another results in fewer predicates calculations being executed then that optimisaton can be applied.
See this article for more details: http://www.oracle.com/technology/pub/articles/lewis_cbo.html
Furthermore, Oracle doesn't even have to execute predicates where comparison with a check constraint or partition definitions indicates that no rows would be returned anyway.
Complex stuff.
Finally, relational database theory says that you can never depend on the order of execution of the query clauses, so best not to try. As others have said, the cost-based optimizer tries to choose what it thinks is best, but even viewing explain plan won't guarantee the actual order that's used. Explain plan just tells you what the CBO recommends, but that's still not 100%.
Maybe if you explain why you're trying to do this, some could suggest a plan?
Tricky question. Just faced the same dilemma. I need to mention a function within a query. The function itself makes another query, so you understand how it affects performance in general. But in most cases we have, the function wouldn't be called so often if the rest of conditions executed first.
Well, thought it would be useful to post here another article for topic.
The following quote is copied from Donald Burleson's site (http://www.dba-oracle.com/t_where_clause.htm) .
The ordered_predicates hint is specified in the Oracle WHERE clause of
a query and is used to specify the order in which Boolean predicates
should be evaluated.
In the absence of ordered_predicates, Oracle uses
the following steps to evaluate the order of SQL predicates:
Subqueries are evaluated before the outer Boolean conditions in the WHERE clause.
All Boolean conditions without built-in functions or subqueries are evaluated in reverse from the order they are found in the WHERE
clause, with the last predicate being evaluated first.
Boolean predicates with built-in functions of each predicate are evaluated in increasing order of their estimated evaluation costs.

what is the execution order of group by and aggregation function in case statement in mysql?

Here is the SQL problem.
Table: Countries
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| country_id | int |
| country_name | varchar |
+---------------+---------+
country_id is the primary key for this table.
Each row of this table contains the ID and the name of one country.
Table: Weather
+---------------+------+
| Column Name | Type |
+---------------+------+
| country_id | int |
| weather_state | int |
| day | date |
+---------------+------+
(country_id, day) is the primary key for this table.
Each row of this table indicates the weather state in a country for one day.
Write an SQL query to find the type of weather in each country for November 2019.
The type of weather is:
Cold if the average weather_state is less than or equal 15,
Hot if the average weather_state is greater than or equal to 25, and
Warm otherwise.
Return result table in any order.
One of the MySQL solutions is as follows:
SELECT country_name, CASE WHEN AVG(weather_state) <= 15 THEN 'Cold' WHEN AVG(weather_state) >= 25 THEN 'Hot'
ELSE 'Warm'
END AS weather_type
FROM Weather w
JOIN Countries c
ON w.country_id = c.country_id
AND LEFT(w.day, 7) = '2019-11'
GROUP BY w.country_id
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
AVG(weather_state) computes the per-group average of column weather_state. It and other aggregate functions can be used in a select clause, from which you can conclude that the grouping defined by a group by clause must be visible in the context where the select clause is evaluated. In this sense, at least, group by gets executed before select. Pretty much everything else does too.
It is possible for an aggregate query to be identifiable only from the select clause. In such cases, the select clause needs to be parsed before it is known that grouping (all rows into a single group) is to be performed. This is the closest I can think of to the execution-order claim you asserted, but it is not at all well characterized as group by being executed after select.
MySQL's implementation details surely present a more complicated picture, but the fact remains that MySQL does provide correct SQL semantics in this regard. Therefore, even if you look at the details, they cannot reasonably be characterized as executing the group by after the select. Whoever told you that was wrong, or at least their lesson was very misleading, or else you misunderstood them.

sql self join without equal operator on column

Post the problem and SQL solution which works. My confusion is, when I am doing self join in the past, there is always some equal value (equal operator) in columns to join, but in below example, it seems self join could work without using equal operator? In my below example, using minus operator and >, no equal operator to specify which columns used to join.
Wondering if no equal operator, how did underlying self join works in my example?
Problem,
Given a Weather table, write a SQL query to find all dates' Ids with higher temperature compared to its previous (yesterday's) dates.
+---------+------------+------------------+
| Id(INT) | Date(DATE) | Temperature(INT) |
+---------+------------+------------------+
| 1 | 2015-01-01 | 10 |
| 2 | 2015-01-02 | 25 |
| 3 | 2015-01-03 | 20 |
| 4 | 2015-01-04 | 30 |
+---------+------------+------------------+
For example, return the following Ids for the above Weather table:
+----+
| Id |
+----+
| 2 |
| 4 |
+----+
SQL solution,
select W1.Id
from Weather W1, Weather W2
where TO_DAYS(W1.Date)-TO_DAYS(W2.Date) = 1 and W1.Temperature > W2.Temperature
Writing it using an ANSI join, since they're a standard part of SQL:
select W1.Id
from Weather W1
inner join
Weather W2
on TO_DAYS(W1.Date)-TO_DAYS(W2.Date) = 1 and
W1.Temperature > W2.Temperature
(Should produce an identical result set)
A join is just the process of matching up two sets of rows - you have a row source on the "left" and a row source on the "right" of the join. In trivial cases, these row sources are tables, but a join may also join the results of any previous joins as the row sources.
In theory, in the join, the result would be a cartesian product - every row on the left would be matched with every row on the right. If this is what you want, you can indicate this with CROSS JOIN.
Usually, however, we want to restrict the result of the join to less than the cartesian product of the rows. And we express those restrictions by writing an ON clause (or in the WHERE clause in your example using the old-style comma join).
The most common type of join is an equijoin, where one or more columns on each side are compared for equality. But that is by no means required. It can be any predicates that make sense. E.g. one form of join that I employ semi-regularly I described as a "triangle join" (by no means standard terminology) where every row is matched with every row that comes later:
SELECT
*
FROM
Table t1
left join
Table t2
on
t1.ID < t2.ID
And that's perfectly fine. The row with the lowest ID in Table will be matched with every other row in the table. The row with the highest ID value will not be matched with any other rows.
That's called an "implicit join" - I suggest, you read up on SQL JOIN, for example at https://en.wikipedia.org/wiki/Join_(SQL).
In short: The database looks for fitting JOIN columns without requiring you to explicitly specify them.

SQL vs MySQL: Rules about aggregate operations and GROUP BY

In this book I'm currently reading while following a course on databases, the following example of an illegal query using an aggregate operator is given:
Find the name and age of the oldest sailor.
Consider the following attempt to answer this query:
SELECT S.sname, MAX(S.age)
FROM Sailors S
The intent is for this query to return not only the maximum age but
also the name of the sailors having that age. However, this query is
illegal in SQL--if the SELECT clause uses an aggregate operation, then
it must use only aggregate operations unless the query contains a GROUP BY clause!
Some time later while doing an exercise using MySQL, I faced a similar problem, and made a mistake similar to the one mentioned. However, MySQL didn't complain and just spit out some tables which later turned out not to be what I needed.
Is the query above really illegal in SQL, but legal in MySQL, and if so, why is that?
In what situation would one need to make such a query?
Further elaboration of the question:
The question isn't about whether or not all attributes mentioned in a SELECT should also be mentioned in a GROUP BY.
It's about why the above query, using atributes together with aggregate operations on attributes, without any GROUP BY is legal in MySQL.
Let's say the Sailors table looked like this:
+----------+------+
| sname | age |
+----------+------+
| John Doe | 30 |
| Jane Doe | 50 |
+----------+------+
The query would then return:
+----------+------------+
| sname | MAX(S.age) |
+----------+------------+
| John Doe | 50 |
+----------+------------+
Now who would need that? John Doe ain't 50, he's 30!
As stated in the citation from the book, this is a first attempt to get the name and age of the oldest sailor, in this example, Jane Doe at the age of 50.
SQL would say this query is illegal, but MySQL just proceeds and spits out "garbage".
Who would need this kind of result?
Why does MySQL allow this little trap for newcomers?
By the way, it is default MySQL behavior. But it can be changed by setting ONLY_FULL_GROUP_BY server mode in the my.ini file or in the session -
SET sql_mode = 'ONLY_FULL_GROUP_BY';
SELECT * FROM sakila.film_actor GROUP BY actor_id;
Error: 'sakila.film_actor.film_id' isn't in GROUP BY
ONLY_FULL_GROUP_BY - Do not permit queries for which the select list refers to nonaggregated columns that are not named in the GROUP BY clause.
Is the query above really illegal in SQL, but legal in MySQL
Yes
if so, why is that
I don't know the reasons for the design decisions made in MySQL, but considering that you can get the actual related data from the same row(s) as the aggregate came from (e.g., MAX or MIN) with only slightly more work, I don't see any advantage in returning additional column data from arbitrary rows.
I strongly dislike this "feature" in MySQL and it trips up many people who learn aggregates on MySQL and then move to a different dbms, and suddenly realize they never quite knew what they were doing.
Based on a link which a_horse_with_no_name provided in a comment, I have arrived at my own answer:
It seems that the MySQL way of using GROUP BY differs from the SQL way, in order to permit leaving out columns, from the GROUP BY clause, when they are functionally dependant on other included columns anyways.
Lets say we have a table displaying the activity of a bank account.
It's not a very thought-out table, but it's the only one we have, and that will have to do.
Instead of keeping track of an amount, we imagine an account starts at '0', and all transactions to it is recorded instead, so the amount is the sum of the transactions. The table could look like this:
+------------+----------+-------------+
| costumerID | name | transaction |
+------------+----------+-------------+
| 1337 | h4x0r | 101 |
| 42 | John Doe | 500 |
| 1337 | h4x0r | -101 |
| 42 | John Doe | -200 |
| 42 | John Doe | 500 |
| 42 | John Doe | -200 |
+------------+----------+-------------+
It is clear that the 'name' is functionally dependant on the 'costumerID'.
(The other way around would also be possible in this example.)
What if we wanted to know the costumerID, name and current amount of each customer?
In such a situation, two very similar queries would return the following right result:
+------------+----------+--------+
| costumerID | name | amount |
+------------+----------+--------+
| 42 | John Doe | 600 |
| 1337 | h4x0r | 0 |
+------------+----------+--------+
This query can be executed in MySQL, and is legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID, name
This query can be executed in MySQL, and is NOT legal according to SQL.
SELECT costumerID, name, SUM(transaction) AS amount
FROM Activity
GROUP BY costumerID
The following line would make the query return and error instead, since it would now have to follow the SQL way of using aggregation operations and GROUP BY:
SET sql_mode = 'ONLY_FULL_GROUP_BY';
The argument for allowing the second query in MySQL, seems to be that it is assumed that all columns mentioned in SELECT, but not mentioned in GROUP BY, are either used inside an aggregate operation, (the case with 'transaction'), or are functionally dependent on other included columns, (the case with 'name'). In the case of 'name', we can be sure that the correct 'name' is chosen for all group entries, since it is functionally dependant on 'costumerID', and therefore there is only one possibly name for each group of costumerID's.
This way of using GROUP BY seems flawed tough, since it doesn't do any further checks on what is left out from the GROUP BY clause. People can pick and choose columns from their SELECT statement to put in their GROUP BY clause as they see fit, even if it makes no sense to include or leave out any particular column.
The Sailor example illustrates this flaw very well.
When using aggregation operators (possibly in conjunction with GROUP BY), each group entry in the returned set has only one value for each of its columns. In the case of Sailors, since the GROUP BY clause is left out, the whole table is put into one single group entry. This entry needs a name and a maximum age. Choosing a maximum age for this entry is a no-brainer, since MAX(S.age) only returns one value. In the case of S.sname though, wich is only mentioned in SELECT, there are now as many choices as there are unique sname's in the whole Sailor table, (in this case two, John and Jane Doe). MySQL doens't have any clue which to choose, we didn't give it any, and it didn't hit the brakes in time, so it has to just pick whatever comes first, (Jane Doe). If the two rows were switched, it would actually give "the right answer" by accident. It just seems plain dumb that something like this is allowed in MySQL, that the result of a query using GROUP BY could potententially depend on the ordering of the table, if something is left out in the GROUP BY clause. Apparently, that's just how MySQL rolls. But still couldn't it at least have the courtesy of warning us when it has no clue what it's doing because of a "flawed" query? I mean, sure, if you give the wrong instructions to a program, it probably wouldn't (or shouldn't) do as you want, but if you give unclear instructions, I certainly wouldn't want it to just start guessing or pick whatever comes first... -_-'
MySQL allows this non-standard SQL syntax because there is at least one specific case in which it makes the SQL nominally easier to write. That case is when you're joining two tables which have a PRIMARY / FOREIGN KEY relationship (whether enforced by the database or not) and you want an aggregate value from the FOREIGN KEY side and multiple columns from the PRIMARY KEY side.
Consider a system with Customer and Orders tables. Imagine you want all the fields from the customer table along with the total of the Amount field from the Orders table. In standard SQL you would write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip
Notice the unwieldy GROUP BY clause, and imagine what it would look like if there were more columns you wanted from customer.
In MySQL, you could write:
SELECT C.CustomerID, C.FirstName, C.LastName, C.Address, C.City, C.State, C.Zip, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
or even (I think, I haven't tried it):
SELECT C.*, SUM(O.Amount)
FROM Customer C INNER JOIN Orders O ON C.CustomerID = O.CustomerID
GROUP BY C.CustomerID
Much easier to write. In this particular case it's safe as well, since you know that only one row from the Customer table will contribute to each group (assuming CustomerID is PRIMARY or UNIQUE KEY).
Personally, I'm not a big fan of this exception to standard SQL syntax (since there are many cases where it's not safe to use this syntax and rely on getting values from any particular row in the group), but I can see where it makes certain kinds of queries easier and (in the case of my second MySQL example) possible.

Why does mysql decide that this subquery is dependent?

On a MySQL 5.1.34 server, I have the following perplexing situation:
mysql> explain select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| 1 | PRIMARY | ObjectValue | range | IX_ObjectValue_Timestamp,IX_ObjectValue_Timestamp_EventName | IX_ObjectValue_Timestamp_EventName | 9 | NULL | 541944 | Using where |
| 2 | DEPENDENT SUBQUERY | ObjectValue | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using index |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
2 rows in set (0.00 sec)
mysql> select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
Empty set (2 min 48.79 sec)
mysql> select count(*) FROM master.ObjectValue;
+----------+
| count(*) |
+----------+
| 35928440 |
+----------+
1 row in set (2 min 18.96 sec)
How can it take 3 minutes to examine 500000 records when it only
takes 2 minutes to visit all records?
How can a subquery on a
separate database be classified dependent?
What can I do to speed up
this query?
UPDATE:
The actual query that took a long time was a DELETE, but you can't do explain on those; DELETE is why I used subselect. I have now read the documentation and found out about the syntax "DELETE FROM t USING ..." Rewriting the query from:
DELETE FROM master.ObjectValue
WHERE timestamp < '2008-06-26 11:21:59'
AND id IN ( SELECT id FROM backup.ObjectValue ) ;
into:
DELETE FROM m
USING master.ObjectValue m INNER JOIN backup.ObjectValue b ON m.id = b.id
WHERE m.timestamp < '2008-04-26 11:21:59';
Reduced the time from minutes to .01 seconds for an empty backup.ObjectValue.
Thank you all for good advise.
The dependent subquery slows you outer query down to a crawl (I suppose you know it means it's run once per row of found in the dataset being looked at).
You don't need the subquery there, and not using one will speedup your query quite significantly:
SELECT m.*
FROM master.ObjectValue m
JOIN backup.ObjectValue USING (id)
WHERE m.timestamp < '2008-06-26 11:21:59'
MySQL frequently treats subqueries as dependent even though they are not. I've never really understood the exact reasons for that - maybe it's simply because the query optimizer fails to recognize it as independent. I never bothered looking more in details because in these cases you can virtually always move it to the FROM clause, which fixes it.
For example:
DELETE FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
// vs
DELETE m FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
The former will produce a dependent subquery and can be very slow. The latter will tell the optimizer to isolate the subquery, which avoids a table scan and makes the query run much faster.
Notice how it says there is only 1 row for the subquery? There is obviously more than 1 row. That is an indication that mysql is loading only 1 row at a time. What mysql is probably trying to do is "optimize" the subquery so that it only loads records in the subquery that also exist in the master query, a dependent subquery. This is how a join works, but the way you phrased your query you have forced a reversal of the optimized logic of a join.
You've told mysql to load the backup table (subquery) then match it against the filtered result of the master table "timestamp < '2008-04-26 11:21:59'". Mysql determined that loading the entire backup table is probably not a good idea. So mysql decided to use the filtered result of the master to filter the backup query, but the master query hasn't completed yet when trying to filter the subquery. So it needs to check as it loads each record from the master query. Thus your dependent subquery.
As others mentioned, use a join, it's the right way to go. Join the crowd.
How can it take 3 minutes to examine 500000 records when it only takes 2 minutes to visit all records?
COUNT(*) is always transformed to COUNT(1) in MySQL. So it doesn't even have to enter each record, and also, I would imagine that it uses in-memory indexes which speeds things up. And in the long-running query, you use range (<) and IN operators, so for each record it visits, it has to do extra work, especially since it recognizes the subquery as dependent.
How can a subquery on a separate database be classified dependent?
Well, it doesn't matter if it's in a separate database. A subquery is dependent if it depends on values from the outer query, which you could still do in your case... but you don't, so it is, indeed, strange that it's classified as a dependent subquery. Maybe it is just a bug in MySQL, and that's why it's taking so long - it executes the inner query for every record selected by the outer query.
What can I do to speed up this query?
To start with, try using JOIN instead:
SELECT master.*
FROM master.ObjectValue master
JOIN backup.ObjectValue backup
ON master.id = backup.id
AND master.timestamp < '2008-04-26 11:21:59';
The real answer is, don't use MySQL, its optimizer is rubbish. Switch to Postgres, it will save you time in the long run.
To everyone saying "use JOIN", that's just a nonsense perpetuated by the MySQL crowd who have refused for 10 years to fix this glaringly horrible bug.