Duplicate date condition in MySQL Query with different date ranges [duplicate] - mysql

I have a set of conditions in my where clause like
WHERE
d.attribute3 = 'abcd*'
AND x.STATUS != 'P'
AND x.STATUS != 'J'
AND x.STATUS != 'X'
AND x.STATUS != 'S'
AND x.STATUS != 'D'
AND CURRENT_TIMESTAMP - 1 < x.CREATION_TIMESTAMP
Which of these conditions will be executed first? I am using oracle.
Will I get these details in my execution plan?
(I do not have the authority to do that in the db here, else I would have tried)

Are you sure you "don't have the authority" to see an execution plan? What about using AUTOTRACE?
SQL> set autotrace on
SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where emp.ename like 'K%'
4 and dept.loc like 'l%'
5 /
no rows selected
Execution Plan
----------------------------------------------------------
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")
As you can see, that gives quite a lot of detail about how the query will be executed. It tells me that:
the condition "emp.ename like 'K%'" will be applied first, on the full scan of EMP
then the matching DEPT records will be selected via the index on dept.deptno (via the NESTED LOOPS method)
finally the filter "dept.loc like 'l%' will be applied.
This order of application has nothing to do with the way the predicates are ordered in the WHERE clause, as we can show with this re-ordered query:
SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where dept.loc like 'l%'
4 and emp.ename like 'K%';
no rows selected
Execution Plan
----------------------------------------------------------
----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

The database will decide what order to execute the conditions in.
Normally (but not always) it will use an index first where possible.

As has been said, looking at the execution plan will give you some information. However, unless you use the plan stability feature, you can't rely on the execution plan always remaining the same.
In the case of the query you posted, it doesn't look like the order of evaluation will change the logic in any way, so I guess what you are thinking about is efficiency. It's fairly likely that the Oracle optimizer will choose a plan that is efficient.
There are tricks you can do to encourage a particular ordering if you want to compare the performance with base query. Say for instance that you wanted the timestamp condition to be executed first. You could do this:
WITH subset AS
( SELECT /*+ materialize */
FROM my_table
WHERE CURRENT_TIMESTAMP - 1 < x.CREATION_TIMESTAMP
)
SELECT *
FROM subset
WHERE
d.attribute3 = 'abcd*'
AND x.STATUS != 'P'
AND x.STATUS != 'J'
AND x.STATUS != 'X'
AND x.STATUS != 'S'
AND x.STATUS != 'D'
The "materialize" hint should cause the optimizer to execute the inline query first, then scan that result set for the other conditions.
I'm not advising you do this as a general habit. In most cases just writing the simple query will lead to the best execution plans.

To add to the other comments on execution plans, under the cpu-based costing model introduced in 9i and used by default in 10g+ Oracle will also make an assessment of which predicate evaluation order will result in lower computational cost even if that does not affect the table access order and method. If executing one predicate before another results in fewer predicates calculations being executed then that optimisaton can be applied.
See this article for more details: http://www.oracle.com/technology/pub/articles/lewis_cbo.html
Furthermore, Oracle doesn't even have to execute predicates where comparison with a check constraint or partition definitions indicates that no rows would be returned anyway.
Complex stuff.

Finally, relational database theory says that you can never depend on the order of execution of the query clauses, so best not to try. As others have said, the cost-based optimizer tries to choose what it thinks is best, but even viewing explain plan won't guarantee the actual order that's used. Explain plan just tells you what the CBO recommends, but that's still not 100%.
Maybe if you explain why you're trying to do this, some could suggest a plan?

Tricky question. Just faced the same dilemma. I need to mention a function within a query. The function itself makes another query, so you understand how it affects performance in general. But in most cases we have, the function wouldn't be called so often if the rest of conditions executed first.
Well, thought it would be useful to post here another article for topic.
The following quote is copied from Donald Burleson's site (http://www.dba-oracle.com/t_where_clause.htm) .
The ordered_predicates hint is specified in the Oracle WHERE clause of
a query and is used to specify the order in which Boolean predicates
should be evaluated.
In the absence of ordered_predicates, Oracle uses
the following steps to evaluate the order of SQL predicates:
Subqueries are evaluated before the outer Boolean conditions in the WHERE clause.
All Boolean conditions without built-in functions or subqueries are evaluated in reverse from the order they are found in the WHERE
clause, with the last predicate being evaluated first.
Boolean predicates with built-in functions of each predicate are evaluated in increasing order of their estimated evaluation costs.

Related

SQL order of execution for correlated subquery

I have the following Personnel table:
+---------+----------+-------------+
| name | dept_nbr | job_title |
+---------+----------+-------------+
| Michael | 14 | Programmer |
| Kumar | 14 | Programmer |
| Dave | 14 | Programmer |
| Jane | 14 | Manager |
| Carol | 37 | Programmer |
| Joe | 37 | Programmer |
| John | 59 | CEO |
+---------+----------+-------------+
Problem: Find all dept_nbr's (departments) that have fewer than 3 programmers.
Working query:
SELECT DISTINCT dept_nbr
FROM Personnel AS P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer') < 3;
Result:
37
59
Notes:
Department 14 is correctly not included as it has 3 programmers (3 is equal to but not fewer than 3). Department 59 has zero programmers, and is also correctly included in the results.
My question:
When the above query executes, how does a generic SQL engine proceed? From what I have read, SQL execution order is (roughly): From, Where, Group By, Having, and Select. So, is the following correct?
1 - The Outer Query passes each row of the Personnel table as P1 into the Inner query.
2.a - The Inner Query scans the entire Personnel table as P2, row by row, looking for rows that satisfy the condition "P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'".
2.b – Once the Inner Query is done with the entire table, it COUNTs the matching dept_nbr values and returns it to the Outer Query.
3 – In the Outer Query, if the count returned from the Inner Query satisfies the condition "WHERE (Inner Query Count Result) < 3", the corresponding dept_nbr for the P1 row is SELECTed.
4 – Following all rows processed by the Outer Query, the Outer Query does a DISTINCT on the results and displays the unique dept_nbr values.
Is my understanding above correct? Specifically, does the outer query do the DISTINCT at the very end (step #4)? It seems that in this way, the inner query does redundant scanning (for example, it processes dept_nbr = 14 four times, when it really has the answer in the first pass).
I tested the above query on sqlfiddle.com w/ MySQL 5.6.
When the above query executes, how does a generic SQL engine proceed?
From what I have read, SQL execution order is (roughly): From, Where,
Group By, Having, and Select.
This statement is -- generally -- not correct. SQL is parsed in the order that you describe. However, the execution is determined by the optimizer and might have little to do with the original query. Remember: SQL is a descriptive language, not a procedural language. It describes the result set, not the specific steps for calculating it.
That said, MySQL's execution plan is much closer to the query than most other databases (particularly more advanced databases with better optimizers). And, almost any database is going to proceed in the steps you describe for this query. The aggregation in the subquery limits the choices for optimization.
If you want to eliminate the redundancy, then do the select distinct before the filtering:
SELECT dept_nbr
FROM (SELECT DISTINCT dept_nbr FROM Personnel P1) P1
WHERE (SELECT COUNT(P2.dept_nbr)
FROM Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr AND P2.job_title = 'Programmer'
) < 3;
You can also do this more simply with just an aggregation:
select dept_nbr
from personnel
group by dept_nbr
having sum(job_title = 'Programmer') < 3;
Add EXPLAIN (or EXPLAIN EXTENDED) before your query and it should give you the explain plan which will detail exactly the steps in order of your query. This is a very useful tool when trying to optimize queries.

Why does the query execute so much slower when all the columns involved are the same and only the where condition changes?

I have this query:
SELECT 1 AS InputIndex,
IF(TRIM(DeviceInput1Name = '', 0, IF(INSTR(DeviceInput1Name, '|') > 0, 2, 1)) AS InputType,
(SELECT Value1_1 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueLeft,
(SELECT Value1_2 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueRight
FROM devices
WHERE DeviceIMEI = 'Some_Search_Value';
This completes fairly quickly (in up to 0.01 seconds). However, running the same query with WHERE clause as such
WHERE DeviceIMEI = 'Some_Other_Search_Value';
makes it run for upwards of 14 seconds! Some search values finish very quickly, while others run way too long.
If I run EXPLAIN on either query, I get the following:
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | PRIMARY | devices | ref | DeviceIMEI | DeviceIMEI | 28 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
Also, here's the actual number of records, just so it's clear:
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Search_Value';
+----------+
| count(*) |
+----------+
| 1017946 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Other_Search_Value';
+----------+
| count(*) |
+----------+
| 306100 |
+----------+
1 row in set (0.04 sec)
Any ideas why changing a search value in the WHERE clause would cause the query to execute so slowly, even when the number of physical records to search through is lower?
Note there is no need for you to rewrite the query, just explain why the above happens.
UPDATE: I have tried running two separate queries instead of one with dependent subqueries to get the information I need (first I select DeviceID from devices by DeviceIMEI, then select from devicevalues by DeviceID I got from the previous query) and all queries return instantly. I suppose the only solution is to run these queries in a transaction, so I'll be making a stored procedure to do this. This, however, still doesn't answer my question which puzzles me.
I dont think that 1017946 is equivalent to the number of rows returned by your very first query.Your first query returns all rows from devices with some correlated queries,your count query returns all common rows between the 2 tables.If this is so the problem might be cardinality issues namely some_other_values constitute a much larger proportion of the rows in your first query than some_value so Mysql chooses a table scan.
If I understand correctly, the query is the same, and only the searched value changes.
There are three real possibilities as I can see, the first much likelier than the others:
The fast query only appears to be fast. And that's why it is in the MySQL query cache already. Try disabling the cache, running with NO_SQL_CACHE, or run the slow query twice. If the second way round runs in 0.01s instead of 14s, you'll know this is the case.
One query has to look way more records than the other. An IMEI may have lots of rows in devicevalues, another might have next no none. Apparently you are in such a condition, and what makes this unlikely is (apart from the times involved) the fact that it is the slower IMEI which actually has less matches.
The slow query is indeed slow. This means that a particular subset of data is hard to locate or hard to retrieve. The first may be due to an overdue reindexing or to filesystem fragmentation of very large indexes. The second can also be due to fragmentation of the tablespace, or to other condition which splits up records (for example the database is partitioned). A search in a small partition is wont to be faster than a search in a large partition.
But the time differences involved aren't equal in the three cases, and a 1400x difference seems to me an unlikely consequence of (2) or (3). The first possibility seems way more appealing.
Update you seem not to be using indexes on your subqueries. Have you an index such as
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime);
If you can, you can try a covering index:
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime, Value1_1, Value1_2);

Fastest way to count the rows in any database table?

We assume that there is no primary key defined for a table T. In that case, how does one count all the rows in T quickly/efficiently for these databases - Oracle 11g, MySql, Mssql ?
It seems that count(*) and count(column_name) can be slow and inaccurate respectively. The following seems to be the fastest and most reliable way to do it-
select count(rowid) from MySchema.TableInMySchema;
Can you tell me if the above statement also has any shortcomings ? If it is good, then do we have similar statements for mysql and mssql ?
Thanks in advance.
Source -
http://www.thewellroundedgeek.com/2007/09/most-people-use-oracle-count-function.html
count(column_name) is not inaccurate, it's simply something completely different than count(*).
The SQL standard defines count(column_name) as equivalent to count(*) where column_name IS NOT NULL. To the result is bound to be different if column_name is nullable.
In Oracle (and possibly other DBMS as well), count(*) will use an available index on a not null column to count the rows (e.g. PK index). So it will be just as fas
Additionally there is nothing similar to the rowid in SQL Server or MySQL (in PostgreSQL it would be ctid).
Do use count(*). It's the best option to get the row count. Let the DBMS do any optimization in the background if adequate indexes are available.
Edit
A quick demo on how Oracle automatically uses an index if available and how that reduces the amount of work done by the database:
The setup of the test table:
create table foo (id integer not null, c1 varchar(2000), c2 varchar(2000));
insert into foo (id, c1, c2)
select lvl, c1, c1 from
(
select level as lvl, dbms_random.string('A', 2000) as c1
from dual
connect by level < 10000
);
That generates 10000 rows with each row filling up some space in order to make sure the table has a realistic size.
Now in SQL*Plus I run the following:
SQL> set autotrace traceonly explain statistics;
SQL> select count(*) from foo;
Execution Plan
----------------------------------------------------------
Plan hash value: 1342139204
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2740 (1)| 00:00:33 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| FOO | 9999 | 2740 (1)| 00:00:33 |
-------------------------------------------------------------------
Statistics
----------------------------------------------------------
181 recursive calls
0 db block gets
10130 consistent gets
0 physical reads
0 redo size
430 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
As you can see a full table scan is done on the table which requires 10130 "IO Operations" (I know that that is not the right term, but for the sake of the demo it should be a good enough explanation for someone never seen this before)
Now I create an index on that column and run the count(*) again:
SQL> create index i1 on foo (id);
Index created.
SQL> select count(*) from foo;
Execution Plan
----------------------------------------------------------
Plan hash value: 129980005
----------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | INDEX FAST FULL SCAN| I1 | 9999 | 7 (0)| 00:00:01 |
----------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
27 consistent gets
21 physical reads
0 redo size
430 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
As you can see Oracle did use the index on the (not null!) column and the amount of IO went drastically down (from 10130 to 27 - not something I'd call "grossly ineffecient").
The "physical reads" stem from the fact that the index was just created and was not yet in the cache.
I would expect other DBMS to apply the same optimizations.
In Oracle, COUNT(*) is the most efficient. Realistically, COUNT(rowid), COUNT(1), or COUNT('fuzzy bunny') are likely to be equally efficient. But if there is a difference, COUNT(*) will be more efficient.
i EVER use SELECT COUNT(1) FROM anything;, instead of the asterisk...
some people are of the opinion, that mysql uses the asterisk to invoke the query-optimizer and ignores any optimizing when use of "1" as static scalar...
imho, this is straight-forward, because you don't use any variable and it's clear, that you only count all rows.

Compound index required to speed up join-ed query?

A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.

Mysql: Optimizing Selecting rows from multiple ranges (using indexes?)

My table (projects):
id, lft, rgt
1, 1, 6
2, 2, 3
3, 4, 5
4, 7, 10
5, 8, 9
6, 11, 12
7, 13, 14
As you may have noticed, this is hierarchical data using the nested set model. Tree pretty-printed:
1
2
3
4
5
6
7
I want to select all sub projects under project 1 and 4. I can do this with:
SELECT p.id
FROM projects AS p, projects AS ps
WHERE (ps.id = 1 OR ps.id = 4)
AND p.lft BETWEEN ps.lft AND ps.rgt
However, this is very slow with a large table, when running EXPLAIN (Query) i get:
+----+-------------+-------+-------+------------------------+---------+---------+------+------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------+---------+---------+------+------+-------------------------------------------------+
| 1 | SIMPLE | ps | range | PRIMARY,lft,rgt,lftRgt | PRIMARY | 4 | NULL | 2 | Using where |
| 1 | SIMPLE | p | ALL | lft,lftRgt | NULL | NULL | NULL | 7040 | Range checked for each record (index map: 0x12) |
+----+-------------+-------+-------+------------------------+---------+---------+------+------+-------------------------------------------------+
(The project table has indexes on lft, rgt, and lft-rgt. As you can see, mysql does not use any index, and loops through the 7040 records)
I have found that if I only select for one of the super project, mysql manages to use the indexes:
SELECT p.id
FROM projects AS p, projects AS ps
WHERE ps.id = 1
AND p.lft BETWEEN ps.lft AND ps.rgt
EXPLAINs to:
+----+-------------+-------+-------+------------------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | ps | const | PRIMARY,lft,rgt,lftRgt | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | p | range | lft,lftRgt | lft | 4 | NULL | 7 | Using where |
+----+-------------+-------+-------+------------------------+---------+---------+-------+------+-------------+
FINALLY, my question: I there any way i can SELECT rows matching multiple ranges, and still benefit from indexes?
From 7.2.5.1. The Range Access Method for Single-Part Indexes in MySQL reference manual:
Currently, MySQL does not support merging multiple ranges for the range access method for spatial indexes. To work around this limitation, you can use a UNION with identical SELECT statements, except that you put each spatial predicate in a different SELECT.
So you need to have a union of two different selects.
have you tried a union? take your second example, add "union" underneath and the repeat but matching id 4. i don't know if it would work, but it seems like an obvious thing to try.
edit:
SELECT p.id
FROM projects AS p, projects AS ps
WHERE ps.id = 1
AND p.lft BETWEEN ps.lft AND ps.rgt
UNION
SELECT p.id
FROM projects AS p, projects AS ps
WHERE ps.id = 4
AND p.lft BETWEEN ps.lft AND ps.rgt
Your query does merge the multiple ranges.
It uses a range access method to combine the multiple ranges on p (which is leading in the join).
For each row returned from p, it checks the best method to retrieve all rows from ps for the given values of p.lft and p.rgt. Depending on the query selectivity, it may be either a fullscan over ps or a index lookup over one of two possible indexes.
The number of rows shown in the EXPLAIN means nothing: the EXPLAIN just shows the worst possible outcome. It doesn't necessarily mean that all these rows will be examined. Whether they will or not the optimizer can only tell in runtime.
The documentation snippet about the impossibility to merge the multiple ranges is only valid for SPATIAL indexes (R-Tree those that you create over GEOMETRY types). These indexes are good for the queries that search upwards (the ancestors of a given project) but not downwards.
A plain B-Tree index can combine the multiple ranges. From the documentation:
For all types of indexes, multiple range conditions combined with OR or AND form a range condition.
The real problem is that the optimizer in MySQL cannot make a single correct decision: either use a single fullscan (with ps leading), or make several range scans.
Say, you have 10,000 rows and your projects boundaries are 0-500 and 2000-2500. The optimizer will see that each boundary will benefit from the index, the range check will result in two range accesses, while a single fullscan would be better.
It may be even worse if your project boundaries are, say, 0-3000 and 5000-6000. In this case the optimizer will make two fullscans, while one would suffice.
To help the optimizer make the correct decision, you should make the covering index on (lft, id) in this order:
CREATE INDEX ix_lft_id ON projects (lft, id)
The tipping point for using the fullscan over a covering index rather than a range condition is 90%, that means you will never have more than a one fullscan in your actual plan.