Performance in mysql joins with subquery and limit - mysql

In a join operation between two subqueries (or a table and a subquery), is it preferred to specify the LIMIT clause in an inner query rather than on the outer query (since the order would determine the amount of rows the DBMS would have to iterate to check the where clause)? like:
((
SELECT id
FROM Table1
WHERE score>=100
LIMIT 10)
AS A NATURAL JOIN Table2
))
would be better than
((
SELECT id
FROM Table1
WHERE score>=100)
AS A NATURAL JOIN Table2
))
LIMIT 10
My thinking is that in the last query, the DBMS first have to iterate (full table or an index) ALL rows in Table1 where score>=100 that can be mapped to Table2 on their common columns (which could be any number of rows), and only after that it will truncate to only 10 rows, whereas in the first query, it will only scan until it has found 10 rows from Table1 that satisfy the where clause and can be mapped to Table2, then stop....

The 2 partial statements are not equivalent. When using LIMIT order matters. If you're placing the limit on Table1 you might never see the rows you would have otherwise seen with limit placed on the whole dataset. Given that disclaimer, it seems like using the limit and then joining would be more efficient, but rule of thumb is you should always measure.
Also consider that instead of joining the SELECT as table, for which MySQL will have to build an internal temporary table, you could join the table itself, i.e.:
SELECT t0.col0, t1.col1
FROM
Table0 t0
JOIN Table1 t1 ON (t0.col0 = t1.col0 AND t1.score >= 100)
which might be even more efficient if you have good indexes and end up using them. But again, you should measure.

Related

In query with joins and multi-table/field ORDER BY, how to set LIMIT offset to start from a particular row identified by a unique id field?

Suppose I have four tables: tbl1 ... tbl4. Each has a unique numerical id field. tbl1, tbl2 and tbl3 each has a foreign key field for the next table in the sequence. E.g. tbl1 has a tbl2_id foreign key field, and so on. Each table also has a field order (and other fields not relevant to the question).
It is straightforward to join all four tables to return all rows of tbl1 together with corresponding fields from the other three fields. It is also easy to order this result set by a specific ORDER BY combination of the order fields. It is also easy to return just the row that corresponds to some particular id in tbl1, e.g. WHERE tbl1.id = 7777.
QUESTION: what query most efficiently returns (e.g.) 100 rows, starting from the row corresponding to id=7777, in the order determined by the specific combination of order fields?
Using ROW_NUMBER or (an emulation of it in MySQL version < 8) to get the position of the id=7777 row, and then using that in a new version of the same query to set the offset in the LIMIT clause would be one approach. (With a read lock in between.) But can it be done in a single query?
# FIRST QUERY: get row number of result row where tbl1.id = 7777
SELECT x.row_number
FROM
(SELECT #row_number:=#row_number+1 AS row_number, tbl1.id AS id
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <some conditions>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
WHERE id=7777;
Store the row number from the above query and use it to bind :offset in the following query.
# SECOND QUERY : Get 100 rows starting from the one with id=7777
SELECT x.field1, x.field2, <etc.>
FROM
(SELECT #row_number:=#row_number+1 AS row_number, field1, field2
FROM (SELECT #row_number:=0) AS t, tbl1
INNER JOIN tbl2 ON tbl2.id = tbl1.tbl2_id
INNER JOIN tbl3 ON tbl3.id = tbl2.tbl3_id
INNER JOIN tbl4 ON tbl4.id = tbl3.tbl4_id
WHERE <same conditions as before>
ORDER BY tbl4.order, tbl3.order, tbl2.order, tbl1.order
) AS x
LIMIT :offset, 100;
Clarify question
In the general case, you won't ask for WHERE id1 > 7777. Instead, you have a tuple of (11,22,33,44) and you want to "continue where you left off".
Two discussions, with
That is messy, but not impossible. See Iterating through a compound key . Ig gives an example of doing it with 2 columns; 4 columns coming from 4 tables is an extension of such.
A variation
Here is another discussion of such: https://dba.stackexchange.com/questions/164428/should-i-store-data-pre-ordered-rather-than-ordering-on-the-fly/164755#164755
In actually implementing such, I have found that letting the "100" (LIMIT) be flexible can be easier to think through. The idea is: reach forward 100 rows (with LIMIT 100,1). Let's say you get (111,222,333,444). If you are currently at (111, ...), then deal with id2/3/4. If it is, say, (113, ...), then do WHERE id1 < 113 and leave off any specification of id2/3/4. This means fetching less than 100 rows, but it lands you just shy of starting id1=113.
That is, it involves constructing a WHERE clause with between 1 and 4 conditions.
In all cases, your query says ORDER BY id1, id2, id3, id4. And the only use for LIMIT is in the probe to figure out how far ahead the 100th row is (with LIMIT 100,1).
I think I can dig out some old Perl code for that.

How to do a join on 2 tables, but only return the data for one table?

I am not sure if this is possible. But is it possible to do a join on 2 tables, but return the data for only one of the tables. I want to join the two tables based on a condition, but I only want the data for one of the tables. Is this possible with SQL, if so how? After reading the docs, it seems that when you do a join you get the data for both tables. Thanks for any help!
You get data from both tables because join is based on "Cartesian Product" + "Selection". But after the join, you can do a "Projection" with desired columns.
SQL has an easy syntax for this:
Select t1.* --taking data just from one table
from one_table t1
inner join other_table t2
on t1.pk = t2.fk
You can chose the table through the alias: t1.* or t2.*. The symbol * means "all fields".
Also you can include where clause, order by or other join types like outer join or cross join.
A typical SQL query has multiple clauses.
The SELECT clause mentions the columns you want in your result set.
The FROM clause, which includes JOIN operations, mentions the tables from which you want to retrieve those columns.
The WHERE clause filters the result set.
The ORDER BY clause specifies the order in which the rows in your result set are presented.
There are a few other clauses like GROUP BY and LIMIT. You can read about those.
To do what you ask, select the columns you want, then mention the tables you want. Something like this.
SELECT t1.id, t1.name, t1.address
FROM t1
JOIN t2 ON t2.t1_id = t1.id
This gives you data from t1 from rows that match t2.
Pro tip: Avoid the use of SELECT *. Instead, mention the columns you want.
This would typically be done using exists (or in) if you prefer:
select t1.*
from table1 t1
where exists (select 1 from table2 t2 on t2.x = t1.y);
Although you can use join, it runs the risk of multiplying the number of rows in the result set -- if there are duplicate matches in table2. There is no danger of such duplicates using exists (or in). I also find the logic to be more natural.
If you join on 2 tables.
You can use SELECT to select the data you want
If you want to get a table of data, you can do this,just select one table date
SELECT b.title
FROM blog b
JOIN type t ON b.type_id=t.id;
If you want to get the data from two tables, you can do this,select two table date.
SELECT b.title,t.type_name
FROM blog b
JOIN type t ON b.type_id=t.id;

Mysql - Not In takes Time when nested select query is dynamic but not if it's constant

T1 contains about 30 million rows,
T2 contains about 100k rows
select a from T2 gives ('a1','a2','a3',...); (1 lakh rows)
When I use 100k constant values directly inside the in block the query returns result in 80 millisec. But, when I use nested select in the query, it takes like forever.
select a,b from T1 where a in ('a1','a2','a3', ...); (Constant Values inside in block)
select a,b from T1 where a in (select a from T2); (Query instead of values)
Any Idea why is it happening? Also is there a better way to do so?
Since T1 contains 30 million rows, Left Join also takes a lot of time.
My Actual Query is :
select a,b from t1 where (a,b) not in (select a,b from t2) and a in (select a from t1);
There is a third, better, way:
SELECT ...
FROM T1
JOIN T2 ON T1.a = T2.a;
(And be sure that there is an index on a in each table.)
IN ( SELECT ... ) is notoriously slow; avoid it.
The subquery is more expensive in MySQL because MySQL hasn't optimized that type of query very well. MySQL doesn't notice that the subquery is invarant, in other words the subquery on T2 has the same result regardless of which row of T1 is being searched. So you and I can see clearly that MySQL should execute the subquery once and use its result to evaluate each row of T1.
But in fact, MySQL naively assumes that the subquery is a correlated subquery that may have a different result for each row in T1. So it is forced to run the subquery many times.
You clarified in a comment that your query is actually:
select a,b from t1
where (a,b) not in (select a,b from t2)
and a in (select a from t1);
You should also know that MySQL does not optimize tuple comparison at all. Even if it should use an index, it will do a table-scan. I think they're working on fixing that in MySQL 8.
Your second term is unnecessary, because the subquery is selecting from t1, so obviously any value of a in t1 exists in t1. Did you mean to put t2 in the subquery? I'll assume that you just made an error typing.
Here's how I would write your query:
select a, b from t1
left outer join (select distinct a, b from t2) as nt2
on t1.a = nt2.a and t1.b = nt2.b
where nt2.a is null;
In these cases, MySQL treats the subquery differently because it appears in the FROM clause. It runs each subquery once and stores the results in temp tables. Then it evaluates the rows of t1 against the data in the temp tables.
The use of distinct is to make sure to do the semi-join properly; if there are multiple matching rows in t2, we don't want multiple rows of output from the query.

mysql limit with in clause

SELECT *
FROM restaurant_rate
WHERE
table_id IN (SELECT id_table FROM rTable WHERE restaurant_id = ?)
LIMIT 0, 10;
The number of the result of the inner select is not static. In this case, mysql scans only 10 rows searching one by one? Or scans whole table and returns top 10 rows?
The id_table is an index column of rTable.
The way your query is written is quite inneficient, because it forces to evaluate the in clause for every row in your table. If there are many records in rTable that match the criteria (let's say 1000), every row in restaurant_rate will need to be compared with 1000 values before being accepted or rejected by the where condition.
I would rewrite your query like this:
select rr.*
from
restaurant_rate as rr
inner join rTAble as r on rr.table_id = r.id_table
where
r.restaurant_id = ?
limit 0, 10;
Things you must consider:
You need indexes for the columns involved in the join and in the where condition.
Using limit without order by does not makes much sense; add an order by clause before the limit.

Which Query is faster if we put the "Where" inside the Join Table or put it at the end?

Ok, I am using Mysql DB. I have 2 simple tables.
Table1
ID-Text
12-txt1
13-txt2
42-txt3
.....
Table2
ID-Type-Text
13- 1 - MuTxt1
42- 1 - MuTxt2
12- 2 - Xnnn
Now I want to join these 2 tables to get all data for Type=1 in table 2
SQL1:
Select * from
Table1 t1
Join
(select * from Table2 where Type=1) t2
on t1.ID=t2.ID
SQL2:
Select * from
Table1 t1
Join
Table2 t2
on t1.ID=t2.ID
where t2.Type=1
These 2 queries give the same result, but which one is faster?
I don't know how Mysql does the Join (or How the Join works in Mysql) & that why I wonder this!!
Exxtra info, Now if i don't want type=1 but want t2.text='MuTxt1', so Sql2 will become
Select * from
Table1 t1
Join
Table2 t2
on t1.ID=t2.ID
where t2.text='MuTxt1'
I feel like this query is slower??
Sometimes the MySQL query optimizer does a pretty decent job and sometimes it sucks. Having said that, there are exception to my answer where the optimizer optimizes something else better.
Sub-Queries are generally expensive as MySQL will need to execute and store results seperately. Normally if you could use a sub-query or a join, the join is faster. Especially when using sub-query as part of your where clause and don't put a limit to it.
Select *
from Table1 t1
Join Table2 t2 on t1.ID=t2.ID
where t2.Type=1
and
Select *
from Table1 t1
Join Table2 t2
where t1.ID =t2.ID AND t2.Type=1
should perform equally well, while
Select *
from Table1 t1
Join (select *
from Table2
where Type=1) t2
on t1.ID=t2.ID
most likely is a lot slower as MySQL stores the result of select * from Table2 where Type=1 into a temporary table.
Generally joins work by building a table comprised of all combinations of rows from both table and afterwards removing lines which do not match the conditions. MySQL of course will try to use indexes containing the columns compared in the on clause and specified in the where clause.
If you are interested in which indexes are used, write EXPLAIN in front of your query and execute.
As per my view 2nd query is more better than first query in terms of code readability and performance. You can include filter condition in Join clause also like
Select * from
Table1 t1
Join
Table2 t2 on t1.ID=t2.ID and t2.Type=1
You can compare execution time for all queries in SQL fiddle here :
Query 1
Query 2
My Query
I think this question is hard to answer since we don't exactly know the internals of the query parser in the database. Usually these kind of constructions are evaluated by the database in a similar way (it can see that the first and second query are identical so parses it correctly, or not).
I would write the second one since it is more clear what is happening.