I would like to know whether this two versions are equivalent in result and which is better for performance reasons and why?
Nested Select in Select version
select
t1.c1,
t1.c2,
(select Count(t2.c1) from t2 where t2.id = t1.id) as count_t
from
t1
VS
select t1.c1,t1.c2, Count(t2.c1)
from t1,t2
where t2.id= t1.id
The first query is analog of this query -
SELECT
t1.c1,
t1.c2,
COUNT(t2.c1)
FROM t1
LEFT JOIN t2
ON t2.id = t1.id;
It selects all records from first table, and all matched records from second table (it is LEFT JOIN condition).
The second is analog of this query -
SELECT
t1.c1,
t1.c2,
COUNT(t2.c1)
FROM t1
JOIN t2
ON t2.id = t1.id;
It selects only matched records in both tables (it is INNER JOIN condition).
Well they are different queries. The top one will select all rows from t1 returning 0 for the count if there is no matching id in table t2.
The second query will only return rows where t1 and t2 both have a row with the same id.
The first query will likely suffer from performance issues on large data sets. The second query will potentially have a Cartesian issue. I would go with a join or left join based on your intent to have records from table 1 if table 2 has no related records and then add a group by statement to control the Cartesian.
Related
There is a table where strings are stored by type.
How to write 1 query so that it counts and outputs the number of results of each type.
Tried to do this:
SELECT COUNT(t1.id), COUNT(t2.id)
FROM table t1
LEFT JOIN table t2
ON t2.type='n1'
WHERE t1.type='b2';
But nothing works.
There can be many types, how can this be done?
I think this is what you want:
SELECT t1.type, COUNT(DISTINCT t1.id), COUNT(DISTINCT t2.id)
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t1.type = t2.type
GROUP BY t1.type
Note that this won't show any types that are only in table2. For that you'd need a FULL OUTER JOIN, which MySQL doesn't have. See How can I do a FULL OUTER JOIN in MySQL? for how to emulate it.
I try to explain a very high level
I have two complex SELECT queries(for the sake of example I reduce the queries to the following):
SELECT id, t3_id FROM t1;
SELECT t3_id, MAX(added) as last FROM t2 GROUP BY t3_id;
query 1 returns 16k rows and query 2 returns 15k
each queries individually takes less than 1 second to compute
However what I need is to sort the results using column added of query 2, when I try to use LEFT join
SELECT
t1.id, t1.t3_Id
FROM
t1
LEFT JOIN
(SELECT t3_id, MAX(added) as last FROM t2 GROUP BY t3_id) AS t_t2
ON t_t2.t3_id = t1.t3_id
GROUP BY t1.t3_id
ORDER BY t_t2.last
However, the execution time goes up to over a 1 minute.
I like to understand the reason
what is the cause of such a huge explosion?
NOTE:
ALL the used columns on every table have been indexed
e.g. :
table t1 has index on id,t3_Id
table t2 has index on t3_id and added
EDIT1
after #Tim Biegeleisen suggestion, I change the query to the following now the query is executing in about 16 seconds. If I remove the ORDER BY it query gets executed in less than 1 seconds. The problem is that ORDER BY the sole reason for this.
SELECT
t1.id, t1.t3_Id
FROM
t1
LEFT JOIN
t2 ON t2.t3_id = t1.t3_id
GROUP BY t1.t3_id
ORDER BY MAX(t2.added)
Even though table t2 has an index on column t3_id, when you join t1 you are actually joining to a derived table, which either can't use the index, or can't use it completely effectively. Since t1 has 16K rows and you are doing a LEFT JOIN, this means the database engine will need to scan the entire derived table for each record in t1.
You should use MySQL's EXPLAIN to see what the exact execution strategy is, but my suspicion is that the derived table is what is slowing you down.
The correct query should be:
SELECT
t1.id,
t1.t3_Id,
MAX(t2.added) as last
FROM t1
LEFT JOIN t2 on t1.t3_Id = t2.t3_Id
GROUP BY t2.t3_id
ORDER BY last;
This is happen because a temp table is generating on each record.
I think you could try to order everything after the records are available. Maybe:
select * from (
select * from
(select t3_id,max(t1_id) from t1 group by t3_id) as t1
left join (select t3_id,max(added) as last from t2 group by t3_id) as t2
on t1.t3_id = t2.t3_id ) as xx
order by last
I'm trying to run a query that retrives all records in a table that exists in a subquery.
However, it is returning all records insteal of just the ones that I am expecting.
Here is the query:
SELECT DISTINCT x FROM T1 WHERE EXISTS
(SELECT * FROM T1 NATURAL JOIN T2 WHERE T2.y >= 3.0);
I've tried testing the subquery and it returns the correct number of records that meet my constraint.
But when I run the entire query it returns records that should not exists in the subquery.
Why is EXISTS evaluating true for all the records in T1?
You need a correlated subquery, not a join in the subquery. It is unclear what the right correlation clause is, but something like this:
SELECT DISTINCT x
FROM T1
WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.COL = T1.COL AND T2.y >= 3.0);
Your query has a regular subquery. Whenever it returns at least one row, then the exists is true. So, there must be at least one matching row. This version "logically" runs the subquery for each row in the outer T1.
Q: Why is EXISTS evaluating true for all the records in T1?
A: Because the subquery returns a row, entirely independent of anything in the outer query.
The EXISTS predicate is simply checking whether the subquery is returning a row or not, and returning a boolean TRUE or FALSE.
You'd get the same result with:
SELECT DISTINCT x FROM T1 WHERE EXISTS (SELECT 1)
(The only difference would be if that subquery didn't return at least one row, then you'd get no rows returned in the outer query.)
There's no correlation between the rows returned by the subquery and the rows in the outer query.
I expect that there's another question you want to ask. And the answer to that really depends on what result set you are wanting to return.
If you are wanting to return rows from T1 that have some "matching" row in T2, you could use either a NOT EXISTS (correlated subquery)
Or, you could also use a join operation to return an equivalent result, for example:
SELECT DISTINCT T1.x
FROM T1
NATURAL
JOIN T2
WHERE T2.y >= 3.0
It isn't working because there is no correlation between the outer query and the subquery being used. Below there is a correlation in the form of and T1.id = T2.id
SELECT DISTINCT x
FROM T1
WHERE EXISTS ( SELECT 1 FROM T2 WHERE T2.y >= 3.0 and T1.id = T2.id)
;
But, without knowing the data I'd hope you do NOT need to use "distinct" in that query, and this would produce the same result:
SELECT x
FROM T1
WHERE EXISTS ( SELECT 1 FROM T2 WHERE T2.y >= 3.0 and T1.id = T2.id)
;
An alternative, which probably would require distinct, is a variation ofh the second half of your second query
SELECT DISTINCT x FROM T1 NATURAL JOIN T2 WHERE T2.y >= 3.0
You can use an INNER JOIN to get where you're trying to go:
SELECT DISTINCT T1.X
FROM T1
INNER JOIN T2
ON T2.COL = T1.COL
WHERE T2.Y > 3.0
Share and enjoy.
I have a slow-running update statement, and I was curious if moving the where condition to the join clause would improve performance. Here's the query:
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID set T1.X=T2.X where T1.TYPE='something';
Now... for a very big table (millions of records), would it be faster to do this?
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID and T1.TYPE='something' set T1.X=T2.X;
The query is simple enough that both approaches should be optimized identically.
Both approaches might also be sub-optimal because the inner query isn't correlated to the outer query. Your query is creating an implicit temporary table containing all possible rows for derived table T2 -- exactly the same result as if you just ran the query select ID, GROUP_CONCAT(x) as X from T3 group by ID by itself -- and then the server is discarding the ones that can't be joined to T1 and using the rest to do the update.
This is more than likely not the optimum path.
Unless t1.TYPE = 'something' involves a large percentage of the rows in T1, it should be more efficient to do this:
UPDATE t1
SET t1.x = (SELECT GROUP_CONCAT(x) FROM T3 WHERE T3.id = T1.id GROUP BY T3.id)
WHERE t1.TYPE = 'something';
The inner subquery is correlated to the outer subquery, and only executed for the rows in T1 that are matched by the WHERE clause.
Let's say I have a mastertable (table1) with a detailtable (table2). There can be multiple detail records for each masterrecord. Now I want a query that counts all detailrecords for each masterrecord :
SELECT t1.id, count(t2.*)
FROM table1 as t1
LEFT JOIN table2 AS t2 ON t2.id=t1.id
GROUP BY t1.id
This gives me exactly the same number of records as table1 has.
But when I add a WHERE statement to only count the records that have a checkfield that's higher than 0, I don't get all records in table1 anymore! The ones with no matching detailrecords are now left out completely. Why is this happening?
SELECT t1.id, count(t2.*)
FROM table1 as t1
LEFT JOIN table2 AS t2 ON t2.id=t1.id
WHERE t2.checkfield != 0
GROUP BY t1.id
(Maybe something else is wrong in my real query, since I tried to simplify it for this example, but I think I got it right)
The WHERE clause restricts the joined results which are being aggregated over, so while you're trying do an outer join, only those rows with t2.checkfield != 0 survive, but that excludes all the unmatched rows!
On the other hand, when you change WHERE to AND, you now have tab1 LEFT OUTER JOIN tab2 ON(tab1.id = tab2.t1_id AND some_condition) -- but this is still an outer join, i.e. records on the left which have no match on the right will be included.