I'm trying to JOIN a Master Dataset, via a left join with 2 other Datasets, all of them have the same Key field. So nothing special there.
One of those secondary Datasets is the result of another Query and therefor might or might not exist. Obviously my JOIN statement fails when this table doesn't exist.
Below a really simplified version of the code, the JOIN is used to exclude rows from the table_a that exist in table b or c (if they exist).
SELECT a.id, a.name
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
LEFT JOIN table c c
ON a.id = c.id
WHERE b.id IS NULL
AND c.id IS NULL;
I am not sure that I understand your question well, but I think that you should better do:
SELECT a.id,a.name
FROM table_a a
WHERE a.id NOT IN
(SELECT id FROM table_b)
AND a.id NOT IN
(SELECT id FROM table_c)
Any query optimizer should have the exact same performance with this request, and I find it much more readable.
Related
I read a lot that using Dependent Query is bad and i should use JOINs, I have a code which JOINs a lot of tables together, like the following
SELECT a.a, b.b, c.c, d.d FROM
tablea a
JOIN tableb b ON a.id = b.id
JOIN tablec c ON a.id = c.id
JOIN tabled d ON a.id = d.id
WHERE a.id = 1
LIMIT 1
Right now i can use a Dependent Query and Select a value from another table with thea.a like this
SELECT a.a, b.b, c.c, d.d, (SELECT e FROM tablee WHERE id = a.a) AS e FROM
tablea a
JOIN tableb b ON a.id = b.id
JOIN tablec c ON a.id = c.id
JOIN tabled d ON a.id = d.id
WHERE a.id = 1
LIMIT 1
In this case, Do i use the easiest to read and edit for me Dependent Query, Or do i just stick with JOINs?
I read a lot that using Dependent Query is bad and i should use JOINs,
This is misguidance. Certainly, JOINs are very powerful and should not be discouraged. The SQL optimizer understands JOINs and they are quite powerful and easy to read.
The does not mean the correlated subqueries are wrong. For instance, yours is fine -- even from a performance perspective -- if you have an index on tablee(id, e). In fact, that might be the most optimal way of writing the query. Writing the query with a join would probably generate a similar execution plan, though.
If you did use a join, you would need a left join for the same semantics.
There are some situations where correlated subqueries are better in terms of performance. The exact specifics might vary by database, so there are no rules that are "general" across all databases.
My one very strong piece of advice is to qualify all column names when you reference more than on table in a query. So, you should be writing:
SELECT a.a, b.b, c.c, d.d,
(SELECT e.e FROM tablee e WHERE e.id = a.a) AS e
FROM tablea a . . .
This is not a duplicate of this Q&A because the question and answers here concerns the table mentioned in the FROM clause. Which mine doesn't.
Assuming the table in the FROM clause is always the same and I'm never going to change it. Does it matter which order I add my joins?
I am using an in-house built query builder. (Yes I know there are things out there already but that's out of scope for the question).
I want to be able to set some of the joins at the beginning of my script and some later based on conditionals, the query builder adds them to the query from the top down. Will the SQL engine optimize the order of the joins anyway, regardless of their order in the query?
example:
SELECT a.col1, d.col2, c.col1, b.col3
FROM table1 A
INNER JOIN table2 B
ON B.a_id = A.id
LEFT JOIN table3 C
ON C.id = A.c_id
LEFT JOIN table4 D
ON D.id = C.d_id;
SELECT a.col1, d.col2, c.col1, b.col3
FROM table1 A
LEFT JOIN table4 D
ON D.id = C.d_id
INNER JOIN table2 B
ON B.a_id = A.id
LEFT JOIN table3 C
ON C.id = A.c_id;
Here you can see that I have declared the join for table4 D before the join for it's dependent table is declared in the script (C). Does this matter?
Simple answer: No you can't reference a table object/alias before the object has been declared.
mySQL will throw an error on 2nd query. `Unknown column 'C.d_id' in 'on clause'
So yes... the compiler doesn't look ahead to see if it's been referenced later.. It only knows the order first then it tries to figure out which method of joining is best.
SQLFiddle
*To address question of: Will the SQL engine optimize the order of the joins anyway, regardless of their order in the query? *
Yes it would optimize the order; but the "FROM" order can't include a reference to a table before it's been declared or the query will not compile. (See error above and link for example)
Imagine the following scenario:
There are 3 tables A, B and C.
Table A has no knowledge of either table B and table C.
Table B has a foreign key to table A.
Table C has foreign key to table B.
In table B as well as in table C there can be multiple items sharing the same foreign key value.
As you can see, the items from C are indirectly referenced to A through B.
What I want is to get all entries from A that are referenced in C but without any information from B or C in my result tables and without duplicates.
Is this even possible?
I have tried this like so but have no idea if it is correct:
select tableA.*
from tableA,
(select distinct tableB.AId as Aid
from tableB left join tableC on tableC.BId = tableB.id
group by tableB.id)
as temp
where tableA.id = temp.Aid
I am not sure if I understand it correctly, but you can try this one:
SELECT DISTINCT `A`.`id`, `A`.`value1`, `A`.`value2` FROM `A`
INNER JOIN `B` ON `B`.`id-a` = `A`.`id`
INNER JOIN `C` ON `C`.`id-b` = `B`.`id`
It returns all values from table A if there is a key on Table C which is linked to Table B with corresponding foreign key on table A
An alternative approach to Masoud's good response would be to use an exists though a correlated subquery.
The below subquery joins B to C in a correlated fashion (notice the B.IDA to A.ID and A is outside the subquery).
If we assume good database design, then A will not have duplicate records, thus we can omit a distinct here since we are not joining A to the other tables. Instead we are simply checking for the existence of an "A" record in the B table which must have a record in the C table due to the inner join. This has two advantages for performance
It doesn't have to join all the records together which would then
necessitate a distinct; thus you don't have the performance hit on
the distinct.
It can early escape. once a key value of A is found in the
subquery (B to C join) , it can stop looking and thus don't have to join all of B to all of A.
We select "1" in the subquery as we don't care what we select as the value will not be used anywhere. We're just using the coloration of A to (B JOIN C) to determine what in A to display.
SELECT A.*
FROM A
WHERE EXISTS( SELECT 1
FROM C
INNER JOIN B
on C.IDB = B.ID)
AND B.IDA = A.ID)
Taking what you tried and reviewing it:
select tableA.*
from tableA,
(select distinct tableB.AId as Aid
from tableB left join tableC on tableC.BId = tableB.id
group by tableB.id)
as temp
where tableA.id = temp.Aid
Starting with the "FROM"
You have tableA, (subquery) temp. This is a CROSS JOIN meaning all records from A will be joined to ALL records of (B JOIN C) so if you have 1000 records in A and 1000 records in the temp result then you'd be telling the database engine to generate 1000*1000 records in your result set; which then gets filtered to only include records matching in temp and A. The engine may be smart enough to avoid the cross join and optimize the query, but I find it confusing to maintain. So I would rewrite as
SELECT tableA.*
FROM tableA
INNER JOIN (SELECT distinct tableB.AId as Aid
FROM tableB left join tableC on tableC.BId = tableB.id
GROUP BY tableB.id) as temp
ON tableA.id = temp.Aid
Looking at the subquery (temp)
We don't need a group by as we are not aggregating. The distinct does bring us down to 1 record but at a cost to execution time.
So I would re-write as this:
SELECT tableA.*
FROM tableA
INNER JOIN (SELECT distinct tableB.AId as Aid
FROM tableB
LEFT JOIN tableC
on tableC.BId = tableB.id) as temp
ON tableA.id = temp.Aid
Then looking at the whole, if we change the outer query join to temp and make it an exists... using coloration we don't have the performance hit of the join, nor the distinct. and I'd switch the left join to an inner as we only want records in C and B so we'd have null in B if we left it as a "LEFT JOIN" which serve no purpose for us.
This gets me to the answer I initially provided.
SELECT tableA.*
FROM tableA
WHERE EXISTS (SELECT 1
FROM tableB
INNER JOIN tableC
on tableC.BId = tableB.id
AND tableB.AID = A.ID) as temp
I'm trying to improve a (not so much) simple query:
I need to retrieve every row from Table A.
Then join Table A with Table B so I get all the data I need.
At the same time, I need to add an extra column with the count() from Table C.
Something like:
SELECT a.*,
(SELECT Count(*)
FROM table_c c
WHERE c.a_id = a.id) AS counter,
b.*
FROM table_a a
LEFT JOIN table_b b
ON b.a_id = a.id
This works, ok, but in reality, I'm just making 2 queries and I need to improve this so it only do one (if, its even possible).
Anyone knows how can I achive that?
The simplest approach is likely to just move the correlated sub-query into a sub-query.
NOTE: Many optimisers deal with correlated sub-queries extremely effectively. Your example query could be perfectly reasonable.
SELECT
a.*,
b.*,
c.row_count
FROM
table_a a
LEFT JOIN
table_b b
ON b.a_id = a.id
LEFT JOIN
(
SELECT
a_id,
Count(*) row_count
FROM
table_c
GROUP BY
a_id
)
c
ON c.a_id = a.id
Another Note: SQL is an expression, it is not executed directly, it is translated into a plan using nest loops, hash joins, etc. Do not assume that having two queries is a bad thing. In this case my example may significantly minimise the number of reads compared to a single query and then use of GROUP BY and COUNT(DISTINCT).
Try this:
SELECT
tmp.*,
SUM(IF(c.a_id IS NULL,0,1)) as counter,
FROM (
SELECT
a.id as aid,
b.id as bid,
a.*,
b.*
FROM
table_a a
LEFT JOIN table_b b
ON b.a_id = a.id
) as tmp
LEFT JOIN table_c c
ON c.a_id = tmp.id
GROUP BY
tmp.aid,
tmp.bid
SELECT * FROM A
JOIN B
ON B.ID = A.ID
AND B.Time = (SELECT max(Time)
FROM B B2
WHERE B2.ID = B.ID)
I am trying to join these two tables in MYSQL. Don't pay attention to that if the ID is unique then I wouldn't be trying to do this. I condensed the real solution to paint a simplified picture. I am trying to grab and join the table B on the max date for a certain record. This procedure is getting run by an SSIS package and is saying B2.ID is an unknown column. I do things like this frequently in MSSQL and am new to MYSQL. Anyone have any pointers or ideas?
I do this type of query differently, with an exclusion join instead of a subquery. You want to find the rows of B which have the max Time for a given ID; in other words, where no other row has a greater Time and the same ID.
SELECT A.*, B.*
FROM A JOIN B ON B.ID = A.ID
LEFT OUTER JOIN B AS B2 ON B.ID = B2.ID AND B.Time < B2.Time
WHERE B2.ID IS NULL
You can also use a derived table, which should perform better than using a correlated subquery.
SELECT A.*, B.*
FROM A JOIN B ON B.ID = A.ID
JOIN (SELECT ID, MAX(Time) AS Time FROM B GROUP BY ID) AS B2
ON (B.ID, B.Time) = (B2.ID, B2.Time)
P.S.: I've added the greatest-n-per-group tag. This type of SQL question comes up every week on Stack Overflow, so you can follow that tag to see dozens of similar questions and their answers.