Why is my SELECT with a subquery and JOINs so slow? - mysql

This query takes 10 seconds to complete. But when I manually perform the subquery and change the t1.id restriction to that list, it's done in 0.00 seconds. What can I do to let MySQL execute the query quicker?
SELECT t1.col1, t2.col2, t3.col3
FROM t1, t2, t3
WHERE t1.t2id = t2.id AND t1.t3id = t3.id
AND t1.id IN ( SELECT id FROM t4 WHERE blah = 123 )
Also, why is this happening? I suppose MySQL joins all three tables in some way before filtering on t1.id.
t1, t2 and t3 contain 3000, 15 and 80 rows, respectively. The subquery returns 2-10 rows.

try to use the "INNER JOIN" rather than "IN" function.
This way your sql instruction will be more performative.
SELECT t1.col1, t2.col2, t3.col3
FROM ((t1 INNER JOIN t2 ON t1.t2id = t2.id) INNER JOIN t3 ON t1.t3id = t3.id) INNER JOIN t4 ON t1.id = t4.id
WHERE t4.blah = 123

Rewrite the query without subquery:
SELECT t1.col1, t2.col2, t3.col3
FROM t1, t2, t3, (SELECT id FROM t4 WHERE blah = 123) AS t4
WHERE t1.t2id = t2.id AND t1.t3id = t3.id
AND t1.id=t4.id
Be sure you have indexes on the fields used in the WHERE clauses.

If you run EXPLAIN over your statement, you may see that MySql has created a temporary table on disk: this often happens if the data within an inline select (your IN term) is sufficiently large.
In a nutshell:
a) use EXPLAIN to see what's going on in the database (and to see how this behaviour changes with increased data)
b) avoid inline subqueries if you can
c) remember that MySql only has a nested loop join algorithm at its disposal (other DBs use Hash join- and merge-join algorithms too) so you may see "less" parsing going on for smaller data sets, but a sudden drop off when you reach a tipping point.

Related

How to join from one joined table to another table in snowflake

I'm trying to create one snowflake view using multiple tables.
I understand that FROM...JOIN statements can combine multiple tables.
When I would like to join from one table that already has a join from another table, what is the best way to write a script?
In this case, from Table 3, Table 4 and Table 5 are joined. The Table 3 is joined from Table 1.
Your question is not clear, but normally all the tables are joined as follows
select *
from table1 t1
join table2 t2 on t1.id = t2.table1_ID
join table3 t3 on t1.id = t3.table1_ID
join table4 t4 on t3.id = t4.table3_ID
join table5 t5 on t3.id = t5.table3_ID
Your data is not clear what kind of data you need, but it depends on your needs, what information you need with what combination of tables.
with cte1 as (
select *
from table1 t1
join table2 t2 on t1.id = t2.table1_ID
join table3 t3 on t1.id = t3.table1_ID
),
cte2 as (
select *
from table3 t3
join table4 t4 on t3.id = t4.table3_ID
join table5 t5 on t3.id = t5.table3_ID)
select t123.column1, t345.column2
from cte1 join cte2 on cte1.id = cte2.id
You should be able to join exactly in the relation hierarchy you have listed such as
select
t1.*,
t2.whatever,
t3.whatever3,
t4.whatever4,
t5.whatever5
from
table1 t1
join table2 t2
on t1.t2id = t2.id
join table3 t3
on t1.t3id = t3.id
join table4 t4
on t3.t4id = t4.id
join table5 t5
on t3.t5id = t5.id
So, what is the confusion
Meysam answer is very valid, but I see there is more questions at hand.
[Edit] This answer is mostly general, but also focused on the Snowflake-cloud-data-platform tag perspective.
Normally you can have a single block of SELECT and there all the TABLES in the FROM JOIN zone, and all the WHERE's you like, in modern form the WHERE's that belong to the JOINS and not filters, are put on the ON, thus Meysam's answer.
SELECT
t1.thing,
t2.other_thing,
t4.extra_detail,
t5.one_last_thing
FROM table1 t1
JOIN table2 t2
ON t1.id = t2.table1_ID
JOIN table3 t3
ON t1.id = t3.table1_ID
JOIN table4 t4
ON t3.id = t4.table3_ID
JOIN table5 t5
ON t3.id = t5.table3_ID
Now you mention a CTE which could be done on the table3 chain if there was merit like so:
WITH table_3_sub_chain_cte_of_merit AS (
SELECT
t3.table1_ID
t4.extra_detail,
t5.one_last_thing
FROM table3 t3
JOIN table4 t4
ON t3.id = t4.table3_ID
JOIN table5 t5
ON t3.id = t5.table3_ID
)
SELECT
t1.thing,
t2.other_thing,
cte3.extra_detail,
cte3.one_last_thing
FROM table1 t1
JOIN table2 t2
ON t1.id = t2.table1_ID
JOIN table_3_sub_chain_cte_of_merit cte3
ON t1.id = cte3.table1_ID
OR the CTE sub expression can be moved into a sub-select, if that had merit, like so:
SELECT
t1.thing,
t2.other_thing,
cte3.extra_detail,
cte3.one_last_thing
FROM table1 t1
JOIN table2 t2
ON t1.id = t2.table1_ID
JOIN (
SELECT
t3.table1_ID
t4.extra_detail,
t5.one_last_thing
FROM table3 t3
JOIN table4 t4
ON t3.id = t4.table3_ID
JOIN table5 t5
ON t3.id = t5.table3_ID
)cte3
ON t1.id = cte3.table1_ID
Now the interesting part, merit, why would we be doing these things.
The first version should meet your needs just fine if you want to just get some value off each table and move on. But if you are doing some complex filters on table3 and below, or you are doing some expensive aggregation on the table3 and below, but those result are match many times to table1, then doing the work in a CTE or sub-query makes sense.
Now why might you use the CTE over the subquery, the simple answer for the code given they are the same. But if you joined table1 and table3 multiple times, because you are calculated daily costs, weekly costs, and monthly costs, then build the costs once (in a CTE), and then joining those results can save a lot of time. But at the same time, sometimes CTE's can slow things down, as what might seem the "expensive code" is mostly free once the other work is taken into account, and thus I have seen code run faster on Snowflake doing a large aggregation three times in sub-selects as it removes the synchronization cost between the data paths, and the remote data read was the same bottle neck under both.
On the other hand sometime CTE's make reading the code cleaner, as you get to name the expression something meaningful, and then use an alias, so the SQL is more readable, but the intent is captured. And Snowflake optimizer rewrites the SQL anyways, some they can and are often the same. So helping humans is more value.
On other databases there optimizers can be helped by the order of the joins, and them being nested (or so I have been told) but I have not read/witnessed that on snowflake, but have spent days rewriting SQL to have it have the "same execution plan" in the other form.
But where CTE's shine can be in really large (hundreds of lines of SQL) in pushing filters to where you want them, to avoid really large data reads, and full table processing only to have that pruned. This sort of thing is spot able in the query profiler, but 10 billion rows going between blocks for many steps only to hit a filter later in the pipeline and 5 thousand coming out.

Does moving a where clause to the join clause improve performance?

I have a slow-running update statement, and I was curious if moving the where condition to the join clause would improve performance. Here's the query:
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID set T1.X=T2.X where T1.TYPE='something';
Now... for a very big table (millions of records), would it be faster to do this?
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID and T1.TYPE='something' set T1.X=T2.X;
The query is simple enough that both approaches should be optimized identically.
Both approaches might also be sub-optimal because the inner query isn't correlated to the outer query. Your query is creating an implicit temporary table containing all possible rows for derived table T2 -- exactly the same result as if you just ran the query select ID, GROUP_CONCAT(x) as X from T3 group by ID by itself -- and then the server is discarding the ones that can't be joined to T1 and using the rest to do the update.
This is more than likely not the optimum path.
Unless t1.TYPE = 'something' involves a large percentage of the rows in T1, it should be more efficient to do this:
UPDATE t1
SET t1.x = (SELECT GROUP_CONCAT(x) FROM T3 WHERE T3.id = T1.id GROUP BY T3.id)
WHERE t1.TYPE = 'something';
The inner subquery is correlated to the outer subquery, and only executed for the rows in T1 that are matched by the WHERE clause.

Difference between the AND statement in an Inner Join or in a WHERE clause

Hello guys I have a specific question about the AND clause in SQL.
The two following SQL statements provide the same output:
SELECT * FROM Table1 t1 INNER JOIN Table2 t2 ON t1.id = t2.id AND t2.id = 0
SELECT * FROM Table1 t1 INNER JOIN Table2 t2 ON t1.id = t2.id WHERE t2.id = 0
Notice the difference at the end of the query. In the first one, I use the AND clause (without using the WHERE clause before). In the second one, I use a WHERE to specify my id.
Is the first syntax correct?
If yes, is the first one better in terms of performance (not using WHERE clause for filtering after)?
Should I expect different outputs with different queries?
Thanks for your help.
Yes, no, and no.
To be specific:
Yes, the syntax is correct. Conceptually, the first query creates an inner join between t1 and t2 with the join condition t1.id = t2.id AND t2.id = 0, while the second creates an inner join on t1.id = t2.id and then filters the result using the condition t2.id = 0.
However, no SQL engine I know of would actually execute either query like that. Rather, in both cases, the engine will optimize both of them to something like t1.id = 0 AND t2.id = 0 and then do two single-row lookups.
No, pretty much any reasonable SQL engine should treat these two queries as effectively identical.
No, see above.
By the way, the following ways to write the same query are also valid:
SELECT * FROM Table1 t1 INNER JOIN Table2 t2 WHERE t1.id = t2.id AND t2.id = 0
SELECT * FROM Table1 t1, Table2 t2 WHERE t1.id = t2.id AND t2.id = 0

Shorten a join query

I have a query with 3 joins:
SELECT t1.email, t2.firstname, t2.lastname, t4.value
FROM t1
left join t2 on t1.email = t2.email
Inner join t3 on t2.entity_id = t3.order_id
Inner join t4 on t3.product_id = t4.entity_id
WHERE t4.attribute_id = 126
I think my server just can't make it :) --> time is running out so an error occurs!
Thanks a lot
Table structur:
T1:
email (which is the same then in t2)
T2:
email firstname lastname orderid (which is called entity id in t3)
T3:
entityid product id (which is called entity id in t4)
T4:
entityid attributeid value
Unless t2 links straight to t4 there is no way.
Also, do you need a left join between t1 and t2?
As #Sachin already stated, you can't "shorten" this query unless t2 links straight to t4 without requiring a comparison with t3. However, in order to speed up your query, you should have indexes on some or all of the columns referenced in your join conditions (i.e. t1.email, t2.email, t2.entity_id, etc).
Having an index on each of these columns will give you much faster SELECT queries, but it will slow down your INSERT and UPDATE queries. So if you SELECT more often than you INSERT or UPDATE, then you should definitely be using indexes. If not, try to make indexes in wise places (tables that have INSERT or UPDATE statements run less often but still have a lot of rows, for instance).
For further clarification, see the following links:
More information on how indexes work
Syntax for creating indexes
Try your query this way:
SELECT t1.email, t2.firstname, t2.lastname, t4.value
FROM t4
INNER JOIN t3 ON t3.product_id = t4.entity_id
INNER JOIN t2 ON t2.entity_id = t3.order_id
INNER JOIN t1 ON t1.email = t2.email
WHERE t4.attribute_id = 126
It's basically your query but "backwards". Your original way, your DBMS has to try to join t2 for ALL records in t1, then join t3 for ALL records found in t2 before it can even attempt to address your WHERE clause.
My way, you're finding all the records in t4 where attribute_id = 126 first, THEN attempting to join other tables. It should be a lot quicker. You should then be able to speed things up even more by making sure the proper indexes exist on the tables involved. You can prepend the keyword EXPLAIN to your query to see how the DBMS attempts to seek data in your query.

which method is better to join mysql tables?

What is difference between these two methods of selecting data from multiple tables. First one does not use JOIN while the second does. Which one is prefered method?
Method 1:
SELECT t1.a, t1.b, t2.c, t2.d, t3.e, t3.f
FROM table1 t1, table2 t2, table3 t3
WHERE t1.id = t2.id
AND t2.id = t3.id
AND t3.id = x
Method 2:
SELECT t1.a, t1.b, t2.c, t2.d, t3.e, t3.f
FROM `table1` t1
JOIN `table2` t2 ON t1.id = t2.id
JOIN `table3` t3 ON t1.id = t3.id
WHERE t1.id = x
For your simple case, they're equivalent. Even though the 'JOIN' keyword is not present in Method #1, it's still doing joins.
However, method #2 offers the flexibility of allowing extra conditions in the JOIN condition that can't be accomplished via WHERE clauses. Such as when you're doing aliased multi-joins against the same table.
select a.id, b.id, c.id
from sometable A
left join othertable as b on a.id=b.a_id and some_condition_in_othertable
left join othertable as c on a.id=c.a_id and other_condition_in_othertable
Putting the two extra conditions in the whereclause would cause the query to return nothing, as both conditions cannot be true at the same time in the where clause, but are possible in the join.
The methods are apparently identical in performance, it's just new vs old syntax.
I don't think there is much of a difference. You could use the EXPLAIN statement to check if MySQL does anything differently. For this trivial example I doubt it matters.