How to optimize mysql on left join - mysql

I try to explain a very high level
I have two complex SELECT queries(for the sake of example I reduce the queries to the following):
SELECT id, t3_id FROM t1;
SELECT t3_id, MAX(added) as last FROM t2 GROUP BY t3_id;
query 1 returns 16k rows and query 2 returns 15k
each queries individually takes less than 1 second to compute
However what I need is to sort the results using column added of query 2, when I try to use LEFT join
SELECT
t1.id, t1.t3_Id
FROM
t1
LEFT JOIN
(SELECT t3_id, MAX(added) as last FROM t2 GROUP BY t3_id) AS t_t2
ON t_t2.t3_id = t1.t3_id
GROUP BY t1.t3_id
ORDER BY t_t2.last
However, the execution time goes up to over a 1 minute.
I like to understand the reason
what is the cause of such a huge explosion?
NOTE:
ALL the used columns on every table have been indexed
e.g. :
table t1 has index on id,t3_Id
table t2 has index on t3_id and added
EDIT1
after #Tim Biegeleisen suggestion, I change the query to the following now the query is executing in about 16 seconds. If I remove the ORDER BY it query gets executed in less than 1 seconds. The problem is that ORDER BY the sole reason for this.
SELECT
t1.id, t1.t3_Id
FROM
t1
LEFT JOIN
t2 ON t2.t3_id = t1.t3_id
GROUP BY t1.t3_id
ORDER BY MAX(t2.added)

Even though table t2 has an index on column t3_id, when you join t1 you are actually joining to a derived table, which either can't use the index, or can't use it completely effectively. Since t1 has 16K rows and you are doing a LEFT JOIN, this means the database engine will need to scan the entire derived table for each record in t1.
You should use MySQL's EXPLAIN to see what the exact execution strategy is, but my suspicion is that the derived table is what is slowing you down.

The correct query should be:
SELECT
t1.id,
t1.t3_Id,
MAX(t2.added) as last
FROM t1
LEFT JOIN t2 on t1.t3_Id = t2.t3_Id
GROUP BY t2.t3_id
ORDER BY last;
This is happen because a temp table is generating on each record.

I think you could try to order everything after the records are available. Maybe:
select * from (
select * from
(select t3_id,max(t1_id) from t1 group by t3_id) as t1
left join (select t3_id,max(added) as last from t2 group by t3_id) as t2
on t1.t3_id = t2.t3_id ) as xx
order by last

Related

MySQL: LIMIT parameter in JOIN producing unexpected results

I have read here that MySQL processes ordering before applying limits. However, I receive different results when applying a LIMIT parameter in conjunction with a JOIN subquery. Here is my query:
SELECT
t1.id,
(t2.counts / c.matches)
FROM
table_one t1
JOIN
table_two t2 ON t1.id = t2.id
JOIN
(
SELECT
t1.id, COUNT(DISTINCT t1.id) AS matches
FROM
table_one t1
JOIN table_two t2 ON t1.id = t2.id
WHERE
t1.id IN (3390 , 3236, 148, 2811, 829, 137)
AND t2.value_one <= 30
AND t2.value_two < 2
GROUP BY t1.id
ORDER BY (t2.counts / matches)
LIMIT 0, 50 -- PROBLEM IS HERE (I think)
) c ON c.id = t1.id
ORDER BY (t2.counts / c.matches), t1.id;
Here is a rough description of what I think is happening:
The sub-query selects a bunch of ids from table_one that meet the criteria
These are ordered by (t2.counts / matches)
The top 50 (in ascending order) are fashioned into a table
This resulting table is then joined on the the id column
Results are returned from the top level JOIN - without a GROUP BY clause this time. table_one is a reference table so this will return many rows with the same ID.
I appreciate that some of these joins don't make a lot of sense but I have stripped down my query for readability - it's normally quite chunky .
The problem is that when, I include the LIMIT parameter I get a different set of results and not just the top 50. What I want to do is get the top results from the subquery and use these to join onto a bunch of other tables based on the reference table.
Here is what I have tried so far:
LIMIT on the outer query (this is undesirable as this cuts off important information).
Trying different LIMIT tables and values.
Any idea what is going wrong, or what else I could try?
I have found a solution to my problem. It seems as if my matches column name does can't be used in my ORDER BY clause - which is weird since I don't get an error. Either way, this solves the problem:
SELECT
t1.id,
(t2.counts / c.matches)
FROM
table_one t1
JOIN
table_two t2 ON t1.id = t2.id
JOIN
(
SELECT
t1.id, COUNT(DISTINCT t1.id) AS matches
FROM
table_one t1
JOIN table_two t2 ON t1.id = t2.id
WHERE
t1.id IN (3390 , 3236, 148, 2811, 829, 137)
AND t2.value_one <= 30
AND t2.value_two < 2
GROUP BY t1.id
ORDER BY (t2.counts / COUNT(DISTINCT t1.id)) -- This line is changed
LIMIT 0, 50
) c ON c.id = t1.id
ORDER BY (t2.counts / c.matches), t1.id;

Difference between "INNER JOIN table" and "INNER JOIN (SELECT table)"?

I work on a query in mysql that spend 30 sec to execute. The format is like this :
SELECT id
FROM table1 t1
INNER JOIN table2 t2
ON t1.id = t2.idt2
The INNER JOIN take 25 of 30 sec. When I write this like this :
SELECT id
FROM table1 t1
INNER JOIN (
SELECT idt2,col1,col2,col3
FROM table2
) t2
ON t1.id = t2.idt2
It take only 8 sec! Why does it work? I'm afraid of losing data.
(obviously, my query is more complex than this one, it's just an exemple)
Well you haven't shown us the EXPLAIN output
EXPLAIN SELECT id
FROM table1 t1
INNER JOIN table2 t2
ON t1.id = t2.idt2
this would definitly give us some insights of your query and table sctructures.
Based on your scenario, 1st query seems like you have issues with indexing.
What happened in your 2nd query is the optimizer is creating a temporary set from your subquery furthering filtering your data. I dont recommend doing that in MOST cases.
Purpose of subquery is to solve complex logic, not an instant solution for everything.

How to find latest record from two different tables

There are 2 tables table1 and table 2
First column, foreign_id is the common column between both tables.
Data type of all the related columns are same.
Now, we need to find the latest record based on timestamp column, for each foreign_id taking from both the tables, for example as below, also an extra column from_table, which shows from which table this row is selected.
One method that I can think of is
Combine both the tables
then, find the latest for each foreign_id column
Any, better way to do this as there could be 5000+ rows in both the tables.
Try this:
SELECT
t1.foreign_id,
MAX(t1.timestamp) max_time_table1,
MAX(t2.timestamp) max_time_table2
FROM *table1* t1
LEFT JOIN *table2* t2 USING (foreign_id)
GROUP BY foreign_id;
Note: This can be a bit slow, if the number of records are quite large.
However you can also use this:
SELECT a.foreign_id,
IF(a.max_time_table1 > a.max_time_table2, a.max_time_table1, a.max_time_table2) latest_update
FROM(
SELECT
t1.foreign_id,
SUBSTRING_INDEX(GROUP_CONCAT(t1.timestamp ORDER BY t1.id DESC),',',1) max_time_table1,
SUBSTRING_INDEX(GROUP_CONCAT(t2.timestamp ORDER BY t2.id DESC),',',1) max_time_table2
FROM *table1* t1
LEFT JOIN *table2* t2 USING (foreign_id)
GROUP BY foreign_id) a;
Make sure the id columns in both tables are auto_increment.
From your explanation, this would do then:
SELECT
foreign_id,
CASE
WHEN max_time_table1 < max_time_table2 THEN max_time_table2
WHEN max_time_table2 < max_time_table1 THEN max_time_table1
END as timestamps
FROM(
SELECT
t1.foreign_id,
SUBSTRING_INDEX(GROUP_CONCAT(t1.timestamp ORDER BY t1.id DESC),',',1) max_time_table1,
SUBSTRING_INDEX(GROUP_CONCAT(t2.timestamp ORDER BY t2.id DESC),',',1) max_time_table2
FROM *table1* t1
LEFT JOIN *table2* t2 USING (foreign_id)
GROUP BY foreign_id) a;

Does moving a where clause to the join clause improve performance?

I have a slow-running update statement, and I was curious if moving the where condition to the join clause would improve performance. Here's the query:
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID set T1.X=T2.X where T1.TYPE='something';
Now... for a very big table (millions of records), would it be faster to do this?
update T1 inner join (select ID, GROUP_CONCAT(x) as X from T3 group by ID) as T2
on T1.ID=T2.ID and T1.TYPE='something' set T1.X=T2.X;
The query is simple enough that both approaches should be optimized identically.
Both approaches might also be sub-optimal because the inner query isn't correlated to the outer query. Your query is creating an implicit temporary table containing all possible rows for derived table T2 -- exactly the same result as if you just ran the query select ID, GROUP_CONCAT(x) as X from T3 group by ID by itself -- and then the server is discarding the ones that can't be joined to T1 and using the rest to do the update.
This is more than likely not the optimum path.
Unless t1.TYPE = 'something' involves a large percentage of the rows in T1, it should be more efficient to do this:
UPDATE t1
SET t1.x = (SELECT GROUP_CONCAT(x) FROM T3 WHERE T3.id = T1.id GROUP BY T3.id)
WHERE t1.TYPE = 'something';
The inner subquery is correlated to the outer subquery, and only executed for the rows in T1 that are matched by the WHERE clause.

Sql queries difference

I would like to know whether this two versions are equivalent in result and which is better for performance reasons and why?
Nested Select in Select version
select
t1.c1,
t1.c2,
(select Count(t2.c1) from t2 where t2.id = t1.id) as count_t
from
t1
VS
select t1.c1,t1.c2, Count(t2.c1)
from t1,t2
where t2.id= t1.id
The first query is analog of this query -
SELECT
t1.c1,
t1.c2,
COUNT(t2.c1)
FROM t1
LEFT JOIN t2
ON t2.id = t1.id;
It selects all records from first table, and all matched records from second table (it is LEFT JOIN condition).
The second is analog of this query -
SELECT
t1.c1,
t1.c2,
COUNT(t2.c1)
FROM t1
JOIN t2
ON t2.id = t1.id;
It selects only matched records in both tables (it is INNER JOIN condition).
Well they are different queries. The top one will select all rows from t1 returning 0 for the count if there is no matching id in table t2.
The second query will only return rows where t1 and t2 both have a row with the same id.
The first query will likely suffer from performance issues on large data sets. The second query will potentially have a Cartesian issue. I would go with a join or left join based on your intent to have records from table 1 if table 2 has no related records and then add a group by statement to control the Cartesian.