MySQL creating temp table then join faster than left join - mysql

I have a LEFT JOIN that is very expensive:
    select X.c1, COUNT(Y.c3) from X LEFT JOIN Y on X.c1=Y.c2 group by X.c1;
After several minutes (20+), it still does not finish. But I want all rows in X. So I really do need a LEFT JOIN at some point.
It appears that I can hack my way around this to return the result set I am looking for by using a temp table in less than two minutes. I first trim down table Y so that it only contains rows in the join.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS 
(select X.c1 as t, COUNT(Y.c2) as c from X
INNER JOIN Y where X.c1=Y.c2 group by X.c1);
select X.c1, table2.c from X 
LEFT JOIN table2 on X.c1 = table2.t; 
This finishes in under two minutes.
My questions are:
1) Are they equivalent?
2) Why is the second so much faster (why doesn't MySQL do this kind of optimization), meaning, do I need to do these kinds mysql?
EDIT: additional info: C1, C2 are BIGINTS. C1 is unique but there can be many C2s that all point to the same C1. As far as I know, I have not indexed any tables. X.C1 is an _id column that Y.c2 refers to.

Try indexing X.c1 and Y.c2 and running your original query.
It's hard to tell why your 1st query runs slower without the indexes without comparing the query plans from both queries (you can get the query plan by running your queries with explain at the beginning) but I suspect it's because the 2nd table contains many rows that do not have a corresponding row in the 1st table.

If x.c1 is unique, then I would suggest writing the query as:
select X.c1,
(select COUNT(Y.c3)
from Y
where X.c1 = Y.c2
)
from X;
For this query, you want an index on Y(c2, c3).
The reason why a left join might take longer is if many rows do not match. In that case, the group by is aggregating by many more rows than it really needs to. And no, MySQL does not attempt this type of optimization.

Related

Optimizing INNER JOIN across multiple tables

I have trawled many of the similar responses on this site and have improved my code at several stages along the way. Unfortunately, this 3-row query still won't run.
I have one table with 100k+ rows and about 30 columns of which I can filter down to 3-rows (in this example) and then perform INNER JOINs across 21 small lookup tables.
In my first attempt, I was lazy and used implicit joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `lookup_table` x 21
WHERE `master_table`.`indexed_col` = "value"
AND `lookup_table`.`id` = `lookup_col` x 21
The query looked to be timing out:
#2013 - Lost connection to MySQL server during query
Following this, I tried being explicit about the joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `master_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `master_table`.`lookup_col` x 21
WHERE `master_table`.`indexed_col` = "value"
Still got the same result. I then realised that the query was probably trying to perform the joins first, then filter down via the WHERE clause. So after a bit more research, I learned how I could apply a subquery to perform the filter first and then perform the joins on the newly created table. This is where I got to, and it still returns the same error. Is there any way I can improve this query further?
SELECT `temp_table`.*, `lookup_table`.`data_point` x 21
FROM (SELECT * FROM `master_table` WHERE `indexed_col` = "value") as `temp_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `temp_table`.`lookup_col` x 21
Is this the best way to write up this kind of query? I tested the subquery to ensure it only returns a small table and can confirm that it returns only three rows.
First, at its most simple aspect you are looking for
select
mt.*
from
Master_Table mt
where
mt.indexed_col = 'value'
That is probably instantaneous provided you have an index on your master table on the given indexed_col in the first position (in case you had a compound index of many fields)…
Now, if I am understanding you correctly on your different lookup columns (21 in total), you have just simplified them for redundancy in this post, but actually doing something in the effect of
select
mt.*,
lt1.lookupDescription1,
lt2.lookupDescription2,
...
lt21.lookupDescription21
from
Master_Table mt
JOIN Lookup_Table1 lt1
on mt.lookup_col1 = lt1.pk_col1
JOIN Lookup_Table2 lt2
on mt.lookup_col2 = lt2.pk_col2
...
JOIN Lookup_Table21 lt21
on mt.lookup_col21 = lt21.pk_col21
where
mt.indexed_col = 'value'
I had a project well over a decade ago dealing with a similar situation... the Master table had about 21+ million records and had to join to about 30+ lookup tables. The system crawled and queried died after running a query after more than 24 hrs.
This too was on a MySQL server and the fix was a single MySQL keyword...
Select STRAIGHT_JOIN mt.*, ...
By having your master table in the primary position, where clause and its criteria directly on the master table, you are good. You know the relationships of the tables. Do the query in the exact order I presented it to you. Don't try to think for me on this and try to optimize based on a subsidiary table that may have smaller record count and somehow think that will help the query faster... it won't.
Try the STRAIGHT_JOIN keyword. It took the query I was working on and finished it in about 1.5 hrs... it was returning all 21 million rows with all corresponding lookup key descriptions for final output, hence still needed a longer duration than just 3 records.
First, don't use a subquery. Write the query as:
SELECT mt.*, lt.`data_point`
FROM `master_table` mt INNER JOIN
`lookup_table` l
ON l.`id` = mt.`lookup_col`
WHERE mt.`indexed_col` = value;
The indexes that you want are master_table(value, lookup_col) and lookup_table(id, data_point).
If you are still having performance problems, then there are multiple possibilities. High among them is that the result set is simply too big to return in a reasonable amount of time. To see if that is the case, you can use select count(*) to count the number of returned rows.

Mysql: Why is WHERE IN much faster than JOIN in this case?

I have a query with a long list (> 2000 ids) in a WHERE IN clause in mysql (InnoDB):
SELECT id
FROM table
WHERE user_id IN ('list of >2000 ids')
I tried to optimize this by using an INNER JOIN instead of the wherein like this (both ids and the user_id use an index):
SELECT table.id
FROM table
INNER JOIN users ON table.user_id = users.id WHERE users.type = 1
Surprisingly, however, the first query is much faster (by the factor 5 to 6). Why is this the case? Could it be that the second query outperforms the first one, when the number of ids in the where in clause becomes much larger?
This is not Ans to your Question but you may use as alternative to your first query, You can better increase performance by replacing IN Clause with EXISTS since EXISTS performance better than IN ref : Here
SELECT id
FROM table t
WHERE EXISTS (SELECT 1 FROM USERS WHERE t.user_id = users.id)
This is an unfair comparison between the 2 queries.
In the 1st query you provide a list of constants as a search criteria, therefore MySQL has to open and search only table and / or 1 index file.
In the 2nd query you instruct MySQL to obtain the list dynamically from another table and join that list back to the main table. It is also not clear, if indexes were used to create a join or a full table scan was needed.
To have a fair comparison, time the query that you used to obtain the list in the 1st query along with the query itself. Or try
SELECT table.id FROM table WHERE user_id IN (SELECT users.id FROM users WHERE users.type = 1)
The above fetches the list of ids dynamically in a subquery.

How to improve a MySQL query that have an internal join

I'm using MySQL DB and as a result of a 3rd party client, we have some query that takes a long time. The 'problem' is that there is an outer-select using some internal-join without filtering results with 'where', and the 'where' is only on the "outer" section, which causes a join of 2 very big tables instead of joining 2 much smaller subsets of the tables (I can't control it, this is they way it is done... I must define them the join and they just add where clauses to it using this structure). Note that if the 'where' clauses would have been within the internal-join the join would be much-much smaller and the whole query would have been faster.
I've considered implementing the internal-join using a view, but it results the same performance. All fields compared by the join are indexed.
I was told that it can be improved with some DB's configuration tweaking, but no one could say what exactly.
Here is a paraphrase of the query (takes lots of seconds to minute to execute):
SELECT a.*,
SUM(b.p1) p1
FROM
(SELECT a.*,
b.p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value)
WHERE a.some_value = 'x'
Just to explain, if I could write the query myself I would have written it like this (takes ~200ms to execute):
SELECT a.*,
SUM(b.p1) p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value
WHERE a.some_value = 'x'
Any idea how can I improve that?
Your personal rewrite would be ok, however, by adding the AND b.y to the where clause kills your LEFT join to an INNER JOIN. The AND b.y should be part of the join's ON clause to retain left-join qualification.
For indexes, table A should have index on (x, b_id) and table B have a covering index on (id, y, p1)

INNER JOIN condition in WHERE clause or ON clause?

I mistyped a query today, but it still worked and gave the intended result. I meant to run this query:
SELECT e.id FROM employees e JOIN users u ON u.email=e.email WHERE u.id='139840'
but I accidentally ran this query
SELECT e.id FROM employees e JOIN users u ON u.email=e.email AND u.id='139840'
(note the AND instead of WHERE in the last clause)
and both returned the correct employee id from the user id.
What is the difference between these 2 queries? Does the second form only join members of the 2 tables meeting the criteria, whereas the first one would join the entire table, and then run the query? Is one more or less efficient than the other? Is it something else I am missing?
Thanks!
For inner joins like this they are logically equivalent. However, you can run in to situations where a condition in the join clause means something different than a condition in the where clause.
As a simple illustration, imagine you do a left join like so;
select x.id
from x
left join y
on x.id = y.id
;
Here we're taking all the rows from x, regardless of whether there is a matching id in y. Now let's say our join condition grows - we're not just looking for matches in y based on the id but also on id_type.
select x.id
from x
left join y
on x.id = y.id
and y.id_type = 'some type'
;
Again this gives all the rows in x regardless of whether there is a matching (id, id_type) in y.
This is very different, though:
select x.id
from x
left join y
on x.id = y.id
where y.id_type = 'some type'
;
In this situation, we're picking all the rows of x and trying to match to rows from y. Now for rows for which there is no match in y, y.id_type will be null. Because of that, y.id_type = 'some type' isn't satisfied, so those rows where there is no match are discarded, which effectively turned this in to an inner join.
Long story short: for inner joins it doesn't matter where the conditions go but for outer joins it can.
In the case of an INNER JOIN, the two queries are semantically the same, meaning they are guaranteed to have the same results. If you were using an OUTER join, the meaning of the two queries could be very different, with different results.
Performance-wise, I would expect that these two queries would result in the same execution plan. However, the query engine might surprise you. The only way to know is to view the execution plans for the two queries.
The optimizer will treat them the same. You can do an EXPLAIN to prove it to yourself.
Therefore, write the one that is clearer.
SELECT e.id
FROM employees e JOIN users u ON u.email=e.email
WHERE u.id='139840'
If it were an outer join instead of inner, you'd get unintended results, but when using an inner join it makes no real difference whether you use additional join criteria instead of a WHERE clause.
Performance-wise they are most likely identical, but can't be certain.
I brought this up with my colleagues on our team at work. This response is a bit SQL Server centered and not MySQL. However, the optimizer should have similarities in operation between SQL and MySQL..
Some thoughts:
Essentially, if you have to add a WHERE, there are additional table scans done to verify equality for each condition (This goes up by orders of magnitude with an AND or dataset, an OR, the decision is cast at the first true condition) – if you have one id pointer in the example given it is extremely quick conversely, if you have to find all of the records that belong to a company or department it becomes more obscure as you may have multiples of records. If you can apply the equals condition, it is far more effective when working with an AuditLog or EventLog table that has zillions of rows. One would not really see the large benefits of this on small tables (at around 200,000 rows or so).
From: Allesandro Alpi
http://suxstellino.wordpress.com/2013/01/07/sql-server-logical-query-processing-summary/
From: Itzik Ben-Gan
http://tsql.solidq.com/books/insidetsql2008/Logical%20Query%20Processing%20Poster.pdf

Querying to find if some columns are in array

I have a complex nested-query which is inside a join, is it possible to find several columns that match that query instead of repeating the query in the Join? ie:
select * from
A left join B on a.xid=b.xid and
(a.userid or b.userid) in (select userid from A where..)
^^^ don't want to duplicate the nested-query...
There is a nested query that should match several columns from the parent-query (as seen in the example above). The simple way is to duplicate the nested query several times. ie-
select * from A
left join B
on a.xid=b.xid
and a.userid in (select userid from ...)
and b.userid in (Select userid from ....)
BUT - since the subquery is bit complicated I don't want mysql to run it twice, but rather only once and than match it against several of the parent query columns.
If your subquery is working properly and you have the query cache turned on you won't have to worry about performance. If its a question of it being overly complex then maybe you could use a proc for this query: put the results of the sub into a temp table and join to it.
There are lots of ways to approach this.