I'm using MySQL DB and as a result of a 3rd party client, we have some query that takes a long time. The 'problem' is that there is an outer-select using some internal-join without filtering results with 'where', and the 'where' is only on the "outer" section, which causes a join of 2 very big tables instead of joining 2 much smaller subsets of the tables (I can't control it, this is they way it is done... I must define them the join and they just add where clauses to it using this structure). Note that if the 'where' clauses would have been within the internal-join the join would be much-much smaller and the whole query would have been faster.
I've considered implementing the internal-join using a view, but it results the same performance. All fields compared by the join are indexed.
I was told that it can be improved with some DB's configuration tweaking, but no one could say what exactly.
Here is a paraphrase of the query (takes lots of seconds to minute to execute):
SELECT a.*,
SUM(b.p1) p1
FROM
(SELECT a.*,
b.p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value)
WHERE a.some_value = 'x'
Just to explain, if I could write the query myself I would have written it like this (takes ~200ms to execute):
SELECT a.*,
SUM(b.p1) p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value
WHERE a.some_value = 'x'
Any idea how can I improve that?
Your personal rewrite would be ok, however, by adding the AND b.y to the where clause kills your LEFT join to an INNER JOIN. The AND b.y should be part of the join's ON clause to retain left-join qualification.
For indexes, table A should have index on (x, b_id) and table B have a covering index on (id, y, p1)
Related
I am wondering how MySQL (or its underlying engine) processes the queries.
There are two set queries below (one uses left join and the other one uses cross join), which eventually will give the same result.
My question is, how come the processing time of the two sets of queries are similar?
What I expected is that the first set query will run quicker because the computer is dealing with left join so the size of the "table" won't be expanding, while the second set of queries makes the size of the "table" (what I assume is that the computer needs to get the result of the cross-join from multiple tables before it can go ahead and do the where clause) relatively larger.
select s.*, a.score as score_01, b.score as score_02
from student s
left join (select \* from sc where cid = '01') a using (sid)
left join (select \* from sc where cid = '02') b using (sid)
where a.score > b.score;
select s.*, a.score as score_01, b.score as score_02
from student s
,(select * from sc where cid = '01') a
,(select * from sc where cid = '02') b
where a.score > b.score and a.sid = b.sid and s.sid = a.sid;
I tried both sets of queries and expected the processing time for the first set query will be shorter, but it is not the case.
Add this to sc:
INDEX(sid, cid, score)
Better yet, if you have a useless id on side replace it with
PRIMARY KEY(sid, cid)`
(Assuming that pair is Unique.)
With either of those fixes, I expect both of your queries run at similar speed, and faster than currently.
For further discussion, please provide SHOW CREATE TABLE.
Addressing some of the Comments
MySQL ignores the keywords INNER, OUTER, and CROSS. So, it up to the WHERE to figure whether it is "inner" or "outer".
MySQL throws the ON and WHERE conditions together (except when it matters for LEFT), then decides what is used for filtering (WHERE) so it may be able to do that first. Then other conditions (which belonged in ON) help it get to the 'next' table.
So... Please use ON to say how the tables are related; use WHERE for filtering. (And don't use the old comma-join.)
That is, MySQL will [usually] look at one table at a time, doing a "Nested Loop Join" (NLJ) to get to the next.
There are many possible ways to evaluate a JOIN; MySQL ponders which one might be best, then uses that.
The order of non-LEFT JOINs does not matter, nor does the order of expressions AND'd together in WHERE.
In some situations, a HAVING expression can (and is) moved to the WHERE clause.
Although FROM comes before WHERE, the two get somewhat tangled up together. But, in general, the clauses are required to be in a certain order, and that order is logically the order that things have to happen in.
It is up to the Optimizer to combine steps. For example
WHERE a = 1
ORDER BY b
and the table has INDEX(a,b) -- The index will be used to do both, essentially at the same time. Ditto for
SELECT a, MAX(b)
...
GROUP BY a
ORDER BY a
can hop through the BTree index on (a,b) and deliver the results without an extra sort pass for either the GROUP BY or ORDER BY.
SELECT x is executed after WHERE y = 'abc' -- Well, in some sense it is. But if you have INDEX(y,x), the Optimizer is smart enough to grab the x values while it is performing the WHERE.
When a WHERE references more than one table of a JOIN, the Optimizer has a quandary. Which table should it start its NLJ with? It has some statistics to help make the decision, but it does not always get it right. It will usually
filter on one of the tables
NLJ to get to the next table, meanwhile throwing in any WHERE clauses for that table in with the ON clause.
Repeat for other tables.
When there is both a WHERE and an ORDER BY, the Optimizer will usually filter filter, then sort. But sometimes (not always correctly) it will decide to use an index for the ORDER BY (thereby eliminating the sort) and filter as it reads the table. LIMIT, which is logically done last further muddies the decision.
MySQL does not have FULL OUTER JOIN. It can be simulated with two JOIN and a UNION. (It is only very rarely needed.)
I have multiple large tables (several million rows) of data that need to all be combined via inner joins in a single query and filtered. These tables are all large and some of them contain large text columns. However, I don't need all the large text columns in the result of my query. I could filter the tables incrementally as I join them in subqueries or I could skip the subqueries and just join all the tables and filter in the select clause. Which one of these would be faster, and why?
Example with filtering subquery:
select aa.col1, aa.col2, aa.col3, aa.col4, c.col5, c.col6
from
(select a.col1, a.col2, b.col3, b.col4
from table_a a
join table_b b using(col1)
where a.col2 < 10 and b.col3 > 3)
as aa
join table_c c using(col1)
Example without subquery:
select a.col1, a.col2, b.col3, b.col4, c.col5, c.col6
from table_a a
join table_b b using(col1)
join table_c c using(col1)
where a.col2 < 10 and b.col3 > 3
I've done a little bit of research and some people are saying that the filtering order doesn't matter and that the sql query optimizer will choose the most efficient route. However, I've also seen some answers saying to filter incrementally.
With my own experiments in MYSQL, I've found that using subqueries speeds things up due to the large text field. The fetch time dominates the sql execution time (I guess due to large text fields) and filtering the data before the second join cuts down on the fetch time considerably. However, I don't understand the underlying mechanism for this and don't know if it's a fluke of my particular setup or generally applicable. Are there general rules for this type of query in SQL? Is there a difference between these types of queries in Microsoft SQL Server vs MYSQL? I primarily care about the speed of the entire query.
As per my study the second query is faster. Because subquery takes time.
Suppose you have a query:
SELECT * FROM table where id IN (SELECT id FROM table where condition1 AND condition 2 )
In this query first the subquery will execute, after selecting the subquery it checks the outer where conditions and then select.
And if you are using joins then it is faster because first it join table on the common field and then it check the other condition and then selects the data. So they are faster.
Filtering in derived tables can indeed be faster, but... it will depend specifically on the database design, the number of records filtered out, the indexes and other local conditions. So it is best to write both queries and do performance testing with your own system. Look at the explain plan for both and test the actual timing for both (you may need to clear the cache bewtteeen for a fair test)
I have a LEFT JOIN that is very expensive:
select X.c1, COUNT(Y.c3) from X LEFT JOIN Y on X.c1=Y.c2 group by X.c1;
After several minutes (20+), it still does not finish. But I want all rows in X. So I really do need a LEFT JOIN at some point.
It appears that I can hack my way around this to return the result set I am looking for by using a temp table in less than two minutes. I first trim down table Y so that it only contains rows in the join.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS
(select X.c1 as t, COUNT(Y.c2) as c from X
INNER JOIN Y where X.c1=Y.c2 group by X.c1);
select X.c1, table2.c from X
LEFT JOIN table2 on X.c1 = table2.t;
This finishes in under two minutes.
My questions are:
1) Are they equivalent?
2) Why is the second so much faster (why doesn't MySQL do this kind of optimization), meaning, do I need to do these kinds mysql?
EDIT: additional info: C1, C2 are BIGINTS. C1 is unique but there can be many C2s that all point to the same C1. As far as I know, I have not indexed any tables. X.C1 is an _id column that Y.c2 refers to.
Try indexing X.c1 and Y.c2 and running your original query.
It's hard to tell why your 1st query runs slower without the indexes without comparing the query plans from both queries (you can get the query plan by running your queries with explain at the beginning) but I suspect it's because the 2nd table contains many rows that do not have a corresponding row in the 1st table.
If x.c1 is unique, then I would suggest writing the query as:
select X.c1,
(select COUNT(Y.c3)
from Y
where X.c1 = Y.c2
)
from X;
For this query, you want an index on Y(c2, c3).
The reason why a left join might take longer is if many rows do not match. In that case, the group by is aggregating by many more rows than it really needs to. And no, MySQL does not attempt this type of optimization.
I have multiple joins including left joins in mysql. There are two ways to do that.
I can put "ON" conditions right after each join:
select * from A join B ON(A.bid=B.ID) join C ON(B.cid=C.ID) join D ON(c.did=D.ID)
I can put them all in one "ON" clause:
select * from A join B join C join D ON(A.bid=B.ID AND B.cid=C.ID AND c.did=D.ID)
Which way is better?
Is it different if I need Left join or Right join in my query?
For simple uses MySQL will almost inevitably execute them in the same manner, so it is a manner of preference and readability (which is a great subject of debate).
However with more complex queries, particularly aggregate queries with OUTER JOINs that have the potential to become disk and io bound - there may be performance and unseen implications in not using a WHERE clause with OUTER JOIN queries.
The difference between a query that runs for 8 minutes, or .8 seconds may ultimately depend on the WHERE clause, particularly as it relates to indexes (How MySQL uses Indexes): The WHERE clause is a core part of providing the query optimizer the information it needs to do it's job and tell the engine how to execute the query in the most efficient way.
From How MySQL Optimizes Queries using WHERE:
"This section discusses optimizations that can be made for processing
WHERE clauses...The best join combination for joining the tables is
found by trying all possibilities. If all columns in ORDER BY and
GROUP BY clauses come from the same table, that table is preferred
first when joining."
For each table in a join, a simpler WHERE is constructed to get a fast
WHERE evaluation for the table and also to skip rows as soon as
possible
Some examples:
Full table scans (type = ALL) with NO Using where in EXTRA
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id AND cr.role = cr2.role AND cr.util = 1000
[Err] Out of memory
Uses where to optimize results, with index (Using where,Using index):
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
WHERE cr.role = cr2.role
AND cr.util = 1000
515661 rows in set (0.124s)
****Combination of ON/WHERE - Same result - Same plan in EXPLAIN*******
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
AND cr.role = cr2.role
WHERE cr.util = 1000
515661 rows in set (0.121s)
MySQL is typically smart enough to figure out simple queries like the above and will execute them similarly but in certain cases it will not.
Outer Join Query Performance:
As both LEFT JOIN and RIGHT JOIN are OUTER JOINS (Great in depth review here) the issue of the Cartesian product arises, the avoidance of Table Scans must be avoided, so that as many rows as possible not needed for the query are eliminated as fast as possible.
WHERE, Indexes and the query optimizer used together may completely eliminate the problems posed by cartesian products when used carefully with aggregate functions like AVERAGE, GROUP BY, SUM, DISTINCT etc. orders of magnitude of decrease in run time is achieved with proper indexing by the user and utilization of the WHERE clause.
Finally
Again, for the majority of queries, the query optimizer will execute these in the same manner - making it a manner of preference but when query optimization becomes important, WHERE is a very important tool. I have seen some performance increase in certain cases with INNER JOIN by specifying an indexed col as an additional ON..AND ON clause but I could not tell you why.
Put the ON clause with the JOIN it applies to.
The reasons are:
readability: others can easily see how the tables are joined
performance: if you leave the conditions later in the query, you'll get way more joins happening than need to - it's like putting the conditions in the where clause
convention: by following normal style, your code will be more portable and less likely to encounter problems that may occur with unusual syntax - do what works
What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.