What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.
Related
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
I have multiple large tables (several million rows) of data that need to all be combined via inner joins in a single query and filtered. These tables are all large and some of them contain large text columns. However, I don't need all the large text columns in the result of my query. I could filter the tables incrementally as I join them in subqueries or I could skip the subqueries and just join all the tables and filter in the select clause. Which one of these would be faster, and why?
Example with filtering subquery:
select aa.col1, aa.col2, aa.col3, aa.col4, c.col5, c.col6
from
(select a.col1, a.col2, b.col3, b.col4
from table_a a
join table_b b using(col1)
where a.col2 < 10 and b.col3 > 3)
as aa
join table_c c using(col1)
Example without subquery:
select a.col1, a.col2, b.col3, b.col4, c.col5, c.col6
from table_a a
join table_b b using(col1)
join table_c c using(col1)
where a.col2 < 10 and b.col3 > 3
I've done a little bit of research and some people are saying that the filtering order doesn't matter and that the sql query optimizer will choose the most efficient route. However, I've also seen some answers saying to filter incrementally.
With my own experiments in MYSQL, I've found that using subqueries speeds things up due to the large text field. The fetch time dominates the sql execution time (I guess due to large text fields) and filtering the data before the second join cuts down on the fetch time considerably. However, I don't understand the underlying mechanism for this and don't know if it's a fluke of my particular setup or generally applicable. Are there general rules for this type of query in SQL? Is there a difference between these types of queries in Microsoft SQL Server vs MYSQL? I primarily care about the speed of the entire query.
As per my study the second query is faster. Because subquery takes time.
Suppose you have a query:
SELECT * FROM table where id IN (SELECT id FROM table where condition1 AND condition 2 )
In this query first the subquery will execute, after selecting the subquery it checks the outer where conditions and then select.
And if you are using joins then it is faster because first it join table on the common field and then it check the other condition and then selects the data. So they are faster.
Filtering in derived tables can indeed be faster, but... it will depend specifically on the database design, the number of records filtered out, the indexes and other local conditions. So it is best to write both queries and do performance testing with your own system. Look at the explain plan for both and test the actual timing for both (you may need to clear the cache bewtteeen for a fair test)
I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado
I have multiple queries (from different section of my site) i am executing
Some are like this:
SELECT field, field1
FROM table1, table2
WHERE table1.id = table2.id
AND ....
and some are like this:
SELECT field, field1
FROM table1
JOIN table2
USING (id)
WHERE ...
AND ....
and some are like this:
SELECT field, field1
FROM table1
LEFT JOIN table2
ON (table1.id = table2.id)
WHERE ...
AND ....
Which of these queries is better, or slower/faster or more standard?
The first two queries are equivalent; in the MySql world the using keyword is (well, almost - see the documentation but using is part of the Sql2003 spec and there are some differences in NULL values) the same as saying field1.id = field2.id
You could easily write them as:
SELECT field1, field2
FROM table1
INNER JOIN table2 ON (table1.id = table2.id)
The third query is a LEFT JOIN. This will select all the matching rows in both tables, and will also return all the rows in table1 that have no matches in table2. For these rows, the columns in table2 will be represented by NULL values.
I like Jeff Atwood's visual explanation of these
Now, on to what is better or worse. The answer is, it depends. They are for different things. If there are more rows in table1 than table2, then a left join will return more rows than an inner join. But the performance of the queries will be effected by many factors, like table size, the types of the column, what the database is doing at the same time.
Your first concern should be to use the query you need to get the data out. You might honestly want to know what rows in table1 have no match in table2; in this case you'd use a LEFT JOIN. Or you might only want rows that match - the INNER JOIN.
As Krister points out, you can use the EXPLAIN keyword to tell you how the database will execute each kind of query. This is very useful when trying to figure out just why a query is slow, as you can see where the database spends all of its time.
personally, i prefer using left joins in my queries, though you can run into issues in the case of null records or duplicates, but that can be resolved with a simple modification with an outer clause. it's my understanding that a join is a bit more resource intensive, but this is up for debate and might be based on personal preference.
just my $.02.
The third example, using ON (field1=field2) is the more common, and seems to be the more commonly accepted standard.
I don't know about the performance difference, you would have to run some EXPLAIN queries to see what MySQL actually ends up doing with them all really.
I do know though that the first, with WHERE being used to join them all, is much less readable on anything other than trivial queries. Once you have some complex conditions in a query, it's confusing to have "join conditions" all muddled in with "selection conditions".