My simple question is : Does multiple table join slows down mysql performance?
I have a data set where I need to do about 6 tables JOIN, on properly indexed columns.
I read the threads like
Join slows down sql
MySQL adding join slows down whole query
MySQL multiple table join query performance issue
But the question remains still as it is.
Can someone who experienced this thing reply?
MySQL, by default, uses the Block Nested-Loop join algorithm for joins.
SELECT t1.*, t2.col1
FROM table1 t1
LEFT JOIN table2 t2
ON t2.id = t1.id
In effect, yields the same performance as a subquery like the following:
SELECT t1.*, (SELECT col1 FROM table2 t2 WHERE t2.id = t1.id)
FROM table1 t1
Indexes are obviously important to satisfy the WHERE clause in the subquery, and are used in the same fashion for join operations.
The performance of a join, assuming proper indexes, amounts to the number of lookups that MySQL must perform. The more lookups, the longer it takes.
Hence, the more rows involved, the slower the join. Joins with small result sets (few rows) are fast and considered normal usage. Keep your result sets small and use proper indexes, and you'll be fine. Don't avoid the join.
Of course, sorting results from multiple tables can be a bit more complicated for MySQL, and any time you join text or blob columns MySQL requires a temporary table, and there are numerous other details.
Related
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
I have multiple large tables (several million rows) of data that need to all be combined via inner joins in a single query and filtered. These tables are all large and some of them contain large text columns. However, I don't need all the large text columns in the result of my query. I could filter the tables incrementally as I join them in subqueries or I could skip the subqueries and just join all the tables and filter in the select clause. Which one of these would be faster, and why?
Example with filtering subquery:
select aa.col1, aa.col2, aa.col3, aa.col4, c.col5, c.col6
from
(select a.col1, a.col2, b.col3, b.col4
from table_a a
join table_b b using(col1)
where a.col2 < 10 and b.col3 > 3)
as aa
join table_c c using(col1)
Example without subquery:
select a.col1, a.col2, b.col3, b.col4, c.col5, c.col6
from table_a a
join table_b b using(col1)
join table_c c using(col1)
where a.col2 < 10 and b.col3 > 3
I've done a little bit of research and some people are saying that the filtering order doesn't matter and that the sql query optimizer will choose the most efficient route. However, I've also seen some answers saying to filter incrementally.
With my own experiments in MYSQL, I've found that using subqueries speeds things up due to the large text field. The fetch time dominates the sql execution time (I guess due to large text fields) and filtering the data before the second join cuts down on the fetch time considerably. However, I don't understand the underlying mechanism for this and don't know if it's a fluke of my particular setup or generally applicable. Are there general rules for this type of query in SQL? Is there a difference between these types of queries in Microsoft SQL Server vs MYSQL? I primarily care about the speed of the entire query.
As per my study the second query is faster. Because subquery takes time.
Suppose you have a query:
SELECT * FROM table where id IN (SELECT id FROM table where condition1 AND condition 2 )
In this query first the subquery will execute, after selecting the subquery it checks the outer where conditions and then select.
And if you are using joins then it is faster because first it join table on the common field and then it check the other condition and then selects the data. So they are faster.
Filtering in derived tables can indeed be faster, but... it will depend specifically on the database design, the number of records filtered out, the indexes and other local conditions. So it is best to write both queries and do performance testing with your own system. Look at the explain plan for both and test the actual timing for both (you may need to clear the cache bewtteeen for a fair test)
I have two tables "users" and "temp_users". Now, "users" table contains millions of data and "temp_users" contains thousands. Both the table contains same sort of information but sometime some record might be missing.
So, the requirement is to compare these two tables and show the differences between them. I wrote the comparison Query but may be due to huge volume of data (millions) it is taking more than 5 minutes to execute. Any Suggestion??
The comparison query which I wrote is below:
SELECT
id,
dateTime,
phone,
address
FROM
tempUsers t1
WHERE NOT EXISTS (
SELECT id,dateTime
FROM users t2
WHERE t1.id = t2.id
OR t1.dateTime=t2.dateTime
)
The system is developed in JSP and MySQL and is deployed in Apache Tomcat
Thanks,
Two Observations:
Did you really intend to have an 'OR' in your where clause? Shouldn't it be an 'AND'? 'OR's can cause queries to run much slower if the query optimizer is unable to utilize indexes due to the 'OR' logic.
You are using a sub-select rather than a JOIN, and that can also cause a significant problem called 'correlated subquery' where the sub-select has to execute for every row being returned by the outer select.
The two issues above (correlated subquery with an OR condition) is likely what is causing the problem.
Try the following query instead:
SELECT
t1.id,
t1.dateTime,
t1.phone,
t1.address
FROM
tempUsers t1
LEFT OUTER JOIN
users t2
ON
t1.id = t2.id
AND t1.dateTime=t2.dateTime
WHERE
t2.id IS NULL
The above query performs a 'LEFT OUTER JOIN' using ID and DATETIME to join the two tables, then filters the results to only those where there is no row in USERS. This should return what you want.
If the 'OR' condition really is the logic you need, then change it in the 'ON' clause, but be prepared that it could adversely affect the speed of the query.
For additional speed: ensure that there is an index on either 'id', 'dateTime', or both.
Hope this helps!
john...
I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.
What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.