I have a MySQL JOIN consisting 4 tables:
Direct chaining
SELECT col1, col2, col3... col12 FROM
(((tbl1 LEFT JOIN tbl2...) LEFT JOIN tbl3 ...) LEFT JOIN tbl4);
Sub-SELECT
(SELECT col10 .. col12 FROM
(SELECT col7 .. col9 FROM
(SELECT col1, ... col6 FROM tbl1
LEFT JOIN tbl2) AS J1
LEFT JOIN tbl3) AS J2
LEFT JOIN tbl4...)
Is there an efficiency difference between the two methods? My gut feeling is that sub-selects discard unnecessary rows and columns with the SELECT ... WHERE clause and makes JOINs faster and less memory intensive. Any advice? How about other databases?
It will depend on your tables size and filtered data by queries.
Condition1:
If your table size is normal (suppose all tables have approx 5000 rows) and you are fetching data from tables with out any filteration then there should not be any difference in both queries even first query can give better performance.
Condition2:
If your tables have bulky data suppose rows in billions but after filtering data actual data set has suppose near about approx. below 100 rows then 2nd query can be better.
There is no hard and fast rule, you have to check your query performance various manner as per your table data size and your requirement. The thumb rule is if we can reduce data size for joins with different tables then it will increase performance.
it will depend on the size of the table normally your first query will be more faster then the second because the evaluation period be less when compared with second.
Related
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
I have two tables containing 6M rows each. I'm trying to join the two using an inner join but the query ran for 2 days without finishing. The join is (note I've used count(*) just to enable me to run an explain, I'm actually using the join in a CTAS):
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
After a bit of investigation I've found the below query runs fine:
SELECT count(*)
FROM
(SELECT *
FROM table1) t1,
(SELECT *
FROM table2) t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
The only difference between that instead of the table, I use the sub-query SELECT * FROM table;
Running the explain plans shows that the latter query is building up an index when it selects table2. Whereas the first query is using a join buffer (Block Nested Loop).
Surely MySQL is clever enough to work out that the two queries are practically identical and do the same with both queries? I don't see why an index should be need because a full scan is required for both tables anyway. These are temporary/transitory tables so if I did put an index on, it would literally be just to perform this join.
Is there a way to fix this via MySQL configuration?
You NEED the index on at least ONE of the tables, even such as
create index Temp1 on Table2 ( colA, colB )
So, your query from Table 1 joined to table 2, so even if a table scan is on all of table 1, you need it to quickly find the record(s) that match in table 2. If NEITHER has an index, then think of it this way. For every record in Table1, scan through ALL records in Table 2 and grab all records that match for ColA, ColB. Now, go back to table 1 for the SECOND record... go back through table 2 for ALL records until it finds a match.
Being that you have 6M records, you could practically choke a cow (so-to-speak) on performance. By having an index, even on the SECOND table, when the query is on the first record, it can immediately jump to the rows that match ColA, ColB and as soon as those A/B records are done, it goes back to the first table.
Now, for other overhead efficiencies. If you have BOTH tables indexed on respective Col1, Col2 and ColA, ColB, then the engine will have in its memory / cache a whole block of records for each common area and doesn't have to keep going back to the raw data pages for other elements repeatedly.
So, even though you think it might not be practical, it is still good to handle large table queries. Also, if you have multiple records in the first table with the same values for Col1, Col2, but have different other values for other columns in the table, and similarly in the second table for multiple ColA, ColB, you would get a Cartesian result. Consider the following scenario
Table1
Col1 Col2 OtherColumn
X Y blah1
X Y blah2
X Y blah3
Table2
ColA ColB OtherColumn
X Y second blah1
X Y second blah2
X Y second blah3
A simple query like you have
SELECT count(*)
FROM table1 t1,
table2 t2
WHERE t1.col1 = t2.colA
AND t1.col2 = t2.colB;
would result in a count of 9. You have 6M records and a possible Cartesian result? Hopefully this clarifies some problems you may be encountering.
My simple question is : Does multiple table join slows down mysql performance?
I have a data set where I need to do about 6 tables JOIN, on properly indexed columns.
I read the threads like
Join slows down sql
MySQL adding join slows down whole query
MySQL multiple table join query performance issue
But the question remains still as it is.
Can someone who experienced this thing reply?
MySQL, by default, uses the Block Nested-Loop join algorithm for joins.
SELECT t1.*, t2.col1
FROM table1 t1
LEFT JOIN table2 t2
ON t2.id = t1.id
In effect, yields the same performance as a subquery like the following:
SELECT t1.*, (SELECT col1 FROM table2 t2 WHERE t2.id = t1.id)
FROM table1 t1
Indexes are obviously important to satisfy the WHERE clause in the subquery, and are used in the same fashion for join operations.
The performance of a join, assuming proper indexes, amounts to the number of lookups that MySQL must perform. The more lookups, the longer it takes.
Hence, the more rows involved, the slower the join. Joins with small result sets (few rows) are fast and considered normal usage. Keep your result sets small and use proper indexes, and you'll be fine. Don't avoid the join.
Of course, sorting results from multiple tables can be a bit more complicated for MySQL, and any time you join text or blob columns MySQL requires a temporary table, and there are numerous other details.
I'm working on a project involving joins between datasets and we have a requirement to allow previews of arbitrary joins between arbitrary datasets. Which is crazy, but thats why its fun. This is use facing so given a join I want to show ~10 rows of results quickly.
I've been basing my experimentation around different ways to sub-sample the different tables in such a way that I get at least a few result rows but keep the samples small enough that the join is fast and not cause the sampling to be expensive.
Here are the methods I've found pass the smell test. I would like to know a few things about them:
What types of joins or datasets would these fail at?
How could I identify those datasets?
If both of these are bad at the same thing, how could they be improved?
Is there a type of sampling I have not put here that is better?
Subselect with a limit.
Takes a random sample of one dataset to reduce the overall size.
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 LIMIT #) AS sample2
on table1.col1 = sample2.col1
LIMIT 10;
I like this because its easy and there is potential in the future to be smart about which table to samples from. It is also possible to select a portion where table1.col1 never equals sample2.col1 so no results are returned.
Find equals values of col1 and Sample them
More complicated, multi-query approach. Here I would do a distinct select of the columns to join on, compare the results to find common values and then do a subselect limiting the results to the common values.
SELECT DISTINCT col1 FROM table1;
SELECT DISTINCT col1 FROM table2;
commonVals = intersection of above results
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 WHERE col1 IN(commonVals) LIMIT #) as sample2
on table1.col1 = sample2.col1
LIMIT 10;
This gets us a good sample of table2, but the select distinct query may be more expensive than the join. I believe there may be a way to determine if this method is faster if you knew something about how long the distinct cals would take but at this point we don't have that much knowledge of the datasets.
Slap a LIMIT on the join
This is the easiest and the one I'm leaning towards.
SELECT col1, col1 FROM table1 join table2 on table1.col1 = table2.col1 LIMIT #
Assuming the join is good, this will always return data and for at least a large set of cases it will do it fast.
The problem with the first approach is that the rows in the first table might not have a match in the second table. Remember, inner joins not only do matching, they also do filtering.
The second approach could work, if all the columns used for joining have indexes on them. You can then get a list of matching ids by doing something like:
where id in (select id from table1) and id in (select id from table2) . . .
This gets rid of the initial code and should be pretty fast.
The third method is using the capabilities of the database most directly. You would be depending on the ability of MySQL to optimize according to the size of the result set. This is something that it does, at least in theory.
I would strongly recommend the third approach in conjunction with indexes on the columns used in the joins. This requires minimal changes to the query (just add a limit clause). It allows the database to pursue additional optimizations, if appropriate. It works on a more general set of queries.
What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.