When doing a select query, how does the performance of a table self-join compare with a join between two different tables? For example, if tableA and tableB are exact duplicates (both structure and data), would it be preferable to:
select ... from tableA a inner join tableA b..., or
select ... from tableA a inner join tableB b...
Perhaps its not a straightforward answer, in which case, are there any references on this topic?
I am using MySql.
Thanks!
Assuming that table B is exact copy of table A, and that all necessary indexes are created, self-join of table A should be a bit faster than join of B with A simply because data from table A and its indexes can be reused from cache in order to perform self-join (this may also implicitly give more memory for self-join, and more rows will fit into working buffers).
If table B is not the same, then it is impossible to compare.
Related
I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.
One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.
90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).
I have 2 tables a and b,each have 2 M and 3.2 Million records. I'am trying to get those id's which are not exists in b from a. I have written below query,
select a.id from a where not exists (select b.id from b where a.id =b.id)
this is taking longer time. is there any better way to get results faster.
Update: I just look into the table structure for both tables and found table a.id has decimal datatype and table b.id has varchar as datatype
will this difference in datatype cause any issues.
Could you try the LEFT JOIN with NULL. It will return the Id's which are exists in TableA and those are not in TableB.
SELECT T1.Id
FROM TableA T1
LEFT JOIN TableB T2 ON T2.Id = T1.Id
WHERE T2.Id IS NULL
While you could write your query using an anti-join, it probably would not affect the performance much, and in fact the underlying execution plan could even be the same. The only way I can see to speed up your query would be to add an index to the b table:
CREATE TABLE idx ON b (id);
But, if b.id be a primary key, then it should already be part of the clustered index. In this case, your current performance might be as good as you can get.
(this is mostly comment, but it's a bit long)
Please take some time to read some of the many questions about query optimization here is SO. The ones which are downvoted and closed omit table/index definitions and explain plans. The ones which will receive upvotes include these along with cardinality, performance and result metrics.
The join to table a in your sub query is redundant. When you remove the second reference to that table you end up with a simpler query. Then you can use a "not in" or a left join.
But the performance is still going to suck. Wherever possible you should try to avoid painting yourself into a corner like this in your data design.
Thanks for your valuable answers, I found the way. It got resolved after keeping same datatypes for lookup ID's, got results in 22 sec.
So, I have these two tables: tableA and tableB. Upon doing a simple inner join of these tables,
SELECT *
FROM tableA
JOIN tableB
ON tableA.columnA = tableB.id
Now, tableA contains 29000+ rows, whereas tableB contains just 11000+ rows. tableB.id is a primary key, hence clustered. And there exists a non-clustered index on columnA.
According to my thinking, the query optimizer should treat tableB as the inner table while performing the join, because it has a lesser number of rows, and treat tableA as the outer table, as a lot of rows need to be filtered from tableA based on the value of the tableB.id column.
But, the exact opposite of this actually happens. For some reason, the query optimizer is treating tableA as the inner table and tableB as the outer table.
Can someone please explain why that happens and what error am I making in my thought process? Also, is there a way to forcefully supersede the decision of query optimizer and dictate it to treat tableB as inner table? I am just curious to see how do the two different executions of the same query compare to each other. Thanks.
In InnoDB, primary key index lookups are marginally more efficient than secondary index lookups. The optimizer is probably preferring to run the join that does lookups against tableB.id because it uses the primary key index.
If you want to override the optimizer's ability to reorder tables, you can use an optimizer hint. The tables will be accessed in the order you specify them in your query.
SELECT *
FROM tableA
STRAIGHT_JOIN tableB
ON tableA.columnA = tableB.id
That syntax should work in any currently supported version of MySQL.
That will give you the opportunity to time query with either table order, and see which one in fact runs faster.
There's also new syntax in MySQL 8.0 to specify join order with greater control: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-join-order
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
Simple question :D. I know how to do it, but I have to do it fast.
What’s the most time efficient method?
Scenario: two tables, tableA and tableB, update tableA.columnA from tableB.columnB, based on tableA.primarykey = tableB.primarykey.
Problem: tableA and tableB are over 10.000.000 records each.
update TableA as a
join TableB as b on
a.PrimaryKey = b.PrimaryKey
set a.ColumnA = b.ColumnB
Updating 10 million rows cannot be fast. Well... at least in comparison to the update of one row.
The best you can do:
indexes on joining fields, but you've got this, as these fields are primary keys
limit by where condition if applicable. Index covering where condition is needed to speed it up.