Simple question :D. I know how to do it, but I have to do it fast.
What’s the most time efficient method?
Scenario: two tables, tableA and tableB, update tableA.columnA from tableB.columnB, based on tableA.primarykey = tableB.primarykey.
Problem: tableA and tableB are over 10.000.000 records each.
update TableA as a
join TableB as b on
a.PrimaryKey = b.PrimaryKey
set a.ColumnA = b.ColumnB
Updating 10 million rows cannot be fast. Well... at least in comparison to the update of one row.
The best you can do:
indexes on joining fields, but you've got this, as these fields are primary keys
limit by where condition if applicable. Index covering where condition is needed to speed it up.
Related
I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.
One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.
90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
So, I have these two tables: tableA and tableB. Upon doing a simple inner join of these tables,
SELECT *
FROM tableA
JOIN tableB
ON tableA.columnA = tableB.id
Now, tableA contains 29000+ rows, whereas tableB contains just 11000+ rows. tableB.id is a primary key, hence clustered. And there exists a non-clustered index on columnA.
According to my thinking, the query optimizer should treat tableB as the inner table while performing the join, because it has a lesser number of rows, and treat tableA as the outer table, as a lot of rows need to be filtered from tableA based on the value of the tableB.id column.
But, the exact opposite of this actually happens. For some reason, the query optimizer is treating tableA as the inner table and tableB as the outer table.
Can someone please explain why that happens and what error am I making in my thought process? Also, is there a way to forcefully supersede the decision of query optimizer and dictate it to treat tableB as inner table? I am just curious to see how do the two different executions of the same query compare to each other. Thanks.
In InnoDB, primary key index lookups are marginally more efficient than secondary index lookups. The optimizer is probably preferring to run the join that does lookups against tableB.id because it uses the primary key index.
If you want to override the optimizer's ability to reorder tables, you can use an optimizer hint. The tables will be accessed in the order you specify them in your query.
SELECT *
FROM tableA
STRAIGHT_JOIN tableB
ON tableA.columnA = tableB.id
That syntax should work in any currently supported version of MySQL.
That will give you the opportunity to time query with either table order, and see which one in fact runs faster.
There's also new syntax in MySQL 8.0 to specify join order with greater control: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-join-order
I have one large Master Table, TableA with 12 Million Records. The core of it was:
TableA:
|--FieldA--|--FieldP--|--FieldS--|--FieldH--|--ValueField--|--FieldX--|
I created two subtables from that:
TableB with about 5 million unique records based on FieldA,FieldP,FieldS where I also pushed Value as I didn't need FieldX. I
TableB: |--FieldA--|--FieldP--|--FieldS--|--FieldH--|--ValueField
TableC with about 200,000 records which pulled the upper and lower value fields for each unique FieldH,FieldP,FieldS:
TableC: |--FieldP--|--FieldS--|--FieldH--|--MaxValue--|--MinValue--|
I neglected to push FieldH into TableB initially and have done a lot of work to it in the interim so cannot redo that step.
There is no way for me to test performance so just asking the following question hoping this is enough information:
To update TableB with the FieldH data it started with in TableA I have two choices:
Update TableB as T1
Inner Join TableA as T2
On T1.FieldA=T2.FieldA
And T1.FieldP=T2.FieldP
And T2. FieldS=T2.FieldS
Set T1.FieldH=T2.FieldH
I have indexes on each of the select fields.
This seems like a massive join to me.
My other option is to use the ranges and do a smaller join with more calculations:
Update TableB as T1
Inner Join TableC as T2
On T1.ValueField>=T2.MinValue
And T1.ValueField<=T2.MaxValue
Set T1.FieldH=T2.FieldH
I have an index on the value field as well.
Clearly in the latter case the advantage is it is a far smaller join, but on the other hand I am adding numeric calculations to each record. I don't know enough about the inner workings on indexes or joins or calculations to even make an educated guess on which is better.
I hope I provided a clear picture here. Trying not to add more and over-complicate the question, if any add'l data would help I am happy to provide/elaborate.
I'll refer you to this question (which I finally answered):
Faster Update in specific case: Join Large with equals or smaller with >- and <=
which was an outgrowth of this one. I got to the point where I realized it was not the >=<= but my indexing itself and specifically my need to create a compound index AND to have MySql choose it. So tackled that question instead.
Long story short is since I needed to join two large tables with concatenated varchar fields, I made an intermediate table and used those IDs to join instead. Please see the link above if you need details.
In the end it worked great even using the >=<=, in fact it ended up doing 52k updates on a 5 million record to 5 million record table in just over a minute (where it was taking 10 seconds per record prior).
When doing a select query, how does the performance of a table self-join compare with a join between two different tables? For example, if tableA and tableB are exact duplicates (both structure and data), would it be preferable to:
select ... from tableA a inner join tableA b..., or
select ... from tableA a inner join tableB b...
Perhaps its not a straightforward answer, in which case, are there any references on this topic?
I am using MySql.
Thanks!
Assuming that table B is exact copy of table A, and that all necessary indexes are created, self-join of table A should be a bit faster than join of B with A simply because data from table A and its indexes can be reused from cache in order to perform self-join (this may also implicitly give more memory for self-join, and more rows will fit into working buffers).
If table B is not the same, then it is impossible to compare.