How to update millions of records in MySql? - mysql

I have two tables tableA and tableB. tableA has 2 Million records and tableB has over 10 millions records. tableA has more than thirty columns whereas tableB has only two column. I need to update a column in tableA from tableB by joining both tables.
UPDATE tableA a
INNER JOIN tableB b ON a.colA=b.colA
SET a.colB= b.colB
colA in both table has been indexed.
Now when I execute the query it takes hours. Honestly I never saw it completed and max i have waited is 5 hours. Is their any way to complete this query within 20-30 minutes. What approach should I take.
EXPLAIN on SQL Query
"id" "_type" "table" "type" "possible_" "key" "key_len" "ref" "rows" "Extra"
"1" "SIMPLE" "a" "ALL" "INDX_DESC" \N \N \N "2392270" "Using where"
"1" "SIMPLE" "b" "ref" "indx_desc" "indx_desc" "133" "cis.a.desc" "1" "Using where"

Your UPDATE operation is performing a single transaction on ten million rows of a large table. (The DBMS holds enough data to roll back the entire UPDATE query if it does not complete for any reason.) A transaction of that size is slow for your server to handle.
When you process entire tables, the operation can't use indexes as well as it can when it has highly selective WHERE clauses.
A few things to try:
1) Don't update rows unless they need it. Skip the rows that already have the correct value. If most rows already have the correct value this will make your update much faster.
UPDATE tableA a
INNER JOIN tableB b ON a.colA=b.colA
SET a.colB = b.colB
WHERE a.colB <> b.colB
2) Do the update in chunks of a few thousand rows, and repeat the update operation until the whole table is updated. I guess tableA contains an id column. You can use it to organize the chunks of rows to update.
UPDATE tableA a
INNER JOIN tableB b ON a.colA=b.colA
SET a.colB = b.colB
WHERE a.id IN (
SELECT a.id
FROM tableA
INNER JOIN tableB ON a.colA = b.colA
WHERE a.colB <> b.colB
LIMIT 5000
)
The subquery finds the id values of 5000 rows that haven't yet been updated, and the UPDATE query updates them. Repeat this query until it changes no rows, and you're done. This makes things faster because the server must only handle smaller transactions.
3) Don't do the update at all. Instead, whenever you need to retrieve your colB value, simply join to tableB in your select query.

Chunking is the right way to go. However, chunk on the PRIMARY KEY of tableA.
I suggest only 1000 rows at a time.
Follow the tips given here
Did you say that the PK of tableA is a varchar? No problem. See the second flavor of code in that link; it uses ORDER BY id LIMIT 1000,1 to find the end of the next chunk, regardless of the datatype of id (the PK).

Hi i am not sure but you can do by cron job.
process: in table tableA you need to add one more field (for example) is_update set its default value is 0, set the cron job every min. when cron is working: for example it pick first time 10000 record having is_update field 0 value and update records and set is_update is1, in 2nd time its pick next 10000 have is_update 0 and so on...
Hope this will help to you.

For updating around 70 million records of a single MySQL table, I wrote a stored procedure to update the table in chunks of 5000. Took approximately 3 hours to complete.
DELIMITER $$
DROP PROCEDURE IF EXISTS update_multiple_example_proc$$
CREATE PROCEDURE update_multiple_example_proc()
BEGIN
DECLARE x bigint;
SET x = 1;
WHILE x <= <MAX_PRIMARY_KEY_TO_REACH> DO
UPDATE tableA A
JOIN tableB B
ON A.col1 = B.col1
SET A.col2_to_be_updated = B.col2_to_be_updated where A.id between x and x+5000 ;
SET x = x + 5000;
END WHILE;
END$$
DELIMITER ;

Look at oak-chunk-update tool. It is one of the best tool if you want to update billion of rows too ;)

Related

Best way to write SQL delete statement, deleting pairs of records

I have a MySQL database with just 1 table:
Fields are: blocknr (not unique), btcaddress (not unique), txid (not unique), vin, vinvoutnr, netvalue.
Indexes exist on both btcaddress and txid.
Data in it looks like this:
I need to delete all "deletable" record pairs. An example is given in red.
Conditions are:
txid must be the same (there can be more than 2 records with same txid)
vinvoutnr must be the same
vin must be different (can have only 2 values 0 and 1, so 1 must be 0 other must be 1)
In a table of 36M records, about 33M records will be deleted.
I've used this:
delete t1
from registration t1
inner join registration t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
It works but takes 5 hours.
Maybe this would work too (not tested yet):
delete t1
from registration as t1, registration as t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
Or do I forget about a delete query and try to make a new table with all non-delatables in and then drop the original ?
Database can be offline for this delete query.
Based on your question, you are deleting most of the rows in the table. That is just really expensive. A better approach is to empty the table and re-populate it:
create table temp_registration as
<query for the rows to keep here>;
truncate table registration;
insert into registration
select *
from temp_registration;
Your logic is a bit hard to follow, but I think the logic on the rows to keep is:
select r.*
from registration r
where not exists (select 1
from registration r2
where r2.txid = r.txid and
r2.vinvoutnr = r.vinvoutnr and
r2.vin <> r.vin
);
For best performance, you want an index on registration(txid, vinvoutnr, vin).
Given that you expect to remove the majority of your data it does sound like the simplest approach would be to create a new table with the correct data and then drop the original table as you suggest. Otherwise ADyson's corrections to the JOIN query might help to alleviate the performance issue.

improve a SELECT SQL query

My data scheme is really simple, let s say it's about farms
tableA is the main one, with an
important field "is_active" assuming
the farm is trusted (kind of)
tableB is a data storage of
serialized arrays about farms
statistics
I want to retrieve all data about active farm so I just do something like that:
SELECT * FROM tableA LEFT JOIN tableB ON id_tableA=id_tableB WHERE is_active=1 ORDER BY id_tableA DESC;
Right now the query takes 15 sec to execute straight from a sql shell, for example it I want to retrieve all data from the tableB, like :
SELECT * FROM tableB ORDER BY id_tableB DESC;
it takes less than 1 sec (approx 1200 rows)...
Any ideas how to improve the original query ?
thx
Create indexes on the keys joing two tables..
check this link, how to create indexes in mysql:
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
You'll have to create an index.
You could create the following index:
mysql> create index ix_a_active_id on tableA (id_tableA, is_active);
mysql> create index ix_b_id on tableB (id_tableB);
This first creates an index on BOTH the id + is active variable.
The second creates an index on the id for tableB.

Update and show mysql query

UPDATE myTable SET niceColumn=1 WHERE someVal=1;
SELECT * FROM myTable WHERE someVal=1;
Is there a way to combine these two queries into one? I mean can I run an update query and it shows the rows it updates. Because here I use "where id=1" filtering twice, I don't want this. Also I think if someVal changes before select query I will have troubles about what I get (ex: update updates it and after that someVal becomes 0 because of other script).
Wrap the two queries in a transaction with the desired ISOLATION LEVEL so that no other threads can't affect the locked rows between the update and the select.
Actually, even what you have done will not show the rows it updated, because meanwhile (after the update) some process may add/change rows.
And this will show all the records, including the ones updated yesterday :)
If I want to see exactly which rows were changed, I would go with temp table. First select into a temp table all the row IDs to be updated. Then perform the update based on the raw IDs in the temp table, and then return the temp table.
CREATE TEMPORARY TABLE to_be_updated
SELECT id
FROM myTable
WHERE someVal = 1;
UPDATE myTable
SET niceColumn = 1
WHERE id IN (SELECT * FROM to_be_updated);
SELECT *
FROM myTable
WHERE id IN (SELECT * FROM to_be_updated)
If in your real code the conditional part (where and so on) is too long to repeat, just put it in a variable that you use in both queries.
Unless you encounter a different problem, you shouldn't need these two combined.

Mysql Faster UPDATE

I have 2 Innodb tables. IN TableA I have a column (guidNew) that I want to assign its' values to a column in TableB (owner) depending on the relation between the column in TableA (guid) and TableB (owner).
Basically Tabl6eB (owner) has multiple entries that correspond to one TableA (guid). This is a Many to One relation. I want to change the TableB(owner) value to the new TableA(guidNew) values.
This is an example of the query:
UPDATE `TableB`, `TableA`
SET
`TableB`.`owner` = `TableA`.`guidNew`
WHERE `TableB`.`guid` != 0
AND `TableB`.`owner` = `TableA`.`guid`;
Now I do not know if this is working or not because there are more than 2 million entries. Is there a way to know the progress it has AND more important, a way to do it faster.
Make sure that you have indexed the guid and owner columns.
Try using the EXPLAIN command to see how the query is being performed
EXPLAIN SELECT TableB.owner, TableA.guidNew
FROM TableB, TableA
WHERE TableB.guid != 0
AND TableB.owner = TableA.guid

SQL Server 2008 bulk update using stored procedure

I have 2 tables in the DB. each with "Name" column and "Count" column each.
I would like to update the Count column in the second table from the Count in the first table only where the "Name" columns are equal.
Example:
First Table:
Name Count
jack 25
mike 44
Name Count
jack 23
mike 9
david 88
Result (the second table would look like that...)
Name Count
jack 25
mike 44
david 88
NOTES:
1. Both tables are huge. (although the second table is bigger...)
2. The update must be as fast as possible...
(if there are more options other than stored procedures, i would love to hear.)
3. "Count" defined as bigint while "Name" as nvarchar(100)
4. the "Count" field in the first table is always bigger than the equivalent in the
second table.
I think that there are more options (other than stored procedure) maybe with MERGE or TRANSACTION as long as it will be the fastest way...
Thanks!
The best way would be to keep it simple
UPDATE Table2
SET Count = t1.Count
FROM Table1
WHERE Table2.Name = Table1.Name
AND Table2.Count <> Table1.Count
If the performance of this query is not satisfactory due to the size of your tables the best solution would be to partition the tables based on the name field. The query can then be run from different threads at the same time with and extra filter based on Name to satisfy the partition function.
For example: (assuming name is a varchar(20) column)
UPDATE Table2
SET Count = t1.Count
FROM Table1
WHERE Table2.Name = Table1.Name
AND Table2.Count <> Table1.Count
AND Table2.Name between cast('Jack' as varchar(20))
and cast('Mike' as varchar(20))
(The cast on the strings is a big help for Sql Server to properly do partition elimination.)