I'm having some issues dealing with updating and inserting millions of row in a MySQL Database. I need to flag 50 million rows in Table A, insert some data from the marked 50 million rows into Table B, then update those same 50 million rows in Table A again. There are about 130 million rows in Table A and 80 million in Table B.
This needs to happen on a live server without denying access to other queries from the website. The problem is while this stored procedure is running, other queries from the website end up locked and the HTTP request times out.
Here's gist of the SP, a little simplified for illustration purposes:
CREATE DEFINER=`user`#`localhost` PROCEDURE `MyProcedure`(
totalLimit int
)
BEGIN
SET #totalLimit = totalLimit;
/* Prepare new rows to be issued */
PREPARE STMT FROM 'UPDATE tableA SET `status` = "Being-Issued" WHERE `status` = "Available" LIMIT ?';
EXECUTE STMT USING #totalLimit;
/* Insert new rows for usage into tableB */
INSERT INTO tableB (/* my fields */)
SELECT /* some values from TableA */
FROM tableA
WHERE `status` = "Being-Issued";
/* Set rows as being issued */
UPDATE tableB SET `status` = 'Issued' WHERE `status` = 'Being-Issued';
END$$
DELIMITER ;
Processing 50M rows three times will be slow irrespective of what you're doing.
Make sure your updates are affecting smaller-sized, disjoint sets. And execute each of them one by one, rather than each and every one of them within the same transaction.
If you're doing this already and MySQL is misbehaving, try this slight tweak to your code:
create a temporary table
begin
insert into tmp_table
select your stuff
limit ?
for update
do your update on A using tmp_table
commit
begin
do your insert on B using tmp_table
do your update on A using tmp_table
commit
this should keep locks for a minimal time.
What about this? It basically calls the original stored procedure in a loop until the total amount needed is reached, and having a sleep period in between calls (like 2 seconds) to allow other queries to process.
increment is the amount to do at one time (using 10,000 in this case)
totalLimit is the total amount to be processed
sleepSec is the amount of time to rest between calls
BEGIN
SET #x = 0;
REPEAT
SELECT SLEEP(sleepSec);
SET #x = #x + increment;
CALL OriginalProcedure( increment );
UNTIL #x >= totalLimit
END REPEAT;
END$$
Obviously it could use a little math to make sure the increment doesn't go over the total limit if its not evenly divisible, but it appears to work (by work I mean allow other queries to still be processed from web requests), and seems to be faster overall as well.
Any insight here? Is this a good idea? Bad idea?
Related
I have table which have 20 million records . I have recently added another column to that table.
I need to update the data into that column.
I'm using MYSQL community edition, when I execute the direct update like this :
Update Employee SET Emp_mail = 'xyz_123#gmail.com'
System Getting hanged and Need to close the execution abruptly.
But when I update the statement with filter condition it is executing fine.
Update Employee SET Emp_mail = 'xyz_123#gmail.com' where ID <= 10000;
Update Employee SET Emp_mail = 'xyz_123#gmail.com' where ID >= 10000 AND ID <= 10000 ;
--------
--------
no of Times
Now I'm looking for looping script where I can execute in chunk wise.
For example in SQL it is like this but I'm not sure of MYSQL:
BEGIN
I int = 0 ;
cnt = 0 ;
while 1 > cnt
SET i = i + 1;
Update Employee SET Emp_mail = 'xyz_123#gmail.com' where ID >= cnt AND ID <= I
END
Note : this is a random script syntax wise there may be some errors . Please ignore it.
I'm looking for Looping in MYSQL
In a row based database system as MySQL, if you need to update each and every row, you should really explore a different approach:
ALTER TABLE original_table RENAME TO original_table_dropme;
CREATE TABLE original_table LIKE original_table_dropme;
ALTER TABLE original_table ADD emp_mail VARCHAR(128);
INSERT INTO original_table SELECT *,'xyz_123#gmail.com'
FROM original_table_dropme;
Then, maybe keep the original table for a while - especially to transfer any constraints, primary keys and grants from the old table to the new - and finally drop the %_dropme table.
Each update, in a row based database, of a previously empty column to a value, will make each row longer than it originally was, and require a reorganisation, internally. If you do that with millions of rows, the effort needed will increase exponentially.
I have a procedure in MYSQL database that should collect some data from multiple tables and loop through it.
I have created a table to insert this data in it, the table has a primary key and a none unique index.
The inserted data is about 200000 rows, the insert is don in few seconds, but the while loop takes very long time (about 30 mins) to complete!!!
The while loop code is something like below:
SET #I = 0;
WhileLoop: WHILE (1=1)
SELECT KeyRW
INTO #I
FROM MyTable
WHERE KeyRW> #I
ORDER BY KeyRW
LIMIT 1;
IF #I IS NULL THEN
leave WhileLoop;
END IF;
(Some simple calculation...)
END WHILE WhileLoop;
We moved the loop code into another procedure and execute it manually after the insert is done (there was a few minutes delay) and it executed very very faster!!!
Then we moved the loop back into the previous procedure and added a delay before it, and now its working, it seems that MYSQL is indexing data asyncly and we should wait for it!!
Did i understand correctly?
If yes then how much should we wait based on size of data?
If no then whats the problem and why a delay is solving this?
I'm having issues where an update statement should (to my knowledge) update 5 rows (I have selected 5 rows into a temp table and used an INNER JOIN in the update statement)
however when it comes to running the update statement it updates anything that could have been selected into the temp table not just the joined contents of the temp table itself.
I'm using the FOR UPDATE code in the selection statement to lock the rows, (as I'm expecting multiple queries to be aimed at this table at one time, (NOTE removing that does not change the error effect)
I've generalised the entire code base and it still has the same effect, I've been on this for the past few days and I'm sure that its just something silly I must be doing
Code description
TABLE `data`.`data_table Table to store data and to show it has been taken my the external program.
Stored Procedure `admin`.`func_fill_table debug code to populate the above table.
Stored Procedure `data`.`func_get_data Actual code designed to retrieve a batch size records, mark them as picked up and then return them to an external application.
Basic Setup Code
DROP TABLE IF EXISTS `data`.`data_table`;
DROP PROCEDURE IF EXISTS `admin`.`func_fill_table`;
DROP PROCEDURE IF EXISTS `data`.`func_get_data`;
DROP SCHEMA IF EXISTS `data`;
DROP SCHEMA IF EXISTS `admin`;
CREATE SCHEMA `admin`;
CREATE SCHEMA `data`;
CREATE TABLE `data`.`data_table` (
`identification_field_1` char(36) NOT NULL,
`identification_field_2` char(36) NOT NULL,
`identification_field_3` int(11) NOT NULL,
`information_field_1` int(11) NOT NULL,
`utc_action_time` datetime NOT NULL,
`utc_actioned_time` datetime DEFAULT NULL,
PRIMARY KEY (`identification_field_1`,`identification_field_2`,`identification_field_3`),
KEY `NC_IDX_data_table_action_time` (`utc_action_time`)
);
Procedure Creation
DELIMITER //
CREATE PROCEDURE `admin`.`func_fill_table`(
IN records int
)
BEGIN
IF records < 1
THEN SET records = 50;
END IF;
SET #processed = 0;
SET #action_time = NULL;
WHILE #processed < records
DO
SET #action_time = DATE_ADD(now(), INTERVAL FLOOR(RAND()*(45)-10) MINUTE);#time shorter for temp testing
SET #if_1 = UUID();
SET #if_2 = UUID();
INSERT INTO data.data_table(
identification_field_1
,identification_field_2
,identification_field_3
,information_field_1
,utc_action_time
,utc_actioned_time)
VALUES (
#if_1
,#if_2
,FLOOR(RAND()*5000+1)
,FLOOR(RAND()*5000+1)
,#action_time
,NULL);
SET #processed = #processed +1;
END WHILE;
END
//
CREATE PROCEDURE `data`.`func_get_data`(
IN batch int
)
BEGIN
IF batch < 1
THEN SET batch = 1; /*Minimum Batch Size of 1 */
END IF;
DROP TABLE IF EXISTS `data_set`;
CREATE TEMPORARY TABLE `data_set`
SELECT
`identification_field_1` as `identification_field_1_local`
,`identification_field_2` as `identification_field_2_local`
,`identification_field_3` as `identification_field_3_local`
FROM `data`.`data_table`
LIMIT 0; /* Create a temp table using the same data format as the table but insert no data*/
SET SESSION sql_select_limit = batch;
INSERT INTO `data_set` (
`identification_field_1_local`
,`identification_field_2_local`
,`identification_field_3_local`)
SELECT
`identification_field_1`
,`identification_field_2`
,`identification_field_3`
FROM `data`.`data_table`
WHERE
`utc_actioned_time` IS NULL
AND `utc_action_time` < NOW()
FOR UPDATE; #Select out the rows to process (up to batch size (eg 5)) and lock those rows
UPDATE
`data`.`data_table` `dt`
INNER JOIN
`data_set` `ds`
ON (`ds`.`identification_field_1_local` = `dt`.`identification_field_1`
AND `ds`.`identification_field_2_local` = `dt`.`identification_field_2`
AND `ds`.`identification_field_3_local` = `dt`. `identification_field_3`)
SET `dt`.`utc_actioned_time` = NOW();
# Update the table to say these rows are being processed
select ROW_COUNT(),batch;
#Debug output for rows altered (should be maxed by batch number)
SELECT * FROM
`data`.`data_table` `dt`
INNER JOIN
`data_set` `ds`
ON (`ds`.`identification_field_1_local` = `dt`.`identification_field_1`
AND `ds`.`identification_field_2_local` = `dt`.`identification_field_2`
AND `ds`.`identification_field_3_local` = `dt`. `identification_field_3`);
# Debug output of the rows that should have been modified
SELECT
`identification_field_1_local`
,`identification_field_2_local`
,`identification_field_3_local`
FROM
`data_set`; /* Output data to external system*/
/* Commit the in process field and allow other processes to access thoese rows again */
END;
//
Run Code
call `admin`.`func_fill_table`(5000);
call `data`.`func_get_data`(5);
You are misusing the sql_select_limit-setting:
The maximum number of rows to return from SELECT statements.
It only applies to select statements (to limit results sent to the client), not to insert ... select .... It is intended as a safeguard to prevent users to be accidentally flooded with millions of results, not as another limit function.
While in general, you cannot use a variable for limit, you can do it in a stored procedure (for MySQL 5.5+):
The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. LIMIT takes one or two numeric arguments, which must both be nonnegative integer constants, with these exceptions: [...]
Within stored programs, LIMIT parameters can be specified using integer-valued routine parameters or local variables.
So in your case, you can simply use
...
FROM `data`.`data_table`
WHERE `utc_actioned_time` IS NULL AND `utc_action_time` < NOW()
LIMIT batch
FOR UPDATE;
I am trying to update a MySQL InnoDB table with c. 100 million rows. The query takes close to an hour, which is not a problem.
However, I'd like to split this update into smaller chunks in order not to block table access. This update does not have to be an isolated transaction.
At the same time, the splitting of the update should not be too expensive in terms of additional overhead.
I considered looping through the table in a procedure using :
UPDATE TABLENAME SET NEWVAR=<expression> LIMIT batchsize, offset,
But UPDATE does not have an offset option in MySQL.
I understand I could try to UPDATE ranges of data that are SELECTed on a key, together with the LIMIT option, but that seems rather complicated for that simple task.
I ended up with the procedure listed below. It works but I am not sure whether it is efficient with all the queries to identify consecutive ranges. It can be called with the following arguments (example):
call chunkUpdate('SET var=0','someTable','theKey',500000);
Basically, the first argument is the update command (e.g. something like "set x = ..."), followed by the mysql table name, followed by a numeric (integer) key that has to be unique, followed by the size of the chunks to be processed. The key should have an index for reasonable performance. The "n" variable and the "select" statements in the code below can be removed and are only for debugging.
delimiter //
CREATE PROCEDURE chunkUpdate (IN cmd VARCHAR(255), IN tab VARCHAR(255), IN ky VARCHAR(255),IN sz INT)
BEGIN
SET #sqlgetmin = CONCAT("SELECT MIN(",ky,")-1 INTO #minkey FROM ",tab);
SET #sqlgetmax = CONCAT("SELECT MAX(",ky,") INTO #maxkey FROM ( SELECT ",ky," FROM ",tab," WHERE ",ky,">#minkey ORDER BY ",ky," LIMIT ",sz,") AS TMP");
SET #sqlstatement = CONCAT("UPDATE ",tab," ",cmd," WHERE ",ky,">#minkey AND ",ky,"<=#maxkey");
SET #n=1;
PREPARE getmin from #sqlgetmin;
PREPARE getmax from #sqlgetmax;
PREPARE statement from #sqlstatement;
EXECUTE getmin;
REPEAT
EXECUTE getmax;
SELECT cmd,#n AS step, #minkey AS min, #maxkey AS max;
EXECUTE statement;
set #minkey=#maxkey;
set #n=#n+1;
UNTIL #maxkey IS NULL
END REPEAT;
select CONCAT(cmd, " EXECUTED IN ",#n," STEPS") AS MESSAGE;
END//
I have to read 460,000 records from one database and update those records in another database. Currently, I read all of the records in (select * from...) and then loop through them sending an update command to the second database for each record. This process is slower than I hoped and I was wondering if there is a faster way. I match up the records by the one column that is indexed (primary key) in the table.
Thanks.
I would probably optimize the fetch size for reads (e.g. setFetchSize(250)) and JDBC - Batch Processing for writes (e.g. a batch size of 250 records).
I am assuming your "other database" is on a separate server, so can't just be directly joined.
The key is to have fewer update statements. It can often be faster to insert your data into a new table like this:
create table updatevalues ( id int(11), a int(11), b int(11), c int(11) );
insert into updatevalues (id,a,b,c) values (1,1,2,3),(2,4,5,6),(3,7,8,9),...
update updatevalues u inner join targettable t using (id) set t.a=u.a,t.b=u.b,t.c=u.c;
drop table updatevalues;
(batching the inserts into however many statements you can fit in however big your maximum size is configured at, usually in the megabytes).
Alternatively, find unique values and update them together:
update targettable set a=42 where id in (1,3,7);
update targettable set a=97 where id in (2,5);
...
update targettable set b=1 where id in (1,7);
...
1. USE MULTI QUERY
aha. 'another db' means remote database.. in this case you SHOULD reduce number of interaction with remote DB. I suggest that use MULTIPLE QUERY. e.g to execute 1,000 UPDATE at once,
$cnt = 1;
for ($row in $rows)
{
$multi_query .= "UPDATE ..;";
if ($cnt % 1000 == 0)
{
mysql_query($multi_query);
$cnt = 0;
$multi_query = "";
}
++$cnt;
}
Normally Multi query feature is disable (for security reason), To use Multi query
PHP : http://www.php.net/manual/en/mysqli.quickstart.multiple-statement.php
C API : http://dev.mysql.com/doc/refman/5.0/en/c-api-multiple-queries.html
VB : http://www.devart.com/dotconnect/mysql/docs/MultiQuery.html (I'm not a VB user, so not sure this is for MULTI Query for VB)
2. USE Prepared Statement
(When you are already using prepared stmt. skip this)
You are running 460K same structured Queries. So If you use PREPARED STATEMENT, you can obtain two advantages.
Reduce query compile time
without prepared stmt. All queries are compiled, but just one time with prepared stmt.
Reduce Network Cost
Assuming each UPDATE query is 100 bytes long, and there are 4 parameters (each is 4 bytes long)
without prepare stmt : 100 bytes * 460K = 46M
with prepare stmt : 16 bytes * 460K = 7.3M
it doesn't reduce dramatically
Here is how to use prepared statement in VB.
What I ended up doing was using a loop to concatenate my queries together. So instead of sending one query at a time, I would send a group at a time separated by semicolons:
update sometable set x=1 where y =2; update sometable set x = 5 where y = 6; etc...
This ended up improving my time by about 40%. My update went from 3 min 23 secs to 2 min 1 second.
But there was a threshold, where concatenating too many together started to slow it down again when the string got too long. So I had to tweak it until I found just the right mix. It ended up being 100 strings concatenated together that gave the best performance.
Thanks for the responses.