Update MySQL table in chunks - mysql

I am trying to update a MySQL InnoDB table with c. 100 million rows. The query takes close to an hour, which is not a problem.
However, I'd like to split this update into smaller chunks in order not to block table access. This update does not have to be an isolated transaction.
At the same time, the splitting of the update should not be too expensive in terms of additional overhead.
I considered looping through the table in a procedure using :
UPDATE TABLENAME SET NEWVAR=<expression> LIMIT batchsize, offset,
But UPDATE does not have an offset option in MySQL.
I understand I could try to UPDATE ranges of data that are SELECTed on a key, together with the LIMIT option, but that seems rather complicated for that simple task.

I ended up with the procedure listed below. It works but I am not sure whether it is efficient with all the queries to identify consecutive ranges. It can be called with the following arguments (example):
call chunkUpdate('SET var=0','someTable','theKey',500000);
Basically, the first argument is the update command (e.g. something like "set x = ..."), followed by the mysql table name, followed by a numeric (integer) key that has to be unique, followed by the size of the chunks to be processed. The key should have an index for reasonable performance. The "n" variable and the "select" statements in the code below can be removed and are only for debugging.
delimiter //
CREATE PROCEDURE chunkUpdate (IN cmd VARCHAR(255), IN tab VARCHAR(255), IN ky VARCHAR(255),IN sz INT)
BEGIN
SET #sqlgetmin = CONCAT("SELECT MIN(",ky,")-1 INTO #minkey FROM ",tab);
SET #sqlgetmax = CONCAT("SELECT MAX(",ky,") INTO #maxkey FROM ( SELECT ",ky," FROM ",tab," WHERE ",ky,">#minkey ORDER BY ",ky," LIMIT ",sz,") AS TMP");
SET #sqlstatement = CONCAT("UPDATE ",tab," ",cmd," WHERE ",ky,">#minkey AND ",ky,"<=#maxkey");
SET #n=1;
PREPARE getmin from #sqlgetmin;
PREPARE getmax from #sqlgetmax;
PREPARE statement from #sqlstatement;
EXECUTE getmin;
REPEAT
EXECUTE getmax;
SELECT cmd,#n AS step, #minkey AS min, #maxkey AS max;
EXECUTE statement;
set #minkey=#maxkey;
set #n=#n+1;
UNTIL #maxkey IS NULL
END REPEAT;
select CONCAT(cmd, " EXECUTED IN ",#n," STEPS") AS MESSAGE;
END//

Related

How to select all tables with column name and update that column

I want to find all the tables in my db that contain the column name Foo, and update its value to 0, I was thinking something like this, but I don't know how to place the UPDATE on that code, I plan on having this statement on the Events inside the MySQL database, I'm using WAMP, the idea is basically having an event run daily which sets all my 'Foo' Columns to 0 without me having to do it manually
SELECT TABLE_NAME, COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE column_name LIKE 'Foo'
No, not in a single statement.
To get the names of all that tables that contain column named Foo:
SELECT table_schema, table_name
FROM information_schema.columns
WHERE column_name = 'Foo'
Then, you'd need an UPDATE statement for each table. (It's possible to do update multiple tables in a single statement, but that would need to be an (unnecessary) cross join.) It's better to do each table separately.
You could use dynamic SQL to execute the UPDATE statements in a MySQL stored program (e.g. PROCEDURE)
DECLARE sql VARCHAR(2000);
SET sql = 'UPDATE db.tbl SET Foo = 0';
PREPARE stmt FROM sql;
EXECUTE stmt;
DEALLOCATE stmt;
If you declare a cursor for the select from information_schema.tables, you can use a cursor loop to process a dynamic UPDATE statement for each table_name returned.
DECLARE done TINYINT(1) DEFAULT FALSE;
DECLARE sql VARCHAR(2000);
DECLARE csr FOR
SELECT CONCAT('UPDATE `',c.table_schema,'`.`',c.table_name,'` SET `Foo` = 0') AS sql
FROM information_schema.columns c
WHERE c.column_name = 'Foo'
AND c.table_schema NOT IN ('mysql','information_schema','performance_schema');
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN csr;
do_foo: LOOP
FETCH csr INTO sql;
IF done THEN
LEAVE do_foo;
END IF;
PREPARE stmt FROM sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END LOOP do_foo;
CLOSE csr;
(This is just an rough outline of an example, not syntax checked or tested.)
FOLLOWUP
Some brief notes about some ideas that were probably glossed over in the answer above.
To get the names of the tables containing column Foo, we can run a query from the information_schema.columns table. (That's one of the tables provided in the MySQL information_schema database.)
Because we may have tables in multiple databases, the table_name is not sufficient to identify a table; we need to know what database the table is in. Rather than mucking with a "use db" statement before we run an UPDATE, we can just reference the table UPDATE db.mytable SET Foo....
We can use our query of information_schema.columns to go ahead and string together (concatenate) the parts we need to create for an UPDATE statement, and have the SELECT return the actual statements we'd need to run to update column Foo, basically this:
UPDATE `mydatabase`.`mytable` SET `Foo` = 0
But we want to substitute in the values from table_schema and table_name in place of mydatabase and mytable. If we run this SELECT
SELECT 'UPDATE `mydatabase`.`mytable` SET `Foo` = 0' AS sql
That returns us a single row, containing a single column (the column happens to be named sql, but name of the column isn't important to us). The value of the column will just be a string. But the string we get back happens to be (we hope) a SQL statement that we could run.
We'd get the same thing if we broke that string up into pieces, and used CONCAT to string them back together for us, e.g.
SELECT CONCAT('UPDATE `','mydatabase','`.`','mytable','` SET `Foo` = 0') AS sql
We can use that query as a model for the statement we want to run against information_schema.columns. We'll replace 'mydatabase' and 'mytable' with references to columns from the information_schema.columns table that give us the database and table_name.
SELECT CONCAT('UPDATE `',c.table_schema,'`.`',c.table_name,'` SET `Foo` = 0') AS sql
FROM information_schema.columns
WHERE c.column_name = 'Foo'
There are some databases we definitely do not want to update... mysql, information_schema, performance_schema. We either need whitelist the databases containing the table we want to update
AND c.table_schema IN ('mydatabase','anotherdatabase')
-or- we need to blacklist the databases we definitely do not want to update
AND c.table_schema NOT IN ('mysql','information_schema','performance_schema')
We can run that query (we could add an ORDER BY if we want the rows returned in a particular order) and what we get back is list containing the statements we want to run. If we saved that set of strings as a plain text file (excluding header row and extra formatting), adding a semicolon at the end of each line, we'd have a file we could execute from the mysql> command line client.
(If any of the above is confusing, let me know.)
The next part is a little more complicated. The rest of this deals with an alternative to saving the output from the SELECT as a plain text file, and executin the statements from the mysql command line client.
MySQL provides a facility/feature that allows us to execute basically any string as a SQL statement, in the context of a MySQL stored program (for example, a stored procedure. The feature we're going to use is called dynamic SQL.
To use dynamic SQL, we use the statements PREPARE, EXECUTE and DEALLOCATE PREPARE. (The deallocate isn't strictly necessary, MySQL will cleanup for us if we don't use it, but I think it's good practice to do it anyway.)
Again, dynamic SQL is available ONLY in the context of a MySQL stored program. To do this, we need to have a string containing the SQL statement we want to execute. As a simple example, let's say we had this:
DECLARE str VARCHAR(2000);
SET str = 'UPDATE mytable SET mycol = 0 WHERE mycol < 0';
To get the contents of str evaluated and executed as a SQL statement, the basic outline is:
PREPARE stmt FROM str;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
The next complicated part is putting that together with the query we are running to get string value we want to execute as SQL statements. To do that, we put together a cursor loop. The basic outline for that is to take our SELECT statement:
SELECT bah FROM humbug
And turn that into a cursor definition:
DECLARE mycursor FOR SELECT bah FROM humbug ;
What we want to is execute that and loop through the rows it returns. To execute the statement and prepare a resultset, we "open" the cursor
OPEN mycursor;
When we're finished with it, we're goin to issue a "close", to release the resultset, so the MySQL server knows we don't need it anymore, and can cleanup, and free up the resources allocated to that.
CLOSE mycursor;
But, before we close the cursor, we want to "loop" through the resultset, fetching each row, and do something with the row. The statement we use to get the next row from the resultset into a procedure variable is:
FETCH mycursor INTO some_variable;
Before we can fetch rows into variables, we need to define the variables, e.g.
DECLARE some_variable VARCHAR(2000);
Since our cursor (SELECT statement) is returning only a single column, we only need one variable. If we had more columns, we'd need a variable for each column.
Eventually, we'll have fetched the last row from the result set. When we attempt to fetch the next one, MySQL is going to throw an error.
Other programming languages would let us just do a while loop, and let us fetch the rows and exit the loop when we've processed them all. MySQL is more arcane. To do a loop:
mylabel: LOOP
-- do something
END LOOP mylabel;
That by itself makes for a very fine infinite loop, because that loop doesn't have an "exit". Fortunately, MySQL gives us the LEAVE statement as a way to exit a loop. We typically don't want to exit the loop the first time we enter it, so there's usually some conditional test we use to determine if we're done, and should exit the loop, or we're not done, and should go around the the loop again.
mylabel: LOOP
-- do something useful
IF some_condition THEN
LEAVE mylabel;
END IF;
END LOOP mylabel;
In our case, we want to loop through all of the rows in the resultset, so we're going to put a FETCH a the first statement inside the loop (the something useful we want to do).
To get a linkage between the error that MySQL throws when we attempt to fetch past the last row in the result set, and the conditional test we have to determine if we should leave...
MySQL provides a way for us to define a CONTINUE HANDLER (some statement we want performed) when the error is thrown...
DECLARE CONTINUE HANDLER FOR NOT FOUND
The action we want to perform is to set a variable to TRUE.
SET done = TRUE;
Before we can run the SET, we need to define the variable:
DECLARE done TINYINT(1) DEFAULT FALSE;
With that we, can change our LOOP to test whether the done variable is set to TRUE, as the exit condition, so our loop looks something like this:
mylabel: LOOP
FETCH mycursor INTO some_variable;
IF done THEN
LEAVE mylabel;
END IF;
-- do something with the row
END LOOP mylabel;
The "do something with the row" is where we want to take the contents of some_variable and do something useful with it. Our cursor is returning us a string that we want to execute as a SQL statement. And MySQL gives us the dynamic SQL feature we can use to do that.
NOTE: MySQL has rules about the order of the statements in the procedure. For example the DECLARE statement have to come at the beginning. And I think the CONTINUE HANDLER has to be the last thing declared.
Again: The cursor and dynamic SQL features are available ONLY in the context of a MySQL stored program, such as a stored procedure. The example I gave above was only the example of the body of a procedure.
To get this created as a stored procedure, it would need to be incorporated as part of something like this:
DELIMITER $$
DROP PROCEDURE IF EXISTS myproc $$
CREATE PROCEDURE myproc
NOT DETERMINISTIC
MODIFIES SQL DATA
BEGIN
-- procedure body goes here
END$$
DELIMITER ;
Hopefully, that explains the example I gave in a little more detail.
This should get all tables in your database and append each table with update column foo statement Copy and run it, the copy the output and run as sql
select concat('update ',table_name,' set foo=0;') from information_schema.tables
where table_schema = 'Your database name here' and table_type = 'base table';

MySQL and JDBC caching (?) issue with procedure call in Scala

I'm working on a project that parses text units (referenced as "ngrams") from a string and stores them in a MySQL database.
In MySQL I have the following procedure that's supposed to store an ngram in a specific dataset (table) if it's not already there and return the id of the ngram:
CREATE PROCEDURE `add_ngram`(IN ngram VARCHAR(400), IN dataset VARCHAR(128), OUT ngramId INT)
BEGIN
-- try get id of ngram
SET #s = CONCAT('SELECT `id` INTO #ngramId FROM `mdt_', dataset, '_b1` WHERE `ngram` = ''', ngram, ''' LIMIT 1');
PREPARE stm FROM #s;
EXECUTE stm;
SET ngramId = #ngramId;
-- if id could not be retrieved
IF ngramId IS NULL THEN BEGIN
-- insert ngram into dataset
SET #s = CONCAT('INSERT INTO `mdt_', dataset, '_b1`(`ngram`) VALUES (''', ngram, ''')');
PREPARE stm FROM #s;
EXECUTE stm;
SET ngramId = LAST_INSERT_ID();
END;
END IF;
END$$
A dataset table has only two columns: id, an auto-incremented int that serves as the primary key, and ngram, a varchar(400) that serves as a unique index.
In my scala app I have a method that takes in a string, splits it into ngrams and then return a Seq of the ngrams' ids by passing the ngrams to the above procedure:
private def processNgrams(text: String, dataSet: String) {
val ids = parser.parse(text).map(ngram => {
val query = dbConn.prepareCall("CALL add_ngram(?,?,?)")
query.setString(1, ngram)
query.setString(2, dataSet)
query.registerOutParameter(3, java.sql.Types.INTEGER)
query.executeUpdate()
dbConn.commit()
val id = query.getInt(3)
Debug(s"parsed ngram - id: $id ${" " * (3 - id.toString.length)}[ $ngram ]")
id
}
}
dbConn in the above code is an instance of java.sql.Connection and has auto commit set to false.
When I executed this I noticed that very few ngrams were stored in the database. Here's what the debug statement from the above method prints out:
So there are multiple ngrams that are clearly different from each other that seem to have the same id returned from the procedure. If I look in the database table, I can see that for example the ngram "i" has id "1", but it seems that ngrams inserted immediately after also get returned an id of "1". This is true of the other ngrams I looked up in the table as well. This leads me to believe that perhaps the procedure call maybe gets cached?
I've tried a number of things, such as creating the prepared call outside of the method and reusing it and calling clearParameters every time, creating a new call inside the method every time (as it is above), even sleeping the Thread between calls, but nothing seems to make a difference.
I've also tested the procedure by manually running queries in a MySQL client and it seems to run fine, though my program executes queries at a much faster speed than I can manually, so that might make a difference.
I'm not entirely sure if it's a JDBC issue with making the call or a MySQL issue with the procedure. I'm new to scala and MySQL procedures, so forgive me if this is something really simple that's escaped me.
Figured it out! Turns out, it was the stored procedure causing all the trouble.
In the procedure I check to see if a ngram exists by doing a dynamic SQL query (since the table name is passed in) that stores a value in #ngramId, which is a session variable. I store it in #ngramId and not into ngramId (a procedure output parameter) because prepared statements can only select into session variables (or so I've been told by an error when I originally created the procedure). Next I set the value of #ngramId to ngramId and check if ngramId is null to determine if the ngram exists in the table; if null, the ngram is inserted into the table and ngramId is set to the last inserted id.
The problem with this is that because #ngramId is a session variable and because I used the same database connection for all procedures calls, the value of #ngramId persisted between calls. So for example, if I made a call with the ngram "I", and it was found in the database with id 1, #ngramId now had the value of 1. Next if I tried to insert another ngram that did not exist in the table, the dynamic select statement did not return anything so the value of #ngramId remained as 1. Since the output parameter ngramId is populated with the value of #ngramId and now it was no longer NULL, it bypassed the if statement that inserted the ngram in the database and returned the id of the last ngram found in the table, resulting in the seeming caching of ngram ids.
The solution to this was to add the following line as the very first statement in the procedure:
SET #ngramId = NULL;
Which resets the value of #ngramId between calls to the procedure over the same session.

MySQL temporary tables do not clear

Background - I have a DB created from a single large flat file. Instead of creating a single large table with 106 columns. I created a "columns" table which stores the column names and the id of the table that holds that data, plus 106 other tables to store the data for each column. Since not all the records have data in all columns, I thought this might be a more efficient way to load the data (maybe a bad idea).
The difficulty with this was rebuilding a single record from this structure. To facilitate this I created the following procedure:
DROP PROCEDURE IF EXISTS `col_val`;
delimiter $$
CREATE PROCEDURE `col_val`(IN id INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS tmp_record;
CREATE TEMPORARY TABLE tmp_record (id INT(11), val varchar(100)) ENGINE=MEMORY;
SET #ctr = 1;
SET #valsql = '';
WHILE (#ctr < 107) DO
SET #valsql = CONCAT('INSERT INTO tmp_record SELECT ',#ctr,', value FROM col',#ctr,' WHERE recordID = ',#id,';');
PREPARE s1 FROM #valsql;
EXECUTE s1;
DEALLOCATE PREPARE s1;
SET #ctr = #ctr+1;
END WHILE;
END$$
DELIMITER ;
Then I use the following SQL where the stored procedure parameter is the id of the record I want.
CALL col_val(10);
SELECT c.`name`, t.`val`
FROM `columns` c INNER JOIN tmp_record t ON c.ID = t.id
Problem - The first time I run this it works great. However, each subsequent run returns the exact same record even though the parameter is changed. How does this persist even when the stored procedure should be dropping and re-creating the temp table?
I might be re-thinking the whole design and going back to a single table, but the problem illustrates something I would like to understand.
Unsure if it matters but I'm running MySQL 5.6 (64 bit) on Windows 7 and executing the SQL via MySQL Workbench v5.2.47 CE.
Thanks,
In MySQL stored procedures, don't put an # symbol in front of local variables (input parameters or locally declared variables). The #id you used refers to a user variable, which is kind of like a global variable for the session you're invoking the procedure from.
In other words, #id is a different variable from id.
That's the explanation of the immediate problem you're having. However, I would not design the tables as you have done.
Since not all the records have data in all columns, I thought this might be a more efficient way to load the data
I recommend using a conventional single table, and use NULL to signify missing data.

Populate until certain amount

I am trying to create a procedure which will fill up the table until certain amount of elements.
At the current moment I have
CREATE PROCEDURE PopulateTable(
IN dbName tinytext,
IN tableName tinytext,
IN amount INT)
BEGIN
DECLARE current_amount INT DEFAULT 0;
SET current_amount = SELECT COUNT(*) FROM dbName,'.',tableName;
WHILE current_amount <= amount DO
set #dll=CONCAT('INSERT INTO `',dbName,'`.`',tableName,'`(',
'`',tableName,'_name`) ',
'VALUES(\'',tableName,'_',current_amount,'\')');
prepare stmt from #ddl;
execute stmt;
SET current_amount = current_amount + 1;
END WHILE;
END;
So what I am trying to do is, once user calls the procedure, the function will check and see how many current elements exist, and fill up the remaining elements.
First problem I have is that I do not know how to count the elements, so my SELECT COUNT(*) FROM dbName,'.',tableName; does not work.
I also want to a suggestion since I am kind of new to databases if what I am doing is correct or if there is a better way to do this?
Also just if this is of any help the table I am trying to do this to only has 2 fields, one of them being id, which is auto incremented and is primary and the other being profile_name which I am populating.
Thanks to anyone for their help!
Firstly, I think you'll have a delimiter problem if you try to execute the code you pasted. The procedure declaration delimiter has to be different from the one you use into the procedure code (here ';'). You will have to use DELIMITER statement;
Your procedure belongs to a schema; I'm not sure you can query tables from other shemas (especially without USE statement);
This is not a good idea if your database contains tables whitch are not supposed to be populated through your procedure;
If you have a limited number of concerned tables, I think it will be a better idea to define one procedure for each table. In this way, table name will be explicit in each procedure code and it will avoid the use of prepared statements;
Be careful about you 'amount' parameter : are you sure that your server can handle requests if I pass the maximum value for INT as amount ?
I think you should use '<' instead of '<=' in your WHILE condition
If you want to insert a large number of lines, you'll obtain better performances performing "grouped" inserts, or generating a temporary table (for example with MEMORY engine) containing all your lines and performing a unique INSERT selecting your temporary table's content.

MySQL - Executing intensive queries on live server

I'm having some issues dealing with updating and inserting millions of row in a MySQL Database. I need to flag 50 million rows in Table A, insert some data from the marked 50 million rows into Table B, then update those same 50 million rows in Table A again. There are about 130 million rows in Table A and 80 million in Table B.
This needs to happen on a live server without denying access to other queries from the website. The problem is while this stored procedure is running, other queries from the website end up locked and the HTTP request times out.
Here's gist of the SP, a little simplified for illustration purposes:
CREATE DEFINER=`user`#`localhost` PROCEDURE `MyProcedure`(
totalLimit int
)
BEGIN
SET #totalLimit = totalLimit;
/* Prepare new rows to be issued */
PREPARE STMT FROM 'UPDATE tableA SET `status` = "Being-Issued" WHERE `status` = "Available" LIMIT ?';
EXECUTE STMT USING #totalLimit;
/* Insert new rows for usage into tableB */
INSERT INTO tableB (/* my fields */)
SELECT /* some values from TableA */
FROM tableA
WHERE `status` = "Being-Issued";
/* Set rows as being issued */
UPDATE tableB SET `status` = 'Issued' WHERE `status` = 'Being-Issued';
END$$
DELIMITER ;
Processing 50M rows three times will be slow irrespective of what you're doing.
Make sure your updates are affecting smaller-sized, disjoint sets. And execute each of them one by one, rather than each and every one of them within the same transaction.
If you're doing this already and MySQL is misbehaving, try this slight tweak to your code:
create a temporary table
begin
insert into tmp_table
select your stuff
limit ?
for update
do your update on A using tmp_table
commit
begin
do your insert on B using tmp_table
do your update on A using tmp_table
commit
this should keep locks for a minimal time.
What about this? It basically calls the original stored procedure in a loop until the total amount needed is reached, and having a sleep period in between calls (like 2 seconds) to allow other queries to process.
increment is the amount to do at one time (using 10,000 in this case)
totalLimit is the total amount to be processed
sleepSec is the amount of time to rest between calls
BEGIN
SET #x = 0;
REPEAT
SELECT SLEEP(sleepSec);
SET #x = #x + increment;
CALL OriginalProcedure( increment );
UNTIL #x >= totalLimit
END REPEAT;
END$$
Obviously it could use a little math to make sure the increment doesn't go over the total limit if its not evenly divisible, but it appears to work (by work I mean allow other queries to still be processed from web requests), and seems to be faster overall as well.
Any insight here? Is this a good idea? Bad idea?