I've created a service application that uses multi-threading for parallel processing of data located in an InnoDB table (about 2-3 millions of records, and no more InnoDB-related queries performed by the application). Each thread makes the following queries to the mentioned table:
START TRANSACTION
SELECT FOR UPDATE (SELECT pk FROM table WHERE status='new' LIMIT 100 FOR UPDATE)
UPDATE (UPDATE table SET status='locked' WHERE pk BETWEEN X AND Y)
COMMIT
DELETE (DELETE FROM table WHERE pk BETWEEN X AND Y)
The guys from forum.percona.com gave me a piece of advice - do not use SELECT FOR UPDATE and UPDATE because of longer time needed for transaction to execute (2 queries), and waiting lock timeouts that result. Their advice was (autocommit is on):
UPDATE (UPDATE table SET status='locked', thread = Z LIMIT 100)
SELECT (SELECT pk FROM table WHERE thread = Z)
DELETE (DELETE FROM table WHERE pk BETWEEN X AND Y)
and it was supposed to improve performance. However, instead, I got even more deadlocks and wait lock timeouts than before...
I read a lot about optimizing InnoDB, and tuned the server correspondlingly, so my InnoDB settings are 99% ok. This fact is also proven by the first scenario working fine and better than second one. The my.cnf file:
innodb_buffer_pool_size = 512M
innodb_thread_concurrency = 16
innodb_thread_sleep_delay = 0
innodb_log_buffer_size = 4M
innodb_flush_log_at_trx_commit=2
Any ideas why the optimization had no success?
What I understand from the description of your process is:
You have a table which has many rows that needs to be processed.
You select a row from that table (using for update) so that other threads cannot get access to the same row.
When you are done you update the row and commit the transaction.
And then delete the row from the database.
If this is the case then you are doing the right thing as this will have less locks then the second approach you mentioned.
You can decrease the lock contention further by removing the delete statement as this will lock the whole table. Rather than doing that add a flag (new column named processed) and update that. And delete the rows at the end when all the threads are done processing.
You can also make the work distribution intelligent by batching the work load - in your case the row range (may be using PK) which each thread is going to process - in that case you can do a simple select and no need for the FOR UPDATE clause and it will work fast.
Related
I have a table EMPLOYEE with the following columns in my MySQL (innoDB) database,
internal_employee_id (auto incrementing PK)
external_employee_id
name
gender
exported (boolean field)
In a distributed system I want to ensure that multiple processes in the cluster read the top 100 distinct rows from the table each time for which the exported column is set to false. The rows read by the process should remain locked during calculation such that if process1 reads row 1-100, process2 should not be able to see the rows from 1-100 and should then pick up the next available 100 rows.
For this, I tried using pessimistic_write locks but they don't seem to serve the purpose. They do block multiple processes from updating at the same time but multiple processes can read the same locked rows.
I tried using the following java code,
Query query = entityManager.createNativeQuery("select * from employee " +
"where exported = 0 limit 100 for update");
List<Employee> employeeListLocked = query.getResultList();
EDIT: Found the answer finally
What I needed was to use the "Skip Locked" feature. So my updated code has become:
Query query = entityManager.createNativeQuery("select * from employee " +
"where exported = 0 limit 100 for update skip locked");
with the help of 'skip locked' all the rows that are in a locked state are ignored/skipped by the db engine when running a select. Hope this helps you all.
You could add a new column in the table
for example, a column named 'processed' (boolean field) and update all the records with the false value
update EMPLOYEE set processed = 0;
When a process starts, in the same transaction, you can select for update and then update in these 100 rows the column processed to 1.
Query query = entityManager.createNativeQuery("select * from employee " +
"where exported = 0 and processed = 0
order by internal_employee_id desc limit 100 for update");
List<Employee> employeeListLocked = query.getResultList();
make an update on these 100 rows
UPDATE EMPLOYEE eUpdate INNER JOIN (select internal_employee_id
from EMPLOYEE where exported = 0 and processed = 0
order by internal_employee_id desc limit 100) e
ON eUpdate.internal_employee_id = e.internal_employee_id
SET eUpdate.processed = 1 ;
Then, the next process will not process the same list
There are a couple of ways to block reads:
The session that wants to update the tables first does:
LOCK TABLES employee WRITE;
This acquires an exclusive metadata lock on the table. Then other sessions are blocked, even if they only try to read that table. They must wait for a metadata lock. See https://dev.mysql.com/doc/refman/8.0/en/lock-tables.html for more information on this.
The downside of table locks is that they lock the whole table. There's no way to use this to lock individual rows or sets of rows.
Another solution is that you must code all reads to require a shared lock:
SELECT ... FROM employee WHERE ... LOCK IN SHARE MODE;
MySQL 8.0 changes the syntax, but it works the same way:
SELECT ... FROM employee WHERE ... FOR SHARE;
These are not metadata locks, they're row locks. So you can lock individual rows or sets of rows.
A request for a shared lock on some rows won't conflict with other shared locks on those rows, but if there's an exclusive lock on the rows, the SELECT FOR SHARE waits. The reverse is true too -- if there's any SELECT FOR SHARE on the rows uncommitted, the request for exclusive lock waits.
The downside of this method is that it only works if all queries that read that table have the FOR SHARE option.
All that said, I post this just to answer your question directly. I do think that the system described in the answer from Perkilis is good. I implemented a system like that recently, and it works.
Sometimes the implementation you have in mind is not the best solution, and you need to consider another way to solve the problem.
-- In a transaction by itself:
UPDATE t
SET who_has = $me -- some indicate of the process locking the rows
WHERE who_has IS NULL
LIMIT 100;
-- Grab some or all rows that you have and process them.
-- You should not need to lock them further (at least not for queue management)
SELECT ... WHERE who_has = $me ...
-- Eventually, release them, either one at a time, or all at once.
-- Here's the bulk release:
UPDATE t SET who_has = NULL
WHERE who_has = $me
-- Again, this UPDATE is in its own transaction.
Note that this general mechanism has no limitations on how long it takes to "process" the items.
Also, the use of that extra who_has column helps you if there is a crash without releasing the items. It should be augmented by a timestamp of when the items were grabbed. A cron job (or equivalent) should look around for any unprocessed items that have been locked for "too long".
FOUND THE ANSWER:
What I needed was to use the "Skip Locked" feature. So my updated code has become:
Query query = entityManager.createNativeQuery("select * from employee " +
"where exported = 0 limit 100 for update skip locked");
List<Employee> employeeListLocked = query.getResultList();
with the help of 'skip locked' all the rows that are in a locked state are ignored/skipped by the db engine when running a select. Hope this helps you all.
If I have a query like:
UPDATE table_x SET a = 1 WHERE id = ? AND (
SELECT SUM(a) < 100 FROM table_x
)
And
hundreds of this query could be made at exactly the same time
I need to be certain that a never gets to more than 100
Do I need to lock the table or will table_x be locked automatically as it's a subquery?
Assuming this is innodb table, You will have row level locking . So, even if they are 100 of these happening at a time, only ONE transaction will be able to acquire the lock on those rows and finish processing before the next transaction is to occur. There is no difference between how a transaction is processed for the update and the subquery. To the innodb engine this is all ONE transaction, not two separate transactions.
If you want to see what is going on behind the scenes when you run your query, type 'show engine innodb status' in the command line while the query is running.
Here is a great walkthrough on what all that output means.
If you want to read more about Innodb and row level locking, follow link here.
I have a table with more than 40 million records.i want to delete about 150000 records with a sql query:
DELETE
FROM t
WHERE date="2013-11-24"
but I get error 1206(The total number of locks exceeds the lock table size).
I searched a lot and change the buffer pool size:
innodb_buffer_pool_size=3GB
but it didn't work.
I also tried to lock tables but didn't work too:
Lock Tables t write;
DELETE
FROM t
WHERE date="2013-11-24";
unlock tables;
I know one solution is to split the process of deleting but i want this be my last option.
I am using mysql server, server OS is centos and server Ram is 4GB.
I'll appreciate any help.
You can use Limit on your delete and try deleting data in batches of say 10,000 records at a time as:
DELETE
FROM t
WHERE date="2013-11-24"
LIMIT 10000
You can also include an ORDER BY clause so that rows are deleted in the order specified by the clause:
DELETE
FROM t
WHERE date="2013-11-24"
ORDER BY primary_key_column
LIMIT 10000
There are a lot of quirky ways this error can occur. I will try to list one or two and perhaps the analogy holds true for someone reading this at some point.
On larger datasets even when changing innodb_buffer_pool_size to a larger value, you can hit this error when an adequate index is not in place to isolate the rows in the where clause. Or in some cases with the primary index (see this) and the comment from Roger Gammans:
From the (5.0 documentation for innodb):-
If you have no indexes suitable for your statement and MySQL must scan
the entire table to process the statement, every row of the table
becomes locked, which in turn blocks all inserts by other users to the
table. It is important to create good indexes so that your queries do
not unnecessarily scan many rows.
A visual of how this error can occur and difficult to solve is with this simple schema:
CREATE TABLE `students` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
`campusId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ix_stu_cam` (`camId`)
) ENGINE=InnoDB;
A table with 50 Million rows. FK's not shown, not the issue. This table was originally for showing query performance also not important. Yet, in initializing thing=id in blocks of 1M rows, I had to perform a limit during the block update to prevent other problems, by using:
update students
set thing=id
where thing!=id
order by id desc
limit 1000000 ; -- 1 Million
This was all well until it got down to say 600000 left to update as seen by
select count(*) from students where thing!=id;
Why I was doing that count(*) stemmed from repeated
Error 1206: The total number of locks exceeds the lock table size
I could keep lowering my LIMIT shown in the above update, but in the end I would be left, say, with 1200 != in the count, and the problem just continued.
Why did it continue? Because the system filled the lock table as it scanned this large table. Sure, it might "intra implicit transaction" have changed those last 1200 row to equal, in my mind, but due to the lock table filling up, in reality would abort the transaction with nothing set. And the process would stalemate.
Illustration 2:
In this example, let's say I have 288 rows of the 50 Million row table that could be updated shown above. Due to the end-game problem described, I would often find a problem running this query twice:
update students set thing=id where thing!=id order by id desc limit 200 ;
But I would not have a problem with these:
update students set thing=id where thing!=id order by id desc limit 200;
update students set thing=id where thing!=id order by id desc limit 88 ;
Solutions
There are many ways to solve this, including but not limited to:
A. The creation of another index on a column suggesting the data had been updated, perhaps a boolean. And incorporating that into the where clause. Yet on huge tables, the creation of somewhat temporary indexes may be out of the question.
B. Populating a 2nd table with yet to be cleaned id's could be another solution. Coupled with and update with a join pattern.
C. Dynamically changing the LIMIT value so as to not cause an overrun of the lock table. The overrun can occur when there are simply no more rows to UPDATE or DELETE (your operation), the LIMIT has not been reached, and the lock table fills up in a fruitless scan for more that simply don't exist (seen above in Illustration2).
The main point of this answer is to offer an understanding of why it is happening. And for any reader to craft an end-game solution that fits their needs (versus, at times, fruitless changes to system variables, reboots, and prayers).
The simplest way is to create an index on the date column. I had 170 million rows and was deleting 6.5 million rows. I ran into the same problem and solved it by creating non-clustered index on the column which I was using in the WHERE clause then I executed the delete query and it worked.
Delete the index if you don't need it for future.
It is unclear to me (by reading MySQL docs) if the following query ran on INNODB tables on MySQL 5.1, would create WRITE LOCK for each of the rows the db updates internally (5000 in total) or LOCK all the rows in the batch. As the database has really heavy load, this is very important.
UPDATE `records`
INNER JOIN (
SELECT id, name FROM related LIMIT 0, 5000
) AS `j` ON `j`.`id` = `records`.`id`
SET `name` = `j`.`name`
I'd expect it to be per row but as I do not know a way to make sure it is so, I decided to ask someone with deeper knowledge. If this is not the case and the db would LOCK all the rows in the set, I'd be thankful if you give me explanation why.
The UPDATE is running in transaction - it's an atomic operation, which means that if one of the rows fails (because of unique constrain for example) it won't update any of the 5000 rows. This is one of the ACID properties of a transactional database.
Because of this the UPDATE hold a lock on all of the rows for the entire transaction. Otherwise another transaction can further update the value of a row, based on it's current value (let's say update records set value = value * '2'). This statement should produce different result depending if the first transaction commits or rollbacks. Because of this it should wait for the first transaction to complete all 5000 updates.
If you want to release the locks, just do the update in (smaller) batches.
P.S. autocommit controls if each statement is issued in own transaction, but does not effect the execution of a single query
My question is similar to:
Ignoring locked row in a MySQL query
except that I have already implemented a logic close to what's suggested in the accepted answer. My question is how to set the process id initially. All servers run a query like (the code is in ruby on rails but the resulting mysql query is):
UPDATE (some_table) SET process_id=(some process_id) WHERE (some condition on row_1) AND process_id is null ORDER BY (row_1) LIMIT 100
Now what happens is all processes try to update the same rows, they get locked and they timeout waiting for the lock. I would like the servers to ignore the rows that are locked (because after the lock is released the process_id won't be null anymore so there is no point for locking here).
I could try to randomize the batch of records to update but the problem is I want to prioritize the update based on row_1 as in the query above.
So my question is, is there a way in mysql to check if a record is locked and ignore it if it is?
No, there is no way to ignore already-locked rows. Your best bet will be to ensure that nothing locks any row for any extended period of time. That will ensure that any lock conflicts are very short in duration. That will generally mean "advisory" locking of rows by locking them within a transaction (using FOR UPDATE) and updating the row to mark it as "locked".
For example, first you want to find your candidate row(s) without locking anything:
SELECT id FROM t WHERE lock_expires IS NULL AND lock_holder IS NULL <some other conditions>;
Now lock only the row you want, very quickly:
START TRANSACTION;
SELECT * FROM t WHERE id = <id> AND lock_expires IS NULL AND lock_holder IS NULL;
UPDATE t SET lock_expires = <some time>, lock_holder = <me> WHERE id = <id>;
COMMIT;
(Technical note: If you are planning to lock multiple rows, always lock them in a specific order. Ascending order by primary key is a decent choice. Locking out-of-order or in random order will subject your program to deadlocks from competing processes.)
Now you can take as long as you want (less than lock_expires) to process your row(s) without blocking any other process (they won't match the row during the non-locking select, so will always ignore it). Once the row is processed, you can UPDATE or DELETE it by id, also without blocking anything.