Atomic read and update in MySQL with concurrent workers - mysql

Say I have multiple workers that can concurrently read and write against a MySQL table (e.g. jobs). The task for each worker is:
Find the oldest QUEUED job
Set it's status to RUNNING
Return the corresponding ID.
Note that there may not be any qualifying (i.e. QUEUED) jobs when a worker runs step #1.
I have the following pseudo-code so far. I believe I need to cancel (ROLLBACK) the transaction if step #1 returns no jobs. How would I do that in the code below?
BEGIN TRANSACTION;
# Update the status of jobs fetched by this query:
SELECT id from jobs WHERE status = "QUEUED"
ORDER BY created_at ASC LIMIT 1;
# Do the actual update, otherwise abort (i.e. ROLLBACK?)
UPDATE jobs
SET status="RUNNING"
# HERE: Not sure how to make this conditional on the previous ID
# WHERE id = <ID from the previous SELECT>
COMMIT;

I am implementing something very similar to your case this week. A number of workers, each grabbing the "next" row in a set of rows to work on.
The pseudocode is something like this:
BEGIN;
SELECT ID INTO #id FROM mytable WHERE status = 'QUEUED' LIMIT 1 FOR UPDATE;
UPDATE mytable SET status = 'RUNNING' WHERE id = #id;
COMMIT;
Using FOR UPDATE is important to avoid race conditions, i.e. more than one worker trying to grab the same row.
See https://dev.mysql.com/doc/refman/8.0/en/select-into.html for information about SELECT ... INTO.

It's still not quite clear what you are after. But assuming your task is: Find the next QUEUED job. Set it's status to RUNNING and select the corresponding ID.
In a single threaded environment, you can just use your code. Fetch the selected ID into a variable in your application code and pass it to the UPDATE query in the WHERE clause. You don't even need a transaction, since there is only one writing statement. You can mimic in an SQLscript.
Assuming this is your current state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | QUEUED |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
You want to start the next queued job (which has id=2).
SET #id_for_update = (
SELECT id
FROM jobs
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1
);
UPDATE jobs
SET status="RUNNING"
WHERE id = #id_for_update;
SELECT #id_for_update;
You will get
#id_for_update
2
from the last select. And the table will have this state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If you have multiple processes, which start jobs, you would need to lock the row with FOR UPDATE. But that can be avoided using LAST_INSERT_ID():
Starting from the state above, with job 2 already running:
UPDATE jobs
SET status = 'RUNNING',
id = LAST_INSERT_ID(id)
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
SELECT LAST_INSERT_ID();
You will get:
| LAST_INSERT_ID() | ROW_COUNT() |
| ---------------- | ----------- |
| 3 | 1 |
And the new state is:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | RUNNING |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If the UPDATE statement affected no row (there were no queued rows) ROW_COUNT() will be 0.
There might be some risks, which I am not aware of - But this is also not really how I would approach this. I would rather store more information in the jobs table. Simple example:
CREATE TABLE jobs (
id INT auto_increment primary key,
created_at timestamp not null default now(),
updated_at timestamp not null default now() on update now(),
status varchar(50) not null default 'QUEUED',
process_id varchar(50) null default null
);
and
UPDATE jobs
SET status = 'RUNNING',
process_id = 'some_unique_pid'
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
Now a running job belongs to a specific process and you can just select it with
SELECT * FROM jobs WHERE process_id = 'some_unique_pid';
You might even like to have more information - eg. queued_at, started_at, finished_at.

Adding SKIP LOCKED to the SELECT query, and putting in a SQL transaction, committed when the job is done, avoid jobs stuck in status RUNNING if a worker crashes (because the uncommitted transaction will rollback). It's now supported in newest versions of most common DBMS.
See:
Select only unlocked rows mysql
https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html#innodb-locking-reads-nowait-skip-locked

(This is not an answer to the question, but a list of caveats that you need to be aware of when using any of the real Answers. Some of these have already been mentioned.)
Replication -- You must do all the locking on the Primary. If you are using a cluster with multiple writable nodes, be aware of the inter-node delays.
Backlog -- When something breaks, you could get a huge list of tasks in the queue. This may lead to some ugly messes.
Number of 'workers' -- Don't have more than a "few" workers. If you try to have, say, 100 concurrent workers, they will stumble over each other an cause nasty problems.
Reaper -- Since a worker may crash, the task assigned to it may never get cleared. Have a TIMESTAMP on the rows so a separate (cron/EVENT/whatever) job can discover what tasks are long overdue and clear them.
If the tasks are fast enough, then the overhead of the queue could be a burden. That is, "Don't queue it, just do it."
You are right to grab the task in one transaction, then later release the task in a separate transaction. Using InnoDB's locking is folly for any be trivially fast actions.

Related

Concurrent scripts pulling jobs to do from MySQL table [duplicate]

Say I have multiple workers that can concurrently read and write against a MySQL table (e.g. jobs). The task for each worker is:
Find the oldest QUEUED job
Set it's status to RUNNING
Return the corresponding ID.
Note that there may not be any qualifying (i.e. QUEUED) jobs when a worker runs step #1.
I have the following pseudo-code so far. I believe I need to cancel (ROLLBACK) the transaction if step #1 returns no jobs. How would I do that in the code below?
BEGIN TRANSACTION;
# Update the status of jobs fetched by this query:
SELECT id from jobs WHERE status = "QUEUED"
ORDER BY created_at ASC LIMIT 1;
# Do the actual update, otherwise abort (i.e. ROLLBACK?)
UPDATE jobs
SET status="RUNNING"
# HERE: Not sure how to make this conditional on the previous ID
# WHERE id = <ID from the previous SELECT>
COMMIT;
I am implementing something very similar to your case this week. A number of workers, each grabbing the "next" row in a set of rows to work on.
The pseudocode is something like this:
BEGIN;
SELECT ID INTO #id FROM mytable WHERE status = 'QUEUED' LIMIT 1 FOR UPDATE;
UPDATE mytable SET status = 'RUNNING' WHERE id = #id;
COMMIT;
Using FOR UPDATE is important to avoid race conditions, i.e. more than one worker trying to grab the same row.
See https://dev.mysql.com/doc/refman/8.0/en/select-into.html for information about SELECT ... INTO.
It's still not quite clear what you are after. But assuming your task is: Find the next QUEUED job. Set it's status to RUNNING and select the corresponding ID.
In a single threaded environment, you can just use your code. Fetch the selected ID into a variable in your application code and pass it to the UPDATE query in the WHERE clause. You don't even need a transaction, since there is only one writing statement. You can mimic in an SQLscript.
Assuming this is your current state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | QUEUED |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
You want to start the next queued job (which has id=2).
SET #id_for_update = (
SELECT id
FROM jobs
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1
);
UPDATE jobs
SET status="RUNNING"
WHERE id = #id_for_update;
SELECT #id_for_update;
You will get
#id_for_update
2
from the last select. And the table will have this state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If you have multiple processes, which start jobs, you would need to lock the row with FOR UPDATE. But that can be avoided using LAST_INSERT_ID():
Starting from the state above, with job 2 already running:
UPDATE jobs
SET status = 'RUNNING',
id = LAST_INSERT_ID(id)
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
SELECT LAST_INSERT_ID();
You will get:
| LAST_INSERT_ID() | ROW_COUNT() |
| ---------------- | ----------- |
| 3 | 1 |
And the new state is:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | RUNNING |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If the UPDATE statement affected no row (there were no queued rows) ROW_COUNT() will be 0.
There might be some risks, which I am not aware of - But this is also not really how I would approach this. I would rather store more information in the jobs table. Simple example:
CREATE TABLE jobs (
id INT auto_increment primary key,
created_at timestamp not null default now(),
updated_at timestamp not null default now() on update now(),
status varchar(50) not null default 'QUEUED',
process_id varchar(50) null default null
);
and
UPDATE jobs
SET status = 'RUNNING',
process_id = 'some_unique_pid'
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
Now a running job belongs to a specific process and you can just select it with
SELECT * FROM jobs WHERE process_id = 'some_unique_pid';
You might even like to have more information - eg. queued_at, started_at, finished_at.
Adding SKIP LOCKED to the SELECT query, and putting in a SQL transaction, committed when the job is done, avoid jobs stuck in status RUNNING if a worker crashes (because the uncommitted transaction will rollback). It's now supported in newest versions of most common DBMS.
See:
Select only unlocked rows mysql
https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html#innodb-locking-reads-nowait-skip-locked
(This is not an answer to the question, but a list of caveats that you need to be aware of when using any of the real Answers. Some of these have already been mentioned.)
Replication -- You must do all the locking on the Primary. If you are using a cluster with multiple writable nodes, be aware of the inter-node delays.
Backlog -- When something breaks, you could get a huge list of tasks in the queue. This may lead to some ugly messes.
Number of 'workers' -- Don't have more than a "few" workers. If you try to have, say, 100 concurrent workers, they will stumble over each other an cause nasty problems.
Reaper -- Since a worker may crash, the task assigned to it may never get cleared. Have a TIMESTAMP on the rows so a separate (cron/EVENT/whatever) job can discover what tasks are long overdue and clear them.
If the tasks are fast enough, then the overhead of the queue could be a burden. That is, "Don't queue it, just do it."
You are right to grab the task in one transaction, then later release the task in a separate transaction. Using InnoDB's locking is folly for any be trivially fast actions.

MySQL: How to make sure update is always executed before select?

I am creating a web app that lets N number of users to enter receipt data.
A set of scanned receipts is given to users, but no more than 2 users should work on the same receipt.
i.e. User A and User B can work on receipt-1, but User C can not work on it(Another receipt, say receipt-2, should be assigned to the User C).
The table structure I am using looks similar to the following.
[User-Receipt Table]
+------------+--------------+
| user_id | receipt_id |
+------------+--------------+
| 000000001 | R0000000000 |
| 000000001 | R0000000001 |
| 000000001 | R0000000002 |
| 000000002 | R0000000000 |
| 000000002 | R0000000001 |
+------------+--------------+
[Receipt Table]
+-------------+--------+
| receipt_id | status |
+-------------+--------+
| R0000000000 | 0 |
| R0000000001 | 1 |
| R0000000002 | 0 |
| R0000000003 | 2 |
+-------------+--------+
★status 0:not assigned 1:assigned to a user 2: assigned to 2 users
select receipts from the receipt table whose status is not equal to '2'
insert the receipts fetched from the step 1 along with a user to whom receipts are assigned.
update the receipt status(0->1 or 1->2)
This is how I plan to achieve the above requirement.
The problem with this approach is that there could be a chance that the select(step1) is executed right before the update(step3) is executed.
If this happens, the receipts with status 2 might be fetched and assigned to another user, which does not meet the requirement.
How can I make sure that this does not happen?
For all purposes, use transactions :
START TRANSACTION
your SQL commands
COMMIT
Transactions either let all your statements executed or not executed at all and performs implicitly a lock on the updated row which is more efficient than the second approach
You can also do it using LOCK TABLE

mysql query miss some rows occasionally

I'm having this problem, here is my sql statements:
select * from tb1 where id > the_max_read order by id
The table tb1 is to monitor some other tables' changes, so it keeps growing.
Variable the_max_read is the max id that program already read.
I'm running this sql via C++ and using mysql's mysql_query function, and save result with mysql_store_result.
DB engine is innodb.
The problem is that it miss some rows sometime, not always but keep happening.
For example, say I have this table:
| -- id | -- name|
| 834370 | name1 |
| 834371 | name2 |
| 834372 | name3 |
| 834373 | name4 |
| 834374 | name5 |
| 834375 | name6 |
and the_max_read=834371, when run the above sql, the result only contains 834374 and 834375.
Though this table may be inserted some new rows by other programs, but I still cannot understand why it just miss some rows, it's almost the simplest sql.
This sounds like it might be a transaction issue, where you read before some of the transactions are committed.
Try read uncommitted data: http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html
e.g.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
select * from tb1 where id > the_max_read order by id;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
Hope that helps.

Mysql concurrent SELECT/INSERT

I keep track of user credits in a creditlog table that looks like this:
user quantity action balance
1001 20 SEND 550
1001 30 SEND 520
1001 5 SEND 515
Now at first I tried to use Active Record syntax and select latest balance then insert a new line that computed the new balance. Then I found myself into a race condition:
user quantity action balance
1001 20 SEND 550
1001 30 SEND 520
1001 5 SEND 545 (the latest balance was not picked up because of a race condition)
Next solution was using a single query to do both:
INSERT INTO creditlog (action, quantity, balance, memberId)
VALUES (:action, :quantity, (SELECT tc.balance from creditlog tc where tc.memberId=:memberId ORDER by tc.id desc limit 1) - :quantity, :memberId);
My script which tests this with 10 reqs/second would throw the following error for 2/10 queries:
SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction. The SQL statement executed was:
INSERT INTO creditlog (action, quantity, reference, balance, memberId)
VALUES (:action, :quantity, :reference, (SELECT balance from (SELECT tm.* FROM creditlog tm where tm.memberId=:memberId) tc where tc.memberId=:memberId ORDER by tc.id desc limit 1) -:quantity, :memberId, :recipientId);.
Bound with :action='send', :quantity='10', :reference='Testing:10', :memberId='10001043'.
Shouldn't the engine wait for the first operation to release the table then start on the second one?
Does my issue relate to: How to avoid mysql 'Deadlock found when trying to get lock; try restarting transaction' ?
How can I avoid this situation and turn the concurrent requests into sequential operations?
Here is a working solution, may not be the best one so please help improve.
Since transactions don't block other sessions from SQL SELECTs, I used the following approach:
LOCK TABLES creditlog WRITE;
//query 1 extracts oldBalance
"SELECT balance FROM creditlog where memberId=:memberId ORDER BY ID DESC LIMIT 1;";
//do my thing with the balance (checks and whatever)
//query 2
"INSERT INTO creditlog (action, quantity, balance, memberId)
VALUES (:action, :quantity, oldBalance- :quantity, :memberId);
UNLOCK TABLES;
Result:
mysql> select * from creditlog order by id desc limit 40;
+--------+-----------+----------+---------+----------+---------+---------------------+-------------+------------+
| id | memberId | action | quantity | balance | timeAdded |
+--------+-----------+----------+---------+----------+---------+---------------------+-------------+------------+
| 772449 | 10001043 | send | 10.00 | 0.00 | 2013-12-23 16:21:50 |
| 772448 | 10001043 | send | 10.00 | 10.00 | 2013-12-23 16:21:50 |
| 772447 | 10001043 | send | 10.00 | 20.00 | 2013-12-23 16:21:50 |
| 772446 | 10001043 | send | 10.00 | 30.00 | 2013-12-23 16:21:50 |
| 772445 | 10001043 | send | 10.00 | 40.00 | 2013-12-23 16:21:50 |
Solution #2
Since I was already using Redis, I tried out a redis mutex wrapper:
https://github.com/phpnode/YiiRedis/blob/master/ARedisMutex.php
This gives me a lot more flexibility, allowing me to lock only specific segments (users) of the table. Also there is no risk of deadlocking since the mutex automatically expires after X (configurable) seconds.
Here's the final version:
$this->mutex = new ARedisMutex("balance:lock:".$this->memberId);
$this->mutex->block();
//execute credit transactions for this user
$this->mutex->unlock();

Efficiently handle current balance of account

I've been developing an application that handles accounts and transactions made over these accounts.
Currently the tables the application uses are modelled the following way:
account
+----+-----------------+---------+-----+
| id | current_balance | version | ... |
+----+-----------------+---------+-----+
| 1 | 1000 | 48902 | ... |
| 2 | 2000 | 34933 | ... |
| 3 | 100 | 103 | ... |
+----+-----------------+---------+-----+
account_transaction
+------+-------------+----------------------+---------+------------------+-----+
| id | account_id | date | value | resulting_amount | ... |
+------+-------------+----------------------+---------+------------------+-----+
| 101 | 1 | 03/may/2012 10:13:33 | 1000 | 2000 | ... |
| 102 | 2 | 03/may/2012 10:13:33 | 500 | 1500 | ... |
| 103 | 1 | 03/may/2012 10:13:34 | -500 | 1500 | ... |
| 104 | 2 | 03/may/2012 10:13:35 | -50 | 1450 | ... |
| 105 | 2 | 03/may/2012 10:13:35 | 550 | 2000 | ... |
| 106 | 1 | 03/may/2012 10:13:35 | -500 | 1000 | ... |
+------+-------------+----------------------+---------+------------------+-----+
Whenever the application processes a new transaction, it inserts a new row into account_transaction and, at the account table, it updates the column current_balance that store the current balance for the account and the column version used for optimistic locking.
If the optimistic locking works, the transaction is commited, if it doesn't the transaction is rolled back.
As a rough example, when processing the transaction 102, the application did the
following pseudo SQL/JAVA:
set autocommit = 0;
insert into account_transaction
(account_id, date, value, resulting_amount)
values
(2, sysdate(), 550, 2000);
update account set
current_balance = 2000,
version = 34933
where
id = 2 and
version = 34932;
if (ROW_COUNT() != 1) {
rollback;
}
else {
commit;
}
However certain accounts are very active and receive many simultaneous transactions which causes deadlocks at MySQL while updating the rows at account table. These deadlocks impose a serious performance penalty to the application since it causes the transactions to be reprocessed when deadlocks at the database occur.
How can I efficiently handle the current balance for the accounts? The current balance is needed to authorize/deny new transactions and is used in various reports.
How can I efficiently handle the current balance for the accounts?
I think this whole model is over-engineered.
Abandoning the optimistic locking through version and having a simple...
UPDATE account SET current_balance = current_balance + value WHERE id = ...
...at the end of the transaction that inserts a new account_transaction should be plenty fast. For data integrity, consider putting this into AFTER INSERT trigger on account_transaction1.
First of all, you are doing it at the end of the transaction, so even if the transaction is long, the lock contention on this row should be short.
SQL guarantees consistent data view within a single statement, so there is no need for separate SELECT ... FOR UPDATE.
Also, since you are adding a value, instead or directly setting the sum, it doesn't really matter in which order these operations are done - addition is commutative (so shorter transactions can "overtake" the longer ones).
1 But be careful not to trigger it too early - only insert new account_transaction when it is completely "cooked", don't (for example) insert early but update the resulting_amount later.