Mysql concurrent SELECT/INSERT - mysql

I keep track of user credits in a creditlog table that looks like this:
user quantity action balance
1001 20 SEND 550
1001 30 SEND 520
1001 5 SEND 515
Now at first I tried to use Active Record syntax and select latest balance then insert a new line that computed the new balance. Then I found myself into a race condition:
user quantity action balance
1001 20 SEND 550
1001 30 SEND 520
1001 5 SEND 545 (the latest balance was not picked up because of a race condition)
Next solution was using a single query to do both:
INSERT INTO creditlog (action, quantity, balance, memberId)
VALUES (:action, :quantity, (SELECT tc.balance from creditlog tc where tc.memberId=:memberId ORDER by tc.id desc limit 1) - :quantity, :memberId);
My script which tests this with 10 reqs/second would throw the following error for 2/10 queries:
SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction. The SQL statement executed was:
INSERT INTO creditlog (action, quantity, reference, balance, memberId)
VALUES (:action, :quantity, :reference, (SELECT balance from (SELECT tm.* FROM creditlog tm where tm.memberId=:memberId) tc where tc.memberId=:memberId ORDER by tc.id desc limit 1) -:quantity, :memberId, :recipientId);.
Bound with :action='send', :quantity='10', :reference='Testing:10', :memberId='10001043'.
Shouldn't the engine wait for the first operation to release the table then start on the second one?
Does my issue relate to: How to avoid mysql 'Deadlock found when trying to get lock; try restarting transaction' ?
How can I avoid this situation and turn the concurrent requests into sequential operations?

Here is a working solution, may not be the best one so please help improve.
Since transactions don't block other sessions from SQL SELECTs, I used the following approach:
LOCK TABLES creditlog WRITE;
//query 1 extracts oldBalance
"SELECT balance FROM creditlog where memberId=:memberId ORDER BY ID DESC LIMIT 1;";
//do my thing with the balance (checks and whatever)
//query 2
"INSERT INTO creditlog (action, quantity, balance, memberId)
VALUES (:action, :quantity, oldBalance- :quantity, :memberId);
UNLOCK TABLES;
Result:
mysql> select * from creditlog order by id desc limit 40;
+--------+-----------+----------+---------+----------+---------+---------------------+-------------+------------+
| id | memberId | action | quantity | balance | timeAdded |
+--------+-----------+----------+---------+----------+---------+---------------------+-------------+------------+
| 772449 | 10001043 | send | 10.00 | 0.00 | 2013-12-23 16:21:50 |
| 772448 | 10001043 | send | 10.00 | 10.00 | 2013-12-23 16:21:50 |
| 772447 | 10001043 | send | 10.00 | 20.00 | 2013-12-23 16:21:50 |
| 772446 | 10001043 | send | 10.00 | 30.00 | 2013-12-23 16:21:50 |
| 772445 | 10001043 | send | 10.00 | 40.00 | 2013-12-23 16:21:50 |

Solution #2
Since I was already using Redis, I tried out a redis mutex wrapper:
https://github.com/phpnode/YiiRedis/blob/master/ARedisMutex.php
This gives me a lot more flexibility, allowing me to lock only specific segments (users) of the table. Also there is no risk of deadlocking since the mutex automatically expires after X (configurable) seconds.
Here's the final version:
$this->mutex = new ARedisMutex("balance:lock:".$this->memberId);
$this->mutex->block();
//execute credit transactions for this user
$this->mutex->unlock();

Related

Concurrent scripts pulling jobs to do from MySQL table [duplicate]

Say I have multiple workers that can concurrently read and write against a MySQL table (e.g. jobs). The task for each worker is:
Find the oldest QUEUED job
Set it's status to RUNNING
Return the corresponding ID.
Note that there may not be any qualifying (i.e. QUEUED) jobs when a worker runs step #1.
I have the following pseudo-code so far. I believe I need to cancel (ROLLBACK) the transaction if step #1 returns no jobs. How would I do that in the code below?
BEGIN TRANSACTION;
# Update the status of jobs fetched by this query:
SELECT id from jobs WHERE status = "QUEUED"
ORDER BY created_at ASC LIMIT 1;
# Do the actual update, otherwise abort (i.e. ROLLBACK?)
UPDATE jobs
SET status="RUNNING"
# HERE: Not sure how to make this conditional on the previous ID
# WHERE id = <ID from the previous SELECT>
COMMIT;
I am implementing something very similar to your case this week. A number of workers, each grabbing the "next" row in a set of rows to work on.
The pseudocode is something like this:
BEGIN;
SELECT ID INTO #id FROM mytable WHERE status = 'QUEUED' LIMIT 1 FOR UPDATE;
UPDATE mytable SET status = 'RUNNING' WHERE id = #id;
COMMIT;
Using FOR UPDATE is important to avoid race conditions, i.e. more than one worker trying to grab the same row.
See https://dev.mysql.com/doc/refman/8.0/en/select-into.html for information about SELECT ... INTO.
It's still not quite clear what you are after. But assuming your task is: Find the next QUEUED job. Set it's status to RUNNING and select the corresponding ID.
In a single threaded environment, you can just use your code. Fetch the selected ID into a variable in your application code and pass it to the UPDATE query in the WHERE clause. You don't even need a transaction, since there is only one writing statement. You can mimic in an SQLscript.
Assuming this is your current state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | QUEUED |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
You want to start the next queued job (which has id=2).
SET #id_for_update = (
SELECT id
FROM jobs
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1
);
UPDATE jobs
SET status="RUNNING"
WHERE id = #id_for_update;
SELECT #id_for_update;
You will get
#id_for_update
2
from the last select. And the table will have this state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If you have multiple processes, which start jobs, you would need to lock the row with FOR UPDATE. But that can be avoided using LAST_INSERT_ID():
Starting from the state above, with job 2 already running:
UPDATE jobs
SET status = 'RUNNING',
id = LAST_INSERT_ID(id)
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
SELECT LAST_INSERT_ID();
You will get:
| LAST_INSERT_ID() | ROW_COUNT() |
| ---------------- | ----------- |
| 3 | 1 |
And the new state is:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | RUNNING |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If the UPDATE statement affected no row (there were no queued rows) ROW_COUNT() will be 0.
There might be some risks, which I am not aware of - But this is also not really how I would approach this. I would rather store more information in the jobs table. Simple example:
CREATE TABLE jobs (
id INT auto_increment primary key,
created_at timestamp not null default now(),
updated_at timestamp not null default now() on update now(),
status varchar(50) not null default 'QUEUED',
process_id varchar(50) null default null
);
and
UPDATE jobs
SET status = 'RUNNING',
process_id = 'some_unique_pid'
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
Now a running job belongs to a specific process and you can just select it with
SELECT * FROM jobs WHERE process_id = 'some_unique_pid';
You might even like to have more information - eg. queued_at, started_at, finished_at.
Adding SKIP LOCKED to the SELECT query, and putting in a SQL transaction, committed when the job is done, avoid jobs stuck in status RUNNING if a worker crashes (because the uncommitted transaction will rollback). It's now supported in newest versions of most common DBMS.
See:
Select only unlocked rows mysql
https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html#innodb-locking-reads-nowait-skip-locked
(This is not an answer to the question, but a list of caveats that you need to be aware of when using any of the real Answers. Some of these have already been mentioned.)
Replication -- You must do all the locking on the Primary. If you are using a cluster with multiple writable nodes, be aware of the inter-node delays.
Backlog -- When something breaks, you could get a huge list of tasks in the queue. This may lead to some ugly messes.
Number of 'workers' -- Don't have more than a "few" workers. If you try to have, say, 100 concurrent workers, they will stumble over each other an cause nasty problems.
Reaper -- Since a worker may crash, the task assigned to it may never get cleared. Have a TIMESTAMP on the rows so a separate (cron/EVENT/whatever) job can discover what tasks are long overdue and clear them.
If the tasks are fast enough, then the overhead of the queue could be a burden. That is, "Don't queue it, just do it."
You are right to grab the task in one transaction, then later release the task in a separate transaction. Using InnoDB's locking is folly for any be trivially fast actions.

Atomic read and update in MySQL with concurrent workers

Say I have multiple workers that can concurrently read and write against a MySQL table (e.g. jobs). The task for each worker is:
Find the oldest QUEUED job
Set it's status to RUNNING
Return the corresponding ID.
Note that there may not be any qualifying (i.e. QUEUED) jobs when a worker runs step #1.
I have the following pseudo-code so far. I believe I need to cancel (ROLLBACK) the transaction if step #1 returns no jobs. How would I do that in the code below?
BEGIN TRANSACTION;
# Update the status of jobs fetched by this query:
SELECT id from jobs WHERE status = "QUEUED"
ORDER BY created_at ASC LIMIT 1;
# Do the actual update, otherwise abort (i.e. ROLLBACK?)
UPDATE jobs
SET status="RUNNING"
# HERE: Not sure how to make this conditional on the previous ID
# WHERE id = <ID from the previous SELECT>
COMMIT;
I am implementing something very similar to your case this week. A number of workers, each grabbing the "next" row in a set of rows to work on.
The pseudocode is something like this:
BEGIN;
SELECT ID INTO #id FROM mytable WHERE status = 'QUEUED' LIMIT 1 FOR UPDATE;
UPDATE mytable SET status = 'RUNNING' WHERE id = #id;
COMMIT;
Using FOR UPDATE is important to avoid race conditions, i.e. more than one worker trying to grab the same row.
See https://dev.mysql.com/doc/refman/8.0/en/select-into.html for information about SELECT ... INTO.
It's still not quite clear what you are after. But assuming your task is: Find the next QUEUED job. Set it's status to RUNNING and select the corresponding ID.
In a single threaded environment, you can just use your code. Fetch the selected ID into a variable in your application code and pass it to the UPDATE query in the WHERE clause. You don't even need a transaction, since there is only one writing statement. You can mimic in an SQLscript.
Assuming this is your current state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | QUEUED |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
You want to start the next queued job (which has id=2).
SET #id_for_update = (
SELECT id
FROM jobs
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1
);
UPDATE jobs
SET status="RUNNING"
WHERE id = #id_for_update;
SELECT #id_for_update;
You will get
#id_for_update
2
from the last select. And the table will have this state:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | QUEUED |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If you have multiple processes, which start jobs, you would need to lock the row with FOR UPDATE. But that can be avoided using LAST_INSERT_ID():
Starting from the state above, with job 2 already running:
UPDATE jobs
SET status = 'RUNNING',
id = LAST_INSERT_ID(id)
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
SELECT LAST_INSERT_ID();
You will get:
| LAST_INSERT_ID() | ROW_COUNT() |
| ---------------- | ----------- |
| 3 | 1 |
And the new state is:
| id | created_at | status |
| --- | ------------------- | -------- |
| 1 | 2020-06-15 12:00:00 | COMLETED |
| 2 | 2020-06-15 12:00:10 | RUNNING |
| 3 | 2020-06-15 12:00:20 | RUNNING |
| 4 | 2020-06-15 12:00:30 | QUEUED |
View on DB Fiddle
If the UPDATE statement affected no row (there were no queued rows) ROW_COUNT() will be 0.
There might be some risks, which I am not aware of - But this is also not really how I would approach this. I would rather store more information in the jobs table. Simple example:
CREATE TABLE jobs (
id INT auto_increment primary key,
created_at timestamp not null default now(),
updated_at timestamp not null default now() on update now(),
status varchar(50) not null default 'QUEUED',
process_id varchar(50) null default null
);
and
UPDATE jobs
SET status = 'RUNNING',
process_id = 'some_unique_pid'
WHERE status = 'QUEUED'
ORDER BY id
LIMIT 1;
Now a running job belongs to a specific process and you can just select it with
SELECT * FROM jobs WHERE process_id = 'some_unique_pid';
You might even like to have more information - eg. queued_at, started_at, finished_at.
Adding SKIP LOCKED to the SELECT query, and putting in a SQL transaction, committed when the job is done, avoid jobs stuck in status RUNNING if a worker crashes (because the uncommitted transaction will rollback). It's now supported in newest versions of most common DBMS.
See:
Select only unlocked rows mysql
https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html#innodb-locking-reads-nowait-skip-locked
(This is not an answer to the question, but a list of caveats that you need to be aware of when using any of the real Answers. Some of these have already been mentioned.)
Replication -- You must do all the locking on the Primary. If you are using a cluster with multiple writable nodes, be aware of the inter-node delays.
Backlog -- When something breaks, you could get a huge list of tasks in the queue. This may lead to some ugly messes.
Number of 'workers' -- Don't have more than a "few" workers. If you try to have, say, 100 concurrent workers, they will stumble over each other an cause nasty problems.
Reaper -- Since a worker may crash, the task assigned to it may never get cleared. Have a TIMESTAMP on the rows so a separate (cron/EVENT/whatever) job can discover what tasks are long overdue and clear them.
If the tasks are fast enough, then the overhead of the queue could be a burden. That is, "Don't queue it, just do it."
You are right to grab the task in one transaction, then later release the task in a separate transaction. Using InnoDB's locking is folly for any be trivially fast actions.

How can I select the last timestamp reading per day for multiple user id's and multiple days?

I have a database that contains user id, calories burned (value), and the timestamp at which those calories burned were recorded(reading_date). An individual could have multiple calorie readings for the same day, but I'm only interested in the last reading since it's a total of all the previous readings for that day.
IN:
SELECT
DISTINCT ON (date, user_contents.content_id)
date_trunc('day',reading_date + time '05:00') date,
user_id,
created_at,
value
FROM data
OUT:
date | user_id | created_at | value
2019-01-13 00:00:00 | 138 | 2019-01-18 06:07:52 | 81.0
2019-01-15 00:00:00 | 137 | 2019-01-15 15:43:25 | 87.0
2019-01-16T00:00:00 | 137 | 2019-01-18 04:22:11 | 143.0
2019-01-16T00:00:00 | 137 | 2019-01-18 06:12:11 | 230.0
additional values omitted
I want to be able to select the maximum reading value for each day per person. I've tried using DISTINCT statements such as:
SELECT
DISTINCT ON (date, user_contents.content_id)
date_trunc('day',reading_date + time '05:00') date,
Sometimes that results in an error message:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Sometimes it filters out some results, but isn't always giving me the last reading of the day or only one result per person per day.
My optimal end result would look like this (the third record having been removed):
date | user_id | created_at | value
2019-01-13 00:00:00 | 138 | 2019-01-18 06:07:52 | 81.0
2019-01-15 00:00:00 | 137 | 2019-01-15 15:43:25 | 87.0
2019-01-16T00:00:00 | 137 | 2019-01-18 06:12:11 | 230.0
additional values omitted
Ultimately, I'm going to use this data to sum up the value column and determine the total number of calories burned by everyone in the dataset over a time period.
You appear to be using Postgres.
Follow the instructions in the error message. You want something like this:
SELECT DISTINCT ON (user_id, reading_date::date)
date_trunc('day',reading_date + time '05:00') date,
user_id, created_at,value
FROM data
ORDER BY user_id, reading_date::date DESC, reading_date DESC

Mysql only keep the highest value per id per value

I have a database that looks like this:
post metrics minutes(There is only data for post id 1 in this example)
| post id | date updated local | reach |
|1 | 2018-01-01 01:00:00 | 10 |
|1 | 2018-01-01 01:05:00 | 20 |
|1 | 2018-01-01 01:15:00 | 22 |
|1 | 2018-01-01 16:05:00 | 100 |
|1 | 2018-01-02 03:00:00 | 121 |
|1 | 2018-01-02 21:00:00 | 140 |
|1 | 2018-01-04 01:00:00 | 147 |
My system is designed to fetch data for all posts every 5 minutes and put the results in the above table if the reach is not the same as the last time it was stored for that post (this to prevent getting a shitload of data that is exactly the same).
Now there are thousands of posts and the table start to grow out of control making my website lot slower when loading the data from this table.
So I decided that I can decrease the data by only keeping the last row per post per day, so I want to delete all rows that are not the max date updated local for that post. The result would be:
| post id | date updated local | reach |
|1 | 2018-01-01 16:05:00 | 100 |
|1 | 2018-01-02 21:00:00 | 140 |
|1 | 2018-01-04 01:00:00 | 147 |
I came up with:
DELETE FROM `post metrics minutes`
WHERE EXISTS (
SELECT *
FROM `post metrics minutes` pmmtemp
WHERE pmmtemp.`post id` = `post metrics minutes`.`post id`
AND pmmtemp.`date updated local` > `post metrics minutes`.`date updated local`
AND DATE(pmmtemp.`date updated local`) = DATE(`post metrics minutes`.`date updated local`)
);
But this gives me the following error:
Error Code: 1093. Table 'post metrics minutes' is specified twice, both as a target for 'DELETE' and as a separate source for data
Hope anyone can help me out!
One cannot delete or update on the same table as a subquery one.
One could create a temp table of post_ids to delete.
But marking the records first does it too. This way both queries do not interfere with each other.
For the nested table, instead of FROM tablename I do FROM (SELECT * FROM tablename) for the temp table.
Here I abused the column reach.
UPDATE `post metrics minutes` p
SET p.reach = -1
WHERE EXISTS (
SELECT *
FROM (SELECT * FROM `post metrics minutes`) pmmtemp
WHERE pmmtemp.`post id` = p.`post id`
AND pmmtemp.`date updated local` > p.`date updated local`
AND DATE(pmmtemp.`date updated local`) = DATE(p.`date updated local`)
);
DELETE FROM `post metrics minutes`
WHERE reach = -1;
As per my comment, it's often quicker to create a new table with the desired dates, then delete the old table and replace it with the newer one.
My column/table names may be very slightly different from yours, but something like...
CREATE TABLE my_new_table AS
SELECT x.*
FROM my_old_table x
JOIN
( SELECT post_id,MAX(dt) dt FROM my_old_table GROUP BY post_id,DATE(dt)) y ON y.post_id = x.post_id
AND y.dt = x.dt;

Efficiently handle current balance of account

I've been developing an application that handles accounts and transactions made over these accounts.
Currently the tables the application uses are modelled the following way:
account
+----+-----------------+---------+-----+
| id | current_balance | version | ... |
+----+-----------------+---------+-----+
| 1 | 1000 | 48902 | ... |
| 2 | 2000 | 34933 | ... |
| 3 | 100 | 103 | ... |
+----+-----------------+---------+-----+
account_transaction
+------+-------------+----------------------+---------+------------------+-----+
| id | account_id | date | value | resulting_amount | ... |
+------+-------------+----------------------+---------+------------------+-----+
| 101 | 1 | 03/may/2012 10:13:33 | 1000 | 2000 | ... |
| 102 | 2 | 03/may/2012 10:13:33 | 500 | 1500 | ... |
| 103 | 1 | 03/may/2012 10:13:34 | -500 | 1500 | ... |
| 104 | 2 | 03/may/2012 10:13:35 | -50 | 1450 | ... |
| 105 | 2 | 03/may/2012 10:13:35 | 550 | 2000 | ... |
| 106 | 1 | 03/may/2012 10:13:35 | -500 | 1000 | ... |
+------+-------------+----------------------+---------+------------------+-----+
Whenever the application processes a new transaction, it inserts a new row into account_transaction and, at the account table, it updates the column current_balance that store the current balance for the account and the column version used for optimistic locking.
If the optimistic locking works, the transaction is commited, if it doesn't the transaction is rolled back.
As a rough example, when processing the transaction 102, the application did the
following pseudo SQL/JAVA:
set autocommit = 0;
insert into account_transaction
(account_id, date, value, resulting_amount)
values
(2, sysdate(), 550, 2000);
update account set
current_balance = 2000,
version = 34933
where
id = 2 and
version = 34932;
if (ROW_COUNT() != 1) {
rollback;
}
else {
commit;
}
However certain accounts are very active and receive many simultaneous transactions which causes deadlocks at MySQL while updating the rows at account table. These deadlocks impose a serious performance penalty to the application since it causes the transactions to be reprocessed when deadlocks at the database occur.
How can I efficiently handle the current balance for the accounts? The current balance is needed to authorize/deny new transactions and is used in various reports.
How can I efficiently handle the current balance for the accounts?
I think this whole model is over-engineered.
Abandoning the optimistic locking through version and having a simple...
UPDATE account SET current_balance = current_balance + value WHERE id = ...
...at the end of the transaction that inserts a new account_transaction should be plenty fast. For data integrity, consider putting this into AFTER INSERT trigger on account_transaction1.
First of all, you are doing it at the end of the transaction, so even if the transaction is long, the lock contention on this row should be short.
SQL guarantees consistent data view within a single statement, so there is no need for separate SELECT ... FOR UPDATE.
Also, since you are adding a value, instead or directly setting the sum, it doesn't really matter in which order these operations are done - addition is commutative (so shorter transactions can "overtake" the longer ones).
1 But be careful not to trigger it too early - only insert new account_transaction when it is completely "cooked", don't (for example) insert early but update the resulting_amount later.