Is there any good & performant alternative to FOR UPDATE SKIP LOCKED in mariaDB? Or is there any good practice to archieve job queueing in mariaDB?
Instead of using a lock to indicate a queue record is being processed, use an indexed processing column. Set it to 0 for new records, and, in a separate transaction from any processing, select a single not yet processing record and update it to 1. Possibly also store the time and process or thread id and server that is processing the record. Have a separate monitoring process to detect jobs flagged as processing that did not complete processing within the expected time.
An alternative that avoids even the temporary lock on a non-primary index needed to select a record is to use a separate, non-database message queue to notify you of new records available in the database queue. (Unless you won't ever care if a unit of work is processed more than once, I would always use a database table in addition to any non-database queue.)
DELETE FROM QUEUE_TABLE LIMIT 1 RETURNING *
for dequeue operations. Depending on your needs it might work ok
Update 2022-06-14:
MariaDB supports SKIP LOCKED now.
Related
What is the exact difference between the two locking read clauses:
SELECT ... FOR UPDATE
and
SELECT ... LOCK IN SHARE MODE
And why would you need to use one over the other?
I have been trying to understand the difference between the two. I'll document what I have found in hopes it'll be useful to the next person.
Both LOCK IN SHARE MODE and FOR UPDATE ensure no other transaction can update the rows that are selected. The difference between the two is in how they treat locks while reading data.
LOCK IN SHARE MODE does not prevent another transaction from reading the same row that was locked.
FOR UPDATE prevents other locking reads of the same row (non-locking reads can still read that row; LOCK IN SHARE MODE and FOR UPDATE are locking reads).
This matters in cases like updating counters, where you read value in 1 statement and update the value in another. Here using LOCK IN SHARE MODE will allow 2 transactions to read the same initial value. So if the counter was incremented by 1 by both transactions, the ending count might increase only by 1 - since both transactions initially read the same value.
Using FOR UPDATE would have locked the 2nd transaction from reading the value till the first one is done. This will ensure the counter is incremented by 2.
For Update --- You're informing Mysql that the selected rows can be updated in the next steps(before the end of this transaction) ,,so that mysql does'nt grant any read locks on the same set of rows to any other transaction at that moment. The other transaction(whether for read/write )should wait until the first transaction is finished.
For Share- Indicates to Mysql that you're selecting the rows from the table only for reading purpose and not to modify before the end of transaction. Any number of transactions can access read lock on the rows.
Note: There are chances of getting a deadlock if this statement( For update, For share) is not properly used.
Either way the integrity of your data will be guaranteed, it's just a question of how the database guarantees it. Does it do so by raising runtime errors when transactions conflict with each other (i.e. FOR SHARE), or does it do so by serializing any transactions that would conflict with each other (i.e. FOR UPDATE)?
FOR SHARE (a.k.a. LOCK IN SHARE MODE): Transactions face a higher probability of failure due to deadlock, because they delay blocking until the moment an update statement is received (at which point they either block until all readlocks are released, or fail due to deadlock if another write is in progress). However, only one client blocks and eventually succeeds: the other clients will fail with deadlock if they try to update, so only one of them will succeed and the rest will have to retry their transactions.
FOR UPDATE: Transactions won't fail due to deadlock, because they won't be allowed to run concurrently. This may be desirable for example because it makes it easier to reason about multi-threading if all updates are serialized across all clients. However, it limits the concurrency you can achieve because all other transactions block until the first transaction is finished.
Pro-Tip: As an exercise I recommend taking some time to play with a local test database and a couple mysql clients on the command line to prove this behavior for yourself. That is how I eventually understood the difference myself, because it can be very abstract until you see it in action.
If two independent scripts call a database with update requests to the same field, but with different values, would they execute at the same time and one overwrite the other?
as an example to help ensure clarity, imagine both of these statements being requested to run at the same time, each by a different script, where Status = 2 is called microseconds after Status = 1 by coincidence.
Update My_Table SET Status = 1 WHERE Status= 0;
Update My_Table SET Status = 2 WHERE Status= 0;
What would my results be and why? if other factors play a roll, expand on them as much as you please, this is meant to be a general idea.
Side Note:
Because i know people will still ask, my situation is using MySql with Google App Engine, but i don't want to limit this question to just me should it be useful to others. I am using Status as an identifier for what script is doing stuff to the field. if status is not 0, no other script is allowed to touch it.
This is what locking is for. All major SQL implementations lock DML statements by default so that one query won't overwrite another before the first is complete.
There are different levels of locking. If you've got row locking then your second update will run in parallel with the first, so at some point you'll have 1s and 2s in your table.
Table locking would force the second query to wait for the first query to completely finish to release it's table lock.
You can usually turn off locking right in your SQL, but it's only ever done if you need a performance boost and you know you won't encounter race conditions like in your example.
Edits based on the new MySQL tag
If you're updating a table that used the InnoDB engine, then you're working with row locking, and your query could yield a table with both 1s and 2s.
If you're working with a table that uses the MyISAM engine, then you're working with table locking, and your update statements would end up with a table that would either have all 1s or all 2s.
from https://dev.mysql.com/doc/refman/5.0/en/lock-tables-restrictions.html (MySql)
Normally, you do not need to lock tables, because all single UPDATE statements are atomic; no other session can interfere with any other currently executing SQL statement. However, there are a few cases when locking tables may provide an advantage:
from https://msdn.microsoft.com/en-us/library/ms177523.aspx (sql server)
An UPDATE statement always acquires an exclusive (X) lock on the table it modifies, and holds that lock until the transaction completes. With an exclusive lock, no other transactions can modify data.
If you were having two separate connections executing the two posted update statements, whichever statement was started first, would be the one that completed. THe other statement would not update the data as there would no longer be records with a status of 0
The short answer is: it depends on which statement commits first. Just because one process started an update statement before another doesn't mean that it will complete before another. It might not get scheduled first, it might be blocked by another process, etc.
Ultimately, it's a race condition: the operation that completes (and commits) last, wins.
Since you have TWO scripts doing the same thing and using different values for the UPDATE, they will NOT run at the same time, one of the scripts will run before even if you think you are calling them at the same time. You need to specify WHEN each script should run, otherwise the program will not know what should be 1 and what should be 2.
TL;DR - MySQL doesn't let you lock a table and use a transaction at the same time. Is there any way around this?
I have a MySQL table I am using to cache some data from a (slow) external system. The data is used to display web pages (written in PHP.) Every once in a while, when the cached data is deemed too old, one of the web connections should trigger an update of the cached data.
There are three issues I have to deal with:
Other clients will try to read the cache data while I am updating it
Multiple clients may decide the cache data is too old and try to update it at the same time
The PHP instance doing the work may be terminated unexpectedly at any time, and the data should not be corrupted
I can solve the first and last issues by using a transaction, so clients will be able to read the old data until the transaction is committed, when they will immediately see the new data. Any problems will simply cause the transaction to be rolled back.
I can solve the second problem by locking the tables, so that only one process gets a chance to perform the update. By the time any other processes get the lock they will realise they have been beaten to the punch and don't need to update anything.
This means I need to both lock the table and start a transaction. According to the MySQL manual, this is not possible. Starting a transaction releases the locks, and locking a table commits any active transaction.
Is there a way around this, or is there another way entirely to achieve my goal?
This means I need to both lock the table and start a transaction
This is how you can do it:
SET autocommit=0;
LOCK TABLES t1 WRITE, t2 READ, ...;
... do something with tables t1 and t2 here ...
COMMIT;
UNLOCK TABLES;
For more info, see mysql doc
If it were me, I'd use the advisory locking function within MySQL to implement a mutex for updating the cache, and a transaction for read isolation. e.g.
begin_transaction(); // although reading a single row doesnt really require this
$cached=runquery("SELECT * FROM cache WHERE key=$id");
end_transaction();
if (is_expired($cached)) {
$cached=refresh_data($cached, $id);
}
...
function refresh_data($cached, $id)
{
$lockname=some_deterministic_transform($id);
if (1==runquery("SELECT GET_LOCK('$lockname',0)") {
$cached=fetch_source_data($id);
begin_transaction();
write_data($cached, $id);
end_transaction();
runquery("SELECT RELEASE_LOCK('$lockname')");
}
return $cached;
}
(BTW: bad things may happen if you try this with persistent connections)
I'd suggest to solve the issue by removing the contention altogether.
Add a timestamp column to your cached data.
When you need to update the cached data:
Just add new cached data to your table using the current timestamp
Remove cached data older than, let's say, 24 hours.
When you need to serve the cached data
Sort by timestamp (DESC) and return the newest cached data
At any given time your clients will retrieve records which are never deleted by any other process. Moreover, you don't care if a client gets cached data belonging to different writes (i.e. with different timestamps)
The second problem may be solved without involving the database at all. Have a lock file for the cache update procedure so that other clients know that someone is already on it. This may not catch each and every corner case, but is it that big of a deal if two clients are updating the cache at the same time? After all, they are doing the update in transactions to the cache will still be consistent.
You may even implement the lock yourself by having the last cache update time stored in a table. When a client wants update the cache, make it lock that table, check the last update time and then update the field.
I.e., implement your own locking mechanism to prevent multiple clients from updating the cache. Transactions will take care of the rest.
I have a mysql table in which I store jobs to be processed. mainly text fields of raw data the will take around a minute each to process.
I have 2 servers pulling data from that table processing it then deleting.
To manage the job allocation between the 2 servers I am currently using amazon SQS. I store all the row IDS that need processing in SQS, the worker servers poll SQS to get new rows to work on.
The system currently works but SQS adds a layer of complexity and costs that I feel are overkill to achieve what I am doing.
I am trying to implement the same thing without SQS and was wondering if there is any way to read lock a row so that if one worker is working on one row, no other worker can select that row. Or if there's any better way to do it.
A simple workaround: add one more column to your jobs table, is_taken_by INT.
Then in your worker you do something like this:
select job_id from jobs where is_taken_by is null limit 1 for update;
update jobs set is_taken_by = worker_pid where id = job_id;
SELECT ... FOR UPDATE sets exclusive locks on rows it reads. This way you ensure that no other worker can take the same job.
Note: you have to run those two lines in an explicit transaction.
Locking of rows for update using SELECT FOR UPDATE only applies when autocommit is disabled (either by beginning transaction with START TRANSACTION or by setting autocommit to 0. If autocommit is enabled, the rows matching the specification are not locked.
I've got a theoretical question and can't find a good solution for this on the net:
For a tblA with 100,000 recs.
I want to have multiple processes/apps running, each of which accesses tblA.
I don't want the apps to access the same recs. ie, I want appA to access the 1st 50 rows, with appB accessing the next 50, and appC accessing the next 50 after that..
So basically I want the apps to do a kind of fetch on the next "N" recs in the table. I'm looking for a way to access/process the row data as fast as possible, essentially running the apps in a simultaneous manner. but I don't want the apps to process the same rows.
So, just how should this kind of process be set up?
Is it simply doing a kind of:
select from tblA limit 50
and doing some kind of row locking for each row (which requires innodb)
Pointers/psuedo code would be useful.
Here is some posts from the DBA StackExchange on this
https://dba.stackexchange.com/q/10017/877
https://dba.stackexchange.com/a/4470/877
It discusses SELECT ... LOCK IN SHARE MODE and potential headcahes that comes with it.
Percona wrote a nice article on this along with SELECT ... FOR UPDATE
Your application should handle what data it wants to access. Create a pointer in that. If you're using stored procedures, use another table to store the pointers. Each process would "reserve" a set of rows before beginning processing. Every process should check for the max of that and also see if it is greater than the length of the table.
If you are specifically looking for processing first set, second set, etc. The you can use LIMIT # (i.e. 0,50 51,100 101,150) with an ORDER BY. Locking is not necessary since the processes won't even try to access each others record sets. But I can't imagine a scenario where that would be a good implementation.
An alternative is to just to use update with a limit, then select the records that were updated. You can use the process ID, random number or something else that is almost guaranteed to be unique across processes. Add a "status" field to your table indicating if the record is available for processing (i.e. value is NULL). Then each process would update the status field to "own" the record for processing.
UPDATE tblA SET status=1234567890 WHERE status IS NULL LIMIT 50;
SELECT * FROM tblA WHERE status=1234567890;
This would work for MyISAM or Innodb. With Innodb you would be able to have multiple updates running at once, improving performance.
The problem with these solutions is lag time. If process A executes at 12:00:00 and proccess B also executes at precisely the same time, and in an application, there are several blocks of distinct code leading up to the locks/DMLs, the process time for each would vary. So process A may complete first, or it may be process B. If process A is setting the lock, and process B modifies the record first, you're in trouble. This is the trouble with forking.