How to achieve database locking based protection against duplicates - mysql

In a web application, using the InnoDB storage engine, I was unable to adequately utilise database locking in the following scenario.
There are 3 tables, I will call them aa, ar and ai.
aa holds the base records, let's say articles. ar holds information related to each aa record and the relation between aa and ar is 1:m.
Records in ar are stored when a record from aa is read the first time. The problem is that when two requests are initiated at (nearly) the same to read a record from aa (which does not yet have its related records stored in ar), the ar records are duplicated.
Here is a pseudo code to help understand the situation:
Read the requested aa record.
Scan the ar table to find out if the given aa record has anything stored already. (Assume it has not.)
Consult ai in order to find out what is to be stored in ar for the given aa record. (ai seems somewhat irrelevant, but I found that it too has to be involved in the locking… may be wrong.)
Insert a few rows to ar
Here is what I want to achieve:
Read the requested aa record.
WITH OR WITHOUT USING A TRANSACTIONS, LOCK ar, SO ANY SUBSEQUENT REQUEST ATTEMPTING TO READ FROM ar WILL WAIT AT THIS POINT UNTIL THIS ONE FINISHES.
Scan the ar table to find out if the given aa record has anything stored already. (Assume it has not.) The problem is that in case of two simultaneous requests, both find there are no records in ar for the given aa record and they both proceed to insert the same rows twice. Otherwise, if there are, this sequence is interrupted and no INSERT occurs.
Consult ai in order to find out what is to be stored in ar for the given aa record. (ai seems somewhat irrelevant, but I found that it too has to be involved in the locking… may be wrong.)
Insert a few rows to ar
RELEASE THE LOCK ON ar
Seems simple enough, I was unsuccessful in avoiding the duplicates. I'm testing the simultaneous requests from a simple command in a Bash shell (using wget).
I have spent a while learning how exactly locking works with the InnoDB engine here http://dev.mysql.com/doc/refman/5.5/en/innodb-lock-modes.html and here http://dev.mysql.com/doc/refman/5.5/en/innodb-locking-reads.html and tried several ways to utilise the lock(s), still no luck.
I want the entire ar table locked (since I want to prevent INSERTs from multiple request to occur to it) causing further attempts to interact with this table to wait until the first lock is released. But there's only one mention of "entire table" being locked in the documentation (Intention Locks section in the first linked page) but that's not further discussed or I was unable to figure how to achieve it.
Could anyone point in the right direction?

SET tx_isolation='READ-COMMITTED';
START TRANSACTION;
SELECT * FROM aa WHERE id = 1234 FOR UPDATE;
This ensures that only one thread gets access to a given row in aa at a time. No need to lock the ar table at all, because any other thread who may want to access row 1234 will wait.
Then query ar to find out what rows exist for the corresponding aa, and decide if you want to insert more rows to ar.
Remember that the row in aa is still locked. So be a good citizen by finishing your work quickly, and COMMIT promptly.
COMMIT;
This allows the next thread who has been waiting for the same row of aa to proceed. By using READ-COMMITTED, it will be able to see the just-committed new rows in ar.

Related

Concurrent writes to MySQL and testing solutions

I was practicing some "system design" coding questions and I was interested in how to solve a concurrency problem in MySQL. The problem was "design an inventory checkout system".
Let's say you are trying to check out a specific item from an inventory, a library book for instance.
If two people are on the website, looking to book it, is it possible that they both check it out? Let's assume the query is updating the status of the row to mark a boolean checked_out to True.
Would transactions solve this issue? It would cause the second query that runs to fail (assuming they are the same query).
Alternatively, we insert rows into a checkouts table. Since both queries read that the item is not checked out currently, they could both insert into the table. I don't think a transaction would solve this, unless the transaction includes reading the table to see if a checkout currently exists for this item that hasn't yet ended.
One of the suggested methods
How would I simulate two writes at the exact same time to test this?
No, transactions alone do not address concurrency issues. Let's quickly revisit mysql's definition of transactions:
Transactions are atomic units of work that can be committed or rolled back. When a transaction makes multiple changes to the database, either all the changes succeed when the transaction is committed, or all the changes are undone when the transaction is rolled back.
To sum it up: transactions are a way to ensure data integrity.
RDBMSs use various types of locking, isolation levels, and storage engine level solutions to address concurrency. People often mistake transactions as a mean to control concurrency because transactions affect how long certain locks are held.
Focusing on InnoDB: when you issue an update statement, mysql places an exclusive lock on the record being updated. Only the transaction holding the exclusive lock can modify the given record, the others have to wait until the transaction is committed.
How does this help you preventing multiple users checking out the same book? Let's say you have an id field uniquely identifying the books and a checked_out field indicating the status of the book.
You can use the following atomic update to check out a book:
update books set checked_out=1 where id=xxx and checked_out=0
The checked_out=0 criteria makes sure that the update only succeeds if the book is not checked out yet. So, if the above statement affects a row, then the current user checks out the book. If it does not affect any rows, then someone else has already checked out the book. The exclusive lock makes sure that only one transaction can update the record at any given time, thus serializing the access to that record.
If you want to use a separate checkouts table for reserving books, then you can use a unique index on book ids to prevent the same book being checked out more than once.
Transactions don't cause updates to fail. They cause sequences of queries to be serialized. Only one accessor can run the sequence of queries; others wait.
Everything in SQL is a transaction, single-statement update operations included. The kind of transaction denoted by BEGIN TRANSACTION; ... COMMIT; bundles a series of queries together.
I don't think a transaction would solve this, unless the transaction
includes reading the table to see if a checkout currently exists for
this item.
That's generally correct. Checkout schemes must always read availability from the database. The purpose of the transaction is to avoid race conditions when multiple users attempt to check out the same item.
SQL doesn't have thread-safe atomic test-and-set instructions like multithreaded processor cores have. So you need to use transactions for this kind of thing.
The simplest form of checkout uses a transaction, something like this.
BEGIN TRANSACTION;
SELECT is_item_available, id FROM item WHERE catalog_number = whatever FOR UPDATE;
/* if the item is not available, tell the user and commit the transaction without update*/
UPDATE item SET is_item_available = 0 WHERE id = itemIdPreviouslySelected;
/* tell the user the checkout succeeded. */
COMMIT;
It's clearly possible for two or more users to attempt to check out the same item more-or-less simultaneously. But only one of them actually gets the item.
A more complex checkout scheme, not detailed here, uses a two-step system. First step: a transaction to reserve the item for a user, rejecting the reservation if someone else has it checked out or reserved. Second step: reservation holder has a fixed amount of time to accept the reservation and check out the item, or the reservation expires and some other user may reserve the item.

Mysql transactions happening at the same time

Can two transactions occur at the same time? Let's say you have transactions A and B, each of which will perform a read to get the max value of some column then a write to insert a new row with that max+1. Is it possible that A performs a read to get the max, then B performs a read before A writes, causing both transactions to write the same value to the column?
Doing this with isolation level set to read uncommitted to false seems to prevent duplicates, but I can't wrap my head around why.
Can two transactions occur at the same time?
Yes, that is quite possible and in fact it is required for all the RDBMS to support that feature out of the box to speed up things. Think about an application accessed by Thousands of users simultaneously, if everything goes in sequence the users may have to wait day in order to get the response.
Let's say you have transactions A and B, each of which will perform a read to get the max value of some column then a write to insert a new row with that max+1. Is it possible that A performs a read to get the max, then B performs a read before A writes, causing both transactions to write the same value to the column?
If A & B are happening into two different sessions, its quite possible user case.
Doing this with isolation level set to read uncommitted to false seems to prevent duplicates, but I can't wrap my head around why?
I think, your requirement to get next increment number with isolation block is quite common, and here you need to instruct database to do a mutual exclusive read operation for writing operation has to happen, you could instruct the database to do it, by setting isolation, or may be 'temporary isolation' level should solve your.
If gettting next number is only problem and you don't have other constrained then
My Sql AUTO_INCREMENT would be best suited answer for you.
But it seems, you have asked this question specifically, means, you may have constrained.
Refer my similar questions and answer.
Your solution should be something like below.
begin;
select last_number from TABLE1 ... FOR UPDATE;
Read the result in App.
update TABLE1 set last_number=last_number+1 where ...;
commit;

exclusive read locks in mysql

I have a table which maintains and assigns portion of input to work on (from a big input table), for multiple instances of a process. The table is organised as follows:
BlockInfo Table
---------------
BlockID int primary key
Status varchar
Every process queries for the block of input it should take, and processes that block.
I am expecting the query to be the following:
select BlockID
from BlockInfo
order by BlockID
where Status='available'
limit 1
For this effect, I would require that the server maintain exclusive read locks, since if the read lock is to be maintained as shared, then multiple instances may get the same block, which causes duplication of efforts and is undesirable.
I could get an exclusive write lock, but not actually write anything. But I want to know if mysql permits an exclusive read lock.
It would also help to hear about alternate ways of implementing this.
What you should do is:
Get an exclusive write lock
Select the row you want to process
Change its status to "processing" (or something other than "available")
Unlock the table
Do all your processing of the row
Update the row to change its status back to "available"
This will then allow other processes to work on other rows concurrently with this. It keeps the table locked for just enough time to keep them from trying to work on the same row.
If you want to achieve this in the database level, table level lock is the way to go, as mentioned in the other answer. But it will be a bad design, if performance is of concern to your application. This will result in frequent table locking and waiting.
I would suggest you to divide the work inside the application.
Let one process read the available rows from the database and fill the queue of the worker processes who would process them.

Producer/consumer pattern via mysql

I have 2 processes that act as a producer/consumer via a table.
One process does only INSERT into the table while the other process does a SELECT for new records and an UPDATE of these records when it finishes to mark them as finished.
This keeps happening constantly.
As far as I can see there is no need for any locking or transactions for this simple interaction. Am I right on this?
Am I overlooking something?
I would say the prime consideration to take into account is a scenario where multiple workers retrieve the same row.
The UPDATE and SELECT operations themselves should be fine, but if you have multiple workers consuming via SELECT on the same table, then you might get two workers simultaneously processing the same row.
If each worker is required to process separate rows, locking on SELECT may be required with careful consideration of deadlock if there's a significant unit of work associated with your process.

Using a table to keep the last used ID in a web server farm

I use a table with one row to keep the last used ID (I have my reasons to not use auto_increment), my app should work in a server farm so I wonder how I can update the last inserted ID (ie. increment it) and select the new ID in one step to avoid problems with thread safety (race condition between servers in the server farm).
You're going to use a server farm for the database? That doesn't sound "right".
You may want to consider using GUID's for Id's. They may be big but they don't have duplicates.
With a single "next id" value you will run into locking contention for that record. What I've done in the past is use a table of ranges of id's (RangeId, RangeFrom, RangeTo). The range table has a primary key of "RangeId" that is a simple number (eg. 1 to 100). The "get next id" routine picks a random number from 1 to 100, gets the first range record with an id lower than the random number. This spreads the locks out across N records. You can use 10's, 100's or 1000's of range records. When a range is fully consumed just delete the range record.
If you're really using multiple databases then you can manually ensure each database's set of range records do not overlap.
You need to make sure that your ID column is only ever accessed in a lock - then only one person can read the highest and set the new highest ID.
You can do this in C# using a lock statement around your code that accesses the table, or in your database you can put together a transaction on your read/write. I don't know the exact syntax for this on mysql.
Use a transactional database and control transactions manually. That way you can submit multiple queries without risking having something mixed up. Also, you may store the relevant query sets in stored procedures, so you can simply invoke these transactional queries.
If you have problems with performance, increment the ID by 100 and use a thread per "client" server. The thread should do the increment and hand each interested party a new ID. This way, the thread needs only access the DB once for 100 IDs.
If the thread crashes, you'll loose a couple of IDs but if that doesn't happen all the time, you shouldn't need to worry about it.
AFAIK the only way to get this out of a DB with nicely incrementing numbers is going to be transactional locks at the DB which is hideous performance wise. You can get a lockless behaviour using GUIDs but frankly you're going to run into transaction requirements in every CRUD operation you can think of anyway.
Assuming that your database is configured to run with a transaction isolation of READ_COMMITTED or better, then use one SQL statement that updates the row, setting it to the old value selected from the row plus an increment. With lower levels of transaction isolation you might need to use INSERT combined with SELECT FOR UPDATE.
As pointed out [by Aaron Digulla] it is better to allocate blocks of IDs, to reduce the number of queries and table locks.
The application must perform the ID acquisition in a separate transaction from any business logic, otherwise any transaction that needs an ID will end up waiting for every transaction that asks for an ID first to commit/rollback.
This article: http://www.ddj.com/architect/184415770 explains the HIGH-LOW strategy that allows your application to obtain IDs from multiple allocators. Multiple allocators improve concurrency, reliability and scalability.
There is also a long discussion here: http://www.theserverside.com/patterns/thread.tss?thread_id=4228 "HIGH/LOW Singleton+Session Bean Universal Object ID Generator"