In my code I need to do the following:
Check a MySQL table (InnoDB) if a particular row (matching some criteria) exists. If it does, return it. If it doesn't, create it and then return it.
The problem I seem to have is race conditions. Every now and then two processes run so closely together, that they both check the table at the same time, don't see the row, and both insert it - thus duplicate data.
I'm reading MySQL documentation trying to come up with some way to prevent this. What I've come up so far:
Unique indexes seem to be one option, but they're not universal (it only works when the criteria is something unique for all rows).
Transactions even at SERIALIZABLE level don't protect against INSERT, period.
Neither do SELECT ... LOCK IN SHARE MODE or SELECT ... FOR UPDATE.
A LOCK TABLE ... WRITE would do it, but it's a very drastic measure - other processes won't be able to read from the table, and I need to lock ALL tables that I intend to use until I unlock them.
Basically, I'd like to do either of the following:
Prevent all INSERT to the table from processes other than mine, while allowing SELECT/UPDATE (this is probably impossible because it make so little sense most of the time).
Organize some sort of manual locking. The two processes would coordinate among themselves which one gets to do the select/insert dance, while the other waits. This needs some sort of operation that waits until the lock is released. I could probably implement a spin-lock (one process repeatedly checks if the other has released the lock), but I'm afraid that it would be too resource intensive.
I think I found an answer myself. Transactions + SELECT ... FOR UPDATE in an InnoDB table can provide a synchronization lock (aka mutex). Have all processes lock on a specific row in a specific table before they start their work. Then only one will be able to run at a time and the rest will wait until the first one finishes its transaction.
Related
Let's say isolation level is Repeatable Read as it's really is as default for MySQL.
I have two inserts (no checking, no unique columns).
a) Let's say these two inserts happen at the same moment. What will happen? Will it first run the first insert and the second or both of them in different MySQL's threads?
b) Let's say I have insert statement and column called vehicle_id as unique, but before that, I check if it exists or not. If it doesn't exist, go on and insert. Let's say two threads in my code both come at the same moment. So they will both go into if statement since they happened at the same moment.
Now, they both have to do insert with the same vehicle_id. How does MySQL handle this? If it's asynchronous or something, maybe both inserts might happen so quickly that they will both get inserted even though vehicle_id was the same as unique field. If it's not asynchronous or something, one will get inserted first, second one waits. When one is done, second one goes and tries to insert, but it won't insert because of unique vehicle_id restriction. How does this situation work?
I am asking because locks in repeatable read for INSERT lose their essence. I know how it's going to work for Updating/Selecting.
As I understand it the situation is:
a) the threads are assigned for each connection. If both updates are received on the same connection then they will be executed in the same thread, one after the other according to the order in whcih they are received. If they're in different threads then it will be down to whichever thread is scheduled first and that's likely to be OS determined and non-deterministic from your point of view.
b) if a column is defined as UNIQUE at the server, then you cannot insert a second row with the same value so the second insert must fail.
Trying to use a conflicting index in the way you described appears to be an application logic problem, not a MySQL problem. Whatever entity is responsible for your unique ID's (which is your application in this case) it needs to ensure that they are unique. One approach is to implement an Application Lock using MySQL which allows applications running in isolation from each other to share a lock at the server. Check in the mysql docs for how to use this. It's usual use is intended to be application level - therefore not binding on the MySQL server. Another approach would be to use Uuids for unique keys and rely on their uniqueness when you need to create a new one.
I'm studying about MySQL and how it works, and something confuses me and I don't find any clear explanation on the web about this.
What exactly is the difference between row and table locks? One locks the row and the other locks the table. Correct?
So, in which sort of situations would you use a table lock and row lock? Is it something the programmer or database manager can program in or it is the enigne that does it for you?
If there is any other information you think is good to know, feel free to add that to your answer.
I'm sorry for this possible noobish question, but I'm still learning.
While this is SQL server, it applies well to mySQL as well: What are row, page and table locks? And when they are acquired?.
MySQL docs shows this:
Generally, table locks are superior to row-level locks in the following cases:
Most statements for the table are reads.
Statements for the table are a mix of reads and writes, where writes are updates or deletes for a single row that can be fetched with one key read:
SELECT combined with concurrent INSERT statements, and very few UPDATE or DELETE statements.
Many scans or GROUP BY operations on the entire table without any writers.
Now when to use: The infamous "It depends" applies here:
Ask yourself what is the use case for this transaction?
Typically row level locking will be used when high granular control is needed. In my opinion this should be used as the default. Say a orders or orders detail table where the order could be updated or deleted. Locking the whole table on a high transaction volume table makes no sense. I want users of individual orders to be able to update each order and not lock someone else out when I know the scope of their change is a limited to a specific order.
Now if I needed to restore the orders and details table from backup for some reason; or make many updates to many records based on an external source; I may lock the whole table to ensure all the updates complete successfully and I can verify the load before I let anyone back in. I don't want any changes while I'm making the needed updates. But we have to consider if locking the whole table will negatively impact user experience; or if we have no other options available. Locking at the table level will prevent other users from changing any value. IS this really what we want?
Assume a MySQL table called, say, results. results is automatically updated via cron every day, around 11AM. However, results is also updated from a user-facing front-end, and around 11AM, there are a lot of users performing actions that also update the results table. What this means is that the automatic cron and the user updates often fail with 'deadlock' errors.
Our current solution:
We have implemented a try/catch that will repeat the attempt 10 times before moving on the next row. I do not like this solution at all because, well, it isn't a solution, just a workaround, and a faulty one at that. There's still no guarantee that the update will work at all if the deadlock persists through 10 attempts, and the execution time is potentially multiplied by 10 (not as much of an issue on the cron side, but definitely on the user side).
Another change we are about to implement is moving the cron to a different time of day, so as to not have the automatic update running at the same time as heavy platform usage. This should alleviate much of the problems for now, however I still don't like it, as it is still just a workaround. If the usage patterns of our users changes and the platform sees heavy use during that period, then we'll encounter the same issue again.
Is there a solution, either technical (code) or architectural (database design) that can help me alleviate or eliminate altogether these deadlock errors?
Deadlocks happen when you have one transaction that is acquiring locks on multiple rows in a non-atomic fashion, i.e. updates row A, then a split-second later it updates row B.
But there's a chance other sessions can split in between these updates and lock row B first, then try to lock row A. It can't lock row A, because the first session has got it locked. And now the first session won't give up its lock on row A, because it's waiting on row B, which the second session has locked.
Solutions:
All sessions must lock rows in the same order. So either session 1 or 2 will lock row A, the other will wait for row A. Only after locking row A does any session proceed to request a lock for row B. If all sessions are locking rows in ascending order, then they will never deadlock (descending order works just as well, the point is that all sessions must do the same).
Make one atomic lock-acquiring operation per transaction. Then you can't get this kind of interleaving effect.
Use pessimistic locking. That is, lock all resources the session might need to update in one atomic lock request at the beginning of its work. One example of doing this broadly is the LOCK TABLES statement. But this is usually considered a hinderance to concurrent access to the tables.
You might like my presentation InnoDB Locking Explained with Stick Figures. The section on deadlocks starts on slide 68.
If two independent scripts call a database with update requests to the same field, but with different values, would they execute at the same time and one overwrite the other?
as an example to help ensure clarity, imagine both of these statements being requested to run at the same time, each by a different script, where Status = 2 is called microseconds after Status = 1 by coincidence.
Update My_Table SET Status = 1 WHERE Status= 0;
Update My_Table SET Status = 2 WHERE Status= 0;
What would my results be and why? if other factors play a roll, expand on them as much as you please, this is meant to be a general idea.
Side Note:
Because i know people will still ask, my situation is using MySql with Google App Engine, but i don't want to limit this question to just me should it be useful to others. I am using Status as an identifier for what script is doing stuff to the field. if status is not 0, no other script is allowed to touch it.
This is what locking is for. All major SQL implementations lock DML statements by default so that one query won't overwrite another before the first is complete.
There are different levels of locking. If you've got row locking then your second update will run in parallel with the first, so at some point you'll have 1s and 2s in your table.
Table locking would force the second query to wait for the first query to completely finish to release it's table lock.
You can usually turn off locking right in your SQL, but it's only ever done if you need a performance boost and you know you won't encounter race conditions like in your example.
Edits based on the new MySQL tag
If you're updating a table that used the InnoDB engine, then you're working with row locking, and your query could yield a table with both 1s and 2s.
If you're working with a table that uses the MyISAM engine, then you're working with table locking, and your update statements would end up with a table that would either have all 1s or all 2s.
from https://dev.mysql.com/doc/refman/5.0/en/lock-tables-restrictions.html (MySql)
Normally, you do not need to lock tables, because all single UPDATE statements are atomic; no other session can interfere with any other currently executing SQL statement. However, there are a few cases when locking tables may provide an advantage:
from https://msdn.microsoft.com/en-us/library/ms177523.aspx (sql server)
An UPDATE statement always acquires an exclusive (X) lock on the table it modifies, and holds that lock until the transaction completes. With an exclusive lock, no other transactions can modify data.
If you were having two separate connections executing the two posted update statements, whichever statement was started first, would be the one that completed. THe other statement would not update the data as there would no longer be records with a status of 0
The short answer is: it depends on which statement commits first. Just because one process started an update statement before another doesn't mean that it will complete before another. It might not get scheduled first, it might be blocked by another process, etc.
Ultimately, it's a race condition: the operation that completes (and commits) last, wins.
Since you have TWO scripts doing the same thing and using different values for the UPDATE, they will NOT run at the same time, one of the scripts will run before even if you think you are calling them at the same time. You need to specify WHEN each script should run, otherwise the program will not know what should be 1 and what should be 2.
When doing a transaction in a mysql db, they are talking about the ongoing transaction not being able to see any updates made by external sources until it commits. So does this mean that changes CAN be made but the transaction just will not be able to see them, or is it actually impossible to update the db while the transaction is going on.
Because I need it to be impossible for other queries to change anything about certain tables while the transaction is going. Right now I write lock all those tables, start a transaction for the atomicity, commit, and than unlock. Is this the way to do this?
From my testing it seems that setting the isolation level to SERIALIZABLE accomplishes the same as manual table locking and unlocking? Is this correct?
It's going to depend on the transaction isolation level you have set on your database. You can read more about the levels here. For example, for READ UNCOMMITTED, you can actually read rows that are uncommitted by another transaction. This is usually not what you want to happen.
Locking an entire table is a really extreme choice though, and should probably not be done unless there's no other choice. My recommendation would be to consider the rows you need to lock, and then you can lock those specific rows using a select for update statement.
For example, suppose you have a resources table and a schedules table that contains bookings for those resources. When booking a resource, you have to check the schedules table for a given resource to make sure it's available for the desired time. However, you have to do this is a concurrent way, that is, you want to ensure that between the time you check the schedules table for availability for the resource, and the time you actually insert the row into the schedules table, you want to ensure that some other transaction doesn't book the resource for the same time (or an overlapping time).
You can accomplish this by using a select for update command:
select * from resources where resource_name=’a’ for update;
Assuming you're doing this in a stored procedure, if some other code fires the stored procedure for the same resource, it will block on that statement. This will ensure that resources don't get double booked.
We could also accomplish this by locking the entire resources table. However, there's no need to do that since we're only interested in booking a single resource. So it's good enough to just lock the resource row we care about.
Note that for MySQL, you need to index the columns you use in the for update or it will lock the entire table.
The point to all this is to always consider maximum concurrency. In other words, don't lock more than you need to. Otherwise, you make the application much less scalable and you inhibit concurrency.