Django MySQL REPEATABLE READ "data loss" - mysql

I'm looking for information about what is behind the entry in Django 2.0 release notes. Stating:
MySQL’s default isolation level, repeatable read, may cause data loss in typical Django usage. To prevent that and for consistency with other databases, the default isolation level is now read committed. You can use the DATABASES setting to use a different isolation level, if needed.
As I understand repeatable read is "stricter" than read commited, so what is Django doing to produce "data loss" is a question bugging me for sometime now.
Is it stuff like prefetch_related? Or maybe in general making an UPDATE based on potentially stale (SELECTED earlier in a thread) is or can be considered data loos? Or even better - maybe there's something that only MySQL does or has a bug that makes it dangerous on repeatable read?
Thank you.

There are actually two issues with REPEATABLE READ worth noting here.
One is that Django's internal code is written with the expectation that transactions obey the READ COMMITTED semantics. The fact that REPEATABLE READ is actually a higher isolation level doesn't matter; the point is that it violates the expectations of Django's code, leading to incorrect behavior. (In fact, adding a setting to change the isolation level was initially resisted because "it would imply that Django works correctly under any isolation level, and I don't think that's true".)
A straightforward example (first noted in the issue tracker 9 years ago) is the behavior of get_or_create(). It works by first trying to read the row; then, if that fails, trying to create it. If that creation operation fails it's presumably because some other transaction has created the row in the meantime. Therefore, it tries again to read and return it. That works as expected in READ COMMITTED, but in REPEATABLE READ that final read won't find anything because the read must return the same results (none) that it found the first time.
That's not data loss, though. The second issue is specific to MySQL and the non-standard way it defines the behavior of REPEATABLE READ. Roughly speaking, reads will behave as in REPEATABLE READ but writes will behave as in READ COMMITTED, and that combination can lead to data loss. This is best demonstrated with an example, so let me quote this one, provided by core contributor Shai Berger:
(1) BEGIN TRANSACTION
(2) SELECT ... FROM some_table WHERE some_field=some_value
(1 row returned)
(3) (some other transactions commit)
(4) SELECT ... FROM some_table WHERE some_field=some_value
(1 row returned, same as above)
(5) DELETE some_table WHERE some_field=some_value
(answer: 1 row deleted)
(6) SELECT ... FROM some_table WHERE some_field=some_value
(1 row returned, same as above)
(7) COMMIT
(the row that was returned earlier is no longer in the database)
Take a minute to read this. Up to step (5), everything is as you would expect;
you should find steps (6) and (7) quite surprising.
This happens because the other transactions in (3) deleted the row that is
returned in (2), (4) & (6), and inserted another one where
some_field=some_value; that other row is the row that was deleted in (5). The
row that this transaction selects was not seen by the DELETE, and hence not
changed by it, and hence continues to be visible by the SELECTs in our
transaction. But when we commit, the row (which has been deleted) no longer
exists.
The tracker issue that led to this change in the default isolation level gives additional detail on the problem, and it links to other discussions and issues if you want to read more.

Related

What are the consequences of accessing the same database from different programs at the same time?

I have a mysql database that is accessed using JDBC. If I access the database from two different programs at the same time then what effect will be there on the database?
Please tell in view of when both programs are reading the database, one is reading and the other is writing data and when both are writing data.
I think that when both programs write data then that would definitely lead to loss of data. But what happens in the other scenarios?
MySQL works on an ACID basis: http://en.wikipedia.org/wiki/ACID
Which means, both clients will be reading the database as if they were the only clients.
For this to happen each client must start a transaction, which is a single logical unit of work. Within this transaction either all the operations done to the database must be committed or rolled back.
Different RDBMSs have different defaults for their transaction support. For MySQL, the isolation level is REPEATABLE READ, which means that SELECT statements within the same transaction are consistent with respect to each other.
How you can verify this:
Have program1 going start a transaction and through every row and increasing a value, while the other program starts a transaction and goes through the database calculating the sum of the same value for all rows. When they are done, they close their transactions and print out the results. You will notice that both of them read the database as if they were isolated from each other.
There are whole books written about JDBC. Here are some links that can get you started:
JDBC Tutorial: http://docs.oracle.com/javase/tutorial/jdbc/
MySQL: http://dev.mysql.com/doc/refman/5.0/en/innodb-consistent-read.html
Hopefully, MySQL like PostgreSQL, MariaDB or other major databases accept to be used by many programs, each being allowed to have many connections. And the database will not break even if multiple programs try to update the same row at the same time. But ... the how to do that is the problem of the client programs via transactions.
Welcome to the world of ACID transactions ! Within a transaction, the database guarantees that the program keeps a level of consistency. There is no problems for Atomicity, Consistency and Durability, but Isolation is a little more tedious. JDBC defines 4 level of isolation, plus no transaction at all (following extracted from The Java Tutorials : Using Transactions) :
The interface Connection includes five values that represent the transaction isolation levels you can use in JDBC:
Isolation Level Transactions Dirty Reads Non-Repeatable Reads/Phantom Reads
TRANSACTION_NONE Not supported Not applicable Not applicable Not applicable
TRANSACTION_READ_COMMITTED Supported Prevented Allowed Allowed
TRANSACTION_READ_UNCOMMITTED Supported Allowed Allowed Allowed
TRANSACTION_REPEATABLE_READ Supported Prevented Prevented Allowed
TRANSACTION_SERIALIZABLE Supported Prevented Prevented Prevented
Accessing an updated value that has not been committed is considered a dirty read because it is possible for that value to be rolled back to its previous value.
A non-repeatable read occurs when transaction A retrieves a row, transaction B subsequently updates the row, and transaction A later retrieves the same row again. Transaction A retrieves the same row twice but sees different data.
A phantom read occurs when transaction A retrieves a set of rows satisfying a given condition, transaction B subsequently inserts or updates a row such that the row now meets the condition in transaction A, and transaction A later repeats the conditional retrieval. Transaction A now sees an additional row. This row is referred to as a phantom.

What transactions does commit / rollback affect?

Does it only affect whatever commands were after the relevant BEGIN transaction?
For example:
BEGIN TRAN
UPDATE orders SET orderdate = '01-08-2013' WHERE orderno > '999'
Now, assume someone else performs a data import that inserts 10,000 new records into another table.
If I subsequently issue a ROLLBACK command, do those records get discarded or is it just the command above that gets rolled back?
Sorry if this a stupid question, I'm only just starting to use COMMIT and ROLLBACK.
Any transaction is confined to the connection it was opened on.
One of the four ACID properties of any relational database management system is Isolation. That means your actions are isolated from other connections and vice versa until you commit. Any change you do is invisible to other connections and if you roll it back they will never know it happened. That means in turn that changes that happened somewhere else are invisible to you until they are committed. Particularly that means that you can't ROLLBACK anyone else's changes.
The Isolation is achieved in one of two ways. One way is to "lock" the resource (e.g. the row). If that happens any other connection trying to read from that row has to wait until you finish your transaction.
The other way is to create a copy of the row that contains the old values. In this case all other connections will see the old version until you commit your transaction.
SQL Server can use both isolation methods. Which one is used depends on the isolation level you choose. The two Snapshot Isolation Levels provide the "copy method" the other four use the "lock method". The default isolation level of "read committed" is one of the "lock method" isolation levels.
Be aware however that the isolation level "READ UNCOMMITTED" basically circumvents these mechanisms and allows you to read changes that others started and have not yet committed. This is a special isolation level that can be helpful when diagnosing a problem but should be avoided in production code.

Should I commit after a single select

I am working with MySQL 5.0 from python using the MySQLdb module.
Consider a simple function to load and return the contents of an entire database table:
def load_items(connection):
cursor = connection.cursor()
cursor.execute("SELECT * FROM MyTable")
return cursor.fetchall()
This query is intended to be a simple data load and not have any transactional behaviour beyond that single SELECT statement.
After this query is run, it may be some time before the same connection is used again to perform other tasks, though other connections can still be operating on the database in the mean time.
Should I be calling connection.commit() soon after the cursor.execute(...) call to ensure that the operation hasn't left an unfinished transaction on the connection?
There are thwo things you need to take into account:
the isolation level in effect
what kind of state you want to "see" in your transaction
The default isolation level in MySQL is REPEATABLE READ which means that if you run a SELECT twice inside a transaction you will see exactly the same data even if other transactions have committed changes.
Most of the time people expect to see committed changes when running the second select statement - which is the behaviour of the READ COMMITTED isolation level.
If you did not change the default level in MySQL and you do expect to see changes in the database if you run a SELECT twice in the same transaction - then you can't do it in the "same" transaction and you need to commit your first SELECT statement.
If you actually want to see a consistent state of the data in your transaction then you should not commit apparently.
then after several minutes, the first process carries out an operation which is transactional and attempts to commit. Would this commit fail?
That totally depends on your definition of "is transactional". Anything you do in a relational database "is transactional" (That's not entirely true for MySQL actually, but for the sake of argumentation you can assume this if you are only using InnoDB as your storage engine).
If that "first process" only selects data (i.e. a "read only transaction"), then of course the commit will work. If it tried to modify data that another transaction has already committed and you are running with REPEATABLE READ you probably get an error (after waiting until any locks have been released). I'm not 100% about MySQL's behaviour in that case.
You should really try this manually with two different sessions using your favorite SQL client to understand the behaviour. Do change your isolation level as well to see the effects of the different levels too.

Locking mySQL tables/rows

can someone explain the need to lock tables and/or rows in mysql?
I am assuming that it to prevent multiple writes to the same field, is this the best practise?
First lets look a good document This is not a mysql related documentation, it's about postgreSQl, but it's one of the simplier and clear doc I've read on transaction. You'll understand MySQl transaction better after reading this link http://www.postgresql.org/docs/8.4/static/mvcc.html
When you're running a transaction 4 rules are applied (ACID):
Atomicity : all or nothing (rollback)
Coherence : coherent before, coherent after
Isolation: not impacted by others?
Durability : commit, if it's done, it's really done
In theses rules there's only one which is problematic, it's Isolation. using a transaction does not ensure a perfect isolation level. The previous link will explain you better what are the phantom-reads and suchs isolation problems between concurrent transactions. But to make it simple you should really use Row levels locks to prevent other transaction, running in the same time as you (and maybe comitting before you), to alter the same records. But with locks comes deadlocks...
Then when you'll try using nice transactions with locks you'll need to handle deadlocks and you'll need to handle the fact that transaction can fail and should be re-launched (simple for or while loops).
Edit:------------
Recent versions of InnoDb provides greater levels of isolation than previous ones. I've done some tests and I must admit that even the phantoms reads that should happen are now difficult to reproduce.
MySQL is on level 3 by default of the 4 levels of isolation explained in the PosgtreSQL document (where postgreSQL is in level 2 by default). This is REPEATABLE READS. That means you won't have Dirty reads and you won't have Non-repeatable reads. So someone modifying a row on which you made your select in your transaction will get an implicit LOCK (like if you had perform a select for update).
Warning: If you work with an older version of MySQL like 5.0 you're maybe in level 2, you'll need to perform the row lock using the 'FOR UPDATE' words!
We can always find some nice race conditions, working with aggregate queries it could be safer to be in the 4th level of isolation (by using LOCK IN SHARE MODE at the end of your query) if you do not want people adding rows while you're performing some tasks. I've been able to reproduce one serializable level problem but I won't explain here the complex example, really tricky race conditions.
There is a very nice example of race conditions that even serializable level cannot fix here : http://www.postgresql.org/docs/8.4/static/transaction-iso.html#MVCC-SERIALIZABILITY
When working with transactions the more important things are:
data used in your transaction must always be read INSIDE the transaction (re-read it if you had data from before the BEGIN)
understand why the high isolation level set implicit locks and may block some other queries ( and make them timeout)
try to avoid dead locks (try to lock tables in the same order) but handle them (retry a transaction aborted by MySQL)
try to freeze important source tables with serialization isolation level (LOCK IN SHARE MODE) when your application code assume that no insert or update should modify the dataset he's using (if not you will not have problems but your result will have ignored the concurrent changes)
It is not a best practice. Modern versions of MySQL support transactions with well defined semantics. Use transactions, and forget about locking stuff by hand.
The only new thing you'll have to deal with is that transaction commits may fail because of race conditions, but you'd be doing error checking with locks anyway, and it is easier to retry the logic that led to a transaction failure than to recover from errors in a non-transactional setup.
If you do get race conditions and failed commits, then you may want to fine-tune the isolation configuration for your transactions.
For example if you need to generate invoice numbers which are sequential and have no numbers missing - this is a requirement at least in the country I live in.
If you have a few web servers, then a few users might be buying stuff literally at the same time.
If you do select max(invoice_id)+1 from invoice to get the new invoice number, two web servers might do that at the same time (before the new invoice has been added), and get the same invoice number for the invoices they're trying to create.
If you use a mechanism such as "auto_increment", this is just meant to generate unique values, and makes no guarantees about not missing out numbers (if one transaction tries to insert a row, then does a rollback, the number is "lost"),
So the solution is to (a) lock the table (b) select max(invoice_id)+1 from invoice (c) do the insert (d) commit + unlock the table.
On another note, in MySQL you're best using InnoDB and using row-level locking. Doing a lock table command can implicitly commit the transaciton you're working on.
Take a look here for general introduction to what transactions are and how to use them.
Databases are designed to work in concurrent environments, so locking the tables and/or records helps to keep the transactions consistent.
So a record affected by one transaction should not be altered until this transaction commits or rolls back.

mysql isolation levels

I'm a bit confused by the documentation here. I have a transaction, which
start transaction
does some updates
does some selects
does some more updates
commit
I want my selects at step 3 to see the results of updates in step 2 but I want to be able to roll back the whole thing.
read committed seems to imply that selects only show data that's been committed, and repeatable read seems to imply that all subsequent selects will see the same data as existed at the time of the 1st select - thus ignoring my updates. read uncommitted seems to do the right thing, but: "but a possible earlier version of a row might be used" -- this is also not acceptable, as my selects MUST see the result of my updates.
is serializable really my only hope here?
I'm working off the documentation here
Transaction isolation levels describe only the interaction between concurrent transactions. With any isolation level, stuff that you have updated within the same transaction will be updated when you re-select them from that transaction.
The right isolation level in your case seems to be read commited, so you can rollback at any point and uncommited data is not visible in other transactions.