How do databases maintain results if another query edits those results - mysql

I was just wondering how most relational databases handled maintaining your set of results if another query has edited those rows that you were working on. For instance if I do a select of like 100k rows and while I am still fetching those another query comes in and does an update on 1 of the rows that hasn't been read yet the update is not going to be seen in the fetching of those rows and I was wondering how the database engine handles that. If you only have specifics for one type of database thats fine I would like to hear it anyway.

Please lookup Multi Version Concurrency Control. Different databases have different approaches to managing this. For MySQL, InnoDB, you can try http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html. PostgreSQL - https://wiki.postgresql.org/wiki/MVCC. A great presentation here - http://momjian.us/main/writings/pgsql/mvcc.pdf. It is explained in stackoverflow in this thread Database: What is Multiversion Concurrency Control (MVCC) and who supports it?

The general goal you are describing in concurrent programming (Wikipedia concurrency control) is serialization (Wikipedia serializability): an implementation manages the database as if transactions occurred without overlap in some order.
The importance of that is that only then does the system act in a way described by the code as we normally interpret it. Otherwise the results of operations are a combination of all processes acting concurrently. Nevertheless by having limited categories of non-normal non-isolated so-called anomalous behaviours arise transaction throughput can be increased. So those implementation techniques are also apropos. (Eg MVCC.) But understand: such non-serialized behaviour is not isolating one transaction from another. (Ie so-called "isolation" levels are actually non-isolation levels.)
Isolation is managed by breaking transaction implementations into pieces based on reading and writing shared resources and executing them interlaced with pieces from other transactions in such a way that the effect is the same as some sequence of non-overlapped execution. Roughly speaking, one can "pessimistically" "lock" out other processes from changed resources and have them wait or "optimistically" "version" the database and "roll back" (throw away) some processes' work when changes are unreconcilable (unserializable).
Some techniques based on an understanding of serializability by an implementer for a major product are in this answer. For relevant notions and techniques, see the Wikipedia articles or a database textbook. Eg Fundamentals of database systems by Ramez Elmasri & Shamkant B. Navathe. (Many textbooks, slides and courses are free online.)
(Two answers and a comment to your question mention MVCC. As I commented, not only is MVCC just one implementation technique, it doesn't even support transaction serialization, ie actually isolating transactions as if each was done all at once. It allows certain kinds of errors (aka anomalies). It must be mixed with other techniques for isolation. The MVCC answers, comments and upvoting reflects a confusion between a popular and valuable technique for a useful and limited failure to isolate per your question vs the actual core issues and means.)

As Jayadevan notes, the general principle used by most widely used databases that permit you to modify values while they're being read is multi-version concurrency control or MVCC. All widely used modern RDBMS implementations that support reading rows that're being updated rely on some concept of row versioning.
The details differ between implementations. (I'll explain PostgreSQL's a little more here, but you should accept Jayadevan's answer not mine).
PostgreSQL uses transaction ID ranges to track row visibility. So there'll be multiple copies of a tuple in a table, but any given transaction can only "see" one of them. Each transaction has a unique ID, with newer transactions having newer IDs. Each tuple has hidden xmin and xmax fields that track which transactions can "see" the tuple. Insertion is implimented by setting the tuple's xmin so that transactions with lower xids know to ignore the tuple when reading the table. Deletion is implimented by setting the tuple's xmax so that transactions with higher xids know to ignore the tuple when reading the table. Updates are implemented by effectively deleting the old tuple (setting xmax) then inserting a new one (setting xmin) so that old transactions still see the old tuple, but new transactions see the new tuple. When no running transaction can still see a deleted tuple, vacuum marks it as free space to be overwritten.
Oracle uses undo and redo logs, so there's only one copy of the tuple in the main table, and transactions that want to find old versions have to go and find it in the logs. Like PostgreSQL it uses row versioning, though I'm less sure of the details.
Pretty much every other DB uses a similar approach these days. Those that used to rely on locking, like older MS-SQL versions, have moved to MVCC.
MySQL uses MVCC with InnoDB tables, which are now the default. MyISAM tables still rely on table locking (but they'll also eat your data, so don't use them for anything you care about).
A few embedded DBs, like SQLite, still rely only on table locking - which tends to require less wasted disk space and I/O overhead, at the cost of greatly reduced concurrency. Some databases let you bypass MVCC if you take an exclusive lock on a table.
(Marked community wiki, since I also close-voted this question).
You should also read the PostgreSQL docs on transaction isolation and locking, and similar documentation for other DBs you use. See the Wikipedia article on isolation.

Snapshot isolation solves the problem you are describing. If you use locking, you can see the same record twice as the iterator through the database, as the unlocked records change underneath your feet as you're doing the scan.
Read committed isolation level with locking suffers from this problem.
Depending on the granularity of the lock, the WHERE predicate may lock matching pages and tuples for locking so that the running read query doesn't see phantom data appearing (phantom reads)
I implemented multiversion concurrency control in my Java project. A transaction is given a monotonically increasing timestamp which starts at 0 and goes up by 1 each time the transaction is aborted. Later transactions have higher timestamps. When a transaction goes to read, it can only see data that has a timestamp that is less than or equal to itself and is committed for that key (or column of that tuple). (Equal to so it can see its own writes)
When a transaction writes, it updates the committed timestamp for that key to that of that transactions timestamp.

Related

Can I INSERT into table while UPDATING multiple different rows with MariaDB or MySQL?

I am creating a custom analytics system and currently in the database designing process. I'm planning to use MariaDB with the InnoDB engine to be able to handle big loads.
The data I'm expecting could be around 500k clicks/day. I will need to insert these rows into the database, which means that I'll have around 5.8 inserts/sec on average. However, at the same time, I want to record if someone visited a page associated with that click. (basically to record funnels)
So what I'm planning to do is to create additional columns and search for the ID of the specific row then update that column with the exact time of the visit.
My first question: is this generally a recommended approach to design the database like that? If not, how else is it worth to design the database?
My only concern is that while updating rows the Table will be locked, and can't do inserts, therefore slowing down the user experience.
My second question: is this something I should worry about, that the table gets locked while updating, and thus slowing down inserts? Does it hurt performance?
InnoDB doesn't lock the table for insert if you're performing the update. Your users won't experience any weird hanging.
It's an MVCC compliant engine, designed to handle concurrent access to underlying tables.
You can control the engine's behavior by choosing an appropriate isolation level, however the default (REPEATABLE READ) is excellent and does the job more than well.
If a table is being modified by multiple users (not users that connect to your site but connections established towards MySQL via a scripting language or some other service) and there's many inserts/updates/deletes - MySQL can throw an error saying a deadlock occurred.
A deadlock is a warning, not an error, that more than 1 thread tried to access an occupied resource (such as two threads tried to update the same row at the same time, but only 1 will be allowed to do so). It's an indication you should repeat the query.
I'm suggesting that you take care of all possible scenarios in the language of your choice when it comes to handling MySQL that's under heavier I/O.
~6 inserts a second isn't a lot, make sure you're allowing MySQL to access sufficient system resources. For InnoDB, check the value of innodb_buffer_pool_size or google a bit to see what it is and how to use it to make your database run fast.
Good luck!
At a mere 5.6/second, there won't be much problem.
I do, however, suggest vertical partitioning for "Likes", "Upvotes", "Clicks", and similar things. These tend to have a lot of UPDATEs of random single rows, and may interfere with other activity.
That is, have a separate table with (perhaps) just 2 columns:
The id of the item being Liked/Clicked/etc.
A counter.
It is simple enough (and fast enough) to JOIN via that id when you want to display info including the counter.
As already pointed out, the row is locked, not the table.

How does MySQL handle concurrent inserts?

I know there is one issue in MySQL with concurrent SELECT and INSERT. However, my question is if I open up two connections with MySQL and keep loading data using both of them, does MySQL takes data concurrently or waits for one to finish before loading another?
I’d like to know how MySQL behaves in both cases. Like when I am trying to load data in the same table or different tables concurrently when opening separate connections.
If you will create a new connection to the database and perform inserts from both the links, then from the database's perspective, it will still be sequential.
The documentation of Concurrent Inserts for MyISAM on the MySQL's documentation page says something like this:
If MyISAM storage is used and table has no holes, multiple INSERT statements are queued and performed in sequence, concurrently with the SELECT statements.
Mind that there is no control over the order in which two concurrent inserts will take place. The order in this concurrency is at the mercy of a lot of different factors. To ensure order, by default you will have to sacrifice concurrency.
MySQL does support parallel data inserts into the same table.
But approaches for concurrent read/write depends upon storage engine you use.
InnoDB
MySQL uses row-level locking for InnoDB tables to support simultaneous write access by multiple sessions, making them suitable for multi-user, highly concurrent, and OLTP applications.
MyISAM
MySQL uses table-level locking for MyISAM, MEMORY, and MERGE tables, allowing only one session to update those tables at a time, making them more suitable for read-only, read-mostly, or single-user applications
But, the above mentioned behavior of MyISAM tables can be altered by concurrent_insert system variable in order to achieve concurrent write. Kindly refer to this link for details.
Hence, as a matter of fact, MySQL does support concurrent insert for InnoDB and MyISAM storage engine.
You ask about Deadlock detection, ACID and particulary MVCC, locking and transactions:
Deadlock Detection and Rollback
InnoDB automatically detects transaction deadlocks and rolls back a
transaction or transactions to break the deadlock. InnoDB tries to
pick small transactions to roll back, where the size of a transaction
is determined by the number of rows inserted, updated, or deleted.
When InnoDB performs a complete rollback of a transaction, all locks
set by the transaction are released. However, if just a single SQL
statement is rolled back as a result of an error, some of the locks
set by the statement may be preserved. This happens because InnoDB
stores row locks in a format such that it cannot know afterward which
lock was set by which statement.
https://dev.mysql.com/doc/refman/5.6/en/innodb-deadlock-detection.html
Locking
The system of protecting a transaction from seeing or changing data
that is being queried or changed by other transactions. The locking
strategy must balance reliability and consistency of database
operations (the principles of the ACID philosophy) against the
performance needed for good concurrency. Fine-tuning the locking
strategy often involves choosing an isolation level and ensuring all
your database operations are safe and reliable for that isolation
level.
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_locking
ACID
An acronym standing for atomicity, consistency, isolation, and
durability. These properties are all desirable in a database system,
and are all closely tied to the notion of a transaction. The
transactional features of InnoDB adhere to the ACID principles.
Transactions are atomic units of work that can be committed or rolled
back. When a transaction makes multiple changes to the database,
either all the changes succeed when the transaction is committed, or
all the changes are undone when the transaction is rolled back. The
database remains in a consistent state at all times -- after each
commit or rollback, and while transactions are in progress. If related
data is being updated across multiple tables, queries see either all
old values or all new values, not a mix of old and new values.
Transactions are protected (isolated) from each other while they are
in progress; they cannot interfere with each other or see each other's
uncommitted data. This isolation is achieved through the locking
mechanism. Experienced users can adjust the isolation level, trading
off less protection in favor of increased performance and concurrency,
when they can be sure that the transactions really do not interfere
with each other.
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_acid
MVCC
InnoDB is a multiversion concurrency control (MVCC) storage engine
which means many versions of the single row can exist at the same
time. In fact there can be a huge amount of such row versions.
Depending on the isolation mode you have chosen, InnoDB might have to
keep all row versions going back to the earliest active read view, but
at the very least it will have to keep all versions going back to the
start of SELECT query which is currently running
https://www.percona.com/blog/2014/12/17/innodbs-multi-versioning-handling-can-be-achilles-heel/
It depends.
It depends on the client -- some clients allow concurrent access; some will serialize access, thereby losing the expected gain. You have not even specified PHP vs Java vs ... or Apache vs ... or Windows vs ... Many combinations simply do not provide any parallelism.
If different tables, there is only general contention for I/O, CPU, Mutexes on the buffer_pool, etc. A reasonable amount of parallelism is possible.
If same table, it depends on the indexes and access patterns. In some cases the threads will block each other. In some cases it will even "deadlock" and rollback one of the transactions. Deadlocks not only slow you down, but make you retry the inserts.
If you looking for high speed ingestion of a lot of rows, see my blog. It lays out techniques, and points out sever of the ramifications, such as replication, Engine choice, multi-threading.
Multiple threads inserting into the same tables -- It depend a lot on the values you are providing for any PRIMARY or UNIQUE keys. It depends on whether other actions are taken in the same transaction. It depends on how much I/O is involved. It depends on whether you are doing single-row inserts, or batching. It depends on ... (Sorry to be vague, but your question is not very specific.)
If you would like to present specifics on two or three designs, we can discuss the specifics.

InnoDB Isolation Level for single SELECT query

I know that every single query sent to MySQL (with InnoDB as engine) is made as a separate transaction. However my concerns is about the default isolation level (Repeatable Read).
My question is: as SELECT query are sent one by one, what is the need to made the transaction in repeatable read ? In this case, InnoDB doesn't add overhead for nothing ?
For instance, in my Web Application, I have lot of single read queries but the accuracy doesn't matter: as an example, I can retreive the number of books at a given time, even if some modifications are being processed, because I precisely know that such number can evolve after my HTTP request.
In this case READ UNCOMMITED seems appropriate. Do I need to turn every similar transaction-with-single-request to such ISOLATION LEVEL or InnoDB handle it automatically?
Thanks.
First of all your question is a part of wider topic re performance tuning. It is hard to answer just like that - knowing only this. But i try to give you at least some overview.
The fact that Repeatable Read is good enough for most database, does not mean it is also best for you! That’s holly true!
BTW, I think only in MySQL this is at this level defaultly. In most database this is at Read Committed (e.g. Oracle). In my opinion it is enough for most cases.
My question is: as SELECT query are sent one by one, what is the need
to made the transaction in repeatable read ?
Basically no need. Repeatable read level ensure you are not allowing for dirty reads, not repeatable reads and phantom rows (but maybe this is a little different story). And basically these are when you run DMLs. So when query only pure SELECTs one by one -this simply does not apply to.
In this case, InnoDB doesn't add overhead for nothing ?
Another yep. It does not do it for nothing. In general ACID model in InnoDB is at cost of having data consistently stored without any doubts about data reliability. And this is not free of charge. It is simply trade off between performance and data consistency and reliability.
In more details MySQL uses special segments to store snapshots and old row values for rollback purposes. And refers to them if necessary. As I said it costs.
But also worth to mention that performance increase/decrease is visible much more when doing INSERT, UPDATE, DELETE. SELECT does not cost so much. But still.
If you do not need to do it, this is theoretically obvious benefit. How big? You need to assess it by yourself, measuring your query performance in your environment.
Cause many depends also on individual incl. scale, how many reads/writes are there, how often, reference application design, database and much, much more .
And with the same problem in different environments the answer could be simply different.
Another alternative here you could consider is to simply change engine to MyISAM (if you do not need any foreign keys for example). From my experience it is very good choice for heavy reads needs. Again all depends- but in many cases is faster than InnoDB. Of course less safer but if you are aware of possible threats - it is good solution.
In this case READ UNCOMMITED seems appropriate. Do I need to turn
every similar transaction-with-single-request to such ISOLATION LEVEL
or InnoDB handle it automatically?
You can set the isolation level globally, for the current session, or for the next transaction.
Set your transaction level globally for next sessions.
SET GLOBAL tx_isolation = 'READ-UNCOMMITTED';
http://dev.mysql.com/doc/refman/5.0/en/set-transaction.html

Is there ever a reason to use a database transaction for read only sql statements?

As the question says, is there ever a reason to wrap read-only sql statements in a transaction? Obviously updates require transactions.
You still need a read-lock on the objects you operate on. You want to have consistent reads, so writing the same records shouldn't be possible while you're reading them...
If you issue several SELECT statements in a single transaction, you will also produce several read-locks.
SQL Server has some good documentation on this (the "read-lock" is called shared lock, there):
http://msdn.microsoft.com/en-us/library/aa213039%28v=sql.80%29.aspx
I'm sure MySQL works in similar ways
Yes, if it's important that the data is consistent across the select statements run. For instance if you were getting the balance of several bank accounts for a user, you wouldn't want the balance values read to be inconsistent. Eg if this happened:
With balance values B1=10 and B2=20
Your code reads B1= 10.
Transaction TA1 starts on another DB client
TA1 writes B1 to 20, B2 to 10
TA1 commits
Your code reads B2 = 10
So you now think that B1 is 10 and B2 is 10, which could be displayed to the user and that says that $10 has disappeared!
Transactions for reading will prevent this, since we would read B2 as 20 in step 5 (assuming a multiversioning concurrency control DB, which mysql+innodb is).
MySQL 5.1, with the innodb engine has a default transaction isolation level which is REPEATABLE READS. So if you perform your SELECT inside a transaction no Dirty reads or Nonrepeatable reads can happen. That means even with transaction commiting between two of your queries you'll always get a consistent database. In theory in REPEATABLE READS you couls only fear phantom reads, but with innodb this cannot even occurs. So by simply opening a Transaction you can assume database consistency (coherence) and perform as much select as you want without fearing parallel-running-and-ending write transactions.
Do you have any interest in having such a big consistency constraint? Well it depends of what you're doing with your queries. having inconsistent reads means that if one of your query is based on a result from a previous one you may have problems:
if you're performing only one query you do not care, at all
if none of your queries assumes a result from a previous one, do not care
if you never re-read a record in the same session, same thing
if you always read dependencies of your main record in the same query and do not use lazy loading, no problem
if a small inconsistency between your first and last query will not break your code, then forget about it. But be careful, this can make a very hard to debug application bug (and hard to reproduce). So get a robust application code, something which could maybe handle databases errors and crash nicely (or not even crash) when this occurs (2 time in one year?).
if you show critical data (I mean bank accounts and not blogs or chats), then you should maybe care about it
if you have a lot of write operations, then you increase the risk of inconsistent reads, you may need to add transactions at least on some key points
you may need to test impact on performances, having all read requests in transactions, when several write transactions are really altering the data, is certainly slowing the engine, he needs to handle several versions of the data. So you shoul dcheck if the impact is not too big for your application

Locking mySQL tables/rows

can someone explain the need to lock tables and/or rows in mysql?
I am assuming that it to prevent multiple writes to the same field, is this the best practise?
First lets look a good document This is not a mysql related documentation, it's about postgreSQl, but it's one of the simplier and clear doc I've read on transaction. You'll understand MySQl transaction better after reading this link http://www.postgresql.org/docs/8.4/static/mvcc.html
When you're running a transaction 4 rules are applied (ACID):
Atomicity : all or nothing (rollback)
Coherence : coherent before, coherent after
Isolation: not impacted by others?
Durability : commit, if it's done, it's really done
In theses rules there's only one which is problematic, it's Isolation. using a transaction does not ensure a perfect isolation level. The previous link will explain you better what are the phantom-reads and suchs isolation problems between concurrent transactions. But to make it simple you should really use Row levels locks to prevent other transaction, running in the same time as you (and maybe comitting before you), to alter the same records. But with locks comes deadlocks...
Then when you'll try using nice transactions with locks you'll need to handle deadlocks and you'll need to handle the fact that transaction can fail and should be re-launched (simple for or while loops).
Edit:------------
Recent versions of InnoDb provides greater levels of isolation than previous ones. I've done some tests and I must admit that even the phantoms reads that should happen are now difficult to reproduce.
MySQL is on level 3 by default of the 4 levels of isolation explained in the PosgtreSQL document (where postgreSQL is in level 2 by default). This is REPEATABLE READS. That means you won't have Dirty reads and you won't have Non-repeatable reads. So someone modifying a row on which you made your select in your transaction will get an implicit LOCK (like if you had perform a select for update).
Warning: If you work with an older version of MySQL like 5.0 you're maybe in level 2, you'll need to perform the row lock using the 'FOR UPDATE' words!
We can always find some nice race conditions, working with aggregate queries it could be safer to be in the 4th level of isolation (by using LOCK IN SHARE MODE at the end of your query) if you do not want people adding rows while you're performing some tasks. I've been able to reproduce one serializable level problem but I won't explain here the complex example, really tricky race conditions.
There is a very nice example of race conditions that even serializable level cannot fix here : http://www.postgresql.org/docs/8.4/static/transaction-iso.html#MVCC-SERIALIZABILITY
When working with transactions the more important things are:
data used in your transaction must always be read INSIDE the transaction (re-read it if you had data from before the BEGIN)
understand why the high isolation level set implicit locks and may block some other queries ( and make them timeout)
try to avoid dead locks (try to lock tables in the same order) but handle them (retry a transaction aborted by MySQL)
try to freeze important source tables with serialization isolation level (LOCK IN SHARE MODE) when your application code assume that no insert or update should modify the dataset he's using (if not you will not have problems but your result will have ignored the concurrent changes)
It is not a best practice. Modern versions of MySQL support transactions with well defined semantics. Use transactions, and forget about locking stuff by hand.
The only new thing you'll have to deal with is that transaction commits may fail because of race conditions, but you'd be doing error checking with locks anyway, and it is easier to retry the logic that led to a transaction failure than to recover from errors in a non-transactional setup.
If you do get race conditions and failed commits, then you may want to fine-tune the isolation configuration for your transactions.
For example if you need to generate invoice numbers which are sequential and have no numbers missing - this is a requirement at least in the country I live in.
If you have a few web servers, then a few users might be buying stuff literally at the same time.
If you do select max(invoice_id)+1 from invoice to get the new invoice number, two web servers might do that at the same time (before the new invoice has been added), and get the same invoice number for the invoices they're trying to create.
If you use a mechanism such as "auto_increment", this is just meant to generate unique values, and makes no guarantees about not missing out numbers (if one transaction tries to insert a row, then does a rollback, the number is "lost"),
So the solution is to (a) lock the table (b) select max(invoice_id)+1 from invoice (c) do the insert (d) commit + unlock the table.
On another note, in MySQL you're best using InnoDB and using row-level locking. Doing a lock table command can implicitly commit the transaciton you're working on.
Take a look here for general introduction to what transactions are and how to use them.
Databases are designed to work in concurrent environments, so locking the tables and/or records helps to keep the transactions consistent.
So a record affected by one transaction should not be altered until this transaction commits or rolls back.