I know there is one issue in MySQL with concurrent SELECT and INSERT. However, my question is if I open up two connections with MySQL and keep loading data using both of them, does MySQL takes data concurrently or waits for one to finish before loading another?
I’d like to know how MySQL behaves in both cases. Like when I am trying to load data in the same table or different tables concurrently when opening separate connections.
If you will create a new connection to the database and perform inserts from both the links, then from the database's perspective, it will still be sequential.
The documentation of Concurrent Inserts for MyISAM on the MySQL's documentation page says something like this:
If MyISAM storage is used and table has no holes, multiple INSERT statements are queued and performed in sequence, concurrently with the SELECT statements.
Mind that there is no control over the order in which two concurrent inserts will take place. The order in this concurrency is at the mercy of a lot of different factors. To ensure order, by default you will have to sacrifice concurrency.
MySQL does support parallel data inserts into the same table.
But approaches for concurrent read/write depends upon storage engine you use.
InnoDB
MySQL uses row-level locking for InnoDB tables to support simultaneous write access by multiple sessions, making them suitable for multi-user, highly concurrent, and OLTP applications.
MyISAM
MySQL uses table-level locking for MyISAM, MEMORY, and MERGE tables, allowing only one session to update those tables at a time, making them more suitable for read-only, read-mostly, or single-user applications
But, the above mentioned behavior of MyISAM tables can be altered by concurrent_insert system variable in order to achieve concurrent write. Kindly refer to this link for details.
Hence, as a matter of fact, MySQL does support concurrent insert for InnoDB and MyISAM storage engine.
You ask about Deadlock detection, ACID and particulary MVCC, locking and transactions:
Deadlock Detection and Rollback
InnoDB automatically detects transaction deadlocks and rolls back a
transaction or transactions to break the deadlock. InnoDB tries to
pick small transactions to roll back, where the size of a transaction
is determined by the number of rows inserted, updated, or deleted.
When InnoDB performs a complete rollback of a transaction, all locks
set by the transaction are released. However, if just a single SQL
statement is rolled back as a result of an error, some of the locks
set by the statement may be preserved. This happens because InnoDB
stores row locks in a format such that it cannot know afterward which
lock was set by which statement.
https://dev.mysql.com/doc/refman/5.6/en/innodb-deadlock-detection.html
Locking
The system of protecting a transaction from seeing or changing data
that is being queried or changed by other transactions. The locking
strategy must balance reliability and consistency of database
operations (the principles of the ACID philosophy) against the
performance needed for good concurrency. Fine-tuning the locking
strategy often involves choosing an isolation level and ensuring all
your database operations are safe and reliable for that isolation
level.
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_locking
ACID
An acronym standing for atomicity, consistency, isolation, and
durability. These properties are all desirable in a database system,
and are all closely tied to the notion of a transaction. The
transactional features of InnoDB adhere to the ACID principles.
Transactions are atomic units of work that can be committed or rolled
back. When a transaction makes multiple changes to the database,
either all the changes succeed when the transaction is committed, or
all the changes are undone when the transaction is rolled back. The
database remains in a consistent state at all times -- after each
commit or rollback, and while transactions are in progress. If related
data is being updated across multiple tables, queries see either all
old values or all new values, not a mix of old and new values.
Transactions are protected (isolated) from each other while they are
in progress; they cannot interfere with each other or see each other's
uncommitted data. This isolation is achieved through the locking
mechanism. Experienced users can adjust the isolation level, trading
off less protection in favor of increased performance and concurrency,
when they can be sure that the transactions really do not interfere
with each other.
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_acid
MVCC
InnoDB is a multiversion concurrency control (MVCC) storage engine
which means many versions of the single row can exist at the same
time. In fact there can be a huge amount of such row versions.
Depending on the isolation mode you have chosen, InnoDB might have to
keep all row versions going back to the earliest active read view, but
at the very least it will have to keep all versions going back to the
start of SELECT query which is currently running
https://www.percona.com/blog/2014/12/17/innodbs-multi-versioning-handling-can-be-achilles-heel/
It depends.
It depends on the client -- some clients allow concurrent access; some will serialize access, thereby losing the expected gain. You have not even specified PHP vs Java vs ... or Apache vs ... or Windows vs ... Many combinations simply do not provide any parallelism.
If different tables, there is only general contention for I/O, CPU, Mutexes on the buffer_pool, etc. A reasonable amount of parallelism is possible.
If same table, it depends on the indexes and access patterns. In some cases the threads will block each other. In some cases it will even "deadlock" and rollback one of the transactions. Deadlocks not only slow you down, but make you retry the inserts.
If you looking for high speed ingestion of a lot of rows, see my blog. It lays out techniques, and points out sever of the ramifications, such as replication, Engine choice, multi-threading.
Multiple threads inserting into the same tables -- It depend a lot on the values you are providing for any PRIMARY or UNIQUE keys. It depends on whether other actions are taken in the same transaction. It depends on how much I/O is involved. It depends on whether you are doing single-row inserts, or batching. It depends on ... (Sorry to be vague, but your question is not very specific.)
If you would like to present specifics on two or three designs, we can discuss the specifics.
Related
I was just wondering how most relational databases handled maintaining your set of results if another query has edited those rows that you were working on. For instance if I do a select of like 100k rows and while I am still fetching those another query comes in and does an update on 1 of the rows that hasn't been read yet the update is not going to be seen in the fetching of those rows and I was wondering how the database engine handles that. If you only have specifics for one type of database thats fine I would like to hear it anyway.
Please lookup Multi Version Concurrency Control. Different databases have different approaches to managing this. For MySQL, InnoDB, you can try http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html. PostgreSQL - https://wiki.postgresql.org/wiki/MVCC. A great presentation here - http://momjian.us/main/writings/pgsql/mvcc.pdf. It is explained in stackoverflow in this thread Database: What is Multiversion Concurrency Control (MVCC) and who supports it?
The general goal you are describing in concurrent programming (Wikipedia concurrency control) is serialization (Wikipedia serializability): an implementation manages the database as if transactions occurred without overlap in some order.
The importance of that is that only then does the system act in a way described by the code as we normally interpret it. Otherwise the results of operations are a combination of all processes acting concurrently. Nevertheless by having limited categories of non-normal non-isolated so-called anomalous behaviours arise transaction throughput can be increased. So those implementation techniques are also apropos. (Eg MVCC.) But understand: such non-serialized behaviour is not isolating one transaction from another. (Ie so-called "isolation" levels are actually non-isolation levels.)
Isolation is managed by breaking transaction implementations into pieces based on reading and writing shared resources and executing them interlaced with pieces from other transactions in such a way that the effect is the same as some sequence of non-overlapped execution. Roughly speaking, one can "pessimistically" "lock" out other processes from changed resources and have them wait or "optimistically" "version" the database and "roll back" (throw away) some processes' work when changes are unreconcilable (unserializable).
Some techniques based on an understanding of serializability by an implementer for a major product are in this answer. For relevant notions and techniques, see the Wikipedia articles or a database textbook. Eg Fundamentals of database systems by Ramez Elmasri & Shamkant B. Navathe. (Many textbooks, slides and courses are free online.)
(Two answers and a comment to your question mention MVCC. As I commented, not only is MVCC just one implementation technique, it doesn't even support transaction serialization, ie actually isolating transactions as if each was done all at once. It allows certain kinds of errors (aka anomalies). It must be mixed with other techniques for isolation. The MVCC answers, comments and upvoting reflects a confusion between a popular and valuable technique for a useful and limited failure to isolate per your question vs the actual core issues and means.)
As Jayadevan notes, the general principle used by most widely used databases that permit you to modify values while they're being read is multi-version concurrency control or MVCC. All widely used modern RDBMS implementations that support reading rows that're being updated rely on some concept of row versioning.
The details differ between implementations. (I'll explain PostgreSQL's a little more here, but you should accept Jayadevan's answer not mine).
PostgreSQL uses transaction ID ranges to track row visibility. So there'll be multiple copies of a tuple in a table, but any given transaction can only "see" one of them. Each transaction has a unique ID, with newer transactions having newer IDs. Each tuple has hidden xmin and xmax fields that track which transactions can "see" the tuple. Insertion is implimented by setting the tuple's xmin so that transactions with lower xids know to ignore the tuple when reading the table. Deletion is implimented by setting the tuple's xmax so that transactions with higher xids know to ignore the tuple when reading the table. Updates are implemented by effectively deleting the old tuple (setting xmax) then inserting a new one (setting xmin) so that old transactions still see the old tuple, but new transactions see the new tuple. When no running transaction can still see a deleted tuple, vacuum marks it as free space to be overwritten.
Oracle uses undo and redo logs, so there's only one copy of the tuple in the main table, and transactions that want to find old versions have to go and find it in the logs. Like PostgreSQL it uses row versioning, though I'm less sure of the details.
Pretty much every other DB uses a similar approach these days. Those that used to rely on locking, like older MS-SQL versions, have moved to MVCC.
MySQL uses MVCC with InnoDB tables, which are now the default. MyISAM tables still rely on table locking (but they'll also eat your data, so don't use them for anything you care about).
A few embedded DBs, like SQLite, still rely only on table locking - which tends to require less wasted disk space and I/O overhead, at the cost of greatly reduced concurrency. Some databases let you bypass MVCC if you take an exclusive lock on a table.
(Marked community wiki, since I also close-voted this question).
You should also read the PostgreSQL docs on transaction isolation and locking, and similar documentation for other DBs you use. See the Wikipedia article on isolation.
Snapshot isolation solves the problem you are describing. If you use locking, you can see the same record twice as the iterator through the database, as the unlocked records change underneath your feet as you're doing the scan.
Read committed isolation level with locking suffers from this problem.
Depending on the granularity of the lock, the WHERE predicate may lock matching pages and tuples for locking so that the running read query doesn't see phantom data appearing (phantom reads)
I implemented multiversion concurrency control in my Java project. A transaction is given a monotonically increasing timestamp which starts at 0 and goes up by 1 each time the transaction is aborted. Later transactions have higher timestamps. When a transaction goes to read, it can only see data that has a timestamp that is less than or equal to itself and is committed for that key (or column of that tuple). (Equal to so it can see its own writes)
When a transaction writes, it updates the committed timestamp for that key to that of that transactions timestamp.
When running queries while using myisam engine, because its not transactional, long queries (as far as I understand) don't affect the data from other queries.
In InnoDB, one of the things it warns is to avoid long queries. When InnoDB snapshots, is it snap shotting everything?
The reason I am asking this is: say a query for whatever reason takes a longer time than normal and eventually rolls back. Meanwhile, 200 other users have updated or inserted rows into the database. When the long query rolls back, does it also remove the updates/inserts that were made by the other users? or are the rows that involved the other users safe, unless they crossed over with the one that gets rolled back?
Firstly, I think that it would be useful as background to read up on multi-version concurrency control (MVCC) as a background to this answer.
InnoDB implements MVCC, which means it can use non-locking reads for regular SELECT. This does not require creating a "snapshot" and in fact InnoDB doesn't have any real concept of a snapshot as an object. Instead, every record in the database keeps track of its own version number and maintains a "roll pointer" to an "undo log" record (which may or may not still exist) which modifies the row to its previous version. If an older version of a record is needed, the current version is read and those roll pointers are followed and undo records applied until a sufficiently old version of the record is produced.
Normally the system is constantly cleaning up those undo logs and re-using the space they consume.
Any time any long-running transaction (note, not necessarily a single query) is present, the undo logs must be kept (not purged) in order to sufficiently recreate old enough versions of all records to satisfy that transaction. In a very busy system, those undo logs can very quickly accumulate to consume gigabytes of space. Additionally if specific individual records are very frequently modified, reverting that record to an old enough version to satisfy the query could take very many undo log applications (thousands).
That is what makes "long-running queries" expensive and frowned upon. They will increase disk space consumption for keeping the undo logs in the system tablespace, and they will perform poorly due to undo log record application to revert row versions upon read.
Some databases implement a maximum amount of undo log space that can be consumed, and once they have reached that limit they start throwing away older undo log records and invalidating running transactions. This generates a "snapshot too old" error message to the user. InnoDB has no such limit, and allows accumulation indefinitely.
Whether your queries affect concurrency or not have to do with the types of queries. Having many read queries won't affect concurrency in MyISAM or InnoDB (besides performance issues).
Inserts (to the end of an index with InnoDB, or the end of a table with MyISAM) also don't impact concurrency.
However, as soon as you have an update query, rows get locked in InnoDB, and with MyISAM, it's the entire table that gets write locked. When you try to update a record (or table) that has a write lock, you must wait until the lock is released before you can proceed. In MyISAM, updates are served before reads, so you have to wait until the updates are processed.
MyISAM can be more performant because table locks are faster than record locks (though record locks are fast). However, when you start making a significant number of updates, InnoDB is generally preferred because different users are generally not likely to contend for the same records. So, with InnoDB, many users can work in parallel without affecting each other too much, thanks to the record level locking (rather than table locks).
Not to mention the benefit of full ACID compliance that you get with InnoDB, enforcement of foreign key constraints, and the speed of clustered indexes.
Snapshots (log entries) are kept long enough to complete the current transaction and are discarded if they are rolled back or committed. The longer a transaction runs, the more likely it is that other updates will occur, which grows the number of log entries required to roll back.
There will be no "cross-over" due to locking. When there is write contention for the same records, one user must wait until the other commits or rolls back.
You can read more about The InnoDB Transaction Model and Locking.
I am developing application that will run from multiple comuters. I want to lock mysql tables, so there won't be process concurrency issues, like one process is writing and other process is reading at the same time. Or what is even worse both process simultaneously writing (updating) different values. MySQL provides locks, but documentation says that we should avoid using locks with InnoDB. Read here. Please provide some advices what to do in this situation. Thanks everyone.
InnoDB is a transactional storage engine with full ACID support. One of the properties of InnoDB is that it handles the concurrent updates. How exactly depends on the Isolation level, but generally InnoDB disallow two transactions to modify the same row by locking the row. It does not lock the whole table so other records can be modified by other transactions.
If you set the isolation level to serializable the application will work as there is no concurrency at all, but still will allow some concurrency.
The higher the isolation level, the less concurrency you have, still you have more then if you lock the table.
After noticing that our database has become a major bottleneck on our live production systems, I decided to construct a simple benchmark to get to the bottom of the issue.
The benchmark: I time how long it takes to increment the same row in an InnoDB table 3000 times, where the row is indexed by its primary key, and the column being updated is not part of any index. I perform these 3000 updates using 20 concurrent clients running on a remote machine, each with its own separate connection to the DB.
I'm interested in learning why the different storage engines I benchmarked, InnoDB, MyISAM, and MEMORY, have the profiles that they do. I'm also hoping to understand why InnoDB fares so poorly in comparison.
InnoDB (20 concurrent clients):
Each update takes 0.175s.
All updates are done after 6.68s.
MyISAM (20 concurrent clients):
Each update takes 0.003s.
All updates are done after 0.85s.
Memory (20 concurrent clients):
Each update takes 0.0019s.
All updates are done after 0.80s.
Thinking that the concurrency could be causing this behavior, I also benchmarked a single client doing 100 updates sequentially.
InnoDB:
Each update takes 0.0026s.
MyISAM:
Each update takes 0.0006s.
MEMORY:
Each update takes 0.0005s.
The actual machine is an Amazon RDS instance (http://aws.amazon.com/rds/) with mostly default configurations.
I'm guessing that the answer will be along the following lines: InnoDB fsyncs after each update (since each update is an ACID compliant transaction), whereas MyISAM does not since it doesn't even support transaction. MyISAM is probably performing all updates in memory, and regularly flushing to disk, which is how its speed approaches the MEMORY storage engine. If this is so, is there a way to use InnoDB for its transaction support, but perhaps relax some constraints (via configurations) so that writes are done faster at the cost of some durability?
Also, any suggestions on how to improve InnoDB's performance as the number of clients increases? It is clearly scaling worse than the other storage engines.
Update
I found https://blogs.oracle.com/MySQL/entry/comparing_innodb_to_myisam_performance, which is precisely what I was looking for. Setting innodb-flush-log-at-trx-commit=2 allows us to relax ACID constraints (flushing to disk happens once per second) for the case where a power failure or server crash occurs. This gives us a similar behavior to MyISAM, but we still get to benefit from the transaction features available in InnoDB.
Running the same benchmarks, we see a 10x improvement in write performance.
InnoDB (20 concurrent clients):
Each update takes 0.017s.
All updates are done after 0.98s.
Any other suggestions?
I found https://blogs.oracle.com/MySQL/entry/comparing_innodb_to_myisam_performance, which is precisely what I was looking for. Setting innodb-flush-log-at-trx-commit=2 allows us to relax ACID constraints (flushing to disk happens once per second) for the case where a power failure or server crash occurs. This gives us a similar behavior to MyISAM, but we still get to benefit from the transaction features available in InnoDB.
Running the same benchmarks, we see a 10x improvement in write performance.
InnoDB (20 concurrent clients): Each update takes 0.017s. All updates are done after 0.98s.
We have done some similar tests in our application and we noticed that if no transaction is explicitly opened, each single SQL instruction is treated inside a transaction, which takes much more time to execute. If your business logic allows, you can put several SQL commands inside a transaction block, reducing overall ACID overhead. In our case, we had great performance improvement with this approach.
I was wondering if anyone has a suggestion for what kind of storage engine to use. The programs needs to perform a lot of writes to database but very few reads.
[edit] No foreign keys necessary. The data is simple, but it needs to preform the writes very fast.
From jpipes:
MyISAM and Table-Level Locks
Unlike InnoDB, which employs row-level
locking, MyISAM uses a much
coarser-grained locking system to
ensure that data is written to the
data file in a protected manner.
Table-level locking is the only level
of lock for MyISAM, and this has a
couple consequences:
Any connection issuing an UPDATE or DELETE against a MyISAM table will
request an exclusive write lock on the
MyISAM table. If no other locks (read
or write) are currently placed on the
table, the exclusive write lock is
granted and all other connections
issuing requests of any kind (DDL,
SELECT, UPDATE, INSERT, DELETE) must
wait until the thread with the
exclusive write lock updates the
record(s) it needs to and then
releases the write lock.
Since there is only table-level locks, there is no ability (like there
is with InnoDB) to only lock one or a
small set of records, allowing other
threads to SELECT from other parts of
the table data.
The point is, for writing, InnoDB is better as it will lock less of the resource and enable more parallel actions/requests to occur.
"It needs to perform the writes very fast" is a vague requirement. Whatever you do, writes may be delayed by contention in the database. If your application needs to not block when it's writing audit records to the database, you should make the audit writing asynchronous and keep your own queue of audit data on disc or in memory (so you don't block the main worker thread/process)
InnoDB may allow concurrent inserts, but that doesn't mean they won't be blocked by contention for resources or internal locks for things like index pages.
MyISAM allows one inserter and several readers ("Concurrent inserts") under the following circumstances:
The table has no "holes in it"
There are no threads trying to do an UPDATE or DELETE
If you have an append-only table, which you recreate each day (or create a new partition every day if you use 5.1 partitioning), you may get away with this.
MyISAM concurrent inserts are mostly very good, IF you can use them.
When writing audit records, do several at a time if possible - this applies whichever storage engine you use. It is a good idea for the audit process to "batch up" records and do an insert of several at once.
You've not really given us enough information to make a considered suggestion - are you wanting to use foreign keys? Row-level locking? Page-level locking? Transactions?
As a general rule, if you want to use transactions, InnoDB/BerkeleyDB. If you don't, MyISAM.
In my experience, MyISAM is great for fast writes as long as, after insertion, it's read-only. It'll keep happily appending faster than any other option I'm familiar with (including supporting indexes).
But as soon as you start deleting records or updating index keys, and it needs to refill emptied holes (in tables or indexes) the discussion gets a lot more complicated.
For classic log-type or journal-type tables, though, it's very happy.