XA vs. Non-XA JDBC Driver Performance? - mysql

We are using an XA JDBC driver in a case where it is not required (read-only work that doesn't participate in a distributed transaction).
Just wondering if there are any known performance gains to be had to switch to the Non-XA JDBC driver - if not it's probably not worth switching?
FWIW we are using MySQL 5.1

As with all things performance related, the answer is: it depends. Specifically, it depends on exactly how you are using the driver.
The cost of interacting transactionally with a database is divided roughly into: code complexity overhead, communication overhead, sql processing and disk I/O.
Communication overhead differs somewhat between the XA and non-XA cases. All else being equal, an XA transaction carries a little more cost here as it requires more round trips to the db. For a non-XA transaction in manual commit mode, the cost is at least two calls: the sql operation(s) and the commit. In the XA case it's start, sql operation(s), end, prepare and commit. For your specific use case that will automatically optimize to start, sql operation(s), end, prepare. Not all the calls are of equal cost: the data moved in the result set will usually dominate. On a LAN the cost of the additional round trips is not usually significant.
Note however that there are some interesting gotchas lurking in wait for the unwary. For example, some drivers don't support prepared statement caching in XA mode, which means that XA usage carries the added overhead of re-parsing the SQL on every call, or requires you to use a separate statement pool on top of the driver. Whilst on the topic of pools, correctly pooling XA connections is a little more complex than pooling non-XA ones, so depending on the connection pool implementation you may see a slight hit there too. Some ORM frameworks are particularly vulnerable to connection pooling overhead if they use aggressive connection release and reacquire within transaction scope. If possible, configure to grab and hold a connection for the lifetime of the tx instead of hitting the pool multiple times.
With the caveat mentioned previously regarding the caching of prepared statements, there is no material difference in the cost of the sql handling between XA and non-XA tx. There is however a small difference to resource usage on the db server: in some cases it may be possible for the server to release resources sooner in the non-XA case. However, transactions are normally short enough that this is not a significant consideration.
Now we consider disk I/O overhead. Here we are concerned with I/O occasioned by the XA protocol rather than the SQL used for the business logic, as the latter is unchanged in either case. For read-only transactions the situation is simple: a sensible db and tx manager won't do any log writes, so there is no overhead. For write cases the same is true where the db is the only resource involved, due to XA's one phase commit optimization. For the 2PC case each db server or other resource manager needs two disk writes instead of the one used in non-XA cases, and the tx manager likewise needs two. Thanks to the slowness of disk storage this is the dominant source of performance overhead in XA vs. non-XA.
Several paragraphs back I mentioned code complexity. XA requires slightly more code execution than non-XA. In most cases the complexity is buried in the transaction manager, although you can of course drive XA directly if you prefer. The cost is mostly trivial, subject to the caveats already mentioned. Unless you are using a particularly poor transaction manager, in which case you may have a problem. The read-only case is particularly interesting - transaction manager providers usually put their optimization effort into the disk I/O code, whereas lock contention is a more significant issue for read-only use cases, particularly on highly concurrent systems.
Note also that code complexity in the tx manager is something of a red-herring in architectures featuring an app server or other standard transaction manager provider, as these usually use much the same code for XA and non-XA transaction coordination. In non-XA cases, to miss out the tx manager entirely you typically have to tell the app server / framework to treat the connection as non-transactional and then drive the commit directly using JDBC.
So the summary is: The cost of your sql queries is going to dominate the read-only transaction time regardless of the XA/non-XA choice, unless you mess up something in the configuration or do particularly trivial sql operations in each tx, the latter being a sign your business logic could probably use some restructuring to change the ratio of tx management overhead to business logic in each tx.
For read-only cases the usual transaction protocol agnostic advise therefore applies: consider a transaction aware level second level cache in an ORM solution rather than hitting the DB each time. Failing that, tune the sql, then increase the db's buffer cache until you see a 90%+ hit rate or you max out the server's RAM slots, whichever comes first. Only worry about XA vs. non-XA once you've done that and found things are still too slow.

To explain this briefly,
An XA transaction is a "global transaction".
A non-XA transaction is a "local transaction".
An XA transaction involves a coordinating transaction manager, with one or more databases (or other resources, like JMS) all involved in a single global transaction.
Non-XA transactions have no transaction coordinator, and a single resource is doing all its transaction work itself.

Related

How does MySQL InnoDB implement Read Uncommitted isolation level

Oracle doesn't allow dirty reads, so Read Uncommitted is not even allowed to be set from JDBC.
PostgreSQL also falls back to Read Committed, when choosing Read Uncommitted.
SQL Server defines a Read Uncommitted isolation level, because its concurrency control model is based on locking (unless switching to the two snapshot isolation levels), so it's probably the only database which can see some performance advantage from avoiding locking for reports that don't really need strict consistency.
InnoDB is also using MVCC but unlike Oracle and PostgreSQL, it allows dirty reads. Why is it so? Is there any performance advantage from going directly to the latest version, instead of rebuilding the previous version from the rollback segments? Is the rollback segment query-time restoring such an intensive process that would call for allowing dirty reads?
The main advantage I'm aware of, is that if all your sessions are READ-UNCOMMITTED then house-keeping (cleaning up UNDO) will never be blocked waiting for old sessions.
There may be some other performance gains if read-view structures (example) do not need to be created for READ-UNCOMMITTED transactions themselves, but I have not confirmed this myself. Generally speaking, this is not an isolation level that the InnoDB team targets optimizations for.
Edit: In terms of performance from unrolling rollback segments, yes it is possible it can be slow with many revisions. AFAIK it is a simple link list, and many traversals could be required. The comparison to PostgreSQL is a difficult one to make here, because the architecture (mysql features UNDO) is quite different. Generally speaking I would say that UNDO works well when the relocation is "logical only + fits in working set"; i.e. it is performed in memory, but cleaned up before physical IO was required.

MongoDB write concern sync level

I am trying to understand what exactly are the limitations of using MongoDB as the primary database for a project I am working on, it can be hard to wade through the crap online to properly understand how it compares to a more traditional database choice of say MySQL.
From what I understand from reading about HADR configuration of
IBM DB2 - http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.ha.doc%2Fdoc%2Fc0011724.html,
MySQL - http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html
and MongoDB - http://docs.mongodb.org/manual/core/write-concern/
It seems that Replica Acknowledged http://docs.mongodb.org/manual/core/replica-set-write-concern/ is the highest level of write concern in a replica set.
Is replica acknowledged the equivalent to the synchronous level in DB2 and Semisynchronous level in MySQL?
No they are not.
IBM DB2 provides a way to make sure that all members of a replica set are upto speed at the same time, it is the same as MySQLs own synchronous replication. It ensures full consistentcy at all times throughout the slave set.
Semisynchronous replication again is not replica set majority either; from the documentation page:
The master waits after commit only until at least one slave has received and logged the events.
But then:
It does not wait for all slaves to acknowledge receipt, and it requires only receipt, not that the events have been fully executed and committed on the slave side.
In other words you have no idea whether or not any slaves actually performed the command. It is the same as w:0 or "unsafe" writes in MongoDB.
With majority you have an idea that every member you send to has actually performed your command as can be seen by a cute little diagram in the documentation: http://docs.mongodb.org/manual/core/replica-set-write-concern/#verify-write-operations
and if that doesn't convince you then the quote:
The following sequence of commands creates a configuration that waits for the write operation to complete on a majority of the set members before returning:
From the next paragraph should.
So MySQL semisynchronous is similar to majority but it isn't the same. DB2 is totally different.
The IBM documentation sums up the differences in replica/slave wirte concern quite well:
The more strict the synchronization mode configuration parameter value, the more protection your database solution has against transaction data loss, but the slower your transaction processing performance. You must balance the need for protection against transaction loss with the need for performance.
This applies to DB2, MySQL and MongoDB alike. You must choose.

HandlerSocket transactions

In Redis can complete the transaction in this way:
redis.watch('powerlevel')
current = redis.get('powerlevel')
redis.multi()
redis.set('powerlevel', current + 1)
redis.exec()
Is it possible to perform this operation using the HandlerSocket?
What are the general features for working with transaction provides handlersotsket?
Comparing Redis "transactions" to a general purpose transactional engine is always a bit misleading. A Redis WATCH/MULTI/EXEC block is:
Not atomic (no rollback in case of error)
Consistent (there are not many consistency rules anyway with Redis)
Fully isolated (everything is serialized)
Possibly durable if AOF+fsync strategy is selected
So the full ACID properties which are commonly used to define a transaction are not completely provided by Redis. Contrary to most transactional engines, Redis provides very strong isolation, and does not attempt to provide any rollback capabilities.
The example provided in the question is not really representative IMO, since the same behavior can be achieved in a simpler way by just using:
redis.incr( "powerlevel" )
because Redis single operations are always atomic and isolated.
WATCH/MULTI/EXEC blocks are typically used when consistency between various keys must be enforced, or to implement optimistic locking patterns. In other words, if your purpose is just to increment isolated counters, there is no need to use a WATCH/MULTI/EXEC block.
The HandlerSocket is a complete different beast. It is built on top of the generic handler of MySQL, and depending on the underlying storage engine, the transactional behavior will be different. For instance when it is used with MyISAM, it will use no ACID transactions, but consistency will be ensured by a R/W lock at the table level. With InnoDB, ACID transactions will be used with the default isolation level (which can be set in the InnoDB configuration AFAIK). InnoDB implements MVCC (multi-versioning concurrency control), so locking is much more complex than with MyISAM.
The HandlerSocket works with two pools of worker threads (one for read-only connections, one for write oriented connections). People are supposed to use several read worker threads, but only one write thread though (probably to decrease locking contention). So in the base configuration, write operations are serialized, but not read operations. AFAIK, the only possibility to have the same isolation semantic than Redis is to only use the write oriented socket to perform both read and write operations, and keep only one write thread (full serialization of all operations). It will impact scalability though.
From the HandlerSocket protocol, there is no access to transactional capabilities. At each event loop iteration, it collects all the operations (coming from all the sockets), and perform a unique transaction (only relevant with InnoDB) for all these operations. AFAIK, the user has no way to alter the scope of this transaction.
The conclusion is it is not generally possible to emulate the behavior of a Redis WATCH/MULTI/EXEC block with HandlerSocket.
Now, back to the example, if the purpose is just to increment counters in a consistent way, this is fully supported by the HandlerSocket protocol. For instance, the +/- (increment/decrement) operations are available, and also the U? operation (similar to Redis GETSET command), or +?/-? (increment/decrement, returning the previous value).

SQL Azure performance considerations

Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.

MySQL: Transactions across multiple threads

Preliminary:
I have an application which maintains a thread pool of about 100 threads. Each thread can last about 1-30 seconds before a new task replaces it. When a thread ends, that thread will almost always will result in inserting 1-3 records into a table, this table is used by all of the threads. Right now, no transactional support exists, but I am trying to add that now. Also, the table in question is InnoDB. So...
Goal
I want to implement a transaction for this. The rules for whether or not this transaction commits or rollback reside in the main thread. Basically there is a simple function that will return a boolean.
Can I implement a transaction across multiple connections?
If not, can multiple threads share the same connection? (Note: there are a LOT of inserts going on here, and that is a requirement).
1) No, a transaction is limited to a single DB connection.
2) Yes, a connection (and transaction) can be shared across multiple threads.
Well, as stated in a different answer you can't create a transaction across multiple connections. And you can share the single connection across threads. However you need to be very careful with that. You need to make sure that only one thread is writing to the connection at the same time. You can't just have multiple threads talking across the same connection without synchronizing their activities in some way. Bad things will likely happen if you allow two threads to talk at once (memory corruptions in the client library, etc). Using a mutex or critical section to protect the connection conversations is probably the way to go.
-Don
Sharing connections between lots of threads is usually implemented by using a connection pool. Every thread can request a connection from the pool, use it for its purposes (one or more transactions, committed or rolled back) and hand it back to the pool once the task is finished.
This is what application servers offer you. They will take care of transactions, too, i. e. when the method that requested the transaction finishes normally, changes are committed, if it throws an exception, the database transaction is rolled back.
I suggest you have a look at Java EE 5 or 6 - it is very easy to use and can even be employed in embedded systems. For easy start, have a look at Netbeans and the Glassfish application server. However the general concepts apply to all application servers alike.
As for InnoDB, it will not have any problems handling lots of transactions. Under the supervision of the app server you can concentrate on the business logic and do not have to worry about half-written updates or anyone seeing updates/inserts before the transaction they originate from has been committed.
InnoDB uses MVCC (multi version concurrency control), effectively presenting each transaction with a snapshot of the whole database as of the time when it was started. You can read more about MVCC here in a related question: Question 812512