Spring Transaction Performance Multiple Vs Single - mysql

So I am curious in knowing what is the performance impact if we have a single transaction with multiple updates in one flow Or have separate transaction for each of the update.
If application can sustain both of the patterns which one is the best one to opt for. But in the case if application can only go with the second option which is to have different transactions for all the updates then how much kick back to do we get in terms of performance.
#Transactional
public updateXYZ(){
updateX()
updateY()
updateZ()
}
VS
public updateXYZ(){
updateSeprateTransactionX()
updateSeprateTransactionY()
updateSeprateTransactionZ()
}

Related

how bad is it to have "extra" database queries?

I come from the front-end world in web development where we try really hard to limit the number of HTTP requests issued (by consolidating css, js files, images, etc.).
With db connections (MySQL), obviously you don't want to have unnecessary connections, but as a general rule, how bad is it to have multiple small queries? (they execute quickly)
I ask because I'm moving my application to a clustered environment and where before I was caching some stuff in server memory (as I was running on a single server), I am now trying to make my app "stateless" and in my current implementation that means more small db calls. This will help me with load balancing (avoiding sticky sessions) and also keep server memory usage down.
We're not talking a ton of queries, maybe 6-8 db calls instead of 2-4, returning anywhere from a handful of records to a few thousand records. Each of them executes quickly, less than 30ms (some much less), but I don't know if there is some "connection latency" I should be concerned about.
Thanks for your insight.
Short answer: (1) make sure you're staying at the same big-O level, reuse connections, measure performance; (2) think about how much you care about data consistency.
Long answer:
Performance
Strictly from performance perspective, and generally speaking, unless you are already close to maxing out your database resources, such as max connections, this is not likely to have major impact. But there are certain things you should keep in mind:
do the "6-8" queries that replace "2-4" queries stay in the same execution time? e.g. if current database interaction is at O(1) is it going to change to O(n)? Or current O(n) going to change to O(n^2)? If yes, you should think about what that means for your application
most application servers can reuse existing database connections, or have persistent database connection pools; make sure your application does not establish a new connection for every query; otherwise this is going to make it even more inefficient
in many common cases, mainly on larger tables with complex indexes and joins, doing few queries by primary keys may be more efficient than joining those tables in a single query; this would be the case if, while doing such joins, the server not only takes longer to perform the complex query, but also blocks other queries against affected tables
Generally speaking about performance, the rule of thumb is - always measure.
Consistency
Performance is not the only aspect to consider, however. Also think about how much you care about data consistency in your application.
For example, consider a simple case - tables A and B that have one-to-one relationship and you are querying for a single record using a primary key. If you join those tables and retrieve result using a single query, you'll either get a record from both A and B, or no records from either, which is what your application expects too. Now consider if you split that up into 2 queries (and you're not using transactions with preferred isolation levels) - you get a record from table A, but before you could grab the matching record from table B, it is deleted/updated by another process. Now your application has a record from A but none from B.
General question here is - do you care about ACID compliance of your relational data as it pertains to the queries you are breaking apart? If the answer is yes, you must think about how your application logic will react in these specific cases.
6-8 queries for one web page? Usually this is fine. I do it all the time.
Thousands of rows returned? Choke! What is the client going to do with that many? Can the SQL do more processing, then return fewer rows?
With rare exceptions, only 1 connection per web page.
Each query has a lot of overhead. For example, INSERTing 100 rows into a table -- 100 INSERT single-row statements will take about 10 times as long as a single 100-row INSERT. So when practical use fewer round-trips to the server. This becomes very important if the network is a WAN. The other side of the globe is 250ms away, just for latency. A server in the same datacenter is probably so close that latency can be ignored. In a WAN, use Stored Routines to minimize round trips.
I like to time each query actively in the code. Then, if I perceive a performance problem, I look to see which query to work on first. Or use the SlowLog.

Performance of JPA mappings

Right now I am using JPA. What I would like to know is the following:
I have a lot of tables mapped to each other. When I look at the log, I see that a lot of information is being pulled out from the database after a simple query. What will happen if there will be a lot of queries at a time? Or will it work fine? How can I increase performance?
There is an overhead that ORM framework comes with. Because it's fairly high level, it sometimes needs to generate a lot of native SQL queries to get you what you want in one or two lines of JPQL or pure EntityManager operations.
However, JPA uses two caches - L1 and L2. One operates on the entities level, other on PersistenceUnits level. Therefore, you might see a lot of SQL queries generated, but after some time, you should have some of the data cached.
If you're unhappy with the performance, you could try using lazy loaded collections or fetching the required data by yourself (you might be interested in Bozho's post regarding this matter).
Finally, if you see that the cache hasn't improved your performance and that the hand-made JPQL queries are not doing the job right - you can always revert to plain SQL queries. Beware, that those queries bypass the JPA caches and might require you to do some flushes before you execute the native query (or invoke it at the beginning of the active transaction).
Despite the optimisation route you'll choose - firstly test it in your environment and answer yourself if you need this optimisation at all. Do some heavy testing, performance tests and so on.
"Premature optimization is the root of all evil." D. Knuth
If you really have a LOT of entities mapped together, this could indeed lead to a performance problem. This will usually be the case if you have a lot of #OneToManyor #ManyToMany mappings:
#Entity
public class A {
#OneToMany
private List<B> listB;
#ManyToMany
private List<C> listC;
#OneToMany
private List<D> listD;
...
}
However one thing you could do, is using lazy fetching. This means that the loading of a field may be delayed until it is accessed for the first time. You could achieve this by using the fetch attribute:
#Entity
public class A {
#OneToMany(fetch=FetchType.LAZY)
private List<B> listB;
#ManyToMany(fetch=FetchType.LAZY)
private List<C> listC;
#OneToMany(fetch=FetchType.LAZY)
private List<D> listD;
...
}
In the above sample it means listB, listC, listD will not be fetched from the DB until the first access to the list.
The concrete implementation of the lazy fetching depends on your JPA provider.

MySQL performance: views vs. functions vs. stored procedures

I have a table that contains some statistic data which is collected per hour.
Now I want to be able to quickly get statitics per day / week / month / year / total.
What is the best way to do so performance wise? Creating views? Functions? Stored procedures? Or normal tables where i have to write to simultaneously when updating data? (I would like to avoid the latter).
My current idea would be to create a view_day which sums up the hours, then a view_week and view_month and view_year which sum up data from view_day, and view_total which sums up view_year. Is it good or bad?
You essentially have two systems here: One that collects data and one that reports on that data.
Running reports against your frequently-updated, transactional tables will likely result in read-locks that block writes from completing as quickly as they can and therefore possibly degrade performance.
It is generally HIGHLY advisable to run periodic "gathering" task that gathers information from your (probably highly normalized) transactional tables and stuff that data into denormalized reporting tables forming a "data wharehouse". You then point your reporting engine / tools at the denormalized "data wharehouse" which can be queried against without impacting the live transactional database.
This gathering task should only run as often as your reports need to be "accurate". If you can get away with once a day, great. If you need to do this once an hour or more, then go ahead, but monitor the performance impact on your writing tasks when you do.
Remember, if the performance of your transactional system is important (and it generally is), avoid running reports against it at all costs.
Yes, having the tables that store already aggregated data is a good practice.
Whereas views, as well as SPs and functions will just perform queries over big tables, which is not that efficient.
The only real fast and scalable solution is as you put it "normal tables where you have to write to simultaneously when updating data" with proper indexes. You can automate updating of such table using triggers.
My view is that complex calculations should only happen once as the data changes not every time you query. Create an aggregate data and populate it either through a trigger (if no log is acceptable) or through a job that runs once a day or once an hour or whatever lag time is acceptable for reporting. If you go the trigger route, test, test, test. Make sure it can handle multiple row inserts/updates/deletes as well as the more common single ones. Make sure it is as fast as possible and has no bugs whatsoever. Triggers will add a bit of processing to every data action, you have to make sure it adds the smallest possible bit and that no bugs will ever happen that will pervent users from inserting/updating/deleting data.
We have a similar problem, and what we do is utilize a master/slave relationship. We do transactional data (both reads and writes, since in our case some reads need to be ultra fast and can't wait for replication for the transaction), on the master. The slave is quickly replicating data, but then we run every non-transactional query off that, including reporting.
I highly suggest this method as it's simple to put into place as a quick and dirty data warehouse if your data is granular enough to be useful in the reporting layers/apps.

Alternatives to LINQ To SQL on high loaded pages

To begin with, I LOVE LINQ TO SQL. It's so much easier to use than direct querying.
But, there's one great problem: it doesn't work well on high loaded requests. I have some actions in my ASP.NET MVC project, that are called hundreds times every minute.
I used to have LINQ to SQL there, but since the amount of requests is gigantic, LINQ TO SQL almost always returned "Row not found or changed" or "X of X updates failed". And it's understandable. For instance, I have to increase some value by one with every request.
var stat = DB.Stats.First();
stat.Visits++;
// ....
DB.SubmitChanges();
But while ASP.NET was working on those //... instructions, the stats.Visits value stored in the table got changed.
I found a solution, I created a stored procedure
UPDATE Stats SET Visits=Visits+1
It works well.
Unfortunately now I'm getting more and more moments like that. And it sucks to create stored procedures for all cases.
So my question is, how to solve this problem? Are there any alternatives that can work here?
I hear that Stackoverflow works with LINQ to SQL. And it's more loaded than my site.
This isn't exactly a problem with Linq to SQL, per se, it's an expected result with optimistic concurrency, which Linq to SQL uses by default.
Optimistic concurrency means that when you update a record, you check the current version in the database against the copy that was originally retrieved before making any offline updates; if they don't match, report a concurrency violation ("row not found or changed").
There's a more detailed explanation of this here. There's also a fairly sizable guide on handling concurrency errors. Typically the solution involves simply catching ChangeConflictException and picking a resolution, such as:
try
{
// Make changes
db.SubmitChanges();
}
catch (ChangeConflictException)
{
foreach (var conflict in db.ChangeConflicts)
{
conflict.Resolve(RefreshMode.KeepCurrentValues);
}
}
The above version will overwrite whatever is in the database with the current values, regardless of what other changes were made. For other possibilities, see the RefreshMode enumeration.
Your other option is to disable optimistic concurrency entirely for fields that you expect might be updated. You do this by setting the UpdateCheck option to UpdateCheck.Never. This has to be done at the field level; you can't do it at the entity level or globally at the context level.
Maybe I should also mention that you haven't picked a very good design for the specific problem you're trying to solve. Incrementing a "counter" by repeatedly updating a single column of a single row is not a very good/appropriate use of a relational database. What you should be doing is actually maintaining a history table - such as Visits - and if you really need to denormalize the count, implement that with a trigger in the database itself. Trying to implement a site counter at the application level without any data to back it up is just asking for trouble.
Use your application to put actual data in your database, and let the database handle aggregates - that's one of the things databases are good at.
Use a producer/consumer or message queue model for updates that don't absolutely have to happen immediately, particularly status updates. Instead of trying to update the database immediately keep a queue of updates that the asp.net threads can push to and then have a writer process/thread that writes the queue to the database. Since only one thread is writing, there will be much less contention on the relevant tables/roles.
For reads, use caching. For high volume sites even caching data for a few seconds can make a difference.
Firstly, you could call DB.SubmitChanges() right after stats.Visits++, and that would greatly reduce the problem.
However, that still is not going to save you from the concurrency violation (that is, simultaneously modifying a piece of data by two concurrent processes). To fight that, you may use the standard mechanism of transactions. With LINQ-to-SQL, you use transactions by instantiating a TransactionScope class, thusly:
using( TransactionScope t = new TransactionScope() )
{
var stats = DB.Stats.First();
stats.Visits++;
DB.SubmitChanges();
}
Update: as Aaronaught correctly pointed out, TransactionScope is not going to help here, actually. Sorry. But read on.
Be careful, though, not to make the body of a transaction too long, as it will block other concurrent processes, and thus, significantly reduce your overall performance.
And that brings me to the next point: your very design is probably flawed.
The core principle in dealing with highly shared data is to design your application in such way that the operations on that data are quick, simple, and semantically clear, and they must be performed one after another, not simultaneously.
The one operation that you're describing - counting visits - is pretty clear and simple, so it should be no problem, once you add the transaction. I must add, however, that while this will be clear, type-safe and otherwise "good", the solution with stored procedure is actually a much preferred one. This is actually exactly the way database applications were being designed in ye olden days. Think about it: why would you need to fetch the counter all the way from the database to your application (potentially over the network!) if there is no business logic involved in processing it. The database server may increment it just as well, without even sending anything back to the application.
Now, as for other operations, that are hidden behind // ..., it seems (by your description) that they're somewhat heavy and long. I can't tell for sure, because I don't see what's there, but if that's the case, you probably want to separate them into smaller and quicker ones, or otherwise rethink your design. I really can't tell anything else with this little information.

How does transaction suspension work in MySQL?

In the Spring Framework manual they state that for a PROPAGATION_REQUIRES_NEW the current transaction will be suspended.
What does that "suspended transaction"?
The timer for the timeout stops counting on the current transaction?
What are the actual implication of such suspension?
Thank you,
Asaf
It doesn't mean anything special, a suspended transaction is just a transaction that is temporarily not used for inserts, updates, commit or rollback, because a new transaction should be created due to the specified propagation properties, and only one transaction can be active at the same time.
Basically there are two transaction models: the nested and flat model. In the nested model, if you start a transaction, and you need an other one, the first one remains active, that is, the second one will be nested inside its parent, and so on. On the other hand, in the flat model, the first transaction will be suspended, that is, we won't use it until the new one has been completed.
AFAIK the flat model is used almost exclusively (including Spring and the EJB spec as well), since it's much easier to implement: there is only one active transaction at any given time, so it's easy to decide what to do in case of a rollback, say, because of an exception. More importantly, the underlying database has to support it if you need the nested model, so the flat model is just the common denominator in this case.