Spring Framework #Async method + MySql Performance Degradation - Scalability Problem - mysql

I've an api, notifyCustomers() implemented on my batch server which gets called from my application server. It can send notification via three channels SMS, Push & Email. I've separate helper classes for each of them and they all execute in async mode.
I've got around 30k users out of which I usually send notification to the particular set of users ranging from 3k to 20k. The issue that I face is whenever I call that api, mysql performance just goes for a toss, particularly CPU. CPU utilisation goes around 100% for a very long period of around 30 mins
I've figured out workaround by doing following things and it's helping me in keeping things under control:
Using projection instead of domain object
Getting data in batch of 500 in each call
Implemented indexing based on the criteria that I need
No database calls from async methods of SMS, Email and Push
Thread.sleep(10 mins) between each subsequent fetch operation of data batches <== This is the dirty hack that's bothering me a lot
If I remove Thread.sleep() then everything goes haywire because batch server just calls async methods and then fires up db call to fetch next batch of 500 users in very quick successions till the time db server stops responding.
I need help with what I shall be doing in order to get rid of 5th point while keeping things under control? I'm running mysql on RDS with 300 IOPS and 4 GB RAM (db.t3.medium)

Related

ServiceStack Register web service slow performance

We noticed some performance bottlenecks in Service Stack web services especially the ones that comes out of the box like (Register) Web Service.
We ran a load-test using Visual Studio Load Test with the following parameters :
1K concurrent user.
1-minute duration.
5-seconds think time between test iterations.
5-seconds sample rate.
The results are so bad that they are actually preventing us from going live with a customer :
19 Seconds Average Response Time
Environment Specs are :
2 Web Front Ends (IIS) hosted in AWS Europe Region using c4.8xlarge EC2
Instances(16GB-Ram & 8vCPU) behind a publicly exposed load balancer.
MySql Database hosted in AWS RDS running inside (db.m4.4xlarge) EC2
instance (64GB Ram, High Network Traffic, 16 vCPU)
We don't have special code or special global request filters .. only the default configuration .. we even tried connection pooling but that didn't help that much ..
What would be the reason behind such slow performance ? Appreciate your support as we are in a point where the customer is questioning the ServiceStack framework itself and we are starting to have doubts about that as well even though we loved every aspect of it.
We came down to the bottom of this .. After hours of debugging and profiling register/login webs services, we found out that register code executes duplicated queries to the db (like check existing user validation logic etc..) that was even highlighted by Mini Profiler but that still wasn't the reason behind failing on only 1K concurrent users hitting the services which is very low number compared to the environment specs we ran on.
The reason was due to the following code getting called in both register/login :
private static TUserAuth GetUserAuthByUserName(IDbConnection db, string userNameOrEmail)
{
var isEmail = userNameOrEmail.Contains("#");
var userAuth = isEmail
? db.Select<TUserAuth>(q => q.Email.ToLower() == userNameOrEmail.ToLower()).FirstOrDefault()
: db.Select<TUserAuth>(q => q.UserName.ToLower() == userNameOrEmail.ToLower()).FirstOrDefault();
return userAuth;
}
The calls to .ToLower() got translated to SQL lower function getting called and when this is run concurrently and in a table where you have hundreds of thousands of rows, it would cause huge CPU spikes in the db server causing all the bottlenecks.
The fix was as simple as adding a dedicated lowered username and email fields in the database, updating UserAuth POCO to reflect those and finally adding an index to the new db columns and adjusting ormlite where condition to use the new columns.

Better scaling Spring application using #Async for large number of database insert or update queries

I have a spring REST controller whose sole purpose is to create or update a record every time when mobile client launches or boot app. This URL will be fired only when user launches app or if it comes to foreground after resume ( ie, when user press device home button to something else and after a while, user press the app icon to bring it to the foreground from memory ).
The expected number of requests for this URL is around 600 requests per minute.
To scale this application, is it better to put the database (MySql) create / update logic of spring controller in a separate thread or using #Async feature of Spring ?
So that it won't hold the system port for a very long time and one machine can handle large number of requests before my web server ( glassfish ) pushes requests to the waiting queue.
Also,
The expected table size or the number of records in this table is around 10M - 30M.
I personally wouldn't bother with an async call at least to start with. Create a jmeter script and fire some load at it and see how it performs.
If you start to get slow down using Async with a threadPoolExecutor behind it (that you can easily configure) is certainly a valid option. With these type of things configuring the queue size and number of threads (both for your thread pool executor and your web container) is a bit of a black art which is where something like jmeter and a good profiling tool such as Yourkit come into their own.

Slick with Hikari don't use more connections when needed

I'm trying to understand how Slick-Hikari works, I've read a lot of documentation but I've a use case whose behavior I don't understand.
I'm using Slick 3 with Hikari, with the default configuration. I already have a production app with ~1000 users connected concurrently. My app works with websockets and when I deploy a new release all clients are reconnected. (I know it's not the best way to handle a deploy but I don't have clustering at the moment.) When all these users reconnect, they all starts doing queries to get their user state (dog-pile effect). When it happens Slick starts to throw a lot of errors like:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#4dbbd9d1 rejected from java.util.concurrent.ThreadPoolExecutor#a3b8495[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 23740]
What I think it's happening is that the slick queue for pending queries is full because it can't handle all the clients requesting information from the database. But if I see the metrics that Dropwizard provides me I see the following:
Near 16:45 we se a deploy. Until old instance is terminated we can see that the number of connections goes from 20 to 40. I think that's normal, given how the deploy process is done.
But, if the query queue of Slick becomes full because of the dog-pile effect, why is it not using more than 3-5 connections if it has 20 connections available? The database is performing really well, so I think the bottleneck is in Slick.
Do you have any advice for improving this deploy process? I have only 1000 users now, but I'll have a lot more in few weeks.
Based on the "rejected" exception, I think many slick actions were submitted to slick concurrently, which exceeded the default size(1000) of the queue embedded in slick.
So I think you should:
increase queue size(queueSize) to hold more unprocessed actions.
increase number of thread(numThreads) in slick to process more actions concurrently.
You can get more tips here

MySQL jobs stuck in sidekiq queue

My Rails application takes a JSON blob of ~500 entries (from an API endpoint), throws it into a sidekiq/ redis background queue. The background job parses the blob then loops through the entries to perform a basic Rails Model.find_or_initialize_by_field_and_field() and model.update_attributes().
If this job were in the foreground, it would take a matter of seconds (if that long). I'm seeing these jobs remain in the sidekiq queue for 8 hours. Obviously, something's not right.
I've recently re-tuned the MySQL database to use 75% of available RAM as the buffer_pool_size and divided that amongst 3 buffer pools. I originally thought that might be part of the deadlock but the load avg on the box is still well below any problematic status ( 5 CPU and a load of ~ 2.5 ) At this point, I'm not convinced the DB is the problem though, of course, I can't rule it out.
I'm sure, at this point, I need to scale back the sidekiq worker instances. In anticipation of the added load I increased the concurrency to 300 per worker (I have 2 active workers on different servers.) Under a relatively small amount of load there queues operate as expected; even the problematic jobs are completed in ~1 minute. Though, per the sidekiq documentation >50 concurrent workers is a bad idea. I wasn't having any stability issues at 150 workers per instance. The problem has been this newly introduced job that performs ~500 MySQL finds and updates.
If this were a database timeout issue, the background job should have failed and been moved from the active (busy) queue to the failed queue. That's not the case. They're just getting stuck in the queue.
What other either MySQL or Rails/ sidekiq tuning parameters should I be examining to ensure these jobs succeed, fail, or properly time out?

Most efficient method of logging data to MySQL

We have a service which sees several hundred simultaneous connections throughout the day, peeking at about 2000, for about 3 million hits a day, and growing. With each request I need to log 4 or 5 pieces of data to MySQL, we originally used the logging that came with the app were using however it was terribly inefficient and would run my db server at >3x the average cpu load, and would eventually bring the server to it knees.
At this point we are going to add our own logging to the application (php), the only option I have for logging data is the MySQL db, as this is the only common resource available to all of the http servers. This data will be mostly writes however everyday we generate reports based on the data, then crunch and archive the old data.
What recommendations can be made to ensure that I don't take down our services with logging data?
The solution we took with this problem was to create an archive table then regularly ( every 15 minutes, on an app server) crunch the data and put it back into the tables that were used to generate reports. The archive table of course did not have any indices, the tables which the reports are generated from have several indices.
Some stats on this approach:
Short Version: >360 times faster
Long Version:
The original code/model did direct inserts into the indexed table, and the average insert took .036 seconds, using the new code/model inserts took less than .0001 seconds (I was not able to get an accurate fix on the insert time I had to measure 100,000 inserts and average for the insert time). The post-processing (crunch) took an average 12 seconds for several tens-of-thousands records. Overall we were greatly pleased with this approach and so far it has worked incredibly well for us.
Based on what you describe, I recommend you try to leverage the fact that you don't need to read this data immediately and pursue a "periodic bulk commit route". That is, buffer the logging data in RAM on the app servers and doing periodic bulk commits. If you have multiple application nodes, some sort of randomized approach would help even more (e.g., commit updated info every 5 +/- 2 minutes).
The main drawback with this approach is that if an app server fails, you lose the buffered data. However, that's only bad if (a) you absolutely need all of the data and (b) your app servers crash regularly. Small chance that both are true, but in the event they are, you can simply persist your buffer to local disk (temporarily) on an app server if that's really a concern.
The main idea is:
buffering the data
periodic bulk commits (leveraging some sort of randomization in a distributed system would help)
Another approach is to stop opening and closing connections if possible (e.g., keep longer lived connections open). While that's likely a good first step, it may require a fair amount of work on your part on a part of the system that you may not have control over. But if you do, it's worth exploring.