My Rails application takes a JSON blob of ~500 entries (from an API endpoint), throws it into a sidekiq/ redis background queue. The background job parses the blob then loops through the entries to perform a basic Rails Model.find_or_initialize_by_field_and_field() and model.update_attributes().
If this job were in the foreground, it would take a matter of seconds (if that long). I'm seeing these jobs remain in the sidekiq queue for 8 hours. Obviously, something's not right.
I've recently re-tuned the MySQL database to use 75% of available RAM as the buffer_pool_size and divided that amongst 3 buffer pools. I originally thought that might be part of the deadlock but the load avg on the box is still well below any problematic status ( 5 CPU and a load of ~ 2.5 ) At this point, I'm not convinced the DB is the problem though, of course, I can't rule it out.
I'm sure, at this point, I need to scale back the sidekiq worker instances. In anticipation of the added load I increased the concurrency to 300 per worker (I have 2 active workers on different servers.) Under a relatively small amount of load there queues operate as expected; even the problematic jobs are completed in ~1 minute. Though, per the sidekiq documentation >50 concurrent workers is a bad idea. I wasn't having any stability issues at 150 workers per instance. The problem has been this newly introduced job that performs ~500 MySQL finds and updates.
If this were a database timeout issue, the background job should have failed and been moved from the active (busy) queue to the failed queue. That's not the case. They're just getting stuck in the queue.
What other either MySQL or Rails/ sidekiq tuning parameters should I be examining to ensure these jobs succeed, fail, or properly time out?
Related
I am running MySQL 5.7.24 on AWS RDS, I have an InnoDB type table and working fine with normal traffic but when send push notification to 50k user the problem happend.
server features 32 GB RAM, 8vCPU, and my AWS RDS server is db.m5.2xlarge.
the wait/synch/sxlock/innodb/btr_search_latch take resources greater than wait/io/table/sql/handler like below image
the innodb_adaptive_hash_index enabled now
You're trying to send 50,000 push notifications in five minutes?
50,000 / 300 seconds means you're pushing 167 notifications per second and I assume then updating the database to record the result of the push. Probably you are doing this in many concurrent threads so you can do the pushes in parallel.
Have you considered doing these push notifications more gradually, like over 10 or 15 minutes?
Or updating the database in batches?
Or using fewer threads to avoid the high contention on the database?
I used to work for SchoolMessenger, a company that provide notification services for the majority of public schools in the USA. We sent millions of notifications, SMS messages, and phone calls every day. The way we did it was to have a very complex Java application queue up the notifications, and then post them gradually. Then as the results of the pushes came in, these also queued up, and updated the database gradually.
We used MySQL, but we also used it together with ActiveMQ as a persistent queue. Push all the tasks to be done into the queue, then a pool of worker threads would act on the tasks, and push the results back into another queue. Then a result-reading thread would read batches of results from the queue and update the database in bulk updates.
When you are designing a back-end system to do large-scale work to do, you have to think of new ways to architect your application to avoid choke-points.
As a database performance and scaling consultant, I have observed this rule many times:
For every 10x growth in data or traffic, you should reevaluate your software architecture. You may have to redesign some parts of it to work at the larger scale.
I have a database with around a million job records that are executed in a queue. Everyday, about 50 000 new jobs are created and executed. I start dozens of aws spot instances and jobs are executed with supervisor workers (6 at a time).
The total database connection is 260 at the same time and the rds mysql instance type is t2.medium. When there are no jobs to do or if jobs are not created yet, workers will exit after a few seconds and a a new workers will check again if jobs are available and so on.
I noticed that there is a major slowdown when they all connect at the same time i.e. A query can take 8 seconds instead of 20ms. Then as soon as all instances are connected to the DB, everything seems fine again. So the questions is, how do I handle this so that the database is always super fast?
Should I try to start the workers not at the same time and add random sleeps before workers exit?
Also since there is a lot of Read/Write operations, should I scale my database with load balancers and read repblicas?
The slowdown is probably caused by the instance type that you are using. T2 family instances are burstable, which means that under sustained load for long periods of time, the CPU credits balance will drain and the instance will become slower.
You should definitely look to upgrade your instance type to another class (e.g. m5 family) so the instance's performance will be stable under sustained loads.
I'm trying to understand how Slick-Hikari works, I've read a lot of documentation but I've a use case whose behavior I don't understand.
I'm using Slick 3 with Hikari, with the default configuration. I already have a production app with ~1000 users connected concurrently. My app works with websockets and when I deploy a new release all clients are reconnected. (I know it's not the best way to handle a deploy but I don't have clustering at the moment.) When all these users reconnect, they all starts doing queries to get their user state (dog-pile effect). When it happens Slick starts to throw a lot of errors like:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#4dbbd9d1 rejected from java.util.concurrent.ThreadPoolExecutor#a3b8495[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 23740]
What I think it's happening is that the slick queue for pending queries is full because it can't handle all the clients requesting information from the database. But if I see the metrics that Dropwizard provides me I see the following:
Near 16:45 we se a deploy. Until old instance is terminated we can see that the number of connections goes from 20 to 40. I think that's normal, given how the deploy process is done.
But, if the query queue of Slick becomes full because of the dog-pile effect, why is it not using more than 3-5 connections if it has 20 connections available? The database is performing really well, so I think the bottleneck is in Slick.
Do you have any advice for improving this deploy process? I have only 1000 users now, but I'll have a lot more in few weeks.
Based on the "rejected" exception, I think many slick actions were submitted to slick concurrently, which exceeded the default size(1000) of the queue embedded in slick.
So I think you should:
increase queue size(queueSize) to hold more unprocessed actions.
increase number of thread(numThreads) in slick to process more actions concurrently.
You can get more tips here
Recently I have started getting mySQL "too many connection" errors at times of high traffic. My rails app runs on a mongrel cluster with 2 instances on a shared host. Some recent changes that might be driving it:
Traffic to my site has increased. I
am now averaging about 4K pages a
day.
Database size has increased. My largest table has ~ 100K rows.
Some associations could return
several hundred instances in the
worst case, though most are far less.
I have added some features that
increased the number and size of
database calls in some actions.
I have done a code review to reduce database calls, optimize SQL queries, add missing indexes, and use :include for eager loading. However, many of my methods still make 5-10 separate SQL calls. Most of my actions have a response time of around 100ms, but one of my most common actions averages 300-400ms, and some actions randomly peak at over 1000ms.
The logs are of little help, as the errors seem to occur randomly, or at least the pattern does not appear related to the actions being called or data being accessed.
Could I alleviate the error by adding additional mongrel instances? Or are the mySQL connections limited by the server, and thus unrelated to the number of processes I divide my traffic across?
Is this most likely a problem with my coding, or should I be pressing my host for more capacity/less load on the shared server?
ActiveRecord has pooled database connections since Rails 2.2, and it's likely that that's what's causing your excess connections here. Try turning down the value of pool in your database.yml for that environment (it defaults to 5).
Docs can be found here.
Are you caching anything? It's an important part of alleviating application and database load. The Rails Guides have a section on caching.
Something is wrong. A Mongrel instance processes 1 request at a time so if you have 2 Mongrel instances then you should not be seeing more than 2 active MySQL connections (from the mongrels at least)
You could log or graph the output of SHOW STATUS LIKE 'Threads_connected' over time.
PS: this is not very many Mongrels. if you want to be able to service more than 2 simultaneous requests then you'll want more. ...if memory is tight, you can switch to Phusion Passenger and REE.
We have a service which sees several hundred simultaneous connections throughout the day, peeking at about 2000, for about 3 million hits a day, and growing. With each request I need to log 4 or 5 pieces of data to MySQL, we originally used the logging that came with the app were using however it was terribly inefficient and would run my db server at >3x the average cpu load, and would eventually bring the server to it knees.
At this point we are going to add our own logging to the application (php), the only option I have for logging data is the MySQL db, as this is the only common resource available to all of the http servers. This data will be mostly writes however everyday we generate reports based on the data, then crunch and archive the old data.
What recommendations can be made to ensure that I don't take down our services with logging data?
The solution we took with this problem was to create an archive table then regularly ( every 15 minutes, on an app server) crunch the data and put it back into the tables that were used to generate reports. The archive table of course did not have any indices, the tables which the reports are generated from have several indices.
Some stats on this approach:
Short Version: >360 times faster
Long Version:
The original code/model did direct inserts into the indexed table, and the average insert took .036 seconds, using the new code/model inserts took less than .0001 seconds (I was not able to get an accurate fix on the insert time I had to measure 100,000 inserts and average for the insert time). The post-processing (crunch) took an average 12 seconds for several tens-of-thousands records. Overall we were greatly pleased with this approach and so far it has worked incredibly well for us.
Based on what you describe, I recommend you try to leverage the fact that you don't need to read this data immediately and pursue a "periodic bulk commit route". That is, buffer the logging data in RAM on the app servers and doing periodic bulk commits. If you have multiple application nodes, some sort of randomized approach would help even more (e.g., commit updated info every 5 +/- 2 minutes).
The main drawback with this approach is that if an app server fails, you lose the buffered data. However, that's only bad if (a) you absolutely need all of the data and (b) your app servers crash regularly. Small chance that both are true, but in the event they are, you can simply persist your buffer to local disk (temporarily) on an app server if that's really a concern.
The main idea is:
buffering the data
periodic bulk commits (leveraging some sort of randomization in a distributed system would help)
Another approach is to stop opening and closing connections if possible (e.g., keep longer lived connections open). While that's likely a good first step, it may require a fair amount of work on your part on a part of the system that you may not have control over. But if you do, it's worth exploring.