I have a database with around a million job records that are executed in a queue. Everyday, about 50 000 new jobs are created and executed. I start dozens of aws spot instances and jobs are executed with supervisor workers (6 at a time).
The total database connection is 260 at the same time and the rds mysql instance type is t2.medium. When there are no jobs to do or if jobs are not created yet, workers will exit after a few seconds and a a new workers will check again if jobs are available and so on.
I noticed that there is a major slowdown when they all connect at the same time i.e. A query can take 8 seconds instead of 20ms. Then as soon as all instances are connected to the DB, everything seems fine again. So the questions is, how do I handle this so that the database is always super fast?
Should I try to start the workers not at the same time and add random sleeps before workers exit?
Also since there is a lot of Read/Write operations, should I scale my database with load balancers and read repblicas?
The slowdown is probably caused by the instance type that you are using. T2 family instances are burstable, which means that under sustained load for long periods of time, the CPU credits balance will drain and the instance will become slower.
You should definitely look to upgrade your instance type to another class (e.g. m5 family) so the instance's performance will be stable under sustained loads.
Related
I am running MySQL 5.7.24 on AWS RDS, I have an InnoDB type table and working fine with normal traffic but when send push notification to 50k user the problem happend.
server features 32 GB RAM, 8vCPU, and my AWS RDS server is db.m5.2xlarge.
the wait/synch/sxlock/innodb/btr_search_latch take resources greater than wait/io/table/sql/handler like below image
the innodb_adaptive_hash_index enabled now
You're trying to send 50,000 push notifications in five minutes?
50,000 / 300 seconds means you're pushing 167 notifications per second and I assume then updating the database to record the result of the push. Probably you are doing this in many concurrent threads so you can do the pushes in parallel.
Have you considered doing these push notifications more gradually, like over 10 or 15 minutes?
Or updating the database in batches?
Or using fewer threads to avoid the high contention on the database?
I used to work for SchoolMessenger, a company that provide notification services for the majority of public schools in the USA. We sent millions of notifications, SMS messages, and phone calls every day. The way we did it was to have a very complex Java application queue up the notifications, and then post them gradually. Then as the results of the pushes came in, these also queued up, and updated the database gradually.
We used MySQL, but we also used it together with ActiveMQ as a persistent queue. Push all the tasks to be done into the queue, then a pool of worker threads would act on the tasks, and push the results back into another queue. Then a result-reading thread would read batches of results from the queue and update the database in bulk updates.
When you are designing a back-end system to do large-scale work to do, you have to think of new ways to architect your application to avoid choke-points.
As a database performance and scaling consultant, I have observed this rule many times:
For every 10x growth in data or traffic, you should reevaluate your software architecture. You may have to redesign some parts of it to work at the larger scale.
I've purchased a single VPC on AWS and initiated there 6 MySql databases, and foreach one I've created a reading replica, so that I can always run queries on the reading replicas quickly.
Most of the day, my writing instances (original instances) are fully loaded and their CPUs percentage is mostly 99%. However, the reading replicas shows something ~7-10% CPU usage, but sometimes I get an error when I run a service connecting to the reading replica "TOO MANY CONNECTIONS".
I'm not that expert with AWS, but is this happening because the writing replicas are fully loaded and they're on the same VPC?
this happening because the writing replicas are fully loaded and they're on the same VPC?
No, it isn't. This is unrelated to replication. In replication, the replica counts as exactly 1 connection on the master, but replication does not consume any connections on the replica itself. There is no impact on connections related to the intensity of the total workload from replication.
This issue simply means you have more clients connecting to the replica than are allowed by the parameter group based on your RDS instance type. Use the query SELECT ##MAX_CONNECTIONS; to see what this limit is. Use SHOW STATUS LIKE 'THREADS_CONNECTED'; to see how many connections exist currently, and use SHOW PROCESSLIST; (as the administrative user, or any user holding the PROCESS privilege) in order to see what all of these connections are doing.
If many of them show Sleep and have long values in Time (seconds spent in the current state) then the problem is that your application is somehow abandoning connections, rather than properly closing them after use or when they are otherwise no longer needed.
My Rails application takes a JSON blob of ~500 entries (from an API endpoint), throws it into a sidekiq/ redis background queue. The background job parses the blob then loops through the entries to perform a basic Rails Model.find_or_initialize_by_field_and_field() and model.update_attributes().
If this job were in the foreground, it would take a matter of seconds (if that long). I'm seeing these jobs remain in the sidekiq queue for 8 hours. Obviously, something's not right.
I've recently re-tuned the MySQL database to use 75% of available RAM as the buffer_pool_size and divided that amongst 3 buffer pools. I originally thought that might be part of the deadlock but the load avg on the box is still well below any problematic status ( 5 CPU and a load of ~ 2.5 ) At this point, I'm not convinced the DB is the problem though, of course, I can't rule it out.
I'm sure, at this point, I need to scale back the sidekiq worker instances. In anticipation of the added load I increased the concurrency to 300 per worker (I have 2 active workers on different servers.) Under a relatively small amount of load there queues operate as expected; even the problematic jobs are completed in ~1 minute. Though, per the sidekiq documentation >50 concurrent workers is a bad idea. I wasn't having any stability issues at 150 workers per instance. The problem has been this newly introduced job that performs ~500 MySQL finds and updates.
If this were a database timeout issue, the background job should have failed and been moved from the active (busy) queue to the failed queue. That's not the case. They're just getting stuck in the queue.
What other either MySQL or Rails/ sidekiq tuning parameters should I be examining to ensure these jobs succeed, fail, or properly time out?
I have an application server (Wildfly) connected to a Database (MySQL) that needs to handle requests from 10000 players simultaneously.
Clients send their positions to the server using a call like : https://mytestserver.utd/update?x=7&y=8
In order to handle that quickly and store those informations in the memory and in a Database I wonder what's the best approach. I came up with the following solutions :
Each time a call from a client is made, update in memory values, establish a DB connection and store the new location.
Each time a call from a client is made, update in memory values, queue a task to update the values in the DB (using a ThreadPoolExecutor).
Each time a call from a client is made, only update in memory values. Then every 5 to 10 seconds a worker thread will store everything what's in memory.
Option 1 is not really an option as it's very slow and resources consuming to do this.
Option 2 looks ok to me but a bit more complicated and more resources consuming (in term of database connection) than Option 3
Option 3 seems to be the best solutions to me as it only uses one DB connection and it's very fast to update the informations in memory. Also if the server crash, I would only lose 5 to 10 seconds of game state which is ok in my case.
Am I heading the right direction with Option 3 or do you see a loophole there ?
We have a service which sees several hundred simultaneous connections throughout the day, peeking at about 2000, for about 3 million hits a day, and growing. With each request I need to log 4 or 5 pieces of data to MySQL, we originally used the logging that came with the app were using however it was terribly inefficient and would run my db server at >3x the average cpu load, and would eventually bring the server to it knees.
At this point we are going to add our own logging to the application (php), the only option I have for logging data is the MySQL db, as this is the only common resource available to all of the http servers. This data will be mostly writes however everyday we generate reports based on the data, then crunch and archive the old data.
What recommendations can be made to ensure that I don't take down our services with logging data?
The solution we took with this problem was to create an archive table then regularly ( every 15 minutes, on an app server) crunch the data and put it back into the tables that were used to generate reports. The archive table of course did not have any indices, the tables which the reports are generated from have several indices.
Some stats on this approach:
Short Version: >360 times faster
Long Version:
The original code/model did direct inserts into the indexed table, and the average insert took .036 seconds, using the new code/model inserts took less than .0001 seconds (I was not able to get an accurate fix on the insert time I had to measure 100,000 inserts and average for the insert time). The post-processing (crunch) took an average 12 seconds for several tens-of-thousands records. Overall we were greatly pleased with this approach and so far it has worked incredibly well for us.
Based on what you describe, I recommend you try to leverage the fact that you don't need to read this data immediately and pursue a "periodic bulk commit route". That is, buffer the logging data in RAM on the app servers and doing periodic bulk commits. If you have multiple application nodes, some sort of randomized approach would help even more (e.g., commit updated info every 5 +/- 2 minutes).
The main drawback with this approach is that if an app server fails, you lose the buffered data. However, that's only bad if (a) you absolutely need all of the data and (b) your app servers crash regularly. Small chance that both are true, but in the event they are, you can simply persist your buffer to local disk (temporarily) on an app server if that's really a concern.
The main idea is:
buffering the data
periodic bulk commits (leveraging some sort of randomization in a distributed system would help)
Another approach is to stop opening and closing connections if possible (e.g., keep longer lived connections open). While that's likely a good first step, it may require a fair amount of work on your part on a part of the system that you may not have control over. But if you do, it's worth exploring.