Perfomance issue (Nginx, NodeJs, Mysql) - mysql

I have the following problem.
Using REST, I am getting binary content (BLOBs) from a MySql database via a NodeJS Express app.
All works fine, but I am having issues scaling the solution.
I increased the number of NodeJS instances to 3 : they are running ports 4000,4001,4002.
On the same machine I have Nginx installed and configured to do a load balancing between my 3 instances.
I am using Apache Bench to do some perf testing.
Please see attached pic.
Assuming I have a dummy GET REST that goes to the db, reads the blob (roughly 600KB in size) and returns it back (all http), I am making 300 simultaneous calls. I would have thought that using nginx to distribute the requests would make it faster, but it does not.
Why is this happening?
I am assuming it has to do with MySql?
My NodeJs app is using a connection pool with a limit set to 100 connections. What should be the relation between this value and the max connection value in Mysql? If I increase the connection pool to a higher number of connections, I get worse results.
Any suggestion on how to scale?
Thanks!

"300 simultaneous" is folly. No one (today) has the resources to effectively do more than a few dozen of anything.
4 CPU cores -- If you go much beyond 4 threads, they will be stumbling over each over trying to get CPU time.
1 network -- Have you check to see whether your big blobs are using all the bandwidth, thereby being the bottleneck?
1 I/O channel -- Again, lots of data could be filling up the pathway to disk.
(This math is not quite right, but it makes a point...) You cannot effectively run any faster than what you can get from 4+1+1 "simultaneous" connections. (In reality, you may be able to, but not 300!)
The typical benchmarks try to find how many "connections" (or whatever) leads to the system keeling over. Those hard-to-read screenshots say about 7 per second is the limit.
I also quibble with the word "simultaneous". The only thing close to "simultaneous" (in your system) is the ability to use 4 cores "simultaneously". Every other metric involves sharing of resources. Based on what you say, ...
If you start about 7 each second, some resource will be topped out, but each request will be fast (perhaps less than a second)
If you start 300 all at once, they will stumble over each other, some of them taking perhaps minutes to finish.
There are two interesting metrics:
How many per second you can sustain. (Perhaps 7/sec)
How long the average (and, perhaps, the 95% percentile) takes.
Try 10 "simultaneous" connections and report back. Try 7. Try some other small numbers like those.

Related

Having a hard time to correctly set the RAILS_MAX_THREADS based on the max number of allowed connections

I'm having a hard time to understand the math that I have to do to find out the correct number of RAILS_MAX_THREADS based on my infrastructure.
I'm using multi containers to host one copy of my API that accepts HTTP requests and one copy of my API that runs sidekiq (job processing). The database that I'm using has a max_connections of 45. With that being said, what should be the number of RAILS_MAX_THREADS? I'm using 9 for RAILS_MAX_THREADS AND WEB_CONCURRENCY. I read a few articles about it but I haven't been able to fully wrap my head around it.
Heroku's docs on puma sizing are some of the best even if you aren't using heroku.
https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server
Where they say "dyno", you can just read "host", "virtual machine", or "container" for a non-heroku deploy.
If you are using 9 for RAILS_MAX_THREADS and WEB_CONCURRENCY, and your heroku config is set to use these settings in the normal way -- then every host will have 9 puma workers running (WEB_CONCURRENCY), and each worker will be running 9 threads (RAILS_MAX_THREADS), for a total of 9*9=81 threads.
You really need enough database connections for each thread, so you are already over your 45 database connections by almost 2x. And that's only on ONE container -- if you are running multiple containers, each with these settings, than multiply the 81 by the number of containers -- so that's far too many for your database connections!
So if you are unable to change the max database connections, that is a hard limit and you need to reduce your numbers.
Otherwise, the main limiting factor is how much RAM you have available in each container, and how many vCPUs. Ideally you'd run at least as many workers on a container (WEB_CONCURRENCY) as you have vCPUs -- if you have enough RAM to do it. Workers take a lot of RAM. There is usually no reason to run MORE workers than vCPUs, so whether 9 makes sense or is larger than needed depends on your infrastructure.
How many threads per worker (RAILS_MAX_THREADS) is optimal can depend on exactly what your app is doing, but as a good rule of thumb you can start at 5. 9 is probably more than is useful, generally.
So I'd try RAILS_MAX_THREADS of 3-5. Then as much WEB_CONCURRENCY as you can without running out of RAM (to see how much RAM the app will take after being up under load for a while, you might need to leave it up under load for a while). So long as containers * RAILS_MAX_THREADS * WEB_CONCURRENCY is less than your database max connections -- if it's not, either reduce your values so it is, or increase your database max connections.

AWS RDS MySQL performance drop after random timespan

QUESTION OUTLINE
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage. (see below the question update for detailed problem description)
Question Update
So after more than one month of investigating and some developer support by AWS, I am not exactly closer to a solution.
Here are a couple of steps which I checked off the list, more or less without any further hint of the problem:
Index / Fragmentation (all tables have correct indexes/keys and have no fragmentation)
MySQL Stats Update (manually updating stats source)
Thread Concurrency (changing innodb_thread_concurrency to various different parameters)
Query Cache Hit Ratio doesn't show problems
EXPLAIN to see if any SELECTs are actually slow or not using indexes/keys
SLOW QUERY LOG (returns no results, because see paragraph below, it's a number of prepared SELECTs)
RDS and EC2 are within one VPC
For explanation, the used PlayFramework (2.3.8) has BoneCP and we are using eBeans to select our data. So basically I am running through a nested object and all those child objects, this produces a couple of hundred prepared SELECTs for the API call in question. This should basically also be fine for the used hardware, neither CPU nor RAM are extensively used by these operations.
I also included NewRelic for more insights on this issue and did some JVM profiling. Obviously, most of the time is consumed by NETTY/eBeans?
Is anyone able to make sense of this?
ORIGINAL QUESTION: Problem Outline
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage.
Infrastructure
We run a PlayFramework backend for a mobile app on AWS EC2 instances, connected to AWS RDS MySQL instances, one PROD environment, one DEV environment. Usually the PROD EC2 instance is pointing to the PROD RDS instance, and the DEV EC2 points to the DEV RDS (hi from captain obvious!); however sometimes we also let the DEV EC2 point to the PROD DB for some testing purposes. The PlayFramework in use is working with BoneCP.
Detailed Problem Description
In a quite essential sync process, our app is making a certain API call many times a day per user. I discussed the backgrounds of the functionality in this SO question, where, thanks to comments, I could nail the problem down to be a MySQL issue of some kind.
In short, the API call is loading a set of data, the maximum is about 1MB of json data, which currently takes about 18s to load. When things are running perfectly fine, this takes about 4s to load.
Curious enough, what "solved" the problem last time was upgrading the RDS instance to another instance type (from db.m3.large to db.m4.large, which is just a very marginal step). Now, after about 2-3 weeks, the RDS instance is once again performing slow as before. Rebooting the RDS instance showed no effect. Also re-launching the EC2 instance shows no effect.
I also checked if the indices of the affected mySQL tables are set properly, which is the case. The API call itself is not eager-loading any BLOB fields or similar, I double-checked this. The CPU-usage of the RDS instances is below 1% most of the time, when I stress tested it with 100 simultaneous API calls, it went to ~5%, so this is not the bottleneck. Memory is fine too, so I guess the RDS instance doesn't start swapping which could slow down the whole process.
Giving hard evidence, a (smaller) public API call on the DEV environment currently takes 2.30s load, on the PROD environment it takes 4.86s. Which is interesting, because the DEV environment has both in EC2 and RDS a much smaller instance type. So basically the turtle wins the race here. (If you are interested in this API call I am happy to share it with you via PN, but I don't really want to post links to API calls, even if they are basically public.)
Conclusion
Concluding, it feels (I wittingly say 'feels') like the DB is clogged after x days of usage / after a certain amount of API calls. Not sure if this a RDS-specific issue, once I 'largely' reset the DB instance by changing the instance type, things run fast and smooth. But re-creating my DB instance from a snapshot every 2 weeks is not an option, especially if I don't understand why this is happening.
Do you have any ideas what further steps I could take to investigate this matter?
(Too long for just a comment) I know you have checked a lot of things, but I would like to look at them with a different set of eyes...
Please provide
SHOW VARIABLES; (probably need post.it or something, due to size)
SHOW GLOBAL STATUS;
how much RAM? Sounds like 7.5G
The query. -- Unclear what query/queries you are using
SHOW CREATE TABLE for the table(s) in the query -- indexes, datatypes, etc
(Some of the above may help with "clogging over time" question.)
Meanwhile, here are some guesses/questions/etc...
Some other customer sharing the hardware is busy.
It could be a network problem?
Shrink long_query_time to 1 so you can catch slow queries.
When are backups done on your instance?
4s-18s to load a megabyte -- what percentage of that is SQL statements?
Do you "batch" the inserts? Is it a single transaction? Are lengthy queries going on at the same time?
What, if any, MySQL tunables did you change from the AWS defaults?
6GB buffer_pool on a 7.5GB partition? That sounds dangerously tight. Can you see if there was any swapping?
Any PARTITIONing involved? (Of course the CREATE will answer that.)
There is one very important bit of information missing from your description: The total allocated space for the database. I/O for RDS is around 3x the allocated space, so for a 100GB allocation, you should get around 300 IOPS. That allocated space also includes logs.
Since you don't really know what's going on, the first step should be to turn on detailed monitoring, which will give you more idea of what is happening on the instance.
Until you have additional stats gathered during a slowdown, you can try increasing the allocated space, which will increase the IOPS available.
Also, check the events for the db - are logs getting purged on a regular basis? That might indicate that there's not enough space.
Finally, you can try going with PIOPS (provisioned IOPS) if you have an idea of what the application needs, though at this point it sounds like that would be a guess.
maybe your burst credit balance is (slowly) being depleted? finally, you end up with baseline performance, which may appear "too slow".
this would also explain why the upgrade to another instance type did help, as you then start with a full burst balance again.
i would suggest to increase the size of the volume, even if you don't need the extra space, as the baseline performance grows linearly with volume size.

Amazon RDS - max_connections

My simple question:
How can I increase the possible number of connections of my Amazon RDS Database?
I used a parameter group where I set
max_connections = 30000
which seems to work on the first hand, as
SHOW VARIABLES LIKE 'max_connections';
returns the expected.
But when I run a stress test the monitoring metrics always show a maximum number of 1200 connections.
So obviously there have to be other limiting factors, I just don't know.
Any help would be highly appreciated.
My test setup:
1 Load Balancer
8 fat EC2 instances (m4.4xlarge) (which is a bit overdimensioned, but I'm still testing)
1 DB: r3.4xlarge with 140 GB memory, 1 TB storage and 10.000 provisioned IOPS
Test: 30.000 virtual users in 10 minutes making 4 requests each (2 reading the DB, 1 writing it, 1 not using the DB).
Fails after about two minutes because of too many errors (caused by DB timeouts).
Concerning the hardware this setup should be able to handle the test requests, shouldn't it?
So I hope I'm just missing the obvious and there's a parameter which has to be adapted to make everything working.
I would strongly suggest that the first problem is not with the configuration of the server, but with your test methodology and interpretation of what you are seeing.
Hitting max_connections does not initially cause "db timeouts." It causes connection errors, because the server actively rejects excessive connection attempts, with a refusal to negotiate further. This is not the same thing as a timeout.
At what point, during what operation, are the timeouts occurring? Initial connection phase? That's not going to be related to max_connections, at least not directly.
The maximum connections you observe seems like a suspiciously round number and potentially is even derivable from your test parameters... You mentioned 30000 users and 10 minutes and 4 requests... and 30000 × 4 ÷ 10 ÷ 10 = 1200. Yes, I threw in the "10" twice for no particular reason other than 1200 just seems very suspicious. I wonder whether, if you used 15000 users, the number would drop from 1200 to 600. That would be worth investigating.
Importantly, to serve 30000 concurrent users, your application does not need 30000 database connections. If it does, it's written very, very badly. I don't know how you're testing this, but only a naive implementation given the stated parameters would assume 30000 connections should be established.
Equally important, 30000 connections to a single MySQL server regardless of size seems completely detached from reality, except maybe with thread pooling, which isn't available in the version of MySQL used in RDS. If you were to successfully create that many connections, on a cold server or one without a massive thread cache already warmed up, it would likely take several minutes just for the OS to allow MySQL to create that many new threads. You would indeed see timeouts here, because the OS would not let the server keep up with the incoming demand, but it would be unrelated to max_connections.
It would seem like your most likely path at this point would not be to assume that max_connections isn't actually set to the value that it claims, and to scale down your test parameters, see how the behavior changes and go from there in an effort to understand what is actually happening. Your test parameters also need to be meaningful related to the actual workload you're trying to test against.
Thanks to Michael and the hints of a colleague I was finally able to solve this problem:
As Michael already supposed it wasn't caused by the DB.
The answer was hidden in the Apache configuration which I took under examination after DB problems seem to be out of question (finally).
All my eight EC2 instances were limited by MaxRequestWorkers=150 (-> 8*150=1200).
What is obvious for every holiday admin took me day.
At least everything's working now.

MySQL active connections at once, Windows Server

I have read every possible answer to this question and searched via Google in order to find the correct answer to the following question, but I am rather a novice and don't seem to get a clear understanding.
A lot I've read has to do with web servers, but I don't have a web server, but an intranet database.
I have a MySQL dsatabase in a Windows server at work.
I will have many users accessing this database constantly to perform simple queries and writting back to it new records.
The read/write will not be that heavy (chances are 50-100 users will do so exactly at the same time, even if 1000's could be connected).
The GUI will be either via Excel forms and/or Access.
What I need to know is the maximum number of active connections I can have at any given time to the database.
I know I can change the number on Mysql Admin however I really need to know what will really work...
I don't want to put 1000 users if the system will really handle 100 correctly (after that, although connected, the performance will be too slow, for example)
Any ideas or own experiences will be appreciated
This depends mainly on your server hardware (RAM, cpu, networking) and server load for other processes if not dedicated to the database. I think you won't have an absolute answer and the best way is testing.
I think something like 1000 should work ok, as long as you use 64 bit MySQL server. With 32 bit, too many connections may create virtual memory pressure - a connection has an own thread, and every thread needs a stack, so the stack memory will reduce possible size of the buffer pool and other buffers.
MySQL generally does not slow down if you have many idle connections, however special commands e.g "show processlist" or "kill", that enumerate every connection will be somewhat slower.
If idle connection stays idle for too long (idle time exceeds wait_timeout parameter), it is dropped by the server. If this is the case in your possible scenario, you might want to increase wait_timeout (its default value is 8 hours)

Does MySQL packet size cause slowdown?

I have written a program which uses a MySQL database, and transaction between the database server (a very powerful one) and the client is happening over a ADSL connection (1 Mbit/s).
But I have a very very slow connection between each client and the server. Only approximately 3-4 KB/s data is send through the server. Neither the server nor the clients use the Internet for other purposes, just my program uses the Internet.
I can't figure out why? Is the reason MySQL server packet size?
Any suggestions?
Try using mytop to identify the server low performance cause.
Another one: you may be using SELECT COUNT(*) FROM .. for large InnoDB tables which causes a table scan.
And can you test for some other services whether the exchange data rate between the machines is OK? Even if the even if the output bandwidth is lower for ADSL users 3-4 kB might not be the reason of low performance.
The effective transfer rate is often heavily limited by the number of roundtrips between client and server. Without seeing your code it is sort of difficult to tell, but you should check the number of requests happening.
If you have a single request that results in many records being returned, you should see a better usage of bandwidth than with a higher number of requests which only deliver a few rows each.
In the latter case the actual result transfer is probably quite fast, but the latencies involved in the "control communications" (i. e. the statements themselves, login requests etc.) will add up, effectively lowering overall throughput.
As for the packet size: When it is very small, there is more overhead in the communications, increasing the aforementioned effect. The server's default max_allowed_packet size if 1MB if memory serves, but that should be fine with your connection.
You first have to debug both connections.
What is your upload speed if you upload a file with WinSCP ot equivalent to the MySQL server? It should be near 90 KB/s with ADSL 1 Mbit/s.