Our Server can't handle more than 20 requests/second

Our Server can't handle more than 20 requests/second - mysql

So after 3 months of hard work in developing & switching the company API from PHP to Go I found out that our Go server can't handle more than 20 req/second.
So basically how our API works:
takes in a request
validates the request
fetches the data from the DB using MYSQL
put's the data in a Map
send's it back to the Client in a JSON format
So after writing about 30 APIs I decided to take it for a spin and see how it performance under load test.
Test 1: ab -n 1 -c 1 http://localhost:8000/sales/report the results are "Time per request: 72.623 [ms] (mean)" .
Test 2: ab -n 100 -c 100 http://localhost:8000/sales/report the results are "Time per request: 4548.155 [ms] (mean)" (No MYSQL errors).
How did the number suddenly spike from 72.623 to 4548 ms in the second test?. We expect thousands of requests per day so I need to solve this issue before we finally release it . I was surprised when I saw the numbers ; I couldn't believe it. I know GO can do much better.
So basic info about the server and settings:
Using GO 1.5
16GB RAM
GOMAXPROCS is using all 8 Cores
db.SetMaxIdleConns(1000)
db.SetMaxOpenConns(1000) (also made sure we are using pool of
connections)
Connecting to MYSQL through unix socket
System is running under Ubuntu
External libraries that we are using:
github.com/go-sql-driver/mysql
github.com/gorilla/mux
github.com/elgs/gosqljson
Any ideas what might be causing this? . I took a look at this post but didn't work as I mentioned above I never got any MYSQL error. Thank you in advance for any help you can provide.

Your post doesn't have enough information to address why your program is not performing how you expect, but I think this question alone is worth an answer:
How did the number suddenly spike from 72.623 to 4548 ms in the second test?
In your first test, you did one single request (-n 1). In your second test, you did 100 requests in flight simultaneously (-c 100 -n 100).
You mention that your program communicates to an external database, your program has to wait for that resource to respond. Do you understand how your database performs when you send it 1,000 requests simultaneously? You made no mention of this. Go can certainly handle many hundreds of concurrent requests a second without breaking a sweat, but it depends what you're doing and how you're doing it. If your program can't complete requests as fast as they are coming in, they will pile up, leading to a high latency.
Neither of those tests you told us about are useful to understand how your server performs under "normal" circumstances - which you said would be "thousands of requests per day" (which isn't very specific, but I'll take to mean, "a few a second"). Then it would be much more interesting to look at -c 4 -n 1000, or something that exercises the server over a longer period of time, with a number of concurrent requests which is more like what you expect to get.

I'm not familiar with gosqljson package, but you say your "query by itself is not really complicated" and you're running it against "a well structured DB table," so I'd suggest dropping the gosqljson and binding your query results to structs, then marshalling those structs to json. That should be faster and incur less memory thrashing than using a map[string]interface{} for everything.
But I don't think gosqljson could possibly be that slow, so it's likely not the main culprit.
The way you're doing your second benchmark is not helpful.
Test 2: ab -n 100 -c 100 http://localhost:8000/sales/report
That's not testing how fast you can handle concurrent requests so much as it's testing how fast you can make connections to MySQL. You're only doing 100 queries and using 100 requests, which means Go is probably spending most of its time making up to 100 connections to MySQL. Go probably doesn't even have time to reuse any of the db connections, considering all the other stuff it's doing to satisfy each query, and then, boom, the test is over. You would need to set the max connections to something like 50 and run 10,000 queries to see how long concurrent requests take once a pool of db connections is already established; right now you're basically testing how long it takes Go to build up a pool of db connections.

Related

The Database Connection is going high and also the RDS CPU is hitting 100% during the load testing

When doing the load testing on my application the AWS RDS CPU is hitting 100% and corresponding requests are getting errored out. The RDS is m4.2x.large. With the same configuration the things were fine until 2 weeks back. There are no infra changes done on the environment neither the application level changes. The whole load test used to go smooth until complete 2hrs until 2 weeks back. There are no specific exception apart from GENERICJDBCEXCEPTION.
All other necessary services are up and running on respective instances.
We are using SQL as Database Management System.
Is there any chance that this happens suddenly? How to resolve this? Suggestions are much appreciated. This has created many problems.
Monitoring the slow logs and resolving them did not solve the problem.
Should we upgrade the RDS to next version?
Does more data on then DB slows the Database?
We have modified the connection pool parameters also and tried it.

With "load testing", are you able to finish one day's work in one hour? That sounds great! Or what do you mean by "load testing"?
Or are you trying to launch 200 threads in one second and they are stumbling over each other? That's to be expected. Do you really get 200 new connections in a single second? Or is it spread out?
1 million queries per day is no problem. A million queries all at once will fail.
Do not let your "load test" launch more threads than you can reasonably expect. They will all pile up, and latency will suffer while the server is giving each thread an equal chance.
Meanwhile, use the slowlog to find the "worst" queries in production. Then let's discuss the worst one or two -- Often an improved index makes that query work much faster, thereby no longer contributing to the train wreck.

Perfomance issue (Nginx, NodeJs, Mysql)

I have the following problem.
Using REST, I am getting binary content (BLOBs) from a MySql database via a NodeJS Express app.
All works fine, but I am having issues scaling the solution.
I increased the number of NodeJS instances to 3 : they are running ports 4000,4001,4002.
On the same machine I have Nginx installed and configured to do a load balancing between my 3 instances.
I am using Apache Bench to do some perf testing.
Please see attached pic.
Assuming I have a dummy GET REST that goes to the db, reads the blob (roughly 600KB in size) and returns it back (all http), I am making 300 simultaneous calls. I would have thought that using nginx to distribute the requests would make it faster, but it does not.
Why is this happening?
I am assuming it has to do with MySql?
My NodeJs app is using a connection pool with a limit set to 100 connections. What should be the relation between this value and the max connection value in Mysql? If I increase the connection pool to a higher number of connections, I get worse results.
Any suggestion on how to scale?
Thanks!

"300 simultaneous" is folly. No one (today) has the resources to effectively do more than a few dozen of anything.
4 CPU cores -- If you go much beyond 4 threads, they will be stumbling over each over trying to get CPU time.
1 network -- Have you check to see whether your big blobs are using all the bandwidth, thereby being the bottleneck?
1 I/O channel -- Again, lots of data could be filling up the pathway to disk.
(This math is not quite right, but it makes a point...) You cannot effectively run any faster than what you can get from 4+1+1 "simultaneous" connections. (In reality, you may be able to, but not 300!)
The typical benchmarks try to find how many "connections" (or whatever) leads to the system keeling over. Those hard-to-read screenshots say about 7 per second is the limit.
I also quibble with the word "simultaneous". The only thing close to "simultaneous" (in your system) is the ability to use 4 cores "simultaneously". Every other metric involves sharing of resources. Based on what you say, ...
If you start about 7 each second, some resource will be topped out, but each request will be fast (perhaps less than a second)
If you start 300 all at once, they will stumble over each other, some of them taking perhaps minutes to finish.
There are two interesting metrics:
How many per second you can sustain. (Perhaps 7/sec)
How long the average (and, perhaps, the 95% percentile) takes.
Try 10 "simultaneous" connections and report back. Try 7. Try some other small numbers like those.

Amazon RDS - max_connections

My simple question:
How can I increase the possible number of connections of my Amazon RDS Database?
I used a parameter group where I set
max_connections = 30000
which seems to work on the first hand, as
SHOW VARIABLES LIKE 'max_connections';
returns the expected.
But when I run a stress test the monitoring metrics always show a maximum number of 1200 connections.
So obviously there have to be other limiting factors, I just don't know.
Any help would be highly appreciated.
My test setup:
1 Load Balancer
8 fat EC2 instances (m4.4xlarge) (which is a bit overdimensioned, but I'm still testing)
1 DB: r3.4xlarge with 140 GB memory, 1 TB storage and 10.000 provisioned IOPS
Test: 30.000 virtual users in 10 minutes making 4 requests each (2 reading the DB, 1 writing it, 1 not using the DB).
Fails after about two minutes because of too many errors (caused by DB timeouts).
Concerning the hardware this setup should be able to handle the test requests, shouldn't it?
So I hope I'm just missing the obvious and there's a parameter which has to be adapted to make everything working.

I would strongly suggest that the first problem is not with the configuration of the server, but with your test methodology and interpretation of what you are seeing.
Hitting max_connections does not initially cause "db timeouts." It causes connection errors, because the server actively rejects excessive connection attempts, with a refusal to negotiate further. This is not the same thing as a timeout.
At what point, during what operation, are the timeouts occurring? Initial connection phase? That's not going to be related to max_connections, at least not directly.
The maximum connections you observe seems like a suspiciously round number and potentially is even derivable from your test parameters... You mentioned 30000 users and 10 minutes and 4 requests... and 30000 × 4 ÷ 10 ÷ 10 = 1200. Yes, I threw in the "10" twice for no particular reason other than 1200 just seems very suspicious. I wonder whether, if you used 15000 users, the number would drop from 1200 to 600. That would be worth investigating.
Importantly, to serve 30000 concurrent users, your application does not need 30000 database connections. If it does, it's written very, very badly. I don't know how you're testing this, but only a naive implementation given the stated parameters would assume 30000 connections should be established.
Equally important, 30000 connections to a single MySQL server regardless of size seems completely detached from reality, except maybe with thread pooling, which isn't available in the version of MySQL used in RDS. If you were to successfully create that many connections, on a cold server or one without a massive thread cache already warmed up, it would likely take several minutes just for the OS to allow MySQL to create that many new threads. You would indeed see timeouts here, because the OS would not let the server keep up with the incoming demand, but it would be unrelated to max_connections.
It would seem like your most likely path at this point would not be to assume that max_connections isn't actually set to the value that it claims, and to scale down your test parameters, see how the behavior changes and go from there in an effort to understand what is actually happening. Your test parameters also need to be meaningful related to the actual workload you're trying to test against.

Thanks to Michael and the hints of a colleague I was finally able to solve this problem:
As Michael already supposed it wasn't caused by the DB.
The answer was hidden in the Apache configuration which I took under examination after DB problems seem to be out of question (finally).
All my eight EC2 instances were limited by MaxRequestWorkers=150 (-> 8*150=1200).
What is obvious for every holiday admin took me day.
At least everything's working now.

Web request performance really bad under stress

I wrote a web application using python and Flask framework, and set it up on Apache with mod_wsgi.
Today I use JMeter to perform some load testing on this application.
For one web URL:
when I set only 1 thread to send request, the response time is 200ms
when I set 20 concurrent threads to send requests, the response time increases to more than 4000ms(4s). THIS IS UNACCEPTABLE!
I am trying to find the problem, so I recorded the time in before_request and teardown_request methods of flask. And it turns out the time taken to process the request is just over 10ms.
In this URL handler, the app just performs some SQL queries (about 10) in Mysql database, nothing special.
To test if the problem is with web server or framework configuration, I wrote another method Hello in the same flask application, which just returns a string. It performs perfectly under load, the response time is 13ms with 20-thread concurrency.
And when doing the load test, I execute 'top' on my server, there are about 10 apache threads, but the CPU is mostly idle.
I am at my wit's end now. Even if the request are performed serially, the performance should not drop so drastically... My guess is that there is some queuing somewhere that I am unaware of, and there must be overhead besides handling the request.
If you have experience in tuning performance of web applications, please help!
EDIT
About apache configuration, I used MPM worker mode, the configuration:
<IfModule mpm_worker_module>
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 50
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
As for mod_wsgi, I tried turning WSGIDaemonProcess on and off (by commenting the following line out), the performance looks the same.
# WSGIDaemonProcess tqt processes=3 threads=15 display-name=TQTSERVER

Congratulations! You found the performance problem - not your users!
Analysing performance problems on web applications is usually hard, because there are so many moving parts, and it's hard to see inside the application while it's running.
The behaviour you describe is usually associated with a bottleneck resource - this happens when there's a particular resource that can't keep up, so queues requests, which tends to lead to a "hockey stick" curve with response times - once you hit the point where this resource can't keep up, the response time goes up very quickly.
20 concurrent threads seems low for that to happen, unless you're doing a lot of very heavy lifting on the page.
First place to start is TOP - while CPU is low, what's memory, disk access etc. doing? Is your database running on the same machine? If not, what does TOP say on the database server?
Assuming it's not some silly hardware thing, the next most likely problem is the database access on that page. It may be that one query is returning literally the entire database when all you want is one record (this is a fairly common anti pattern with ORM solutions); that could lead to the behaviour you describe. I would use the Flask logging framework to record your database calls (start, end, number of records returned), and look for anomalies there.
If the database is performing well under load, it's either the framework or the application code. Again, use logging statements in the code to trace the execution time of individual blocks of code, and keep hunting...
It's not glamorous, and can be really tedious - but it's a lot better that you found this before going live!

Look at using New Relic to identify where the bottleneck is. See overview of it and discussion of identifying bottlenecks in my talk:
http://lanyrd.com/2012/pycon/spcdg/
Also edit your original question and add the mod_wsgi configuration you are using, plus whether you are using Apache prefork or worker MPM as you could be doing something non optimal there.

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)

The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008