How to find out what is causing a slow down of the application? - mysql

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)

The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!

Related

Does Symfony3 / Doctrine open one MySQL connection per visitor?

So I have developped this website with Symfony3 and Doctrine. I have one major concern about performance with MySQL and more specifically the number of simultaneous open connexions.
For the moment, one to five users are online on the website. What happens if, let's say, 1,500 users connect within one minute? Does Symfony3 or Doctrine handle this kind of situations? How can I be sure the website doesn't go down providing me with the Too many connections MySQL error?
And if I go up to 5,000? And 10,000? The server has 4GB of RAM and a 2.40Ghz mono-core processor but I wouldn't worry about the hardware as I'm more concerned about MySQL.
These situations already happened in the past but I was running the website with Wordpress and W3 Total Cache plugin. Should I consider using a cache manager such as memcached or else?
In short, I'm concerned about the website becoming unavailable in case of sudden high trafic (and thought of the MysQL Too many connections error in first but I might be missing something even more important).
Thanks for lightening me out on this one as I'm not fully aware about performance issues with Symfony.
I believe it does open one connection per visitor. Regardless of whether it does or not however neither Symfony or Doctrine has a magic bullet to handle every load/connection scenario.
Why don't you use a load testing tool (there are many) and see how it actually pans out? In my experience predicting a bottleneck is useless, as they will always crop up where you least expect it.
For example, the MySQL connection limit is only one part of the optimisation puzzle. It's no good just worrying about connection limits, you need to respond to web requests as quickly and efficiently as possible to free up MySQL connection resources (and other resources your app is using). So if your server is slow you will run out of connections (or some other resource) almost immediately under significant load, regardless of MySQL connection limits.
That said, those server specifications seem a little low for 5-10k users per minute. I wouldn't expect a machine like that to handle that kind of load without some serious optimisation/caching/etc.
The symfony performance page is a good starter, and there is also a good article on caching - there's a ton of available material on the subject. Good luck! :)
If you use php-fpm it depends on pm.max_children in fpm/pool.d/www.conf.
pm.max_children refers to the maximum number of concurrent PHP-FPM processes allowed to exist in such a pool. If the volume of incoming requests requires the creation of more PHP-FPM processes than the number allowed by the max_children limit, those additional requests are backlogged in a queue to await service.
So when pm.max_children > max_connections (my.cnf) and active users > max_connections you will get "Too many connections".

Nginx Vs Apache to solve load isseu on website

So Have a web application that has 10-12 pages with many POST/ GET DB Calls. We usually have a apache crash/other problem when site traffic results to 1000 or so (concurrent users) which is very small number, we have updated server with good RAM and resources. When our system admin guy do load testing on blitz and other custom script and is suggesting to move away from Apache. Some things does not make sense to me. Like Apache is not too bad to handle few thousand of concurrent users considering we have cloudflare for caching. Here is what he suggested:
replacement of Apache+mod_fcgi with Nginx+php-fpm which can make the server handle much more users, and then test it.
or
2. For testing: Need 10-20 servers to run a scenario from. Basically, what is needed is a more complex blitz.io analogue. create one server, which takes all those hours, then just clone it in the cloud and pay for about 1 hour of testing multiplied by the number of servers needed.
Once again there are many DB calls anf HT access. ALso what makes Nginx better than apache in this case?
I would check this comparison first. Basically, nginx is event based, so it's able to handle more requests concurrently. However, as the MySQL DB seems to be the choke point here, it's very possible that nginx wouldn't solve all your problems. Perhaps moving to a NoSQL kind of database, that's better at scaling horizontally, would help (if that's feasible).

Web request performance really bad under stress

I wrote a web application using python and Flask framework, and set it up on Apache with mod_wsgi.
Today I use JMeter to perform some load testing on this application.
For one web URL:
when I set only 1 thread to send request, the response time is 200ms
when I set 20 concurrent threads to send requests, the response time increases to more than 4000ms(4s). THIS IS UNACCEPTABLE!
I am trying to find the problem, so I recorded the time in before_request and teardown_request methods of flask. And it turns out the time taken to process the request is just over 10ms.
In this URL handler, the app just performs some SQL queries (about 10) in Mysql database, nothing special.
To test if the problem is with web server or framework configuration, I wrote another method Hello in the same flask application, which just returns a string. It performs perfectly under load, the response time is 13ms with 20-thread concurrency.
And when doing the load test, I execute 'top' on my server, there are about 10 apache threads, but the CPU is mostly idle.
I am at my wit's end now. Even if the request are performed serially, the performance should not drop so drastically... My guess is that there is some queuing somewhere that I am unaware of, and there must be overhead besides handling the request.
If you have experience in tuning performance of web applications, please help!
EDIT
About apache configuration, I used MPM worker mode, the configuration:
<IfModule mpm_worker_module>
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 50
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
As for mod_wsgi, I tried turning WSGIDaemonProcess on and off (by commenting the following line out), the performance looks the same.
# WSGIDaemonProcess tqt processes=3 threads=15 display-name=TQTSERVER
Congratulations! You found the performance problem - not your users!
Analysing performance problems on web applications is usually hard, because there are so many moving parts, and it's hard to see inside the application while it's running.
The behaviour you describe is usually associated with a bottleneck resource - this happens when there's a particular resource that can't keep up, so queues requests, which tends to lead to a "hockey stick" curve with response times - once you hit the point where this resource can't keep up, the response time goes up very quickly.
20 concurrent threads seems low for that to happen, unless you're doing a lot of very heavy lifting on the page.
First place to start is TOP - while CPU is low, what's memory, disk access etc. doing? Is your database running on the same machine? If not, what does TOP say on the database server?
Assuming it's not some silly hardware thing, the next most likely problem is the database access on that page. It may be that one query is returning literally the entire database when all you want is one record (this is a fairly common anti pattern with ORM solutions); that could lead to the behaviour you describe. I would use the Flask logging framework to record your database calls (start, end, number of records returned), and look for anomalies there.
If the database is performing well under load, it's either the framework or the application code. Again, use logging statements in the code to trace the execution time of individual blocks of code, and keep hunting...
It's not glamorous, and can be really tedious - but it's a lot better that you found this before going live!
Look at using New Relic to identify where the bottleneck is. See overview of it and discussion of identifying bottlenecks in my talk:
http://lanyrd.com/2012/pycon/spcdg/
Also edit your original question and add the mod_wsgi configuration you are using, plus whether you are using Apache prefork or worker MPM as you could be doing something non optimal there.

Service deployed on Tomcat crashing under heavy load

I'm having trouble with a web service deployed on Tomcat. During peak traffic times the server is becoming non response and forces me to restart the entire server in order to get it working again.
First of all, I'm pretty new to all this. I built the server myself using various guides and blogs. Everything has been working great, but due to the larger load of traffic, I'm now getting out of my league a little. So, I need clear instructions on what to do or to be pointed towards exactly what I need to read up on.
I'm currently monitoring the service using JavaMelody, so I can see the spikes occurring, but I am unaware how to get more detailed information than this as to possible causes/solutions.
The server itself is quad core with 16gb ram, so the issue doesn't lie there, more likely in the fact I need to properly configure Tomcat to be able to use this (or setup a cluster...?)
JavaMelody shows the service crashing when the cpu usage only gets to about 20%, and about 300 hits a minute. Is there any max connection limits of memory settings that I should be configuring?
I also only have a single instance of the service deployed. I understand I can simply rename the war file and Tomcat deploys a second instance. Will doing this help?
Each request also opens (and immediately closes) a connection to mySQL to retrieve data, I probably need to be sure it's not getting throttled there too.
Sorry this is so long winded and has multiple questions. I can give more information as needed, I am just not certain what needs to be given at this time!
The server has 16Gs of ram but how much memory do you have dedicated to tomcat, -Xms and -Xmx?

How to benchmark and optimize a really database-intensive Rails action?

There is an action in the admin section of a client's site, say Admin::Analytics (that I did not build but have to maintain) that compiles site usage analytics by performing a couple dozen, rather intensive database queries. This functionality has always been a bottleneck to application performance whenever the analytics report is being compiled. But, the bottleneck has become so bad lately that, when accessed, the site comes to a screeching halt and hangs indefinitely. Until yesterday I never had a reason to run the "top" command on the server, but doing so I realized that Admin::Analytics#index causes mysqld to spin at upwards of 350+% CPU power on the quad-core, production VPS.
I have downloaded fresh copies of production data and the production log. However, when I access Admin::Analytics#index locally on my development box, while using the production data, it loads in about 10 - 12 seconds (and utilizes ~ 150+% of my dual-core CPU), which sadly is normal. I suppose there could be a discrepancy in mysql settings that has suddenly come into play. Also, a mysqldump of the database is now 531 MB, when it was only 336 MB 28 days ago.  Anyway, I do not have root access on the VPS, so tweaking mysqld performance would be cumbersome, and I would really like to get to the exact cause of this problem. However, the production logs don't contain info. on the queries; they merely report the length that these requests took, which average out to a few minutes apiece (although they seemed to have caused mysqld to stall for much longer than this and prompting me to request our host to reboot mysqld just to get our site back up in one instance).
I suppose I can try upping the log level in production to solicit info. on the database queries being performed by Admin::Analytics#index, but at the same time I'm afraid to replicate this behavior in production because I don't feel like calling our host up to restart mysqld again! This action contains a single database request in its controller, and a couple dozen prepared statements embedded in its view!
How would you proceed to benchmark/diagnose and optimize/fix this action?!
(Aside: Obviously I would like to completely replace this functionality with Google Analytics or a similar solution, but I need fix this problem before proceeding.)
I'd recommend taking a look at this article:
http://axonflux.com/building-and-scaling-a-startup
Particularly, query_reviewer and newrelic have been a life-saver for me.
I appreciate all the help with this, but what turned out to be the fix for this was to implement a couple of indexes on the Analytics table to cater to the queries in this action. A simple Rails migration to add the indexes and the action now loads in less than a second both on my dev box and on prod!