I was benchmarking my production server (it's in Beta) and the results were poor to say the least. On pages without any dynamic content, 1000 Requests with a concurrency of 1 returned 73 Requests/Sec.
When I start to add MYSQL queries to the equation, things quickly spiral out of control. The same 1000 requests on my homepage produce the following results:
CPU spikes to 50%
Load spikes to 3.7 (though that doesn't always happen)
complete request:1000
failed requests:0
write errors:0
requests/sec: 2.44
transfer rate: 113.26[Kbytes/sec]
90% of requests are served within 142ms.
95% of requests are served within 3531ms (it just keeps getting worse after that).
Taking a look at top while I run the benchmark
mysqld runs as a process is consuming roughly 7% of memory and 2.5% cpu
Apache seems to spawn 7 concurrent processes at times
At other points, Apache does not show up in Top
I'm running preforked Apache on a Micro AWS instance (ubuntu) and I'll upgrade to a higher instance, but I worry that there is an underlying problem here with the code or my Apache setup.
I am deploying Django with Mod_WSGI and I set KeepAliveTimeout to 3 just in case a couple of slow processes were screwing me up.
My code for the homepage is seemingly straightforward and though it requires joins.
def index(request):
posts=Post.objects.filter(photo__isnull=False).order_by('date').distinct()[0:7]
ohouses=Open_House.objects.filter(post__photo__isnull=False).order_by('day').distinct()[0:4]
return render_to_response("index.html", {'posts':posts,'ohouses':ohouses},context_instance=RequestContext(request))
I have left the default configuration in place for MYSQL.
Could this all be attributable to running a Micro Instance? Could my instance be somewhat corrupted? Any other plausible explanations?
There's a ton that goes into quick response times. Django is pretty optimized for what it is, but relying on a framework alone will never get you where you want to be.
If you're going to use Apache, use the MPM fork, and even then disable all modules you don't absolutely need. Apache can be made to run fast, but it's not the fastest horse out there. You'll do better with something like Nginx or (cringe) Cherokee. Cherokee is a good webserver, but usability index is like zero.
Any static resources should be served directly by your webserver or better yet, off a CDN.
Assuming you've optimized your own code to not make inefficient use of queries, Django's built in, automatic query caching will help reduce the overall amount of queries needed to the database. After that, you need to employ something like memcached.
Then, there's the server itself. Depending on the size of your site, you may not need much RAM and CPU, but it's always better to have too much than not enough. It might be beneficial to put some artificial load on your server (automated testing, spidering your site, etc), and see how your system resources hold up. If you get anywhere near capping out (I'd say over 50% with simple tests like that), you need to add some more into your instance's pool.
Search online for articles on how to optimize MySQL. Out of the box, it tends to use a lot more resources than it actually needs; there's lots of room for improvement there. And, if it's not already on its own server, consider strongly offloading it to it's own server. If you're anticipating a lot of traffic, the same server responding to web requests and fetching data from a database will become a bottleneck quick.
Could this all be attributable to running a Micro Instance?
Micro instances burst to 2 CPUs for a short period of time, after which they are severely capped for several minutes. I wouldn't trust any benchmarks done on a Micro EC2 instance for that reason.
Related
So I have developped this website with Symfony3 and Doctrine. I have one major concern about performance with MySQL and more specifically the number of simultaneous open connexions.
For the moment, one to five users are online on the website. What happens if, let's say, 1,500 users connect within one minute? Does Symfony3 or Doctrine handle this kind of situations? How can I be sure the website doesn't go down providing me with the Too many connections MySQL error?
And if I go up to 5,000? And 10,000? The server has 4GB of RAM and a 2.40Ghz mono-core processor but I wouldn't worry about the hardware as I'm more concerned about MySQL.
These situations already happened in the past but I was running the website with Wordpress and W3 Total Cache plugin. Should I consider using a cache manager such as memcached or else?
In short, I'm concerned about the website becoming unavailable in case of sudden high trafic (and thought of the MysQL Too many connections error in first but I might be missing something even more important).
Thanks for lightening me out on this one as I'm not fully aware about performance issues with Symfony.
I believe it does open one connection per visitor. Regardless of whether it does or not however neither Symfony or Doctrine has a magic bullet to handle every load/connection scenario.
Why don't you use a load testing tool (there are many) and see how it actually pans out? In my experience predicting a bottleneck is useless, as they will always crop up where you least expect it.
For example, the MySQL connection limit is only one part of the optimisation puzzle. It's no good just worrying about connection limits, you need to respond to web requests as quickly and efficiently as possible to free up MySQL connection resources (and other resources your app is using). So if your server is slow you will run out of connections (or some other resource) almost immediately under significant load, regardless of MySQL connection limits.
That said, those server specifications seem a little low for 5-10k users per minute. I wouldn't expect a machine like that to handle that kind of load without some serious optimisation/caching/etc.
The symfony performance page is a good starter, and there is also a good article on caching - there's a ton of available material on the subject. Good luck! :)
If you use php-fpm it depends on pm.max_children in fpm/pool.d/www.conf.
pm.max_children refers to the maximum number of concurrent PHP-FPM processes allowed to exist in such a pool. If the volume of incoming requests requires the creation of more PHP-FPM processes than the number allowed by the max_children limit, those additional requests are backlogged in a queue to await service.
So when pm.max_children > max_connections (my.cnf) and active users > max_connections you will get "Too many connections".
Apologies for the fairly generic nature of the question - I'm simply hoping someone can contribute some suggestions and/or ideas as I'm out of both!
The background:
We run a fairly large (35M hits/month, peak around 170 connections/sec) site which offers free software downloads (stricly legal) and which is written in ASP .NET 2 (VB .Net :( ). We have 2 web servers, sat behind a dedicated hardware load balancer and both servers are fairly chunky machines, Windows Server 2012 Pro 64 bit and IIS 8. We serve extensionless URLs by using a custom 404 page which parses out the requested URL and Server.Transfers appropriately. Because of this particular component, we have to run in classic pipeline mode.
DB wise we use MySQL, and have two replicated DBs, reads are mainly done from the slave. DB access is via a DevArt library and is extensively cached.
The Problem:
We recently (past few months) moved from older servers, running Windows 2003 Server and IIS6. In the process, we also upgraded the Devart Component and MySql (5.1). Since then, we have suffered intermitted scalability issues, which have become significantly worse as we have added more content. We recently increased the number of programs from 2000 to 4000, and this caused response times to increase from <300ms to over 3000ms (measured with NewRelic). This to my mind points to either a bottleneck in the DB (relatively unlikely, given the extensive caching and from DB monitoring) or a badly written query or code problem.
We also regularly see spikes which seem to coincide with cache refreshes which could support the badly written query argument - unfortunately all caching is done for x minutes from retrieval so it can't always be pinpointed accurately.
All our caching uses locks (like this What is the best way to lock cache in asp.net?), so it could be that one specific operation is taking a while and backing up requests behind it.
The problem is... I can't find it!! Can anyone suggest from experience some tools or methods? I've tried to load test, I've profiled the code, I've read through it line by line... NewRelic Pro was doing a good job for us, but the trial expired and for political reasons we haven't purchased a full licence yet. Maybe WinDbg is the way forward?
Looking forward to any insight anyone can add :)
It is not a good idea to guess on a solution. Things could get painful or expensive quickly. You really should start with some standard/common triage techniques and make an educated decision.
Standard process for troubleshooting performance problems on a data driven app go like this:
Review DB indexes (unlikely) and tune as needed.
Check resource utilization: CPU, RAM. If your CPU is maxed-out, then consider adding/upgrading CPU or optimize code or split your tiers. If your RAM is maxed-out, then consider adding RAM or split your tiers. I realize that you just bought new hardware, but you also changed OS and IIS. So, all bets are off. Take the 10 minutes to confirm that you have enough CPU and RAM, so you can confidently eliminate those from the list.
Check HDD usage: if your queue length goes above 1 very often (more than once per 10 seconds), upgrade disk bandwidth or scale-out your disk (RAID, multiple MDF/LDFs, DB partitioning). Check this on each MySql box.
Check network bandwidth (very unlikely, but check it anyway)
Code: a) Consider upgrading to .net 3.5 (or above). It was designed for better scalability and has much better options for caching. b) Use newer/improved caching. c) pick through the code for harsh queries and DB usage. I have had really good experiences with RedGate Ants, but equiv. products work good too.
And then things get more specific to your architecture, code and platform.
There are also some locking mechanisms for the Application variable, but they are rarely the cause of lockups.
You might want to keep an eye on your pool recycle statistics. If you have a memory leak (or connection leak, etc) IIS might seem to freeze when the pool tops-out and restarts.
So Have a web application that has 10-12 pages with many POST/ GET DB Calls. We usually have a apache crash/other problem when site traffic results to 1000 or so (concurrent users) which is very small number, we have updated server with good RAM and resources. When our system admin guy do load testing on blitz and other custom script and is suggesting to move away from Apache. Some things does not make sense to me. Like Apache is not too bad to handle few thousand of concurrent users considering we have cloudflare for caching. Here is what he suggested:
replacement of Apache+mod_fcgi with Nginx+php-fpm which can make the server handle much more users, and then test it.
or
2. For testing: Need 10-20 servers to run a scenario from. Basically, what is needed is a more complex blitz.io analogue. create one server, which takes all those hours, then just clone it in the cloud and pay for about 1 hour of testing multiplied by the number of servers needed.
Once again there are many DB calls anf HT access. ALso what makes Nginx better than apache in this case?
I would check this comparison first. Basically, nginx is event based, so it's able to handle more requests concurrently. However, as the MySQL DB seems to be the choke point here, it's very possible that nginx wouldn't solve all your problems. Perhaps moving to a NoSQL kind of database, that's better at scaling horizontally, would help (if that's feasible).
I wrote a web application using python and Flask framework, and set it up on Apache with mod_wsgi.
Today I use JMeter to perform some load testing on this application.
For one web URL:
when I set only 1 thread to send request, the response time is 200ms
when I set 20 concurrent threads to send requests, the response time increases to more than 4000ms(4s). THIS IS UNACCEPTABLE!
I am trying to find the problem, so I recorded the time in before_request and teardown_request methods of flask. And it turns out the time taken to process the request is just over 10ms.
In this URL handler, the app just performs some SQL queries (about 10) in Mysql database, nothing special.
To test if the problem is with web server or framework configuration, I wrote another method Hello in the same flask application, which just returns a string. It performs perfectly under load, the response time is 13ms with 20-thread concurrency.
And when doing the load test, I execute 'top' on my server, there are about 10 apache threads, but the CPU is mostly idle.
I am at my wit's end now. Even if the request are performed serially, the performance should not drop so drastically... My guess is that there is some queuing somewhere that I am unaware of, and there must be overhead besides handling the request.
If you have experience in tuning performance of web applications, please help!
EDIT
About apache configuration, I used MPM worker mode, the configuration:
<IfModule mpm_worker_module>
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 50
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
As for mod_wsgi, I tried turning WSGIDaemonProcess on and off (by commenting the following line out), the performance looks the same.
# WSGIDaemonProcess tqt processes=3 threads=15 display-name=TQTSERVER
Congratulations! You found the performance problem - not your users!
Analysing performance problems on web applications is usually hard, because there are so many moving parts, and it's hard to see inside the application while it's running.
The behaviour you describe is usually associated with a bottleneck resource - this happens when there's a particular resource that can't keep up, so queues requests, which tends to lead to a "hockey stick" curve with response times - once you hit the point where this resource can't keep up, the response time goes up very quickly.
20 concurrent threads seems low for that to happen, unless you're doing a lot of very heavy lifting on the page.
First place to start is TOP - while CPU is low, what's memory, disk access etc. doing? Is your database running on the same machine? If not, what does TOP say on the database server?
Assuming it's not some silly hardware thing, the next most likely problem is the database access on that page. It may be that one query is returning literally the entire database when all you want is one record (this is a fairly common anti pattern with ORM solutions); that could lead to the behaviour you describe. I would use the Flask logging framework to record your database calls (start, end, number of records returned), and look for anomalies there.
If the database is performing well under load, it's either the framework or the application code. Again, use logging statements in the code to trace the execution time of individual blocks of code, and keep hunting...
It's not glamorous, and can be really tedious - but it's a lot better that you found this before going live!
Look at using New Relic to identify where the bottleneck is. See overview of it and discussion of identifying bottlenecks in my talk:
http://lanyrd.com/2012/pycon/spcdg/
Also edit your original question and add the mod_wsgi configuration you are using, plus whether you are using Apache prefork or worker MPM as you could be doing something non optimal there.
We are Web 2.0 company that built a hosted Content Management solution from the ground up using LAMP. In short, people log into our backend to manage their website content and then use our API to extract that content. This API gets plugged into templates that can be hosted anywhere on the interwebs.
Scaling for us has progressed as follows:
Shared hosting (1and1)
Dedicated single server hosting (Rackspace)
1 Web Server, 1 DB Server (Rackspace)
1 Backend Web Server, 1 API Web Server, 1 DB Server
Memcache, caching, caching, caching.
The question is, what's next for us? Every time one of our sites are dugg or mentioned in a popular website, our API server gets crushed with too many connections. Or every time our DB server gets overrun with queries, our Web server requests back up.
This is obviously the 'next problem' for any company like ours and I was wondering if you could point me in some directions.
I am currently attracted to the virtualization solutions (like EC2) but need some pointers on what to consider.
What/where/how to scale is dependent on what your issues are. Since you've been hit a few times, and you know it's the API server, you need to identify what's actually causing the issue.
Is it DB lookup times?
A volume of requests that the web server just can't handle even though they're shortlived?
API requests take too long to process? (independent of DB lookups, e.g., does the code take a bit to run)?
Once you identify WHAT the problem is, you should have a pretty clear picture of what you need to do. If it's just volume of requests, and it's the API server, you just need more web servers (and code changes to allow horizontal scaling) or a beefier web server. If it's API requests taking too long, you're looking at code optimizations. There's never a 1-shot fix when it comes to scalability.
The most common scaling issues have to do with slow (2-3 seconds) execution of the actual code for each request, which in turn leads to more web servers, which leads to more database interactions (for cross-server sessions, etc.) which leads to database performance issues. High performance, server independent code with memcache (I actually prefer a wrapper around memcache so the application doesn't know/care where it gets the data from, just that it gets it and the translation layer handles DB/memcache lookups as well as populating memcache).
Depends really if your bottleneck is reads or writes. Scaling writes is much harder than reads.
It also depends on how much data you have in the database.
If your database is small, but cannot cope with the read load, you can deploy enough ram that it fits in ram. If it still cannot cope, you can add read-replicas, possibly on the same box as your web servers, this will give you good read-scalability - the number of slaves from one MySQL master is quite high and will depend chiefly on the write workload.
If you need to scale writes, that's a totally different game. To do that you'll need to split your data out, either horizontally (partitioning / sharding) or vertically (functional partitioning etc) so that you can spread the workload over several write servers which do not need to do each others' work.
I'm not sure what EC2 can do for you, it essentially offers slow, high latency machines with nonpersistent discs and low IO performance on the end of a more-or-less nonexistent SLA. I guess it might be useful in your case as you can provision them relatively quickly - provided you're just using them as read-replicas and you don't have too much data (remember they have nonpersistent discs and sucky IO)
What is the level of scaling you are looking for? Is it a stop-gap solution e.g. scale vertically? If it is a more strategic scaling project, does your current architecture support scaling horizontally?