I have a weekly backup that runs for one of my mysql databases for one of my websites (ccms). This backup is about 1.2GB and takes about 30 min to run.
When this database backup runs, all my other railo websites can not connect and go "down" for the duration of the backup.
One of the errors I have managed to catch was:
"[show] railo.runtime.exp.RequestTimeoutException: request (:119) is run into a
timeout (1200 seconds) and has been stopped. open locks at this time (c:/railo/webapps/root/ccms/parsed/photo.view.cfm,
c:/railo/webapps/root/ccms/parsed/profile.view.cfm, c:/railo/webapps/root/ccms/parsed/album.view.cfm,
c:/railo/webapps/root/ccms/parsed/public.dologin.cfm)."
What I believe is happening is that the tables required for those pages (the "ccms" website) are being locked due to the backup, which is fair enough.
But, why is that causing the other railo websites to time out? For example, the error I pasted above was actually taken from a different website, not the "ccms" website that it references in the error. Any website I try and run fails and throws an error that references the "ccms" website, which is the one being backed up. How do I avoid this?
Any insight would be greatly appreciated.
Thanks
One possibility is that because your timeout appears to be 20 minutes is that each time a request comes in to the site which IS being backed up, that thread blocks on waiting for the DB.
Railo has a pool of worker threads to handle requests and now one of them is tied up. As requests continue to come in, any requests to the affected site tie up another thread. Eventually there are no more workers in the pool and all subsequent requests are queued up to be processed once workers become available.
I'm not an expert on debugging Railo, but the above seems plausible to me. You could consider running different Railo processes for different sites, which would isolate them or drastically lowering your DB timeout (if acceptable).
Related
When doing the load testing on my application the AWS RDS CPU is hitting 100% and corresponding requests are getting errored out. The RDS is m4.2x.large. With the same configuration the things were fine until 2 weeks back. There are no infra changes done on the environment neither the application level changes. The whole load test used to go smooth until complete 2hrs until 2 weeks back. There are no specific exception apart from GENERICJDBCEXCEPTION.
All other necessary services are up and running on respective instances.
We are using SQL as Database Management System.
Is there any chance that this happens suddenly? How to resolve this? Suggestions are much appreciated. This has created many problems.
Monitoring the slow logs and resolving them did not solve the problem.
Should we upgrade the RDS to next version?
Does more data on then DB slows the Database?
We have modified the connection pool parameters also and tried it.
With "load testing", are you able to finish one day's work in one hour? That sounds great! Or what do you mean by "load testing"?
Or are you trying to launch 200 threads in one second and they are stumbling over each other? That's to be expected. Do you really get 200 new connections in a single second? Or is it spread out?
1 million queries per day is no problem. A million queries all at once will fail.
Do not let your "load test" launch more threads than you can reasonably expect. They will all pile up, and latency will suffer while the server is giving each thread an equal chance.
Meanwhile, use the slowlog to find the "worst" queries in production. Then let's discuss the worst one or two -- Often an improved index makes that query work much faster, thereby no longer contributing to the train wreck.
This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)
The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!
Recently I have started getting mySQL "too many connection" errors at times of high traffic. My rails app runs on a mongrel cluster with 2 instances on a shared host. Some recent changes that might be driving it:
Traffic to my site has increased. I
am now averaging about 4K pages a
day.
Database size has increased. My largest table has ~ 100K rows.
Some associations could return
several hundred instances in the
worst case, though most are far less.
I have added some features that
increased the number and size of
database calls in some actions.
I have done a code review to reduce database calls, optimize SQL queries, add missing indexes, and use :include for eager loading. However, many of my methods still make 5-10 separate SQL calls. Most of my actions have a response time of around 100ms, but one of my most common actions averages 300-400ms, and some actions randomly peak at over 1000ms.
The logs are of little help, as the errors seem to occur randomly, or at least the pattern does not appear related to the actions being called or data being accessed.
Could I alleviate the error by adding additional mongrel instances? Or are the mySQL connections limited by the server, and thus unrelated to the number of processes I divide my traffic across?
Is this most likely a problem with my coding, or should I be pressing my host for more capacity/less load on the shared server?
ActiveRecord has pooled database connections since Rails 2.2, and it's likely that that's what's causing your excess connections here. Try turning down the value of pool in your database.yml for that environment (it defaults to 5).
Docs can be found here.
Are you caching anything? It's an important part of alleviating application and database load. The Rails Guides have a section on caching.
Something is wrong. A Mongrel instance processes 1 request at a time so if you have 2 Mongrel instances then you should not be seeing more than 2 active MySQL connections (from the mongrels at least)
You could log or graph the output of SHOW STATUS LIKE 'Threads_connected' over time.
PS: this is not very many Mongrels. if you want to be able to service more than 2 simultaneous requests then you'll want more. ...if memory is tight, you can switch to Phusion Passenger and REE.
is there any software that does "lazy" deletion of the rows from the table. I would like to do maintenance of my tables when my server is idle, and ideally i should be able to define what "idle" is (num of database connections/system load/ requests per second). Is there anything remotely similar to this?
If you are on a linux server, you can make your table cleanup scripts only run based on the output of the command "w" which will show you a system load. If your system load is under say .25 you can run your script. Do this with shell scripting.
To some degree, from an internal perspective InnoDB already does this. Rows are initially marked as deleted, but only made free as part of a background operation.
My advice: You can get in to needlessly complicated problems if you try and first check if the server is idle. i.e.
What if it was idle, but the cleanup takes 2 minutes. During that 2 minutes the server load peaks?
What if the server never becomes idle enough? Now you just have an unlimited backlog.
If you just background the task you might improve performance enough, since now at least no users will be sitting in front of web pages waiting for it to complete. Look at activity graphs as to what is the best time to schedule it (3am, 5am etc).
There is an action in the admin section of a client's site, say Admin::Analytics (that I did not build but have to maintain) that compiles site usage analytics by performing a couple dozen, rather intensive database queries. This functionality has always been a bottleneck to application performance whenever the analytics report is being compiled. But, the bottleneck has become so bad lately that, when accessed, the site comes to a screeching halt and hangs indefinitely. Until yesterday I never had a reason to run the "top" command on the server, but doing so I realized that Admin::Analytics#index causes mysqld to spin at upwards of 350+% CPU power on the quad-core, production VPS.
I have downloaded fresh copies of production data and the production log. However, when I access Admin::Analytics#index locally on my development box, while using the production data, it loads in about 10 - 12 seconds (and utilizes ~ 150+% of my dual-core CPU), which sadly is normal. I suppose there could be a discrepancy in mysql settings that has suddenly come into play. Also, a mysqldump of the database is now 531 MB, when it was only 336 MB 28 days ago. Anyway, I do not have root access on the VPS, so tweaking mysqld performance would be cumbersome, and I would really like to get to the exact cause of this problem. However, the production logs don't contain info. on the queries; they merely report the length that these requests took, which average out to a few minutes apiece (although they seemed to have caused mysqld to stall for much longer than this and prompting me to request our host to reboot mysqld just to get our site back up in one instance).
I suppose I can try upping the log level in production to solicit info. on the database queries being performed by Admin::Analytics#index, but at the same time I'm afraid to replicate this behavior in production because I don't feel like calling our host up to restart mysqld again! This action contains a single database request in its controller, and a couple dozen prepared statements embedded in its view!
How would you proceed to benchmark/diagnose and optimize/fix this action?!
(Aside: Obviously I would like to completely replace this functionality with Google Analytics or a similar solution, but I need fix this problem before proceeding.)
I'd recommend taking a look at this article:
http://axonflux.com/building-and-scaling-a-startup
Particularly, query_reviewer and newrelic have been a life-saver for me.
I appreciate all the help with this, but what turned out to be the fix for this was to implement a couple of indexes on the Analytics table to cater to the queries in this action. A simple Rails migration to add the indexes and the action now loads in less than a second both on my dev box and on prod!