I've written a PHP script that runs via SSH and nohup, meant to process records from a database and do stuff with them (eg. process some images, update some rows).
It works fine with small loads, up to maybe 10k records. I have some larger datasets that process around 40k records (not a lot, I realize, but it adds up to a lot of work when each record requires the download and processing of up to 50 images).
The larger datasets can take days to process. Sometimes I'll see in my debug logs memory errors, which are clear enough-- but sometimes the script just appears to "die" or go zombie on me. My tail of the debug log just stops, with no error messages, the tail of the nohup log ends with no error, and the process is still showing in a ps list, looking like this--
26075 pts/0 S 745:01 /usr/bin/php ./import.php
but no work is getting done.
Can anyone give me some ideas on why a process would just quit? The obvious things (like a php script timeout and memory issues) are not a factor, as far as I can tell.
Thanks for any tips
PS-- this is hosted on a godaddy VDS (not my choice). I am sort of suspecting that godaddy has some kind of limits that might kick in on me despite what overrides I put in the code (such as set_time_limit(0);).
Very likely the OOM killer. If you really , really really want to stay out of its reach, as root, have your process write -17 to /proc/self/oom_adj. Caution: The kernel usually knows better. Evading the OOM killer can actually cripple the same RDBMS that you are trying to query. What a vicious cycle that would be :)
You probably (instead) want to stagger queries based on what you read from /proc/loadavg and /proc/meminfo. If you increase loads or swap exponentially, you need to back off, especially as a background process :)
Additionally, monitor IOWAIT while you run. This can be averaged from /proc/stat when compared with the time the system booted. Note it when you start and as you progress.
Unfortunately, the serial killer known as the OOM killer does not maintain a body count that is accessible beyond parsing kernel messages.
Or, your cron job keeps hitting its ulimited amount of allocated heap. Either way, your job needs to back off when appropriate, or prevent its own demise (as noted above) prior to doing any work.
As a side note, you probably should not be doing what you are doing on shared hosting. If its that big, its time to get a VPS (at least) where you have some control over what process gets to do what.
Related
I've got a weird situation. The first time I hit an embedded web server (uclinux/boa) at 10.1.10.29, I get a 10 second delay in the browser window before things start happening. "first time" means I haven't hit the machine in a few days. Browser type/OS doesn't matter (source is 10.1.10.20)
I've got a wireshark capture of it happening.
And here is the detail of frame 296:
Note packet 374 doesn't pop out for around 10 seconds after 296. The packets between those 2 aren't from the machine in question. It's just sitting there for 10 seconds and decides to retransmit. How's it supposed to work?
The main reason is most certainly because the code was swapped out from memory.
MS-Windows is really bad in that regard. If some program does not get used for "too long", it gets swapped out of memory. Period. When you come back at it, it has to re-read it back from the hard drive.
The one good thing (main reason) Windows does that is to defragment the kernel memory. For that, it is good.
You have similar problems under Linux, however, only if your server needs the memory. In other words, if you have tons of processes and they all fight for as much of memory as possible, then it is likely to swap out the least used software. Otherwise it will stay in place.
If you were to use the Cassandra database system, you would notice that on any computer that runs anything else than Cassandra. If you just run Cassandra, it remains fast all the time. If you run other software that use a lot of the memory, Cassandra is slow on first access. This is particularly noticeable.
I want to add the answer that solved our problem that had the problem with the 10 second delay, then working and after 5 minutes of inactivity adding another 10 seconds delay.
First of all, we wiresharked everything, and tried to find some kind of error in code, or in the way that the computer or server handled the network traffic. Found nothing out of the ordinary.
After much searching we found it was a DNS-"problem". In the DNS-server that the client computer used, there were dual entries for the domain name of the server. One was correct and one (the first one in the list) was wrong.
So removing the wrong dns pointer solved the problem.
This means the problem was that the computer tried the first address it got, waited 10seconds to get a reply, didnt get it and went to the second address in line. This creates no error messages as this is how DNS is supposed to work. And that is why all our wireshark logs showed up as just waiting 10 seconds with no error and no reason, and then just jump into life, work for as long as the DNS record is valid (5 minutes in our case) and then the procedure needs to be done again.
Hope this helps someone who has a similar problem.
Can someone explain this Joomla error to me -- what to do to fix it. It shows up with debug on. We seem to have a memory leak? Mysql runs slow than site crashes.
Function Location
JSite->dispatch() /home/greatfam/public_html/index.php:52
JComponentHelper::renderComponent() /home/greatfam/public_html/includes/application.php:197
JComponentHelper::executeComponent() /home/greatfam/public_html/libraries/joomla/application/component/helper.php:351
require_once() /home/greatfam/public_html/libraries/joomla/application/component/helper.php:383
JController->execute() /home/greatfam/public_html/components/com_content/content.php:16
ContentController->display() /home/greatfam/public_html/libraries/joomla/application/component/controller.php:761
JController->display() /home/greatfam/public_html/components/com_content/controller.php:74
ContentViewArticle->display() /home/greatfam/public_html/libraries/joomla/application/component/controller.php:722
JView->get() /home/greatfam/public_html/components/com_content/views/article/view.html.php:32
ContentModelArticle->getItem() /home/greatfam/public_html/plugins/system/jat3/jat3/core/joomla/view.php:348
JError::raiseError() /home/greatfam/public_html/components/com_content/models/article.php:172
JError::raise() /home/greatfam/public_html/libraries/joomla/error/error.php:251
A nice and easy way to get started is: turn Joomla debug on in the Global configuration.
Then reload the frontpage, and examine closely the output at the bottom of the page.
There you will find the details of the memory used by each module, and the list of queries run. This will give you a head start and limit the number of items you need to debug (there will be a single module eating up all your memory).
If "after dispatch" is taking too long, then it could be either a plugin or the component being shown on the page.
If nothing "notable" shows up here (a lot of queries, more than 50, or high memory consumption, or long time for a single item, you might want to look at the apache error_log and mysql log and verify system limits.
In my Magento store there is a piece of code that is not closing the DB connection (it seems). Normally I don't see any problems. However the problem arises when I enable caching. After a few hours, the site grounds to a halt and eventually it dies with the MySQL error "Too many connections".
It seems that the "bad" piece of code is cached and reused and therefore gets worse and worse until...death.
I'm scratching my head to find out where this rogue piece of code is being called from(of course there could be more than 1 problem).
I doubt it's a core problem otherwise I probably would have heard of it whilst Googling the issue. So that just leaves 3rd part modules and code I have written.
One option would be to disable all modules and re-enable each one until the problem occurs. But of course it might not happen for several hours (and when it did you just know it would be in the middle of the night ;)). Then of course it might not a 3rd party issue but something I have done. And also I need certain modules in my store for it to function properly (payment gateways etc).
So I'm after suggestions on how to track this down...
I've enabled MySQL logging but it doesn't really tell me all that much.
Any ideas?
Magento 1.7.0.2 and APC with Apache 2
A Django site (hosted on Webfaction) that serves around 950k pageviews a month is experiencing crashes that I haven't been able to figure out how to debug. At unpredictable intervals (averaging about once per day, but not at the same time each day), all requests to the site start to hang/timeout, making the site totally inaccessible until we restart Apache. These requests appear in the frontend access logs as 499s, but do not appear in our application's logs at all.
In poring over the server logs (including those generated by django-timelog) I can't seem to find any pattern in which pages are hit right before the site goes down. For the most recent crash, all the pages that are hit right before the site went down seem to be standard render-to-response operations using templates that seem pretty straightforward and work well the rest of the time. The requests right before the crash do not seem to take longer according to timelog, and I haven't been able to replicate the crashes intentionally via load testing.
Webfaction says that isn't a case of overrunning our allowed memory usage or else they would notify us. One thing to note is that the database is not being restarted (just the app/Apache) when we bring the site back up.
How would you go about investigating this type of recurring issue? It seems like there must be a line of code somewhere that's hanging - do you have any suggestions about a process for finding it?
I once had some issues like this, and it basically boiled down to my misunderstanding of thread-safety within django middleware. Basically the django middleware is I believe a singleton that is shared among all threads, and these threads were thrashing with the values set on a custom middleware class I had. My solution was to rewrite my middleware to not use instance or class attributes that changed, and to switch the critical parts of my application to not use threads at all with my uwsgi server as these seemed to be an overall performance downside for my app. Threaded uwsgi setups seem to work best when you have views that may complete at different intervals (some long running views and some fast ones).
Since you can't really describe what the failure conditions are until you can replicate the crash, you may need to force the situation with ab (apache benchmark). If you don't want to do this with your production site you might replicate the site in a subdomain. Warning: ab can beat the ever loving crap out of a server, so RTM. You might also want to give the WF admins a heads up about what you are going to do.
Update for comment:
I was suggesting using the exact same machine so that the subdomain name was the only difference. Given that you used a different machine there are a large number of subtle (and not so subtle) environmental things that could tweak you away from getting the error to manifest. If the new machine is OK, and if you are willing to walk away from the problem without actually solving it, you might simply make it your production machine and be happy. Personally I tend to obsess about stuff like this, but then again I'm also retired and have plenty of time to play with my toes. :-)
We have an MS Access 2003 ADP application with SQL Server. Sometimes, without any apparent reason, this application starts consuming 100% of CPU time (50% on a dual-core CPU system). This is what Windows Task Manager and other process monitoring/analysis tools are showing, anyway. Usually, the only way to stop such CPU thrashing is to restart the application.
We still don't know how to trigger this problem at will. But I have a feeling that it usually happens when some of the forms get closed by a user.
NB: Recently we noticed that one of the forms consistently makes CPU usage raise to 100% whenever it gets minimized. Most of the time CPU usage goes back to normal when that form is "un-minimized". Perhaps, it's a different problem, but we'd like to uncover this mystery, too. :)
Googling for a solution of this problem didn't yield very good results. The most frequent theory is that MS Access gets into some sort of waiting-for-events loop which is practically harmless, performance-wise, because the thread running that loop has very low priority. This doesn't seem to help us because in our case (a) it certainly does hurt the system's performance and (b) it's still unclear what exactly makes Access to get into such "bad state" and how to avoid that.
I've gotten this CPU usage problem before in the past, but I don't remember if we ever discovered a solution or it just went away at some point.
In your post, you didn't mention reviewing the VBA. I'd recommend looking for a loop that under certain conditions becomes an endless loop.
I wonder if it is a hangover from this problem that access used to have in the "old" days
http://support.microsoft.com/kb/160819
Whilst the article does say it is fixed in versions >=2000 it still might be something.