I have a PHP script that runs once a day, and it takes a good 30 minutes to run (I think). Everything there is a safe & secure operation. I keep getting the 500 error after about 10~15 minutes of it. However I can't see anything in the logs etc. so I'm a bit confused.
So far the things I set up as "unlimited" are:
max_execution_time
max_input_time
default_socket_timeout
Also set these to obscenely high numbers just for this section (the folder in which the script runs)
memory_limit
post_max_size
The nature of this script is a SOAP type API that imports thousands of rows of data from a 3rd party URL, puts them into a local MySQL table, and then downloads images attached with each and every row, so the amount of data is significant.
I'm trying to figure out what other PHP variables etc. I'm missing in order to get this to complete through the whole thing. Other PHP vars I have set:
display_errors = On
log_errors = On
error_reporting = E_ALL & ~E_NOTICE & ~E_WARNING
error_log = "error_log"
There are three timeouts:
PHP Level: set_time_limit
Apache Level: Timeout
Mysql Level: Mysql Options
In your case seems like the Apache reached its timeout. In such situation it is better to use PHP CLI. But if you really need to do this operation in real-time. Then you can make use of Gearman through which you will achieve true parallelism in PHP.
If you need simple solution that trigger your script from normal HTTP request (Browser->Apache), you can run your back-end script (CLI script) as shell command from PHP but 'asynchronously'. More info can be found in Asynchronous shell exec in PHP
Try to use PHP Command-Line Interface (php-cli) to do lengthy task. Execution time is infinity in command line unless you set it / terminate it. Also you can set schedule by cron job.
Run it from command line with PHP (e.g. php yourscript.php) and this error shouldn't occur. Also it's not a good idea to use set_time_limit(0); you should at most use set_time_limit(86400). You can set a cron job to do this once per day. Just make sure that all filepaths in the script are absolute and not relative so it doesn't get confused.
Compiling the script might also help. HipHop is a great PHP compiler, then your script will run faster, use less memory, and can use as many resources as it likes. HipHop is just very difficult to install.
If the execution time is a problem, then maybe you should set the max_execution time using set_time_limit function inside the script:
set_time_limit(0);
I would also invoke the script on the command line using php directly, instead of through apache. In addition, print out some status messages, and pipe them into a log.
I suspect that your actual problem is that the script chokes on bad data somewhere along the line.
Related
I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.
I have a rather complex SQL script that does several things:
Drops an existing database
Reloads the database from a backup
Sets permissions
A few other miscellaneous db-specific tasks
I've saved this as a .sql file for mobility and ease of use, but I'd like to incorporate this into a Powershell script to make it even simpler to run (and to encapsulate a few other tasks that need to be done around the same process). Here are the bits of code I have in Powershell for the script:
add-pssnapin sqlservercmdletsnapin100
invoke-sqlcmd -inputfile "D:\Scripts\loaddatabase31_Loadtest.sql" -serverinstance "PerfSQL02" -hostname "PerfSQL02"
Assume that the serverinstance, inputfile, and hostname exist and are put in correctly.
The database I am restoring is a couple hundred gigabytes, so the actual drop and restore process takes around 20 minutes. Right now, I'm getting the following error when I try to run that script (it's being run from a workstation within the same network but on a different domain):
invoke-sqlcmd : Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
At line:2 char:1
I've tried using the -connectiontimeout switch on invoke-sqlcmd, but that doesn't seem to work to resolve this issue (so it's probably not timing out on connecting). Most of the examples I've seen of running SQL via Powershell don't use a file, but instead define the SQL within the script. I'd prefer not to do this due to the complexity of the SQL needed to run (and since it's dropping and restoring a database, I can't really wrap all of this up in a SP unless I use one of the System DBs).
Anyone have some ideas as to what could be causing that timeout? Does invoke-sqlcmd not work gracefully with files?
You don't want the -connectiontimeout, you likely want the -QueryTimeout
http://technet.microsoft.com/en-us/library/cc281720.aspx
I have a Perl script running on a FreeBSD/Apache system, which makes some simple queries to a MySQL database via DBI. The server is fairly active (150k pages a day) and every once in a while (as much as once a minute) something is causing a process to hang. I've suspected a file lock might be holding up a read, or maybe it's a SQL call, but I have not been able to figure out how to get information on the hanging process.
Per Practical mod_perl it sounds like the way to identify the operation giving me the headache is either system trace, perl trace, or the interactive debugger. I gather the system trace is ktrace on FreeBSD, but when i attach to one of the hanging processes in top, the only output after the process is killed is:
50904 perl5.8.9 PSIG SIGTERM SIG_DFL
That isn't very helpful to me. Can anyone suggest a more meaningful approach on this? I am not terribly advanced in Unix admin, so your patience if I sound stupid is greatly appreciated.... :o)
If I understood correctly, your Perl process is hanging while querying the MySQL, which, by itself, is still operational. MySQL server has the embedded troubleshooting feature for that, the log_slow_queries option. Putting the following lines in your my.cnf enables the trick:
[mysqld]
log_slow_queries = /var/log/mysql/mysql-slow.log
long_query_time = 10
After that, restart or reload the MySQL daemon. Let the server run for a while to collect the stats and analyse what's going on:
mysqldumpslow -s at /var/log/mysql/mysql-slow.log | less
On one server of mine, the top record (-s at orders by average query time, BTW) is:
Count: 286 Time=101.26s (28960s) Lock=14.74s (4214s) Rows=0.0 (0), iwatcher[iwatcher]#localhost
INSERT INTO `wp_posts` (`post_author`,`post_date`,`post_date_gmt`,`post_content`,`post_content_filtered`,`post_title`,`post_excerpt`,`post_status`,`post_type`,`comment_status`,`ping_status`,`post_password`,`post_name`,`to_ping`,`pinged`,`post_modified`,`post_modified_gmt`,`post_parent`,`menu_order`,`guid`) VALUES ('S','S','S','S','S','S','S','S','S','S','S','S','S','S','S','S','S','S','S','S')
FWIW, it is a WordPress with over 30K posts.
Ktracing only gives you system calls, signals I/O and namei processing. And it generates a lot of data very quickly. So it might not be ideal to fish out trouble spots.
If you can see the standard output for your script, put some strategically placed print statements in your code around suspected trouble spots. Then running the program should show you were the hang occurs:
print "Before query X"
$dbh->do($statement)
print "After query X".
If you cannot see the standard output, either use e.g. the Sys::Syslog perl module, or call FreeBSD's logger(1) program to write the debugging info to a logfile. It is probably easiest to encapsulate that into a debug() function and use that instead or print statements.
Edit: If you don't want a lot of logging on disk, write the logging info to a socket (Sys::Syslog supports that with setlogsock()), and write another script to read from that socket and dump the debug text to a terminal, prefixed with the time the data was received. Once the program hangs, you can see what it was doing.
I have just reloaded my laptop and am now trying to setup my localhost again.
Currently I am trying to re-setup the database.
the issue is,
The script is 169,328 KB.
This keeps crashing whatever I use to run the query and I get the error: The mysql Server has gone away.
Seems everyone is suggesting splitting the script. Instead as this is only me setting my localhost back up, I have temporarily increased the max_packet_size.
This would explain the error.
Perhaps you should open the script and see if you can chunk it into smaller, more manageable pieces.
I don't know what you're doing about transactions, but perhaps the rollback segment (or its MySQL equivalent) is getting too large. If that's the case, break the script into several transactions that you can safely commit individually.
If you're looking to avoid the err message, consider one of these remedies:
ensure that your environment or commands aren't causing this issue. Causes for MySQL gone-away.
split your large script into smaller scripts. You could then run those in sequence.
I'm new to databases and web servers and that kind of thing. So I am looking for information so I can begin to figure out a starting point and options open to me.
I need to have a database that can be accessed by an iPhone app. So logically it will be hosted on a webserver somewhere.
To get/insert the data from/into the database the app would make a HTTP connection to a php file on the same server as the DB which would then insert/return the relevant data. To stop random hackers messing with the DB the app would have some validation code inside it to send to the php file to check that its not a hacker trying to mess with the database. This all making sense or will that not be secure enough.
Now the most confusing part to get my head around is :
I need check every minute has any data in the database become to old and remove it if so. So something needs to be running on the server constantly checking/manageing the database. What would this be? What is commonly used to do this kinda of thing? Is there somekey word for it that i can start searching and reading about to see what options there are?
Thanks for your advise,
-Code
One way to do this is to have a purge script run via crontab. The script can run every minute and check for old data and remove it.
MySQL version greater than 5.1.6 has inbuilt event scheduler which can be used to schedule periodic jobs inside mysql server itself.
http://dev.mysql.com/doc/refman/5.1/en/events.html
Sounds to me like you need a cron job. Cron is the standard scheduling task application for Unix type systems.
You would have some sort of script that connects to the database and performs a cleanup query, and you would schedule that script via cron.
http://en.wikipedia.org/wiki/Cron