I have a perl script which takes in unique parameters (one of the parameters being --user=username_here). Users can start these processes using a web interface I am developing.
A MySQL table, transactions, keeps track of users that run the perl script
id user script_parameters execute last_modified
23 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
24 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
25 alex --user=alex --keywords=lg 0 2014-05-06 05:49:01
26 alex --user=alex --keywords=lg 0 2014-04-30 04:31:39
The execute value for a given row will be "1" if the process should be running. It is set to "0" if the process should be ended.
My perl script constantly checks this value to make sure it's not "0" and if it is, the perl script terminates.
However, I need to manage these process to protect against this problem:
What if my server abruptly crashes and restarts, OR the script crashes? I will need something running in the background, reading the transactions table and make sure it restarts the perl script as many times as needed using the appropriate parameters.
And so, I'm having trouble figuring out how to balance giving control to the user to manage his/her own transaction(s), while I also make sure that the transactions that SHOULD be running, ARE running, and those that AREN'T, AREN'T.
Hope that makes sense and I appreciate any help!
It seems you're trying to launch long-running processes from a web server and then track those processes in a database. That's not impossible, but not a recommended practice.
The main problem is that an HTTP request needs to be currently being handled in your web server for you do actually do anything (including track processes running on the system) -- you need something that can run all the time...
Instead, a better idea would be to have another daemonized "manager" process (as you mention perl, that'd be a good language to write it in) spawn & track the long running tasks (by PID and signals), and for that process to update your SQL database.
You can then have your "manager" process listen for requests to start a new process from your web server. There are various IPC mechanisms you could use. (e.g: signals, SysV shm, unix domain sockets, in-process queues like ZeroMQ, etc).
This has multiple benefits:
If your spawned scripts need to run with user/group based isolation (either from the system or each other), then your webserver doesn't need to run as root, nor be setgid.
If a spawned process "crashes", a signal will be delivered to the "manager" process, so it can track mis-executiions without issues.
If you use in-process queues (e.g: ZeroMQ) to deliver requests to the "manager" process, it can "throttle" requests from the web server (so that users cannot intentionally or accidentally cause D.O.S).
Whether or not the spawned process ends well, you don't need an 'active' HTTP request to the web server in order to update your tracking database.
As to whether something that should be running is running, that's really up to your semantics. (i.e: is it based on a known run time? based on data consumed? etc).
The check as to whether it is running can be two-fold:
The "manager" process updates the database as appropriate, including the spawned PID.
Your web server hosted code can actually list processes to determine if the PID in the database is actually running, and even how much time it's been doing something useful!
The check for whether it is not running would have to be based on convention:
Name the spawned processes something you can predict.
Get a process list to determine what's still running (defunct?) that shouldn't be.
In either case, you could either inform the users who requested the processes be spawned and/or actually do something about it.
One approach might be to have a CRON job which reads from the SQL database and does ps to determine which spawned processes need to be restarted, and then re-requests that the "manager" process does so using the same IPC mechanism used by the web server. How you differentiate starts vs. restarts in your tracking/monitoring/logging is up to you.
If the server itself loses power or crashes, then you could have the "manager" process perform cleanup when it first runs, e.g:
Look for entries in the database for spawned processes that were alegedly running before the server was shut down.
Check for those processes by PID and run time (this is important).
Either re-spawn the spawned proceses that didn't complete, or store something in the database to indicate to the web server that this was the case.
Update #1
Per your comment, here are some pointers to get started:
You mentioned perl, so presuming you have some proficiency there -- here are some perl modules to help you on your way to writing the "manager" process script:
If you're not already familiar with it CPAN is the repository for perl modules that do basically anything.
Daemon::Daemonize - To daemonize process so that it will continue running after you log out. Also provides methods for writing scripts to start/stop/restart the daemon.
Proc::Spawn - Helps with 'spawning' child scripts. Basically does fork() then exec(), but also handles STDIN/STDOUT/STDERR (or even tty) of child process. You could use this to launch your long-running perl scripts.
If your web server front-end code is not already written in perl, you'll need something that's pretty portable for inter-process message-passing and queuing; I'd probably make your web server front end in something easy to deploy (like PHP).
Here are two possibilities (there are many more):
Perl and PHP implementations for the Spread Toolkit.
Perl and PHP implementations for the ZeroMQ library.
Proc::ProcessTable - You can use this check on running processes (and get all sorts of stats as discussed above).
Time::HiRes - Use the high-granularity time functions from this package to implement your 'throttling' framework. Basically just limit the number of requests you de-queue per unit of time.
DBI (with mysql) - Update your MySQL database from the "manager" process.
Related
I'm finishing a system at work that makes calls to mysql server. Those calls' arguments reveal information that I need to keep private, like vote(idUser, idCandidate). There's no information in the db that relates those two of course, nor in "the visible part" of the back end, but even though I think this can't be done, I wanted to make sure that it is impossible to trace this sort of calls, with a log or something (calls that were made, or calls being made at the moment), as it is impossible in most languages, unless you specifically "debug" in a certain way, while the system is in production and being used. I hope the questions is clear enough. Thanks.
How do I log thee? Let me count the ways.
MySQL query log. I can enable this per-session and send everything to a log file.
I can set up a slave server and have insertions sent to me by the master. This is a significant intervention and would leave a wide trace.
On the server, unbeknownst to either Web app and MySQL log, I can intercept communications between the two. I need administrative access to the machine, of course.
On the server, again with administrative access, I can both log the query calls and inject a logging instrumentation into the SQL interface (the legitimate one is the MySQL Audit Plugin, but there are several alternatives, developed for various purposes by developers over the years)
What can you do? You can have the applications use a secure protocol, just for starters.
Then, you need to secure your machine so that administrator tricks do not work, and even if the logs are activated, nobody can read them and you can be advised of any new and modified file to delete it promptly.
We have got 3 REST-Applications within a cluster.
So each application server can receive requests from "outside".
Now we got timed events, which are analysing the database and add/remove rows from the database, send emails, etc.
The problem is, that each application server does start this timed events and it happens that 2 application server are starting this analysing job at the same time.
We got a sql table in the back.
Our idea was to lock a table within the sql database, when starting the job. If the table is locked, we exit the job, because an other application just started to analyse.
What's a good practice to insert some kind of semaphore ?
Any ideas ?
Don't use semaphores, you are over complicating things, just use message queueing, where you queue your tasks and get them executed in row.
Make ONLY one separate node/process/child_process to consume from the queue and get your task done.
We (at a previous employer) used a database-based semaphore. Each of several (for redundancy and load sharing) servers had the same set of cron jobs. The first thing in each was a custom library call that did:
Connect to the database and check for (or insert) "I'm working on X".
If the flag was already set, then the cron job silently exited.
When finished, the flag was cleared.
The table included a timestamp and a host name -- for debugging and recovering from cron jobs that fail to finish gracefully.
I forget how the "test and set" was done. Possibly an optimistic INSERT, then check for "duplicate key".
RapGenius posted this article about how they checked all 170k urls that pointed to them by parellizing the scraping task across worker dynos on Heroku using the Ruby library Typhoeus.
I've been working on a project that involves scraping (getting the source) for 1.5 million URLs, and I've been trying to speed it up. Being more comfortable with Python, I've managed to whip up a scraper that parallelizes across my desktop using redis and python multiprocessing. Where I'm confused is how I would modify it to work on Heroku.
Here's how my program is designed right now:
1) An initializer script runs, that stores all the URLs ahead of time in a Redis queue
2) A script, run_workers.py, runs, that starts all the processes like such:
workers = []
q = get_redis_queue(name)
for i in xrange(num_workers):
p = multiprocessing.Process(target=worker.scraper_worker, args=(i, q))
p.start()
workers.append(p)
for w in workers:
w.join()
3) Workers, in worker.py, that do a scraping task like this:
def scraper_worker(worker_id, queue):
#consumes URL from redis queue, visits using python requests
#stores result into MySQL
Can my current program structure be ported directly onto Heroku? What would I put in the Procfile? My first guess would be
scrape: python init_scrape.py
Where init_scrape.py first initializes the queue, then runs the workers. But I have no experience actually distributing a python task on the cloud, and I want to avoid costly mistakes.
Running this locally, I find that storing the results directly into the database (which has 1.5 million rows, for each URL, and an empty space for where the caches will go), each UPDATE query is slow (takes minutes). Is it better to store results in a temporary table, and then merge the two tables afterward?
What technologies am I not using, that I should be? For example, I've seen Celery and Twisted both mentioned as suitable candidates for this kind of thing. I am not familiar with either but I've seen both as suggested alternatives in peripheral googling.
First off, if this "project" is short-lived, or generally won't be run in production, I suggest you don't start looking into "better technologies" until you really see that you need to. If you only ever are going to run this 3 times, it's a waste of time.
To your last question: Twisted is an async framework, much like "node", that will allow a higher concurrency factor on a single machine. Celery is distributed tasks, is very cool, and both are generally worth learning and suit you fine. (I wouldn't bother with Twisted unless the scale was huge). Instead of celery, for your simple case, you might consider "RedisQ", a Python module that does something similar (and has very concise documentation) in Redis.
To your MySQL question: that shouldn't be the case. A 1.5M rows table is not small, but inserts and updates should definitely not take minutes. Begin investigation by turning off any keys, indexes and primary keys you have.
To your Heroku architecture question: you would have 2 types of processes: a "web" process (which is your init_scrape.py), of which you will have 1 (heroku ps:scale web=1), and a "worker" process (of which you can have as many as you'd like, and is that increases your scale).
Your procfile will look something like:
web: python init_scrape.py
worker: python worker.py
Note that if you want to communicate with your init_scrape.py process, you must call it "web" in the Procfile. Note also that in that case you must bind a TCP listener (basically: spin up a simple http server) to the port os.environ['PORT']. Only "web" processes get routed HTTP requests from "outside" of Heroku.
Also, note that all your processes should never really "exit" (Or Heroku will simple re-spin them). When they have nothing to do, they should simple wait/poll for tasks. You can then increase or decrease the number of workers by using heroku ps:scale.
The main issue here, with regards to what you write, is that your master will not spin up workers. The worker processes will be in entirely different dynos. The worker will simply initialize the redis queue (as you menion), and maybe spin up a simple http server to communicate with, and then sit idly by.
The workers will need to be passed the redis queue name, and each worker will be in a dyno of its own.
Good luck!
I am re-developing a system that will send messages via http to one of a number of suppliers. The original is perl scripts and it's likely that the re-development will also use perl.
In the old system, there were a number of perl scripts all running at the same time, five for each supplier. When a message was put into the database, a random thread number (1-5) and the supplier was chosen to ensure that no message was processed twice while avoiding having to lock the table/row. Additionally there was a "Fair Queue Position" field in the database to ensure that a large message send didn't delay small sends that happened while the large one was being sent.
At some times there would be just a couple of messages per minute, but at other times there would be a dump of potentially hundreds of thousands of messages. It seems to me like a resource waste to have all the scripts running and checking for messages all of the time so I am trying to work out if there is a better way to do it, or if the old way is acceptable.
My thoughts right now lie with the idea of having one script that runs and forks as many child processes as are needed (up to a limit) depending on how much traffic there is, but I am not sure how best to implement it such that each message is processed only once, while the fair queuing is maintained.
My best guess right now is that the parent script updates the DB to indicate which child process should handle it, however I am concerned that this will end up being less efficient than the original method. I have little experience of writing forking code (last time I did it was about 15 years ago).
Any thoughts or links to guides on how best to process message queues appreciated!
You could use Thread::Queue or any other from this: Is there a multiprocessing module for Perl?
If the old system was written in Perl this way you could reuse most part of it.
Non working example:
use strict;
use warnings;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
# Worker thread
my #thrs = threads->create(sub {
while (my $item = $q->dequeue()) {
# Do work on $item
}
})->detach() for 1..10;#for 10 threads
my $dbh = ...
while (1){
#get items from db
my #items = get_items_from_db($dbh);
# Send work to the thread
$q->enqueue(#items);
print "Pending items: "$q->pending()."\n";
sleep 15;#check DB in every 15 secs
}
I would suggest using a message queue server like RabbitMQ.
One process feeds work into the queue, and you can have multiple worker processes consume the queue.
Advantages of this approach:
workers block when waiting for work (no busy waiting)
more worker processes can be started up manually if needed
worker processes don't have to be a child of a special parent process
RabbitMQ will distribute the work among all workers which are ready to accept work
RabbitMQ will put work back into the queue if the worker doesn't return an ACK
you don't have to assign work in the database
every "agent" (worker, producer, etc.) is an independent process which means you can kill it or restart it without affecting other processes
To dynamically scale-up or down the number workers, you can implement something like:
have workers automatically die if they don't get work for a specified amount of time
have another process monitor the length of the queue and spawn more workers if the queue is getting too big
I would recommend using beanstalkd for a dedicated job server, and Beanstalk::Client in your perl scripts for adding jobs to the queue and removing them.
You should find beanstalkd easier to install and set up compared to RabbitMQ. It will also take care of distributing jobs among available workers, burying any failed jobs so they can be retried later, scheduling jobs to be done at a later date, and many more basic features. For your worker, you don't have to worry about forking or threading; just start up as many workers as you need, on as many servers as you have available.
Either RabbitMQ or Beanstalk would be better than rolling your own db-backed solution. These projects have already worked out many of the details needed for queueing, and implemented features you may not realize yet that you want. They should also handle polling for new jobs more efficiently, compared to sleeping and selecting from your database to see if there's more work to do.
I'm new to databases and web servers and that kind of thing. So I am looking for information so I can begin to figure out a starting point and options open to me.
I need to have a database that can be accessed by an iPhone app. So logically it will be hosted on a webserver somewhere.
To get/insert the data from/into the database the app would make a HTTP connection to a php file on the same server as the DB which would then insert/return the relevant data. To stop random hackers messing with the DB the app would have some validation code inside it to send to the php file to check that its not a hacker trying to mess with the database. This all making sense or will that not be secure enough.
Now the most confusing part to get my head around is :
I need check every minute has any data in the database become to old and remove it if so. So something needs to be running on the server constantly checking/manageing the database. What would this be? What is commonly used to do this kinda of thing? Is there somekey word for it that i can start searching and reading about to see what options there are?
Thanks for your advise,
-Code
One way to do this is to have a purge script run via crontab. The script can run every minute and check for old data and remove it.
MySQL version greater than 5.1.6 has inbuilt event scheduler which can be used to schedule periodic jobs inside mysql server itself.
http://dev.mysql.com/doc/refman/5.1/en/events.html
Sounds to me like you need a cron job. Cron is the standard scheduling task application for Unix type systems.
You would have some sort of script that connects to the database and performs a cleanup query, and you would schedule that script via cron.
http://en.wikipedia.org/wiki/Cron