To fork or not to fork? - mysql

I am re-developing a system that will send messages via http to one of a number of suppliers. The original is perl scripts and it's likely that the re-development will also use perl.
In the old system, there were a number of perl scripts all running at the same time, five for each supplier. When a message was put into the database, a random thread number (1-5) and the supplier was chosen to ensure that no message was processed twice while avoiding having to lock the table/row. Additionally there was a "Fair Queue Position" field in the database to ensure that a large message send didn't delay small sends that happened while the large one was being sent.
At some times there would be just a couple of messages per minute, but at other times there would be a dump of potentially hundreds of thousands of messages. It seems to me like a resource waste to have all the scripts running and checking for messages all of the time so I am trying to work out if there is a better way to do it, or if the old way is acceptable.
My thoughts right now lie with the idea of having one script that runs and forks as many child processes as are needed (up to a limit) depending on how much traffic there is, but I am not sure how best to implement it such that each message is processed only once, while the fair queuing is maintained.
My best guess right now is that the parent script updates the DB to indicate which child process should handle it, however I am concerned that this will end up being less efficient than the original method. I have little experience of writing forking code (last time I did it was about 15 years ago).
Any thoughts or links to guides on how best to process message queues appreciated!

You could use Thread::Queue or any other from this: Is there a multiprocessing module for Perl?
If the old system was written in Perl this way you could reuse most part of it.
Non working example:
use strict;
use warnings;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
# Worker thread
my #thrs = threads->create(sub {
while (my $item = $q->dequeue()) {
# Do work on $item
}
})->detach() for 1..10;#for 10 threads
my $dbh = ...
while (1){
#get items from db
my #items = get_items_from_db($dbh);
# Send work to the thread
$q->enqueue(#items);
print "Pending items: "$q->pending()."\n";
sleep 15;#check DB in every 15 secs
}

I would suggest using a message queue server like RabbitMQ.
One process feeds work into the queue, and you can have multiple worker processes consume the queue.
Advantages of this approach:
workers block when waiting for work (no busy waiting)
more worker processes can be started up manually if needed
worker processes don't have to be a child of a special parent process
RabbitMQ will distribute the work among all workers which are ready to accept work
RabbitMQ will put work back into the queue if the worker doesn't return an ACK
you don't have to assign work in the database
every "agent" (worker, producer, etc.) is an independent process which means you can kill it or restart it without affecting other processes
To dynamically scale-up or down the number workers, you can implement something like:
have workers automatically die if they don't get work for a specified amount of time
have another process monitor the length of the queue and spawn more workers if the queue is getting too big

I would recommend using beanstalkd for a dedicated job server, and Beanstalk::Client in your perl scripts for adding jobs to the queue and removing them.
You should find beanstalkd easier to install and set up compared to RabbitMQ. It will also take care of distributing jobs among available workers, burying any failed jobs so they can be retried later, scheduling jobs to be done at a later date, and many more basic features. For your worker, you don't have to worry about forking or threading; just start up as many workers as you need, on as many servers as you have available.
Either RabbitMQ or Beanstalk would be better than rolling your own db-backed solution. These projects have already worked out many of the details needed for queueing, and implemented features you may not realize yet that you want. They should also handle polling for new jobs more efficiently, compared to sleeping and selecting from your database to see if there's more work to do.

Related

How to manage server-side processes using MySQL

I have a perl script which takes in unique parameters (one of the parameters being --user=username_here). Users can start these processes using a web interface I am developing.
A MySQL table, transactions, keeps track of users that run the perl script
id user script_parameters execute last_modified
23 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
24 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
25 alex --user=alex --keywords=lg 0 2014-05-06 05:49:01
26 alex --user=alex --keywords=lg 0 2014-04-30 04:31:39
The execute value for a given row will be "1" if the process should be running. It is set to "0" if the process should be ended.
My perl script constantly checks this value to make sure it's not "0" and if it is, the perl script terminates.
However, I need to manage these process to protect against this problem:
What if my server abruptly crashes and restarts, OR the script crashes? I will need something running in the background, reading the transactions table and make sure it restarts the perl script as many times as needed using the appropriate parameters.
And so, I'm having trouble figuring out how to balance giving control to the user to manage his/her own transaction(s), while I also make sure that the transactions that SHOULD be running, ARE running, and those that AREN'T, AREN'T.
Hope that makes sense and I appreciate any help!
It seems you're trying to launch long-running processes from a web server and then track those processes in a database. That's not impossible, but not a recommended practice.
The main problem is that an HTTP request needs to be currently being handled in your web server for you do actually do anything (including track processes running on the system) -- you need something that can run all the time...
Instead, a better idea would be to have another daemonized "manager" process (as you mention perl, that'd be a good language to write it in) spawn & track the long running tasks (by PID and signals), and for that process to update your SQL database.
You can then have your "manager" process listen for requests to start a new process from your web server. There are various IPC mechanisms you could use. (e.g: signals, SysV shm, unix domain sockets, in-process queues like ZeroMQ, etc).
This has multiple benefits:
If your spawned scripts need to run with user/group based isolation (either from the system or each other), then your webserver doesn't need to run as root, nor be setgid.
If a spawned process "crashes", a signal will be delivered to the "manager" process, so it can track mis-executiions without issues.
If you use in-process queues (e.g: ZeroMQ) to deliver requests to the "manager" process, it can "throttle" requests from the web server (so that users cannot intentionally or accidentally cause D.O.S).
Whether or not the spawned process ends well, you don't need an 'active' HTTP request to the web server in order to update your tracking database.
As to whether something that should be running is running, that's really up to your semantics. (i.e: is it based on a known run time? based on data consumed? etc).
The check as to whether it is running can be two-fold:
The "manager" process updates the database as appropriate, including the spawned PID.
Your web server hosted code can actually list processes to determine if the PID in the database is actually running, and even how much time it's been doing something useful!
The check for whether it is not running would have to be based on convention:
Name the spawned processes something you can predict.
Get a process list to determine what's still running (defunct?) that shouldn't be.
In either case, you could either inform the users who requested the processes be spawned and/or actually do something about it.
One approach might be to have a CRON job which reads from the SQL database and does ps to determine which spawned processes need to be restarted, and then re-requests that the "manager" process does so using the same IPC mechanism used by the web server. How you differentiate starts vs. restarts in your tracking/monitoring/logging is up to you.
If the server itself loses power or crashes, then you could have the "manager" process perform cleanup when it first runs, e.g:
Look for entries in the database for spawned processes that were alegedly running before the server was shut down.
Check for those processes by PID and run time (this is important).
Either re-spawn the spawned proceses that didn't complete, or store something in the database to indicate to the web server that this was the case.
Update #1
Per your comment, here are some pointers to get started:
You mentioned perl, so presuming you have some proficiency there -- here are some perl modules to help you on your way to writing the "manager" process script:
If you're not already familiar with it CPAN is the repository for perl modules that do basically anything.
Daemon::Daemonize - To daemonize process so that it will continue running after you log out. Also provides methods for writing scripts to start/stop/restart the daemon.
Proc::Spawn - Helps with 'spawning' child scripts. Basically does fork() then exec(), but also handles STDIN/STDOUT/STDERR (or even tty) of child process. You could use this to launch your long-running perl scripts.
If your web server front-end code is not already written in perl, you'll need something that's pretty portable for inter-process message-passing and queuing; I'd probably make your web server front end in something easy to deploy (like PHP).
Here are two possibilities (there are many more):
Perl and PHP implementations for the Spread Toolkit.
Perl and PHP implementations for the ZeroMQ library.
Proc::ProcessTable - You can use this check on running processes (and get all sorts of stats as discussed above).
Time::HiRes - Use the high-granularity time functions from this package to implement your 'throttling' framework. Basically just limit the number of requests you de-queue per unit of time.
DBI (with mysql) - Update your MySQL database from the "manager" process.

Parallelized Python Scraping into Database Across Heroku

RapGenius posted this article about how they checked all 170k urls that pointed to them by parellizing the scraping task across worker dynos on Heroku using the Ruby library Typhoeus.
I've been working on a project that involves scraping (getting the source) for 1.5 million URLs, and I've been trying to speed it up. Being more comfortable with Python, I've managed to whip up a scraper that parallelizes across my desktop using redis and python multiprocessing. Where I'm confused is how I would modify it to work on Heroku.
Here's how my program is designed right now:
1) An initializer script runs, that stores all the URLs ahead of time in a Redis queue
2) A script, run_workers.py, runs, that starts all the processes like such:
workers = []
q = get_redis_queue(name)
for i in xrange(num_workers):
p = multiprocessing.Process(target=worker.scraper_worker, args=(i, q))
p.start()
workers.append(p)
for w in workers:
w.join()
3) Workers, in worker.py, that do a scraping task like this:
def scraper_worker(worker_id, queue):
#consumes URL from redis queue, visits using python requests
#stores result into MySQL
Can my current program structure be ported directly onto Heroku? What would I put in the Procfile? My first guess would be
scrape: python init_scrape.py
Where init_scrape.py first initializes the queue, then runs the workers. But I have no experience actually distributing a python task on the cloud, and I want to avoid costly mistakes.
Running this locally, I find that storing the results directly into the database (which has 1.5 million rows, for each URL, and an empty space for where the caches will go), each UPDATE query is slow (takes minutes). Is it better to store results in a temporary table, and then merge the two tables afterward?
What technologies am I not using, that I should be? For example, I've seen Celery and Twisted both mentioned as suitable candidates for this kind of thing. I am not familiar with either but I've seen both as suggested alternatives in peripheral googling.
First off, if this "project" is short-lived, or generally won't be run in production, I suggest you don't start looking into "better technologies" until you really see that you need to. If you only ever are going to run this 3 times, it's a waste of time.
To your last question: Twisted is an async framework, much like "node", that will allow a higher concurrency factor on a single machine. Celery is distributed tasks, is very cool, and both are generally worth learning and suit you fine. (I wouldn't bother with Twisted unless the scale was huge). Instead of celery, for your simple case, you might consider "RedisQ", a Python module that does something similar (and has very concise documentation) in Redis.
To your MySQL question: that shouldn't be the case. A 1.5M rows table is not small, but inserts and updates should definitely not take minutes. Begin investigation by turning off any keys, indexes and primary keys you have.
To your Heroku architecture question: you would have 2 types of processes: a "web" process (which is your init_scrape.py), of which you will have 1 (heroku ps:scale web=1), and a "worker" process (of which you can have as many as you'd like, and is that increases your scale).
Your procfile will look something like:
web: python init_scrape.py
worker: python worker.py
Note that if you want to communicate with your init_scrape.py process, you must call it "web" in the Procfile. Note also that in that case you must bind a TCP listener (basically: spin up a simple http server) to the port os.environ['PORT']. Only "web" processes get routed HTTP requests from "outside" of Heroku.
Also, note that all your processes should never really "exit" (Or Heroku will simple re-spin them). When they have nothing to do, they should simple wait/poll for tasks. You can then increase or decrease the number of workers by using heroku ps:scale.
The main issue here, with regards to what you write, is that your master will not spin up workers. The worker processes will be in entirely different dynos. The worker will simply initialize the redis queue (as you menion), and maybe spin up a simple http server to communicate with, and then sit idly by.
The workers will need to be passed the redis queue name, and each worker will be in a dyno of its own.
Good luck!

How can I use Gearman for File Processing Without Killing the DB?

I'm currently designing a system for processing uploaded files.
The files are uploaded through a LAMP web frontend and must be processed through several stages some of which are sequential and others which may run in parallel.
A few key points:
The clients uploading the files only care about safely delivering the files not the results of the processing so it can be completely asynchronous.
The files are max 50kb in size
The system must scale up to processing over a million files a day
It is critical that no files may be lost or go unprocessed
My assumption is MySQL, but I have no issue with NoSQL if this could offer an advantage.
My initial idea was to have the front end put the files straight into a MySQL DB and then have a number of worker processes poll the database setting flags as they completed each step. After some rough calculations I realised that this wouldn't scale as the workers polling would start to cause locking problems on the upload table.
After some research it looks like Gearman might be the solution to the problem. The workers can register with the Gearman server and can poll for jobs without crippling the DB.
What I am currently puzzling over is how to dispatch jobs in the most efficient manner. There are three ways I can see to do this:
Write a single dispatcher to poll the database and then send jobs to Gearman
Have the upload process fire off an asynchronous Gearman job when it receives a file
Use the Gearman MySQL UDF extension to make the DB fire off jobs when files are inserted
The first approach will still hammer the DB somewhat but it could trivially recover from a failure.
The second two approaches would seem to require enabling Gearman queue persistence to recover from faults, but I am concerned that if I enable this I will loose the raw speed that attracts me to Gearman and shift the DB bottleneck downstream.
Any advice on which of these approaches would be the most efficient (or even better real world examples) would be much appreciated.
Also feel free to pitch in if you think I'm going about the whole thing the wrong way.
This has been open for a little while now so I thought I would provide some information on the approach that I took.
I create a gearman job every time a file is uploaded for a "dispatch" worker which understands the sequence of processing steps required for each file. The dispatcher queues gearman jobs for each of the processing steps.
Any jobs that complete write back a completion timestamp to the DB and call the dispatcher which can then queue any follow on tasks.
The writing of timestamps for each job completion means the system can recover its queues if processing is missed or fails without having to have the burden of persistent queues.
I would save the files to disk, then send the filename to Gearman. As each part of the process completes, it generates another message for the next part of the process, you could move the file into a new work-in-process directory for the next stage to work on it.

Using MySQL as a job queue

I'd like to use MySQL as a job queue. Multiple machines will be producing and consuming jobs. Jobs need to be scheduled; some may run every hour, some every day, etc.
It seems fairly straightforward: for each job, have a "nextFireTime" column, and have worker machines search for the job with the nextFireTime, change the status of the record to "inProcess", and then update the nextFireTime when the job ends.
The problem comes in when a worker dies silently. It won't be able to update the nextFireTime or set the status back to "idle".
Unfortunately, jobs can be long-running, so a reaper thread that looks for jobs that have been inProcess too long isn't an option. There's no timeout value that would work.
Can anyone suggest a design pattern that would properly handle unreliable worker machines?
Maybe like this
When a worker fetches a job it can add it's process-id or another unique id to a field in the job
Then in another table every worker keeps updating a value that they are alive. When updating the "i'm alive" field you check all other "last time worker showed sign of life". If one worker is over a limit, find all the jobs it is working on and reset them.
So in other words the watchdog works on the worker-processes and not the jobs themselves.
Using MySQL as a job queue generally ends in pain, as it's a very poor fit for the usual goals of an RDBMS. User 'toong' already linked to https://www.engineyard.com/blog/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you, which has a lot of interesting stuff to say about it. Unreliable workers are only one of the complications.
There are many, many systems for handling job distribution, mostly distinguished by the sophistication of their queueing and scheduling capabilities. On the simple FIFO end are things like Resque, Celery, Beanstalkd, and Gearman; on the sophisticated end are things like GridEngine, Torque/Maui, and PBS Pro. I highly recommend the new Amazon Simple Workflow system, if you can tolerate reliance on an Amazon service (I believe it does not require that you be in EC2).
To your original question: right now we're implementing a per-node supervisor that can tell if the node's jobs are still active, and sending a heartbeat back to a job monitor if so. It's a pain, but as you are discovering and will continue to discover, there are a lot of details and error cases to manage. Mostly, though, I have to encourage you to do yourself a favor by learning about this domain and build the system properly from the start.
One option is to make sure that jobs are idempotent, and allow more than one worker to start a given job. It doesn't matter which worker completes the job, or if more than one worker completes the job; since the jobs are designed in such a way that multiple completions are handled gracefully. perhaps workers race to supply the result, and the losers find that the slot that will hold the result is already full, so they just drop them.
Another option is to not have big jobs. Break long running jobs into intermediate steps, if the job takes longer than (say) 1 minute, store the intermediate results as a new job (with a link to the old job in some way), so that the new job can be queued again to do another minute of work.

Message Queues Vs DB Table Queue via CRON

We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks
Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)
A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.
First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.
This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)