Parallelized Python Scraping into Database Across Heroku

Parallelized Python Scraping into Database Across Heroku - mysql

RapGenius posted this article about how they checked all 170k urls that pointed to them by parellizing the scraping task across worker dynos on Heroku using the Ruby library Typhoeus.
I've been working on a project that involves scraping (getting the source) for 1.5 million URLs, and I've been trying to speed it up. Being more comfortable with Python, I've managed to whip up a scraper that parallelizes across my desktop using redis and python multiprocessing. Where I'm confused is how I would modify it to work on Heroku.
Here's how my program is designed right now:
1) An initializer script runs, that stores all the URLs ahead of time in a Redis queue
2) A script, run_workers.py, runs, that starts all the processes like such:
workers = []
q = get_redis_queue(name)
for i in xrange(num_workers):
p = multiprocessing.Process(target=worker.scraper_worker, args=(i, q))
p.start()
workers.append(p)
for w in workers:
w.join()
3) Workers, in worker.py, that do a scraping task like this:
def scraper_worker(worker_id, queue):
#consumes URL from redis queue, visits using python requests
#stores result into MySQL
Can my current program structure be ported directly onto Heroku? What would I put in the Procfile? My first guess would be
scrape: python init_scrape.py
Where init_scrape.py first initializes the queue, then runs the workers. But I have no experience actually distributing a python task on the cloud, and I want to avoid costly mistakes.
Running this locally, I find that storing the results directly into the database (which has 1.5 million rows, for each URL, and an empty space for where the caches will go), each UPDATE query is slow (takes minutes). Is it better to store results in a temporary table, and then merge the two tables afterward?
What technologies am I not using, that I should be? For example, I've seen Celery and Twisted both mentioned as suitable candidates for this kind of thing. I am not familiar with either but I've seen both as suggested alternatives in peripheral googling.

First off, if this "project" is short-lived, or generally won't be run in production, I suggest you don't start looking into "better technologies" until you really see that you need to. If you only ever are going to run this 3 times, it's a waste of time.
To your last question: Twisted is an async framework, much like "node", that will allow a higher concurrency factor on a single machine. Celery is distributed tasks, is very cool, and both are generally worth learning and suit you fine. (I wouldn't bother with Twisted unless the scale was huge). Instead of celery, for your simple case, you might consider "RedisQ", a Python module that does something similar (and has very concise documentation) in Redis.
To your MySQL question: that shouldn't be the case. A 1.5M rows table is not small, but inserts and updates should definitely not take minutes. Begin investigation by turning off any keys, indexes and primary keys you have.
To your Heroku architecture question: you would have 2 types of processes: a "web" process (which is your init_scrape.py), of which you will have 1 (heroku ps:scale web=1), and a "worker" process (of which you can have as many as you'd like, and is that increases your scale).
Your procfile will look something like:
web: python init_scrape.py
worker: python worker.py
Note that if you want to communicate with your init_scrape.py process, you must call it "web" in the Procfile. Note also that in that case you must bind a TCP listener (basically: spin up a simple http server) to the port os.environ['PORT']. Only "web" processes get routed HTTP requests from "outside" of Heroku.
Also, note that all your processes should never really "exit" (Or Heroku will simple re-spin them). When they have nothing to do, they should simple wait/poll for tasks. You can then increase or decrease the number of workers by using heroku ps:scale.
The main issue here, with regards to what you write, is that your master will not spin up workers. The worker processes will be in entirely different dynos. The worker will simply initialize the redis queue (as you menion), and maybe spin up a simple http server to communicate with, and then sit idly by.
The workers will need to be passed the redis queue name, and each worker will be in a dyno of its own.
Good luck!

Related

How to manage server-side processes using MySQL

I have a perl script which takes in unique parameters (one of the parameters being --user=username_here). Users can start these processes using a web interface I am developing.
A MySQL table, transactions, keeps track of users that run the perl script
id user script_parameters execute last_modified
23 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
24 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
25 alex --user=alex --keywords=lg 0 2014-05-06 05:49:01
26 alex --user=alex --keywords=lg 0 2014-04-30 04:31:39
The execute value for a given row will be "1" if the process should be running. It is set to "0" if the process should be ended.
My perl script constantly checks this value to make sure it's not "0" and if it is, the perl script terminates.
However, I need to manage these process to protect against this problem:
What if my server abruptly crashes and restarts, OR the script crashes? I will need something running in the background, reading the transactions table and make sure it restarts the perl script as many times as needed using the appropriate parameters.
And so, I'm having trouble figuring out how to balance giving control to the user to manage his/her own transaction(s), while I also make sure that the transactions that SHOULD be running, ARE running, and those that AREN'T, AREN'T.
Hope that makes sense and I appreciate any help!

It seems you're trying to launch long-running processes from a web server and then track those processes in a database. That's not impossible, but not a recommended practice.
The main problem is that an HTTP request needs to be currently being handled in your web server for you do actually do anything (including track processes running on the system) -- you need something that can run all the time...
Instead, a better idea would be to have another daemonized "manager" process (as you mention perl, that'd be a good language to write it in) spawn & track the long running tasks (by PID and signals), and for that process to update your SQL database.
You can then have your "manager" process listen for requests to start a new process from your web server. There are various IPC mechanisms you could use. (e.g: signals, SysV shm, unix domain sockets, in-process queues like ZeroMQ, etc).
This has multiple benefits:
If your spawned scripts need to run with user/group based isolation (either from the system or each other), then your webserver doesn't need to run as root, nor be setgid.
If a spawned process "crashes", a signal will be delivered to the "manager" process, so it can track mis-executiions without issues.
If you use in-process queues (e.g: ZeroMQ) to deliver requests to the "manager" process, it can "throttle" requests from the web server (so that users cannot intentionally or accidentally cause D.O.S).
Whether or not the spawned process ends well, you don't need an 'active' HTTP request to the web server in order to update your tracking database.
As to whether something that should be running is running, that's really up to your semantics. (i.e: is it based on a known run time? based on data consumed? etc).
The check as to whether it is running can be two-fold:
The "manager" process updates the database as appropriate, including the spawned PID.
Your web server hosted code can actually list processes to determine if the PID in the database is actually running, and even how much time it's been doing something useful!
The check for whether it is not running would have to be based on convention:
Name the spawned processes something you can predict.
Get a process list to determine what's still running (defunct?) that shouldn't be.
In either case, you could either inform the users who requested the processes be spawned and/or actually do something about it.
One approach might be to have a CRON job which reads from the SQL database and does ps to determine which spawned processes need to be restarted, and then re-requests that the "manager" process does so using the same IPC mechanism used by the web server. How you differentiate starts vs. restarts in your tracking/monitoring/logging is up to you.
If the server itself loses power or crashes, then you could have the "manager" process perform cleanup when it first runs, e.g:
Look for entries in the database for spawned processes that were alegedly running before the server was shut down.
Check for those processes by PID and run time (this is important).
Either re-spawn the spawned proceses that didn't complete, or store something in the database to indicate to the web server that this was the case.
Update #1
Per your comment, here are some pointers to get started:
You mentioned perl, so presuming you have some proficiency there -- here are some perl modules to help you on your way to writing the "manager" process script:
If you're not already familiar with it CPAN is the repository for perl modules that do basically anything.
Daemon::Daemonize - To daemonize process so that it will continue running after you log out. Also provides methods for writing scripts to start/stop/restart the daemon.
Proc::Spawn - Helps with 'spawning' child scripts. Basically does fork() then exec(), but also handles STDIN/STDOUT/STDERR (or even tty) of child process. You could use this to launch your long-running perl scripts.
If your web server front-end code is not already written in perl, you'll need something that's pretty portable for inter-process message-passing and queuing; I'd probably make your web server front end in something easy to deploy (like PHP).
Here are two possibilities (there are many more):
Perl and PHP implementations for the Spread Toolkit.
Perl and PHP implementations for the ZeroMQ library.
Proc::ProcessTable - You can use this check on running processes (and get all sorts of stats as discussed above).
Time::HiRes - Use the high-granularity time functions from this package to implement your 'throttling' framework. Basically just limit the number of requests you de-queue per unit of time.
DBI (with mysql) - Update your MySQL database from the "manager" process.

Nginx Vs Apache to solve load isseu on website

So Have a web application that has 10-12 pages with many POST/ GET DB Calls. We usually have a apache crash/other problem when site traffic results to 1000 or so (concurrent users) which is very small number, we have updated server with good RAM and resources. When our system admin guy do load testing on blitz and other custom script and is suggesting to move away from Apache. Some things does not make sense to me. Like Apache is not too bad to handle few thousand of concurrent users considering we have cloudflare for caching. Here is what he suggested:
replacement of Apache+mod_fcgi with Nginx+php-fpm which can make the server handle much more users, and then test it.
or
2. For testing: Need 10-20 servers to run a scenario from. Basically, what is needed is a more complex blitz.io analogue. create one server, which takes all those hours, then just clone it in the cloud and pay for about 1 hour of testing multiplied by the number of servers needed.
Once again there are many DB calls anf HT access. ALso what makes Nginx better than apache in this case?

I would check this comparison first. Basically, nginx is event based, so it's able to handle more requests concurrently. However, as the MySQL DB seems to be the choke point here, it's very possible that nginx wouldn't solve all your problems. Perhaps moving to a NoSQL kind of database, that's better at scaling horizontally, would help (if that's feasible).

To fork or not to fork?

I am re-developing a system that will send messages via http to one of a number of suppliers. The original is perl scripts and it's likely that the re-development will also use perl.
In the old system, there were a number of perl scripts all running at the same time, five for each supplier. When a message was put into the database, a random thread number (1-5) and the supplier was chosen to ensure that no message was processed twice while avoiding having to lock the table/row. Additionally there was a "Fair Queue Position" field in the database to ensure that a large message send didn't delay small sends that happened while the large one was being sent.
At some times there would be just a couple of messages per minute, but at other times there would be a dump of potentially hundreds of thousands of messages. It seems to me like a resource waste to have all the scripts running and checking for messages all of the time so I am trying to work out if there is a better way to do it, or if the old way is acceptable.
My thoughts right now lie with the idea of having one script that runs and forks as many child processes as are needed (up to a limit) depending on how much traffic there is, but I am not sure how best to implement it such that each message is processed only once, while the fair queuing is maintained.
My best guess right now is that the parent script updates the DB to indicate which child process should handle it, however I am concerned that this will end up being less efficient than the original method. I have little experience of writing forking code (last time I did it was about 15 years ago).
Any thoughts or links to guides on how best to process message queues appreciated!

You could use Thread::Queue or any other from this: Is there a multiprocessing module for Perl?
If the old system was written in Perl this way you could reuse most part of it.
Non working example:
use strict;
use warnings;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
# Worker thread
my #thrs = threads->create(sub {
while (my $item = $q->dequeue()) {
# Do work on $item
}
})->detach() for 1..10;#for 10 threads
my $dbh = ...
while (1){
#get items from db
my #items = get_items_from_db($dbh);
# Send work to the thread
$q->enqueue(#items);
print "Pending items: "$q->pending()."\n";
sleep 15;#check DB in every 15 secs
}

I would suggest using a message queue server like RabbitMQ.
One process feeds work into the queue, and you can have multiple worker processes consume the queue.
Advantages of this approach:
workers block when waiting for work (no busy waiting)
more worker processes can be started up manually if needed
worker processes don't have to be a child of a special parent process
RabbitMQ will distribute the work among all workers which are ready to accept work
RabbitMQ will put work back into the queue if the worker doesn't return an ACK
you don't have to assign work in the database
every "agent" (worker, producer, etc.) is an independent process which means you can kill it or restart it without affecting other processes
To dynamically scale-up or down the number workers, you can implement something like:
have workers automatically die if they don't get work for a specified amount of time
have another process monitor the length of the queue and spawn more workers if the queue is getting too big

I would recommend using beanstalkd for a dedicated job server, and Beanstalk::Client in your perl scripts for adding jobs to the queue and removing them.
You should find beanstalkd easier to install and set up compared to RabbitMQ. It will also take care of distributing jobs among available workers, burying any failed jobs so they can be retried later, scheduling jobs to be done at a later date, and many more basic features. For your worker, you don't have to worry about forking or threading; just start up as many workers as you need, on as many servers as you have available.
Either RabbitMQ or Beanstalk would be better than rolling your own db-backed solution. These projects have already worked out many of the details needed for queueing, and implemented features you may not realize yet that you want. They should also handle polling for new jobs more efficiently, compared to sleeping and selecting from your database to see if there's more work to do.

Web request performance really bad under stress

I wrote a web application using python and Flask framework, and set it up on Apache with mod_wsgi.
Today I use JMeter to perform some load testing on this application.
For one web URL:
when I set only 1 thread to send request, the response time is 200ms
when I set 20 concurrent threads to send requests, the response time increases to more than 4000ms(4s). THIS IS UNACCEPTABLE!
I am trying to find the problem, so I recorded the time in before_request and teardown_request methods of flask. And it turns out the time taken to process the request is just over 10ms.
In this URL handler, the app just performs some SQL queries (about 10) in Mysql database, nothing special.
To test if the problem is with web server or framework configuration, I wrote another method Hello in the same flask application, which just returns a string. It performs perfectly under load, the response time is 13ms with 20-thread concurrency.
And when doing the load test, I execute 'top' on my server, there are about 10 apache threads, but the CPU is mostly idle.
I am at my wit's end now. Even if the request are performed serially, the performance should not drop so drastically... My guess is that there is some queuing somewhere that I am unaware of, and there must be overhead besides handling the request.
If you have experience in tuning performance of web applications, please help!
EDIT
About apache configuration, I used MPM worker mode, the configuration:
<IfModule mpm_worker_module>
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 50
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
As for mod_wsgi, I tried turning WSGIDaemonProcess on and off (by commenting the following line out), the performance looks the same.
# WSGIDaemonProcess tqt processes=3 threads=15 display-name=TQTSERVER

Congratulations! You found the performance problem - not your users!
Analysing performance problems on web applications is usually hard, because there are so many moving parts, and it's hard to see inside the application while it's running.
The behaviour you describe is usually associated with a bottleneck resource - this happens when there's a particular resource that can't keep up, so queues requests, which tends to lead to a "hockey stick" curve with response times - once you hit the point where this resource can't keep up, the response time goes up very quickly.
20 concurrent threads seems low for that to happen, unless you're doing a lot of very heavy lifting on the page.
First place to start is TOP - while CPU is low, what's memory, disk access etc. doing? Is your database running on the same machine? If not, what does TOP say on the database server?
Assuming it's not some silly hardware thing, the next most likely problem is the database access on that page. It may be that one query is returning literally the entire database when all you want is one record (this is a fairly common anti pattern with ORM solutions); that could lead to the behaviour you describe. I would use the Flask logging framework to record your database calls (start, end, number of records returned), and look for anomalies there.
If the database is performing well under load, it's either the framework or the application code. Again, use logging statements in the code to trace the execution time of individual blocks of code, and keep hunting...
It's not glamorous, and can be really tedious - but it's a lot better that you found this before going live!

Look at using New Relic to identify where the bottleneck is. See overview of it and discussion of identifying bottlenecks in my talk:
http://lanyrd.com/2012/pycon/spcdg/
Also edit your original question and add the mod_wsgi configuration you are using, plus whether you are using Apache prefork or worker MPM as you could be doing something non optimal there.

How to Guarantee Message delivery with Celery?

I have a python application where I want to start doing more work in the background so that it will scale better as it gets busier. In the past I have used Celery for doing normal background tasks, and this has worked well.
The only difference between this application and the others I have done in the past is that I need to guarantee that these messages are processed, they can't be lost.
For this application I'm not too concerned about speed for my message queue, I need reliability and durability first and formost. To be safe I want to have two queue servers, both in different data centers in case something goes wrong, one a backup of the other.
Looking at Celery it looks like it supports a bunch of different backends, some with more features then the others. The two most popular look like redis and RabbitMQ so I took some time to examine them further.
RabbitMQ:
Supports durable queues and clustering, but the problem with the way they have clustering today is that if you lose a node in the cluster, all messages in that node are unavailable until you bring that node back online. It doesn't replicated the messages between the different nodes in the cluster, it just replicates the metadata about the message, and then it goes back to the originating node to get the message, if the node isn't running, you are S.O.L. Not ideal.
The way they recommend to get around this is to setup a second server and replicate the file system using DRBD, and then running something like pacemaker to switch the clients to the backup server when it needs too. This seems pretty complicated, not sure if there is a better way. Anyone know of a better way?
Redis:
Supports a read slave and this would allow me to have a backup in case of emergencies but it doesn't support master-master setup, and I'm not sure if it handles active failover between master and slave. It doesn't have the same features as RabbitMQ, but looks much easier to setup and maintain.
Questions:
What is the best way to setup celery
so that it will guarantee message
processing.
Has anyone done this before? If so,
would be mind sharing what you did?

A lot has changed since the OP! There is now an option for high-availability aka "mirrored" queues. This goes pretty far toward solving the problem you described. See http://www.rabbitmq.com/ha.html.

You might want to check out IronMQ, it covers your requirements (durable, highly available, etc) and is a cloud native solution so zero maintenance. And there's a Celery broker for it: https://github.com/iron-io/iron_celery so you can start using it just by changing your Celery config.

I suspect that Celery bound to existing backends is the wrong solution for the reliability guarantees you need.
Given that you want a distributed queueing system with strong durability and reliability guarantees, I'd start by looking for such a system (they do exist) and then figuring out the best way to bind to it in Python. That may be via Celery & a new backend, or not.

I've used Amazon SQS for this propose and got good results. You will recieve message until you will delete it from queue and it allows to grow you app as high as you will need.

Is using a distributed rendering system an option? Normally reserved for HPC but alot of concepts are the same. Check out Qube or Deadline Render. There are other, open source solutions as well. All have failover in mind given the high degree of complexity and risk of failure in some renders that can take hours per image sequence frame.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008