How to Guarantee Message delivery with Celery? - message-queue

I have a python application where I want to start doing more work in the background so that it will scale better as it gets busier. In the past I have used Celery for doing normal background tasks, and this has worked well.
The only difference between this application and the others I have done in the past is that I need to guarantee that these messages are processed, they can't be lost.
For this application I'm not too concerned about speed for my message queue, I need reliability and durability first and formost. To be safe I want to have two queue servers, both in different data centers in case something goes wrong, one a backup of the other.
Looking at Celery it looks like it supports a bunch of different backends, some with more features then the others. The two most popular look like redis and RabbitMQ so I took some time to examine them further.
RabbitMQ:
Supports durable queues and clustering, but the problem with the way they have clustering today is that if you lose a node in the cluster, all messages in that node are unavailable until you bring that node back online. It doesn't replicated the messages between the different nodes in the cluster, it just replicates the metadata about the message, and then it goes back to the originating node to get the message, if the node isn't running, you are S.O.L. Not ideal.
The way they recommend to get around this is to setup a second server and replicate the file system using DRBD, and then running something like pacemaker to switch the clients to the backup server when it needs too. This seems pretty complicated, not sure if there is a better way. Anyone know of a better way?
Redis:
Supports a read slave and this would allow me to have a backup in case of emergencies but it doesn't support master-master setup, and I'm not sure if it handles active failover between master and slave. It doesn't have the same features as RabbitMQ, but looks much easier to setup and maintain.
Questions:
What is the best way to setup celery
so that it will guarantee message
processing.
Has anyone done this before? If so,
would be mind sharing what you did?

A lot has changed since the OP! There is now an option for high-availability aka "mirrored" queues. This goes pretty far toward solving the problem you described. See http://www.rabbitmq.com/ha.html.

You might want to check out IronMQ, it covers your requirements (durable, highly available, etc) and is a cloud native solution so zero maintenance. And there's a Celery broker for it: https://github.com/iron-io/iron_celery so you can start using it just by changing your Celery config.

I suspect that Celery bound to existing backends is the wrong solution for the reliability guarantees you need.
Given that you want a distributed queueing system with strong durability and reliability guarantees, I'd start by looking for such a system (they do exist) and then figuring out the best way to bind to it in Python. That may be via Celery & a new backend, or not.

I've used Amazon SQS for this propose and got good results. You will recieve message until you will delete it from queue and it allows to grow you app as high as you will need.

Is using a distributed rendering system an option? Normally reserved for HPC but alot of concepts are the same. Check out Qube or Deadline Render. There are other, open source solutions as well. All have failover in mind given the high degree of complexity and risk of failure in some renders that can take hours per image sequence frame.

Related

MySQL Replication: Question about a fallback-system

I want to set up a complete server (apache, mysql 5.7) as a fallback of a productive server.
The synchronization on file level using rsync and cronjob is already done.
The mysql-replication is currently the problem. More precisely: the choice of the right replica method.
Multi primary group replication seemed to be the most suitable method so far.
In case of a longer production downtime, it is possible to switch to the fallback server quickly via DNS change.
Write accesses to the database are possible immediately without adjustments.
So far so good: But, if the fallback-server fails, it is in unreachable status and the production-server switches to read only, since its group no longer has the quota. This is of course a no-go.
I thought it might be possible using different replica variables: If the fallback-server is in unreachable state for a certain time (~5 minutes), the production-server should stop the group_replication and start a new group_replication. This has to happen automatically to keep the read-only time relatively low. When the fallback-server is back online, it should be manually added to the newly started group. But if I read the various forum posts and documentation correctly, it's not possible that way. And running a Group_Replication with only two nodes is the wrong decision anyway.
https://forums.mysql.com/read.php?177,657333,657343#msg-657343
Is the master - slave replication the only one that can be considered for such a fallback system? https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-switch.html
Or does the Group_Replication offer possibilities after all, if you can react suitably to the quota problem? Possibilities that I have overlooked so far.
Many thanks and best regards
Short Answer: You must have [at least] 3 nodes.
Long Answer:
Split brain with only two nodes:
Write only to the surviving node, but only if you can conclude that it is the only surviving node, else...
The network died and both Primaries are accepting writes. This to them disagreeing with each other. You may have no clean way to repair the mess.
Go into readonly mode with surviving node. (The only safe and sane approach.)
The problem is that the automated system cannot tell the difference between a dead Primary and a dead network.
So... You must have 3 nodes to safely avoid "split-brain" and have a good chance of an automated failover. This also implies that no two nodes should be in the same tornado path, flood range, volcano path, earthquake fault, etc.
You picked Group Replication (InnoDB Cluster). That is an excellent offering from MySQL. Galera with MariaDB is an equally good offering -- there are a lot of differences in the details, but it boils down to needing 3, preferably dispersed, nodes.
DNS changes take some time, due to the TTL. A proxy server may help with this.
Galera can run in a "Primary + Replicas" mode, but it also allows you to run with all nodes being read-write. This leads to a slightly different set of steps necessary for a client to take to stop writing to one node and start writing to another. There are "Proxys" to help with such.
FailBack
Are you trying to always use a certain Primary except when it is down? Or can you accept letting any node be the 'current' Primary?
I think of "fallback" as simply a "failover" that goes back to the original Primary. That implies a second outage (possibly briefer). However, I understand geographic considerations. You may want your main Primary to be 'near' most of your customers.
I recommend using the Galera MySQL cluster with HAProxy as a load balancer and automatic failover solution. we have used it in production for a long time now and never had serious problems. The most important thing to consider is monitoring the replication sync status between nodes. also, make sure your storage engine is InnoDB because Galera doesn't work with MyISAM.
check this link on how to setup :
https://medium.com/platformer-blog/highly-available-mysql-with-galera-and-haproxy-e9b55b839fe0
But in these kinds of situations, the main problem is not a failover mechanism because there are many solutions out of the box, but rather you have to check your read/write ratio and transactional services and make sure replication delays won't affect them. some times vertically scalable solutions with master-slave replication are more suitable for transaction-sensitive financial systems and it really depends on the service your providing.

What's the most efficient architecture for this system? (push or pull)

All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?

Scaling up a ruby, activerecord, mysql app

I have an app...
The app does a market comparison for a financial product - for a given quote request, it contacts several other sites for their quotes. It then gives the user the results - several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS - it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection - grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go ...
Bigger hardware...
More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines -
so the threads dont touch the db, all the details/responses go via
the queue - so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions - something easy to
use with Ruby :)
Some kind of remote/RPC mechanism, eg dRb -
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things.
Something
else...?
From this link Reasons for NOT scaling-up vs. -out? - it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go...
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you're making and tune them aggressively. Retrieve only what you need, skipping columns that aren't explicitly used, and be very careful about eager loading things you don't need in their entirety.
You'll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming JOIN operations.
Further, think about caching: The fastest database call is the one that's never made. It's not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it's even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you've exhausted these options, it's time for more front-end servers and even more database capacity, but only then.
Queing is easiest thing for you to implement. Use something like this: http://beanstalkd.github.com/beaneater/
Basically you can prepend your methods with async. which will put them into queue and execute them. They queue and workers can be same server or a different one.

economical way of scaling a php+mysql website

My partner and I are trying to start a website hosted in cloud. It has pretty heavy ajax traffic and the backend handles money transactions so we need ACID in some of the DB tables.
Currently everything is running off a single server. Some of the AJAX traffic are cached in text files.
Question:
What's the best way to scale the database server? I thought about moving mysql to separate instances and do master-master duplication. However this seems tough and I heard I might lose ACID properties even with InnoDB? Is Amazon RDS a good solution?
The web server is relatively stateless except for some custom log files and the ajax cache files. What's a good way to scale to multiple web servers? I guess the custom log files can be moved to a reliable shared file system or DB but not sure what to do about the AJAX cache file coherency across multiple servers. (I dont care about losing /var/log/* if web server dies)
For performance it might be cheaper to go with larger instance with more cores and memory but eventually I would need redundancy so wondering what's the best way to do this cheaply.
thanks
take a look at this post. there is plenty of presentations on the net discussing scalability. few things i suggest to keep in mind:
plan early for the data sharding [even if you are not going to do it immediately]
try using mechanisms like memcached to limit number of queries sent to the database
prepare to serve static content from other domain, in the longer run - from ngin-x-alike server and later CDN
redundancy - depends on your needs. is 'read-only' mode acceptable for your site? if so - go with mysql replication + rsync of static files and in case of failover have your site work in that mode till you recover the master node. if you need high availability - then take a look either at drbd replication [at least for mysql] or setup with automated promotion of slave server to become master node.
you might find following interesting:
http://yoshinorimatsunobu.blogspot.com/2011/08/mysql-mha-support-for-multi-master.html
http://mysqlperformanceblog.com
http://highscalability.com
http://google.com - search for scalability, lamp, failover... there are tones of case studies and horror stories from the trench lines :-]
Another option is using a scaleable platform such as Amazon Web Services. You can start out with a micro instance and configure load balancing to fire up more instances as needed.
Once you determine average resource requirements you can then resize your image to larger or smaller depending on your needs.
http://aws.amazon.com
http://tuts.pinehead.tv/2011/06/26/creating-an-amazon-ec2-instance-with-linux-lamp-stack/
http://tuts.pinehead.tv/2011/09/11/how-to-use-amazon-rds-relation-database-service-to-host-mysql/
Amazon allows you to either load balance or change instance size based off demand.

Implementing cross-thread/process queues in Perl

What is the most efficient way of implementing queues to be read by another thread/process?
I'm thinking of using a basic MySQL table with polling on sleep. This sounds to be the most scalable (it doesn't even have to be on the same server) but might potentially result in too many queries to the DB.
You have several options, and it really depends on what you are trying to get the system to do.
fork child processes, and interface using connections their stdin/stdout pipes.
create a named pipe on the file system, like /tmp/mysql.sock. This is basically using sockets to communicate cross process.
Setup a message broker. I'd recommend giving ActiveMQ a try, and using the Stomp client for Perl. This is probably your most scalable solution.
This is one of those things that is simple to write yourself to your exact specifications. I wrote a toy one here:
http://github.com/jrockway/app-queue
I am not sure it compiles anymore, as AnyEvent::Subprocess has changed significantly since I wrote it. But you can steal the ideas.
Basically, I think an RPC-style infrastructure is the best. You have a server that handles keeping the data. Then clients connect and add data or remove data via RPC calls. This gives you ultimate flexibility with the semantics. You can be "transactional" so that if a client takes data and then never says "hey, I am done with it", you can assume the client died and give the job to another client. You can also ensure that each job is only run once.
Anyway, making a queue work with a relational database table involves a bit of effort. You should use something KiokuDB for the persistence. (You can physically store the data in MySQL if you desire, but this provides a nicer Perl API to that.)
In PostgreSQL you could use the NOTIFY/LISTEN combination, would need only a wait on the PG connection socket after running LISTEN(s).