Push Data onto Queue vs Pull Data by Workers - message-queue

I am building a web site backend that involves a client submitting a request to perform some expensive (in time) operation. The expensive operation also involves gathering some set of information for it to complete.
The work that the client submits can be fully described by a uuid. I am hoping to use a service oriented architecture (SOA) (i.e. multiple micro-services).
The client communicates with the backend using RESTful communication over HTTP. I plan to use a queue that the workers performing the expensive operation can poll for work. The queue has persistence and offers decent reliability semantics.
One consideration is whether I gather all of the data needed for the expensive operation upstream and then enqueue all of that data or whether I just enqueue the uuid and let the worker fetch the data.
Here are diagrams of the two architectures under consideration:
Push-based (i.e. gather data upstream):
Pull-based (i.e. worker gathers the data):
Some things that I have thought of:
In the push-based case, I would be likely be blocking while I gathered the needed data so the client's HTTP request would not be responded to until the data is gathered and then enqueued. From a UI standpoint, the request would be pending until the response comes back.
In the pull based scenario, only the worker needs to know what data is required for the work. That means I can have multiple types of clients talking to various backends. If the data needs change I update just the workers and not each of the upstream services.
Any thing else that I am missing here?

Another benefit of the pull based approach is that you don't have to worry about the data getting stale in the queue.

I think you already pretty much explained that the second (pull-based) approach is better.
If a user's request should anyway be processed asynchronously, why wait for the data to be gathered and then return a response. You need just to queue a work item and return HTTP response.
Passing data via queue is not a good option. If you get the data upstream, you will have to pass it somehow other than via queue to the worker (usually BLOB storage). That is additional work that is not really needed in your case.

I would recommend Cadence Workflow instead of queues as it supports long running operations and state management out of the box.
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Related

SSIS - Script Component pulling information from RabbitMQ

A question that might be mostly theoretical, but I'd love to have my concerns put to rest (or confirmed).
I built a Script Component to pull data from RabbitMQ. On RabbitMQ, we basically set up a durable queue. This means messages will continue to be added to the queue, even when the server reboots. This construction allows us to periodically execute the package and grab all "new" messages since the last time we did so.
(We know RabbitMQ isn't set up to accommodate to this kind of scenario, but rather it expects there to be a constant listener to process messages. However, we are not comfortable having some task start when SQL Server starts, and pretty much running 24/7 to handle that, so we built something we can schedule to run every n minutes and empty the queue that way. If we'd not be able to run the task, we most likely are dealing with a failed SQL Server, and have different priorities).
The component sets up a connection, and then connects to the specific exchange + queue we are pulling messages from. Messages are in JSON format, so we deserialize the message into a class we defined in the script component.
For every message found, we disable auto-acknowledge, so we can process it and only acknowledge it once we're done with it (which ensures the message will be processed, and doesn't slip through). Then we de-serialize the message and push it onto the output buffer of the script component.
There's a few places things can go wrong, so we built a bunch of Try/Catch blocks in the code. However, seeing we're dealing with the queue aspect, and we need the information available to us, I'm wondering if someone can explain how/when a message that is sent to the output buffer is processed.
Is it batched up and then pushed? Is it sent straight away, and is the SSIS component perhaps not updating information back to SSIS in a timely fashion?
Would there be a chance for us to acknowledge a message, but that it somehow ends up not getting committed to our database, yet popped from the queue (as I think happens once a message is acknowledged)?

Integrating systems and snapshotting data exchange

I wanted to know, if in general, when integrating 2 or more systems via whatever means (ie. webservice, MQ, etc.), is it a best practice or a standard for your system to capture a snapshot of data that you are sending with another system? I am thinking that this is as an insurance when reconciling is required for scenarios such as prod incidents.
Secondly, I would think this data snapshot is different from audit trail, in that the data being sent itself is saved (ie. xml data, csv file) as a LOB column in a snapshot table. Is this redundant with the audit trail?
For your first question ...
I've done many, many integrations using queues, web services, etc. and I will usually store an audit trail (a high level set of data telling me what happened), but I've never actually stored the payload itself for each call.
A few reasons for that:
The storage of the payloads being sent back and forth can get quite large.
I can usually reconstruct the payload using the audit trail. "Oh entity XYZ with ID 123 was sent yesterday. Let's take a look at what that entity looks like."
If you do the integration really well and have good testing around it, having copies of the payloads becomes unnecessary.
Instead of storing a copy of the payload I would focus on these things for integration:
Good unit tests on both sides and integration testing for the entire process.
Audit logs as you mentioned.
Good retry policies when a message fails (specifically for queues and topics).
Focusing on idempotent messages. So if something fails, you just do it again and everything is ok.

How to retract a message in RabbitMQ?

I've got something like a job queue over RabbitMQ and, upon a request to cancel a job, I'd like to retract the tasks that have not yet started processing (their messages have not been ack'd), which corresponds to retracting these messages from the queues that they've been routed to.
I haven't found this functionality in AMQP or in the RabbitMQ API; perhaps I haven't searched well enough? Or will I have to use a workaround (it's not hard, but still)?
I would solve this scenario by having the worker check some sort of authoritative data source to determine if the the job should proceed or not. For example, the worker would check the job's status in a database to see if the job was canceled already.
For scenarios where the speed of processing jobs may be faster than the speed with which the authoritative store can be updated and read, a less guaranteed data store that trades speed for other characteristics may be useful.
An example of this would be to use Redis as the store for canceling processing of a message instead of a relational DB like MySQL. Redis is very fast, but makes fewer guarantees regarding the data it holds, whereas MySQL is much slower, but offers more guarantees about the data it holds.
In the end, the concept of checking with another source for whether or not to process a message is the same, but the way you implement that depends on your particular scenario.
RabbitMQ doesn't let you modify or delete messages after they've been enqueued. For that, you want some kind of database to hold the state of each job, and to use RabbitMQ to notify interested parties of changes in that state.
For lowish volumes, you can kludge it together with a queue per job. Create the queue, post the job description to the queue, announce the name of the queue to the workers. If the job needs to be cancelled before it is processed, deleted the job's queue; when the workers come to fetch the job description, they'll notice the queue has vanished.
Lighterweight and generally better would be to use redis or another key/value store to hold the job state (with a deleted or absent record meaning a cancelled or nonexistent job) and to use rabbitmq to notify about new/removed/changed records in the key/value store.
At least two ways to achieve your target:
basic.reject will requeue message if requeue=true is set (otherwise it will reject message).
(supported since RabbitMQ 2.0.0; see http://www.rabbitmq.com/blog/2010/08/03/well-ill-let-you-go-basicreject-in-rabbitmq/).
basic.recover will ask broker to redeliver unacked messages on channel.
You need to subscribe to all the queues to which messages have been routed, and consume them with ack.
For instance if you publish to a topic exchange with "test" as the routing key, and there are 3 persistent queues which subscribe to "test" you would need to consume those three queues. It might be better to add another queue which your consumer processes would also listen too, and tell them to ignore those messages.
An alternative, since you are using RabbitMQ, is to write a custom exchange plugin that will accept some out of band instruction to clear all queues. For instance you might have that exchange read a special message header that tells it to clear all queues to which this message is destined. This does require writing Erlang code, but there are 4 different exchange types implemented so you would only need to copy the most similar one and write the code for the new bahaviours. If you only use custom headers for this, then the body of the message can be a normal message for the consumers.
To sum up:
1) the publisher needs to consume the messages itself
2) the publisher can send a special message in a special queue to tell consumers to ignore the message
3) the publisher can send a special message to a custom exchange that will clear any existing messages from the queues before sending this special message to consumers.

Message Queues Vs DB Table Queue via CRON

We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks
Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)
A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.
First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.
This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)

A backup persistence store. What are my options?

I have a service that accepts callbacks from a provider.
Motivation: I do not want to EVER lose any callbacks (unless of course my network becomes unreachable).
Let's suppose the impossible happens and my mysql server becomes unreachable for some time,
I want to fallback to a secondary persistence store once I've retried several times and fail.
What are my options? Queues, in-memory cache ?
You say you're receiving "Callbacks" - you've not made clear what they are. What is the protocol? Is it over a network.
If it were HTTP, then I would say the best way is that if your application is unable to write the data into permanent storage, it should return an error ("Try again later" if that exists in the protocol) to the caller, who should try again later.
An asynchronous process like a callback should always be able to cope with failures downstream and queue its requests.
I've worked with a payment provider where this has been the case (Paypal). If you're unable to completely process the request, just send an error back to the caller.
I recommend some sort of job queue server. I personally use Starling and have had great results with it. It speaks the memcache protocol so it is easy to use as a persistent queue.
Starling on Github
I've put a queue in SQLite for this before. Though, in my case, it was to protect against loss of the network link to the MySQL server — the data was locally-generated.
You can have a backup MySQL server, and switch your connection to that one in case primary one breaks down. If it's going to be only fail-over store you could probably run it locally on the application server.