Message Queues Vs DB Table Queue via CRON - message-queue

We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks

Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)

A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.

First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.

This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)

Related

Couchbase: Is it possible to only invoke "SELECT" queries on the "master" node?

I am having several race conditions in my app where I was able to "SELECT" a document that was previously deleted by another thread 1-2 secs ago. I added ScanConsistency.REQUEST_PLUS to my "SELECTs" but it takes too long...
I am planning to add PersistTo.ONE param to the "DELETEs" however, I am not sure if the succeeding "SELECT" will still see the deleted document or not because I think that it might invoke "SELECT" on one of the non-master nodes which still has the deleted document in-memory.
Will it be possible to "SELECT" only on the master node?
I could use PersistTo.FOUR but I think that would also affect performance greatly.
From what I could tell from reading the documentation on this feature (this is a fairly recent addition to Couchbase), the fact that the edit takes place on another thread is significant. Each thread is going to have to have its own session with the database, and the consistency level could in theory be pulled from that thread, but you would need to have that thread communicate with your thread (probably a non-starter).
Therefore, going back to basics, it's important to realize that the database itself is an eventually-consistent data store. This means that, given the CAP-theorem, the data store sacrifices consistency for availability and partition tolerance. This is true in all cases, but the N1QL mechanism attempts to compensate a little bit for "your own writes." Being a software architect myself, I would not depend upon this except if I needed it as a temporary workaround, but rather keep the prevailing design principles in mind when designing the application data store.
Bottom line is that I believe this behavior is expected, and your design should be tolerant of it. If your application requires immediate consistency across sessions, then you should use a different data store.

What's the most efficient architecture for this system? (push or pull)

All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?

Scaling up a ruby, activerecord, mysql app

I have an app...
The app does a market comparison for a financial product - for a given quote request, it contacts several other sites for their quotes. It then gives the user the results - several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS - it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection - grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go ...
Bigger hardware...
More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines -
so the threads dont touch the db, all the details/responses go via
the queue - so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions - something easy to
use with Ruby :)
Some kind of remote/RPC mechanism, eg dRb -
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things.
Something
else...?
From this link Reasons for NOT scaling-up vs. -out? - it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go...
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you're making and tune them aggressively. Retrieve only what you need, skipping columns that aren't explicitly used, and be very careful about eager loading things you don't need in their entirety.
You'll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming JOIN operations.
Further, think about caching: The fastest database call is the one that's never made. It's not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it's even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you've exhausted these options, it's time for more front-end servers and even more database capacity, but only then.
Queing is easiest thing for you to implement. Use something like this: http://beanstalkd.github.com/beaneater/
Basically you can prepend your methods with async. which will put them into queue and execute them. They queue and workers can be same server or a different one.

Using MySQL as a job queue

I'd like to use MySQL as a job queue. Multiple machines will be producing and consuming jobs. Jobs need to be scheduled; some may run every hour, some every day, etc.
It seems fairly straightforward: for each job, have a "nextFireTime" column, and have worker machines search for the job with the nextFireTime, change the status of the record to "inProcess", and then update the nextFireTime when the job ends.
The problem comes in when a worker dies silently. It won't be able to update the nextFireTime or set the status back to "idle".
Unfortunately, jobs can be long-running, so a reaper thread that looks for jobs that have been inProcess too long isn't an option. There's no timeout value that would work.
Can anyone suggest a design pattern that would properly handle unreliable worker machines?
Maybe like this
When a worker fetches a job it can add it's process-id or another unique id to a field in the job
Then in another table every worker keeps updating a value that they are alive. When updating the "i'm alive" field you check all other "last time worker showed sign of life". If one worker is over a limit, find all the jobs it is working on and reset them.
So in other words the watchdog works on the worker-processes and not the jobs themselves.
Using MySQL as a job queue generally ends in pain, as it's a very poor fit for the usual goals of an RDBMS. User 'toong' already linked to https://www.engineyard.com/blog/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you, which has a lot of interesting stuff to say about it. Unreliable workers are only one of the complications.
There are many, many systems for handling job distribution, mostly distinguished by the sophistication of their queueing and scheduling capabilities. On the simple FIFO end are things like Resque, Celery, Beanstalkd, and Gearman; on the sophisticated end are things like GridEngine, Torque/Maui, and PBS Pro. I highly recommend the new Amazon Simple Workflow system, if you can tolerate reliance on an Amazon service (I believe it does not require that you be in EC2).
To your original question: right now we're implementing a per-node supervisor that can tell if the node's jobs are still active, and sending a heartbeat back to a job monitor if so. It's a pain, but as you are discovering and will continue to discover, there are a lot of details and error cases to manage. Mostly, though, I have to encourage you to do yourself a favor by learning about this domain and build the system properly from the start.
One option is to make sure that jobs are idempotent, and allow more than one worker to start a given job. It doesn't matter which worker completes the job, or if more than one worker completes the job; since the jobs are designed in such a way that multiple completions are handled gracefully. perhaps workers race to supply the result, and the losers find that the slot that will hold the result is already full, so they just drop them.
Another option is to not have big jobs. Break long running jobs into intermediate steps, if the job takes longer than (say) 1 minute, store the intermediate results as a new job (with a link to the old job in some way), so that the new job can be queued again to do another minute of work.

Database strategy for synchronization based on changes

I have a Spring+Hibernate+MySQL backend that exposes my model (8 different entities) to a desktop client. To keep synchronized, I want the client to regularely ask the server for recent changes. The process may be as follows:
Point A: The client connects for the
first time and retrieves all the
model from the server.
Point B: The client asks the server
for all changes since Point A.
Point C: The client asks the server
for all changes since Point B.
To retrieve the changes (point B&C) I could create a HQL query that returns all rows in all my tables that have been last modified since my previous retrieval. However I'm afraid this can be a heavy query and degrade my performance if executed oftenly.
For this reason I was considering other alternatives as keeping a separate table with recent updates for a fast access. I have looked to using L2 query cache but it doesn't seem to serve for my purpose.
Does someone know a good strategy for my purpose? My initial thought is to keep control of synchronization and avoid using "automatic" synchronization tools.
Many thanks
you can store changes in a queue table. Triggers can populate the queue on insert, update, delete. this preserves the order of the changes like insert, update, update, delete. Empty the queue after download.
Emptying the queue would cause issues if you have multiple clients.... may need to think about a design to handle that case.
there are several designs you can go with, all with trade offs. I have used the queue design before, but it was only copying data to a single destination, not multiple.