Using MySQL as a job queue - mysql

I'd like to use MySQL as a job queue. Multiple machines will be producing and consuming jobs. Jobs need to be scheduled; some may run every hour, some every day, etc.
It seems fairly straightforward: for each job, have a "nextFireTime" column, and have worker machines search for the job with the nextFireTime, change the status of the record to "inProcess", and then update the nextFireTime when the job ends.
The problem comes in when a worker dies silently. It won't be able to update the nextFireTime or set the status back to "idle".
Unfortunately, jobs can be long-running, so a reaper thread that looks for jobs that have been inProcess too long isn't an option. There's no timeout value that would work.
Can anyone suggest a design pattern that would properly handle unreliable worker machines?

Maybe like this
When a worker fetches a job it can add it's process-id or another unique id to a field in the job
Then in another table every worker keeps updating a value that they are alive. When updating the "i'm alive" field you check all other "last time worker showed sign of life". If one worker is over a limit, find all the jobs it is working on and reset them.
So in other words the watchdog works on the worker-processes and not the jobs themselves.

Using MySQL as a job queue generally ends in pain, as it's a very poor fit for the usual goals of an RDBMS. User 'toong' already linked to https://www.engineyard.com/blog/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you, which has a lot of interesting stuff to say about it. Unreliable workers are only one of the complications.
There are many, many systems for handling job distribution, mostly distinguished by the sophistication of their queueing and scheduling capabilities. On the simple FIFO end are things like Resque, Celery, Beanstalkd, and Gearman; on the sophisticated end are things like GridEngine, Torque/Maui, and PBS Pro. I highly recommend the new Amazon Simple Workflow system, if you can tolerate reliance on an Amazon service (I believe it does not require that you be in EC2).
To your original question: right now we're implementing a per-node supervisor that can tell if the node's jobs are still active, and sending a heartbeat back to a job monitor if so. It's a pain, but as you are discovering and will continue to discover, there are a lot of details and error cases to manage. Mostly, though, I have to encourage you to do yourself a favor by learning about this domain and build the system properly from the start.

One option is to make sure that jobs are idempotent, and allow more than one worker to start a given job. It doesn't matter which worker completes the job, or if more than one worker completes the job; since the jobs are designed in such a way that multiple completions are handled gracefully. perhaps workers race to supply the result, and the losers find that the slot that will hold the result is already full, so they just drop them.
Another option is to not have big jobs. Break long running jobs into intermediate steps, if the job takes longer than (say) 1 minute, store the intermediate results as a new job (with a link to the old job in some way), so that the new job can be queued again to do another minute of work.

Related

how to implement saved-searches scenario

what is saved-search?
Save is the mechanism users don't find their desired results in advanced search and just push "Save My Search Criteria bottom" and we save the search criteria and when corresponding data post to website we will inform the user "hey user, the item(s) you were looking for exists now come and visit it".
Saved Searches is useful for sites with complex search options, or sites where users may want to revisit or share dynamic sets of search results.
we have advanced search and don't need to implement new search, what we require is a good performance scenario to achieve saved-search mechanism.
we have a website that users post about 120,000 posts per day into the website and we are going to implement SAVED SEARCH scenario(something like this what https://www.gumtree.com/ do), it means users using advanced search but they don't find their desired content and just want to save the search criteria and if there will be any results in the website we inform them with notification.
We are using Elastic search and Mysql in our Website.We still, haven't implement anything and just thinking about it to find good solution which can handle high rate of date, in other hand **the problem is the scale of work, because we have a lot of posts per day and also we guess users use this feature a lot, So we are looking for good scenario which could handle this scale of work easy with high performance.
suggested solutions but not the best
one quick solution is we save the saved-searches in saved-search-index in Elastic then run a cronjob that for all saved-searches items get results from posts-index- Elastic and if there is any result push a record into the RabbitMq to notify the equivalent user.
on user post an item into the website we check it with exists saved-searches in saved-search-index in Elastic and if matched we put a record into the RabbitMq,( the main problem of this method is it could be matched with a huge number of saved-searches in every post inserted into the website).
My big concern is about scale and performance, I'll appreciate sharing your experiences and ideas about this problem with me.
My estimation about the scale
Expire date of saved-search is three month
at least 200,000 Saved-search Per day
So we have 9,000,000 active Records
I'll appreciate if you share your mind with me
*just FYI**
- we also have RabbitMQ for our queue jobs
- our ES servers are good enough with 64GB RAM
Cron job - No. Continual job - yes.
Why? As things scale, or as activity spikes, cron jobs become problematical. If the cron job for 09:00 runs too long, it will compete for resources with the 10:00 instance; this can cascade into a disaster.
At the other side, if a cron job finishes 'early', then the activity oscillates between "busy" (the cron job is doing stuff) and "not busy" (cron has finished, and not time for next invocation).
So, instead, I suggest a job that continually runs through all the "stored queries", doing them one at a time. When it finishes the list, is simply starts over. This completely eliminates my complaints about cron, and provides an automatic "elasticity" to handle busy/not-busy times -- the scan will slow down or speed up accordingly.
When the job finishes, the list, it starts over on the list. That is, it runs 'forever'. (You could use a simple cron job as a 'keep-alive' monitor that restarts it if it crashes.)
OK, "one job" re-searching "one at a time" is probably not best. But I disagree with using a queuing mechanism. Instead, I would have a small number of processes, each acting on some chunk of the stored queries. There are many ways: grab-and-lock; gimme a hundred to work on; modulo N; etc. Each has pros and cons.
Because you are already using Elasticsearch and you have confirmed that you are creating something like Google Alerts, the most straightforward solution would be Elasticsearch Percolator.
From the official documentation, Percolator is useful when:
You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
I can't say much when it comes to performance, because you did not provide any example of your queries but mostly because my findings are inconsistent.
According to this post (https://www.elastic.co/blog/elasticsearch-queries-or-term-queries-are-really-fast), Elasticsearch queries should be capable of reaching 30,000 queries/second.
However, this unanswered question (Elasticsearch percolate performance) reported a painfully slow 200 queries/second on a 16 CPU server.
With no additional information I can only guess that the cause is configuration problems, so I think you'll have to try a bunch of different configurations to get the best possible performance.
Good luck!
This answer was written without a true understanding of the implications of a "saved search". I leave it here as discussion of a related problem, but not as a "saved search" solution. -- Rick James
If you are saving only the "query", I don't see a problem. I will assume you are saving both the query and the "resultset"...
One "saved search" per second? 2.4M rows? Simply rerun the search when needed. The system should be able to handle that small a load.
Since the data is changing, the resultset will become outdated soon? How soon? That is, saving the resultset needs to be purged rather quickly. Surely the data is not so static that you can wait a month. Maybe an hour?
Actually saving the resultset and being able to replay it involves (1) complexity in your code, (2) overhead in caching, I/O, etc, etc.
What is the average number of times that the user will look at the same search? Because of the overhead I just mentioned, I suspect the average number of times needs to be more than 2 to justify the overhead.
Bottomline... This smells like "premature optimization". I recommend
Build the site without saving resultsets.
Stress test it to see when it will break.
Work on optimizing the slow parts.
As for RabbitMQ -- "Don't queue it, just do it". The cost of queuing and dequeuing is (1) increased latency for the user and (2) increased overhead on system. The benefit (at your medium scale) is minimal.
If you do hit scaling problems, consider
Move clients off to another server -- away from the database. This will give you some scaling, but not 2x. To go farther...
Use replication: One Master + many readonly Slaves -- and do the queries on the Slaves. This gives you virtually unlimited scaling in the database.
Have multiple web servers -- virtually unlimited scaling in this part.
I don't understand why you want to use saved-search... First: you should optimize service, so as to use as little as possible the saved-search.
Have you done anything with the ES server? (What can you afford), so:
Have you optimized elasticearch server? By default, it uses 1GB of RAM. The best solution is to give him half the machine RAM, but no more than 16GB (if I'm remember. Check doc)
How powerful is the ES machine? He likes core instead of MHZ.
How many ES nodes do you have? You can always add another machine to get the results faster.
In my case (ES 2.4), the server slows down after a few days, so I restart it once a day.
And next:
Why do you want to fire up tasks every half hour? If you already use cron, fire then every minute, and you indicate that the query is running. With the other the post you have a better solution and an explanation.
Why do you separate the result from the query?
Remember to standardize the query to change the order of the parameters, not to force a new query.
Why do you want to use MySQL to store results? The better document-type database, like Elasticsearch xD.
I propose you:
Optimize ES structure - choose right tokenisers for fields.
Use asynchronous loading of results - eg WebSocket + Node.js

To fork or not to fork?

I am re-developing a system that will send messages via http to one of a number of suppliers. The original is perl scripts and it's likely that the re-development will also use perl.
In the old system, there were a number of perl scripts all running at the same time, five for each supplier. When a message was put into the database, a random thread number (1-5) and the supplier was chosen to ensure that no message was processed twice while avoiding having to lock the table/row. Additionally there was a "Fair Queue Position" field in the database to ensure that a large message send didn't delay small sends that happened while the large one was being sent.
At some times there would be just a couple of messages per minute, but at other times there would be a dump of potentially hundreds of thousands of messages. It seems to me like a resource waste to have all the scripts running and checking for messages all of the time so I am trying to work out if there is a better way to do it, or if the old way is acceptable.
My thoughts right now lie with the idea of having one script that runs and forks as many child processes as are needed (up to a limit) depending on how much traffic there is, but I am not sure how best to implement it such that each message is processed only once, while the fair queuing is maintained.
My best guess right now is that the parent script updates the DB to indicate which child process should handle it, however I am concerned that this will end up being less efficient than the original method. I have little experience of writing forking code (last time I did it was about 15 years ago).
Any thoughts or links to guides on how best to process message queues appreciated!
You could use Thread::Queue or any other from this: Is there a multiprocessing module for Perl?
If the old system was written in Perl this way you could reuse most part of it.
Non working example:
use strict;
use warnings;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new(); # A new empty queue
# Worker thread
my #thrs = threads->create(sub {
while (my $item = $q->dequeue()) {
# Do work on $item
}
})->detach() for 1..10;#for 10 threads
my $dbh = ...
while (1){
#get items from db
my #items = get_items_from_db($dbh);
# Send work to the thread
$q->enqueue(#items);
print "Pending items: "$q->pending()."\n";
sleep 15;#check DB in every 15 secs
}
I would suggest using a message queue server like RabbitMQ.
One process feeds work into the queue, and you can have multiple worker processes consume the queue.
Advantages of this approach:
workers block when waiting for work (no busy waiting)
more worker processes can be started up manually if needed
worker processes don't have to be a child of a special parent process
RabbitMQ will distribute the work among all workers which are ready to accept work
RabbitMQ will put work back into the queue if the worker doesn't return an ACK
you don't have to assign work in the database
every "agent" (worker, producer, etc.) is an independent process which means you can kill it or restart it without affecting other processes
To dynamically scale-up or down the number workers, you can implement something like:
have workers automatically die if they don't get work for a specified amount of time
have another process monitor the length of the queue and spawn more workers if the queue is getting too big
I would recommend using beanstalkd for a dedicated job server, and Beanstalk::Client in your perl scripts for adding jobs to the queue and removing them.
You should find beanstalkd easier to install and set up compared to RabbitMQ. It will also take care of distributing jobs among available workers, burying any failed jobs so they can be retried later, scheduling jobs to be done at a later date, and many more basic features. For your worker, you don't have to worry about forking or threading; just start up as many workers as you need, on as many servers as you have available.
Either RabbitMQ or Beanstalk would be better than rolling your own db-backed solution. These projects have already worked out many of the details needed for queueing, and implemented features you may not realize yet that you want. They should also handle polling for new jobs more efficiently, compared to sleeping and selecting from your database to see if there's more work to do.

How expensive are MySQL events?

In my web app I use two recurring events that "clean up" one of the tables in the database, both executed every 15 minutes or so.
My question is, could this lead to problems in performance in the future? Because I've read somewhere -I don't recall where exactly- that MySQL events are supposed to be scheduled to run once a month or so. Thing is, this same events keep the table in a pretty reduced size (as they delete records older than 15~ minutes), maybe this compensates the frequency of their execution, right?
Also, is it better to have one big MySQL event or many small ones if they are be called in the same frequency?
I don't think there's a performance indication in the monthly base just more of a suggestion of what to do with it. So i think you're ok with doing your cleanup using the events.
In the end the documentation suggets that the events are
Conceptually, this is similar to the idea of the Unix crontab (also known as a “cron job”) or the Windows Task Scheduler.
And the concept for those is that you can run a task every minute if you wish to do so.
On the second part of that question:
Serialize or spread it up. If you split them up into many events that will run at the same time you will create spikes of possibly very high cpu usage that might slow down the application while processing the events.
So either pack everything into one event so it runs in succession or spread the single events up so they execute on different times during the 15 minutes timeframe. Personally i think the first one is to be preferred, pack them up into a single event as then they are guaranteed to run in succession, even if a single one of them keeps running longer than usual.
The same goes for cronjobs. If you shedule 30 long-running exports at a single time your application is going to fail miserably during that timeslot (learned that the hard way).

is it possible to use cron too much?

I run a game statistics site. Its MySQL database is small potatoes compared to most of the things people work on around here, but shared hosting does necessitate an eye on query optimization, particularly when performing lots of joins and sub-queries.
Earlier this week I moved a rather slow (~0.5s) query that grouped, counted, averaged, and sorted the ratings of members to a nightly cron job. Results are stored in a table.
Because we average about one new rating per day, the change does not cause any perceptible data inaccuracy to my users, AND the new query which just grabs rows from the table runs in the ~0.000X range, so all pages pulling that data are noticeably faster.
Clearly this is a good thing!
And as I sat there basking in the glow of my cron job, my mind started running through other aspects of the site and mentally tagging those that could be cron'd... (many)
Which leads me to wonder - is it possible to use cron too much?
Because my site's database changes about once a day, I could conceivably run ALL complex queries (there are many) through nightly cron jobs and store the results in tables.
Is there ever a downside? (apart from data occasionally not being up-to-the-second accurate?)
Cron is great; it's usually a good thing to refrain from reinventing wheels. Some applications have more precise needs than cron can accommodate, so that's one reason not to use it. Also, distributing and managing cronjobs that are to form an integral part of your app can be difficult and error-prone, especially absent a competent package manager from the OS. Troubleshooting can be a little bit of a pain, particularly when there's one server missing one of its 100 cronjobs or something, but that can be managed with an OS package manager or with something like puppet.
But my opinion is to use cron whenever you can and makes sense, rather than rolling your own.
You're not beginning to approach the limits of what amount of jobs can (or should) be scheduled with cron. You'll be just fine. :)
You might want to consider a worker-message queue like gearman to trigger jobs that should be run 'after the fact', but not necessarily on a fixed schedule.
how about one cron job that runs all your procedures?
I once worked on a unix system that failed pretty miserably after the cron job queue exceeded 20 entries. The queue did not execute on any predictable cycle - i.e. FILO, FIFO LIFO etc. it simply was randomized
You might consider using triggers to keep your summary statistics up to date. There's also an event scheduler in MySQL 5.1+ if you like running queries periodically.
http://dev.mysql.com/doc/refman/5.0/en/triggers.html
http://dev.mysql.com/doc/refman/5.1/en/events.html

Message Queues Vs DB Table Queue via CRON

We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks
Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)
A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.
First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.
This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)