I am thinking about using Quartz and I would like to have each group execute 1 job at a time. But all groups should fire in parallel. The reason being that I am connecting to physical devices that can only handle one socket at a time.
As far as I can tell I can have StateFull jobs which will only execute 1 of that job type at a time. I can set the scheduler thread count to 1 to ensure that the scheduler only executes one at a time. Each is not what I want but I might be able to achieve what I want be creating a scheduler for each group.
Is there a better solution than quartz for queuing?
Related
I'm using Userparameter type with a curl shell command to check if a website live,if the site domain number is more than 1000 then, I set the check time as 5 minutes (in zabbix-agent active mode), but I found that many check items are in the queue waiting for a long time more than 10 minutes. So how can I improve the check speed
Indeed, Zabbix agent processes active items in a serial fashion [per server].
Possible solution - have an item that schedules an atd job, then sends data to trapper item[s].
In the future, the agent rewrite in Go will have parallel active check processing.
I'm trying to understand how Slick-Hikari works, I've read a lot of documentation but I've a use case whose behavior I don't understand.
I'm using Slick 3 with Hikari, with the default configuration. I already have a production app with ~1000 users connected concurrently. My app works with websockets and when I deploy a new release all clients are reconnected. (I know it's not the best way to handle a deploy but I don't have clustering at the moment.) When all these users reconnect, they all starts doing queries to get their user state (dog-pile effect). When it happens Slick starts to throw a lot of errors like:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#4dbbd9d1 rejected from java.util.concurrent.ThreadPoolExecutor#a3b8495[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 23740]
What I think it's happening is that the slick queue for pending queries is full because it can't handle all the clients requesting information from the database. But if I see the metrics that Dropwizard provides me I see the following:
Near 16:45 we se a deploy. Until old instance is terminated we can see that the number of connections goes from 20 to 40. I think that's normal, given how the deploy process is done.
But, if the query queue of Slick becomes full because of the dog-pile effect, why is it not using more than 3-5 connections if it has 20 connections available? The database is performing really well, so I think the bottleneck is in Slick.
Do you have any advice for improving this deploy process? I have only 1000 users now, but I'll have a lot more in few weeks.
Based on the "rejected" exception, I think many slick actions were submitted to slick concurrently, which exceeded the default size(1000) of the queue embedded in slick.
So I think you should:
increase queue size(queueSize) to hold more unprocessed actions.
increase number of thread(numThreads) in slick to process more actions concurrently.
You can get more tips here
We have got 3 REST-Applications within a cluster.
So each application server can receive requests from "outside".
Now we got timed events, which are analysing the database and add/remove rows from the database, send emails, etc.
The problem is, that each application server does start this timed events and it happens that 2 application server are starting this analysing job at the same time.
We got a sql table in the back.
Our idea was to lock a table within the sql database, when starting the job. If the table is locked, we exit the job, because an other application just started to analyse.
What's a good practice to insert some kind of semaphore ?
Any ideas ?
Don't use semaphores, you are over complicating things, just use message queueing, where you queue your tasks and get them executed in row.
Make ONLY one separate node/process/child_process to consume from the queue and get your task done.
We (at a previous employer) used a database-based semaphore. Each of several (for redundancy and load sharing) servers had the same set of cron jobs. The first thing in each was a custom library call that did:
Connect to the database and check for (or insert) "I'm working on X".
If the flag was already set, then the cron job silently exited.
When finished, the flag was cleared.
The table included a timestamp and a host name -- for debugging and recovering from cron jobs that fail to finish gracefully.
I forget how the "test and set" was done. Possibly an optimistic INSERT, then check for "duplicate key".
I'd like to use MySQL as a job queue. Multiple machines will be producing and consuming jobs. Jobs need to be scheduled; some may run every hour, some every day, etc.
It seems fairly straightforward: for each job, have a "nextFireTime" column, and have worker machines search for the job with the nextFireTime, change the status of the record to "inProcess", and then update the nextFireTime when the job ends.
The problem comes in when a worker dies silently. It won't be able to update the nextFireTime or set the status back to "idle".
Unfortunately, jobs can be long-running, so a reaper thread that looks for jobs that have been inProcess too long isn't an option. There's no timeout value that would work.
Can anyone suggest a design pattern that would properly handle unreliable worker machines?
Maybe like this
When a worker fetches a job it can add it's process-id or another unique id to a field in the job
Then in another table every worker keeps updating a value that they are alive. When updating the "i'm alive" field you check all other "last time worker showed sign of life". If one worker is over a limit, find all the jobs it is working on and reset them.
So in other words the watchdog works on the worker-processes and not the jobs themselves.
Using MySQL as a job queue generally ends in pain, as it's a very poor fit for the usual goals of an RDBMS. User 'toong' already linked to https://www.engineyard.com/blog/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you, which has a lot of interesting stuff to say about it. Unreliable workers are only one of the complications.
There are many, many systems for handling job distribution, mostly distinguished by the sophistication of their queueing and scheduling capabilities. On the simple FIFO end are things like Resque, Celery, Beanstalkd, and Gearman; on the sophisticated end are things like GridEngine, Torque/Maui, and PBS Pro. I highly recommend the new Amazon Simple Workflow system, if you can tolerate reliance on an Amazon service (I believe it does not require that you be in EC2).
To your original question: right now we're implementing a per-node supervisor that can tell if the node's jobs are still active, and sending a heartbeat back to a job monitor if so. It's a pain, but as you are discovering and will continue to discover, there are a lot of details and error cases to manage. Mostly, though, I have to encourage you to do yourself a favor by learning about this domain and build the system properly from the start.
One option is to make sure that jobs are idempotent, and allow more than one worker to start a given job. It doesn't matter which worker completes the job, or if more than one worker completes the job; since the jobs are designed in such a way that multiple completions are handled gracefully. perhaps workers race to supply the result, and the losers find that the slot that will hold the result is already full, so they just drop them.
Another option is to not have big jobs. Break long running jobs into intermediate steps, if the job takes longer than (say) 1 minute, store the intermediate results as a new job (with a link to the old job in some way), so that the new job can be queued again to do another minute of work.
I want to write an ant script that is distributed over multiple slaves. I don't understand exactly how the hudson system works, but it seems to simply run the whole of one type of build on a single slave. I would like multiple slaves to run in parallel to do my testing. How can I accomplish this?
You split your testing job into several jobs. 1 job per slave. Your build will then trigger all testing jobs at the same time. If you need to run an additional job, you can use the join trigger plugin.
The release notes for Hudson 1.377 list a new feature:
Queue/execution model is extended to
allow jobs that consume multiple
executors on different nodes
Don't know what that exactly means. But I will definitely have a look.