how to tell NServiceBus is using MaximumConcurrencyLevel? - message-queue

I'm trying to validate our company's code works when NServiceBus v4.3 is using the MaximumConcurrencyLevel value setup in the config.
The problem is, when I try to process 12k+ of queued entries, I cannot tell any difference in times between the five different max concur levels I change. I set it to 1 and I can process the queue in 8m, then I put it to 2 and I get 9m, seems interesting (I was expecting more, but it's still going in the right direction), but then I put 3, 4, 5 and the timings stay at around 8m. I was expecting a much better throughput.
My question is, how can I verify that NServiceBus is actually indeed using five threads to process entries on the queue?
PS I've tried setting the MaximumConcurrencyLevel="1" and the MaximumMessageThroughputPerSecond along with logging the Thread.CurrentThread.ManagedThreadId thinking\hoping I was ONLY going to see one ThreadID value, but I'm seeing quite a few of different ones, which surprised me. My plan was to see one, then bump the max concur level to 5 and hopefully see five different values.
What am I missing? Thank you in advance.

There can be multiple reasons why you don't see faster processing times when increasing the concurrency setting described on the official documentation page: http://docs.particular.net/nservicebus/operations/tuning
You mentioned you're using the MaximumMessageThroughputPerSecond which will negate any performance gains my parallel message processing if a low value has been configured. Try removing this setting if possible.
Maybe you're accessing a resource in your handlers which isn't supporting/optimized for parallel access.
NServiceBus internally schedules the processing logic on the threadpool. This means that even with a MaximumConcurrencyLevel of 1, you will most likely see a different thread processing each message since there is no thread affinity. But the configuration values work as expected, if your queue contains 5 messages:
it will process these messages one by one if you configured MaximumConcurrencyLevel to 1
it will process all messages in parallel if you configured MaximumConcurrencyLevel to 5.
Depending on your handlers it can of course happen that the first message is already processed at the time the fifth message is read from the queue.

Related

EsRejectedExecutionException in elasticsearch for parallel search

I am querying elasticsearch for multiple parallel requests using single transport client instance in my application.
I got the below exception for the parallel execution. How to overcome the issue.
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#5f804c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Elasticsearch has a thread pool and a queue for search per node.
A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU.
When the workers are full and there are more search requests, the request will go to queue. The size of queue is also limited. If by default size is, say, 100 and if there happens more parallel requests than this, then those requests would be rejected as you can see in the error log.
Solutions:
The immediate solution for this would be to increase the size of
the search queue. We can also increase the size of threadpool,
but then that might badly affect the performance of individual
queries. So, increasing the queue might be a good idea. But then
remember that this queue is memory residential and increasing the
queue size too much can result in Out Of Memory issues. (more
info)
Increase number of nodes and replicas - Remember each node has its
own search threadpool/queue. Also, search can happen on primary
shard OR replica.
Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.
Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.
I saw this same error because I was sending lots of indexing requests in to ES in parallel. Since I'm writing a data migration, it was easy enough to make them serial, and that resolved the issue.
I don't know what was your node configuration but your queue size (1000) is already on a higher side. As others have explained already, your search requests are queued in the Elasticsearch thread pool queue. Even after such a high queue size, if you are getting rejections, that gives some hint that you need to revisit your query pattern.
Like many other designs, even in this case, there is no one-size-fits-all solution. I found this is a very good post about how this queue works and different ways to do a performance test to find out what suits best for your use case.
HTH!

Replication Factor to use for system_auth

When using internal security with Cassandra, what replication factor do you use for system_auth?
The older docs seem to suggest it should be N, where N is the number of nodes, while the newer ones suggest we should set it to a number greater than 1. I can understand why it makes sense for it to be higher - if a partition occurs and one section doesn't have a replica, nobody can log in.
However, does it need to be all nodes? What are the downsides of setting it to all ndoes?
Let me answer this question by posing another:
If (due to some unforseen event) all of your nodes went down, except for one; would you still want to be able to log into (and use) that node?
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes. You can't predict node failure, and in the interests of keeping your application running, it's better safe than sorry.
I don't see any glaring downsides in doing so. The system_auth keyspace isn't very big (mine is 20kb) so it doesn't take up a lot of space. The only possible scenario, would be if one of the nodes is down, and a write operation is made to a column family in system_auth (in which case, I think the write gets rejected...depending on your write consistency). Either way system_auth isn't a very write-heavy keyspace. So you'll be ok as long as you don't plan on performing user maintenance during a node failure.
Setting the replication factor of system_auth to the number of nodes in the cluster should be ok. At the very least, I would say you should make sure it replicates to all of your data centers.
In case you were still wondering about this part of your question:
The older docs seem to suggest it should be N where n is the number of nodes, while the newer ones suggest we should set it to a number greater than 1."
I stumbled across this today in the 2.1 documentation Configuring Authentication:
Increase the replication factor for the system_auth keyspace to N
(number of nodes).
Just making sure that recommendation was clear.
Addendum 20181108
So I originally answered this back when the largest cluster I had ever managed had 10 nodes. Four years later, after spending three of those managing large (100+) node clusters for a major retailer, my opinions on this have changed somewhat. I can say that I no longer agree with this statement of mine from four years ago.
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes.
A few times on mind-to-large (20-50 nodes) clusters , we have deployed system_auth with RFs as high as 8. It works as long as you're not moving nodes around, and assumes that the default cassandra/cassandra user is no longer in-play.
The drawbacks were seen on clusters which have a tendency to fluctuate in size. Of course, clusters which change in size usually do so because of high throughput requirements across multiple providers, further complicating things. We noticed that occasionally, the application teams would report auth failures on such clusters. We were able to quickly rectify these situation, by running a SELECT COUNT(*) on all system_auth tables, thus forcing a read repair. But the issue would tend to resurface the next time we added/removed several nodes.
Due to issues that can happen with larger clusters which fluctuate in size, we now treat system_auth like we do any other keyspace. That is, we set system_auth's RF to 3 in each DC.
That seems to work really well. It gets you around the problems that come with having too many replicas to manage in a high-throughput, dynamic environment. After all, if RF=3 is good enough for your application's data, it's probably good enough for system_auth.
The reason the recommendation changed was that a quorum query would require responses from over half of your nodes to fullfill. So if you accidentally left Cassandra user active and you have 80 nodes - we need 41 responses.
Whilst it's good practice to avoid using the super user like that - you'd be surprised how often it's still out there.

Associative cache simulation - Dealing with a Faulty Scheme

While working on simulating a fully associative cache (in MIPS assembly), a couple of questions came to mind based on some information read online;
According to some notes from the University of Maryland
Finding a slot: At most, one slot should match. If
there is more than one slot that
matches, then you have a faulty
fully-associative cache scheme. You
should never have more than one copy
of the cache line in any slot of a
fully-associative cache. It's hard to
maintain multiple copies, and doesn't
make sense. The slots could be used
for other cache lines.
Does that mean that I should check all the time the whole tag list in order to check for a second match? After all if I don't, i will never "realize" about the fault with the cache, yet, checking every single time seems quite inefficient.
In the case I do check, and somehow I manage to find a second match, meaning faulty cache scheme, what shall I do then? Although the best answer would be to fix my implementation, yet Im interested on how to handle it during execution if this situation should arise.
If more than one valid slot matches an address, then that means that when a previous search for the same address was executed, either a valid slot that should have matched the address was not used (perhaps because it was not checked in the first place) or more than one invalid slot was used to store the line that wasn't in the cache at all.
Without a doubt, this should be considered a bug.
But if we've just decided not to fix the bug (maybe we'd rather not commit that much hardware to a better implementation) the most obvious option is to pick one of the slots to invalidate. It will then be available for other cache lines.
As for how to pick which one to invalidate, if one of the duplicate lines is clean, invalidate that one in preference to a dirty cache line. If more than cache line is dirty and they disagree you have an even bigger bug to fix, but at any rate your cache is out of sync and it probably doesn't matter which you pick.
Edit: here's how I might implement hardware to do this:
First off, it doesn't make a whole lot of sense to start with the assumption of duplicates, rather we'll work around that at the appropriate time later. There are a few possibilities of what must happen when caching a new line.
The line is already in the cache, no action is needed
The line is not in the cache but there are invalid slots available: Place the new line into one of the available slots
The line is not in the cache but there are no invalid slots available. Another valid line must be evicted and the new line takes its place.
Picking an eviction candidate has performance consequences. Clean cache lines can be evicted for free, but if chosen poorly, it can cause another cache miss in the near future. Consider if all but one cache line is dirty. If only the clean cache line is evicted, then many sequential reads alternating between two addresses will cause a cache miss on every read. Cache invalidation is among the two hard problems in Comp Sci (the other being 'naming things') and out of the scope of this exact question.
I would probably implement a search that checks for the correct slot to act on for each of these. Then another block would pick the first from that list and act on it.
Now, getting back to the question. What are the conditions under which duplicates could possibly enter the cache. If memory accesses are strictly ordered, and the implementation (as above) is correct, I don't think duplicates are possible at all. And thus there's no need to check for them.
Now lets consider a more implausible case where A single cache is shared across two CPU cores. We're going to just do the simplest thing that could work and duplicate everything except the cache memory itself for each core. Thus the slot searching hardware is not shared. To support this, an extra bit per slot is used as a mutex. search hardware cannot use a slot that is locked by the other core. specifically,
If the address is in the cache, try to lock the slot and return that slot. If the slot is already locked, stall until it is free.
If the address is not in the cache, find an unlocked slot that is invalid or valid but evictable.
in this case we actually can end up in a position where two slots share the same address. If both cores try to write to an address that is not in the cache, they will end up getting different slots, and a duplicate line will occur. First lets think about what could happen:
Both lines were reads from main memory. They will be the same value and they will both be clean. It is correct to evict either.
Both lines were writes. Both will be dirty, but probably not be equal. This is a race condition that should have been resolved by the application by issuing memory fences or some other memory ordering instructions. We cannot guess which one should be used, if there was no cache the race condition would persist into RAM. It is correct to evict either.
One line was a read and one was a write. The write is dirty but the read is clean. Once again this race condition would have persisted into RAM if there was no intervening cache, but the reader could have seen a different value. evicting the clean line is right by RAM, and also has the side effect of always favoring read then write ordering.
So now we know what to do about it, but where does this logic belong. First lets think about what could happen if we don't do anything. A subsequent cache access for the same address on either core could return either line. Even if neither core is issuing writes, reads could keep coming up different, alternating between the two values. This breaks every conceivable idea about memory ordering.
one solution might be to just say that dirty lines belong to one core only, the line is not dirty, but dirty and owned by another core.
In the case of two concurrent reads, both lines are identical, unlocked and interchangeable. It doesn't matter which line a core gets for subsequent operations.
in the case of concurrent writes, both lines are out of sync, but mutually invisible. Although the race condition that this creates is unfortunate, it still leads to a reasonable memory ordering, as if all of the operations that happen on the discarded line happened before any of the operations on the cleaned line.
If a read and a write happen concurrently, the dirty line is invisible to the reading core. However, the clean line is visible to both cores, and would cause memory ordering to break down for the writer. future writes could even cause it to lock both (because both would be dirty).
That last case pretty much militates that dirty lines be preferred to clean ones. This forces at least some extra hardware to look for dirty lines first and clean lines only if no dirty lines were found. So now we have a new concurrent cache implementation:
If the address is in the cache and dirty and owned by the requesting core, use that slot
if the address is in the cache but clean
for reads, just use that slot
for writes, mark the slot as dirty and use that slot
if the address is not in the cache and there are invalid slots, use an invalid slot
if there are no invalid slots, evict a line and use that slot.
We're getting closer, there's still a hole in the implementation. What if both cores access the same address but not concurrently. The simplest thing is probably to just say that dirty lines are really invisible to other cores. In cache but dirty is the same as not being in the cache at all.
Now all we have to think about is actually providing the tool for applications to synchronize. I'd probably do a tool that just explicitly flushes a line if it is dirty. This would just invoke the same hardware that is used during eviction, but marks the line as clean instead of invalid.
To make a long post short, the idea is to deal with the duplicates not by removing them, but by making sure they cannot lead to further memory ordering issues, and leaving the deduplication work to the application or eventual eviction.

How can I make an SQL query thread start, then do other work before getting results?

I have a program that does a limited form of multithreading. It is written in Delphi, and uses libmysql.dll (the C API) to access a MySQL server. The program must process a long list of records, taking ~0.1s per record. Think of it as one big loop. All database access is done by worker threads which either prefetch the next records or write results, so the main thread doesn't have to wait.
At the top of this loop, we first wait for the prefetch thread, get the results, then have the prefetch thread execute the query for the next record. The idea being that the prefetch thread will send the query immediately, and wait for results while the main thread completes the loop.
It often does work that way. But note there's nothing to ensure that the prefetch thread runs right away. I found that often the query was not sent until the main thread looped around and started waiting for the prefetch.
I sort-of fixed that by calling sleep(0) right after launching the prefetch thread. This way the main thread surrenders the remainder of it's time slice, hoping that the prefetch thread will now run, sending the query. Then that thread will sleep while waiting, which allows the main thread to run again.
Of course, there's plenty more threads running in the OS, but this did actually work to some extent.
What I really want to happen is for the main thread to send the query, and then have the worker thread wait for the results. Using libmysql.dll I call
result := mysql_query(p.SqlCon,pChar(p.query));
in the worker thread. Instead, I'd like to have the main thread call something like
mysql_threadedquery(p.SqlCon,pChar(p.query),thread);
which would hand off the task as soon as the data went out.
Anybody know of anything like that?
This is really a scheduling problem, so I could try is lauching the prefetch thread at a higher priority, then have it reduce its priority after the query is sent. But again, I don't have any mysql call that separates sending the query from receiving the results.
Maybe it's in there and I just don't know about it. Enlighten me, please.
Added Question:
Does anyone think this problem would be solved by running the prefetch thread at a higher priority than the main thread? The idea is that the prefetch would immediately preempt the main thread and send the query. Then it would sleep waiting for the server reply. Meanwhile the main thread would run.
Added: Details of current implementation
This program performs calculations on data contained in a MySQL DB. There are 33M items with more added every second. The program runs continuously, processing new items, and sometimes re-analyzing old items. It gets a list of items to analyze from a table, so at the beginning of a pass (current item) it knows the next item ID it will need.
As each item is independent, this is a perfect target for multiprocessing. The easiest way to do this is to run multiple instances of the program on multiple machines. The program is highly optimized via profiling, rewrites, and algorithm redesign. Still, a single instance utilizes 100% of a CPU core when not data-starved. I run 4-8 copies on two quad-core workstations. But at this rate they must spend time waiting on the MySQL server. (Optimization of the Server/DB schema is another topic.)
I implemented multi-threading in the process solely to avoid blocking on the SQL calls. That's why I called this "limited multi-threading". A worker thread has one task: send a command and wait for results. (OK, two tasks.)
It turns out there are 6 blocking tasks associated with 6 tables. Two of these read data and the other 4 write results. These are similar enough to be defined by a common Task structure. A pointer to this Task is passed to a threadpool manager which assigns a thread to do the work. The main thread can check the task status through the Task structure.
This makes the main thread code very simple. When it needs to perform Task1, it waits for Task1 to be not busy, puts the SQL command in Task1 and hands it off. When Task1 is no longer busy, it contains the results (if any).
The 4 tasks that write results are trivial. The main thread has a Task write records while it goes on to the next item. When done with that item it makes sure the previous write finished before starting another.
The 2 reading threads are less trivial. Nothing would be gained by passing the read to a thread and then waiting for the results. Instead, these tasks prefetch data for the next item. So the main thread, coming to this blocking tasks, checks if the prefetch is done; Waits if necessary for the prefetch to finish, then takes the data from the Task. Finally, it reissues the Task with the NEXT Item ID.
The idea is for the prefetch task to immediately issue the query and wait for the MySQL server. Then the main thread can process the current Item and by the time it starts on the next Item the data it needs is in the prefetch Task.
So the threading, a thread pool, the synchronization, data structures, etc. are all done. And that all works. What I'm left with is a Scheduling Problem.
The Scheduling Problem is this: All the speed gain is in processing the current Item while the server is fetching the next Item. We issue the prefetch task before processing the current item, but how do we guarantee that it starts? The OS scheduler does not know that it's important for the prefetch task to issue the query right away, and then it will do nothing but wait.
The OS scheduler is trying to be "fair" and allow each task to run for an assigned time slice. My worst case is this: The main thread receives its slice and issues a prefetch, then finishes the current item and must wait for the next item. Waiting releases the rest of its time slice, so the scheduler starts the prefetch thread, which issues the query and then waits. Now both threads are waiting. When the server signals the query is done the prefetch thread restarts, and requests the Results (dataset) then sleeps. When the server provides the results the prefetch thread awakes, marks the Task Done and terminates. Finally, the main thread restarts and takes the data from the finished Task.
To avoid this worst-case scheduling I need some way to ensure that the prefetch query is issued before the main thread goes on with the current item. So far I've thought of three ways to do that:
Right after issuing the prefetch task, the main thread calls Sleep(0). This should relinquish the rest of its time slice. I then hope that the scheduler runs the prefetch thread, which will issue the query and then wait. Then the scheduler should restart the main thread (I hope.) As bad as it sounds, this actually works better than nothing.
I could possibly issue the prefetch thread at a higher priority than the main thread. That should cause the scheduler to run it right away, even if it must preempt the main thread. It may also have undesirable effects. It seems unnatural for a background worker thread to get a higher priority.
I could possibly issue the query asynchronously. That is, separate sending the query from receiving the results. That way I could have the main thread send the prefetch using mysql_send_query (non blocking) and go on with the current item. Then when it needed the next item it would call mysql_read_query, which would block until the data is available.
Note that solution 3 does not even use a worker thread. This looks like the best answer, but requires a rewrite of some low-level code. I'm currently looking for examples of such asynchronous client-server access.
I'd also like any experienced opinions on these approaches. Have I missed anything, or am I doing anything wrong? Please note that this is all working code. I'm not asking how to do it, but how to do it better/faster.
Still, a single instance utilizes 100% of a CPU core when not data-starved. I run 4-8 copies on two quad-core workstations.
I have a conceptual problem here. In your situation I would either create a multi-process solution, with each process doing everything in its single thread, or I would create a multi-threaded solution that is limited to a single instance on any particular machine. Once you decide to work with multiple threads and accept the added complexity and probability of hard-to-fix bugs, then you should make maximum use of them. Using a single process with multiple threads allows you to employ varying numbers of threads for reading from and writing to the database and to process your data. The number of threads may even change during the runtime of your program, and the ratio of database and processing threads may too. This kind of dynamic partitioning of the work is only possible if you can control all threads from a single point in the program, which isn't possible with multiple processes.
I implemented multi-threading in the process solely to avoid blocking on the SQL calls.
With multiple processes there wouldn't be a real need to do so. If your processes are I/O-bound some of the time they don't consume CPU resources, so you probably simply need to run more of them than your machine has cores. But then you have the problem to know how many processes to spawn, and that may again change over time if the machine does other work too. A threaded solution in a single process can be made adaptable to a changing environment in a relatively simple way.
So the threading, a thread pool, the synchronization, data structures, etc. are all done. And that all works. What I'm left with is a Scheduling Problem.
Which you should leave to the OS. Simply have a single process with the necessary pooled threads. Something like the following:
A number of threads reads records from the database and adds them to a producer-consumer queue with an upper bound, which is somewhere between N and 2*N where N is the number of processor cores in the system. These threads will block on the full queue, and they can have increased priority, so that they will be scheduled to run as soon as the queue has more room and they become unblocked. Since they will be blocked on I/O most of the time their higher priority shouldn't be a problem.
I don't know what that number of threads is, you would need to measure.
A number of processing threads, probably one per processor core in the system. They will take work items from the queue mentioned in the previous point, on block on that queue if it's empty. Processed work items should go to another queue.
A number of threads that take processed work items from the second queue and write data back to the database. There should probably an upper bound for the second queue as well, to make it so that a failure to write processed data back to the database will not cause processed data to pile up and fill all your process memory space.
The number of threads needs to be determined, but all scheduling will be performed by the OS scheduler. The key is to have enough threads to utilise all CPU cores, and the necessary number of auxiliary threads to keep them busy and deal with their outputs. If these threads come from pools you are free to adjust their numbers at runtime too.
The Omni Thread Library has a solution for tasks, task pools, producer consumer queues and everything else you would need to implement this. Otherwise you can write your own queues using mutexes.
The Scheduling Problem is this: All the speed gain is in processing the current Item while the server is fetching the next Item. We issue the prefetch task before processing the current item, but how do we guarantee that it starts?
By giving it a higher priority.
The OS scheduler does not know that it's important for the prefetch task to issue the query right away
It will know if the thread has a higher priority.
The OS scheduler is trying to be "fair" and allow each task to run for an assigned time slice.
Only for threads of the same priority. No lower priority thread will get any slice of CPU while a higher priority thread in the same process is runnable.
[Edit: That's not completely true, more information at the end. However, it is close enough to the truth to ensure that your higher priority network threads send and receive data as soon as possible.]
Right after issuing the prefetch task, the main thread calls Sleep(0).
Calling Sleep() is a bad way to force threads to execute in a certain order. Set the thread priority according to the priority of the work they perform, and use OS primitives to block higher priority threads if they should not run.
I could possibly issue the prefetch thread at a higher priority than the main thread. That should cause the scheduler to run it right away, even if it must preempt the main thread. It may also have undesirable effects. It seems unnatural for a background worker thread to get a higher priority.
There is nothing unnatural about this. It is the intended way to use threads. You only must make sure that higher priority threads block sooner or later, and any thread that goes to the OS for I/O (file or network) does block. In the scheme I sketched above the high priority threads will also block on the queues.
I could possibly issue the query asynchronously.
I wouldn't go there. This technique may be necessary when you write a server for many simultaneous connections and a thread per connection is prohibitively expensive, but otherwise blocking network access in a threaded solution should work fine.
Edit:
Thanks to Jeroen Pluimers for the poke to look closer into this. As the information in the links he gave in his comment shows my statement
No lower priority thread will get any slice of CPU while a higher priority thread in the same process is runnable.
is not true. Lower priority threads that haven't been running for a long time get a random priority boost and will indeed sooner or later get a share of CPU, even though higher priority threads are runnable. For more information about this see in particular "Priority Inversion and Windows NT Scheduler".
To test this out I created a simple demo with Delphi:
type
TForm1 = class(TForm)
Label1: TLabel;
Label2: TLabel;
Label3: TLabel;
Label4: TLabel;
Label5: TLabel;
Label6: TLabel;
Timer1: TTimer;
procedure FormCreate(Sender: TObject);
procedure FormDestroy(Sender: TObject);
procedure Timer1Timer(Sender: TObject);
private
fLoopCounters: array[0..5] of LongWord;
fThreads: array[0..5] of TThread;
end;
var
Form1: TForm1;
implementation
{$R *.DFM}
// TTestThread
type
TTestThread = class(TThread)
private
fLoopCounterPtr: PLongWord;
protected
procedure Execute; override;
public
constructor Create(ALowerPriority: boolean; ALoopCounterPtr: PLongWord);
end;
constructor TTestThread.Create(ALowerPriority: boolean;
ALoopCounterPtr: PLongWord);
begin
inherited Create(True);
if ALowerPriority then
Priority := tpLower;
fLoopCounterPtr := ALoopCounterPtr;
Resume;
end;
procedure TTestThread.Execute;
begin
while not Terminated do
InterlockedIncrement(PInteger(fLoopCounterPtr)^);
end;
// TForm1
procedure TForm1.FormCreate(Sender: TObject);
var
i: integer;
begin
for i := Low(fThreads) to High(fThreads) do
// fThreads[i] := TTestThread.Create(True, #fLoopCounters[i]);
fThreads[i] := TTestThread.Create(i >= 4, #fLoopCounters[i]);
end;
procedure TForm1.FormDestroy(Sender: TObject);
var
i: integer;
begin
for i := Low(fThreads) to High(fThreads) do begin
if fThreads[i] <> nil then
fThreads[i].Terminate;
end;
for i := Low(fThreads) to High(fThreads) do
fThreads[i].Free;
end;
procedure TForm1.Timer1Timer(Sender: TObject);
begin
Label1.Caption := IntToStr(fLoopCounters[0]);
Label2.Caption := IntToStr(fLoopCounters[1]);
Label3.Caption := IntToStr(fLoopCounters[2]);
Label4.Caption := IntToStr(fLoopCounters[3]);
Label5.Caption := IntToStr(fLoopCounters[4]);
Label6.Caption := IntToStr(fLoopCounters[5]);
end;
This creates 6 threads (on my 4 core machine), either all with lower priority, or 4 with normal and 2 with lower priority. In the first case all 6 threads run, but with wildly different shares of CPU time:
In the second case 4 threads run with roughly equal share of CPU time, but the other two threads get a little share of the CPU as well:
But the share of CPU time is very very small, way below a percent of what the other threads receive.
And to get back to your question: A program using multiple threads with custom priority, coupled via producer-consumer queues, should be a viable solution. In the normal case the database threads will block most of the time, either on the network operations or on the queues. And the Windows scheduler will make sure that even a lower priority thread will not completely starve to death.
I don't know any database access layer that permits this.
The reason is that each thread has its own "thread local storage" (The threadvar keyword in Delphi, other languages have equivalents, it is used in a lot of frameworks).
When you start things on one thread, and continue it on another, then you get these local storages mixed up causing all sorts of havoc.
The best you can do is this:
pass the query and parameters to the thread that will handle this (use the standard Delphi thread synchronization mechanisms for this)
have the actual query thread perform the query
return the results to the main thread (use the standard Delphi thread synchronization mechanisms for this)
The answers to this question explains thread synchronization in more detail.
Edit: (on presumed slowness of starting something in an other thread)
"Right away" is a relative term: it depends in how you do your thread synchronization and can be very very fast (i.e. less than a millisecond).
Creating a new thread might take some time.
The solution is to have a threadpool of worker threads that is big enough to service a reasonable amount of requests in an efficient manner.
That way, if the system is not yet too busy, you will have a worker thread ready to start servicing your request almost immediately.
I have done this (even cross process) in a big audio application that required low latency response, and it works like a charm.
The audio server process runs at high priority waiting for requests. When it is idle, it doesn't consume CPU, but when it receives a request it responds really fast.
The answers to this question on changes with big improvements and this question on cross thread communication provide some interesting tips on how to get this asynchronous behaviour working.
Look for the words AsyncCalls, OmniThread and thread.
--jeroen
I'm putting in a second answer, for your second part of the question: your Scheduling Problem
This makes it easier to distinguish both answers.
First of all, you should read Consequences of the scheduling algorithm: Sleeping doesn't always help which is part of Raymond Chen's blog "The Old New Thing".
Sleeping versus polling is also good reading.
Basically all these make good reading.
If I understand your Scheduling Problem correctly, you have 3 kinds of threads:
Main Thread: makes sure the Fetch Threads always have work to do
Fetch Threads: (database bound) fetch data for the Processing Threads
Processing Threads: (CPU bound) process fetched data
The only way to keep 3 running is to have 2 fetch as much data as they can.
The only way to keep 2 fetching, is to have 1 provide them enough entries to fetch.
You can use queues to communicate data between 1 and 2 and between 2 and 3.
Your problem now is two-fold:
finding the balance between the number of threads in category 2 and 3
making sure that 2 always have work to do
I think you have solved the former.
The latter comes down to making sure the queue between 1 and 2 is never empty.
A few tricks:
You can use Sleep(1) (see the blog article) as a simple way to "force" 2 to run
Never let the treads exit their execute: creating and destroying threads is expensive
choose your synchronization objects (often called IPC objects) carefully (Kudzu has a nice article on them)
--jeroen
You just have to use the standard Thread synchronization mechanism of the Delphi threading.
Check your IDE help for TEvent class and its associated methods.

Solutions for handling millions of timed (scheduled) messages?

I'm evaluating possible solutions for handling a large quantity of queued messages, which must be delivered to workers at a certain date and time. The result of executing them is mostly updates to stored data, and they may or may not be originally triggered by user action.
For example, think of what you'd implement in a hypothetical large-scale StarCraft game server for storing and executing users' actions, like upgrading a building, hatching a soldier, all of which requires to be applied to the game state after several seconds or minutes after the player initiates them.
The problem is I can't seem to find the right term to name this problem area. There are several that looks similar, but different:
cron/task/job scheduler
The content of the queue is not dynamic, it's predefined.
Each task is scheduled.
message queue
The content of the queue is dynamic.
Each task is intended to be delivered immediately.
???
The content of the queue is dynamic.
Each task is scheduled.
If there are message queues that allow conditional delivery of messages, that might be it.
Summary:
What are these kind of technology called?
What are some of the solutions out there?
This just sounds like a trivial priority queue on the surface. The priority in this case is the time of completion, and you check the front of the queue to see when the next event is due. Pretty much every language comes with a priority queue or something that can easily be used as one, so I'm not sure what the actual problem is here.
Is it that you're worried about scalability, when it comes to millions of messages? Obviously 'millions' is a meaningless term - if that's millions per day, it's a trivial problem. If it's millions per second, then you can just scale horizontally, splitting the queue across multiple processes. (And the benefit of such a queue system is that this parallelization is really simple.)
I would bet that when implementing a large scale real-time strategy game server you would hit networking problems long before you start hitting problems with the message queue.
Have you tried looking at push queues by Iron.io? The content of the queue can be anything you like, and you specify a webhook to where the messages will be pushed to. You can also set a delay for each of the messages.
The webhook is static though for each queue and delay isn't always exactly on time (could be up to a minute off). If timing is more important or the ability of providing a different webhook per message is important, try looking at boomerang.io.
They say they are pretty accurate on the timing, you can provide a delay or unix timestamp for the webhook to return and that is per message. Sounds like either of those might work for you.
For StarCraft, I would use the Red Dwarf server.
For a Java EE app, I would use Quartz Scheduler.
It seems to me that a queue-based solution would be best in this case for a number of reasons:
Management. Most queuing solutions provide support for inspecting the content of queues which makes it easier to debug, easier to take action when certain threshold are exceeded, ...
Performance. You can divide workload by having multiple enqueue/dequeue processes (gives you the ability to scale out).
Prioritizing. Most queues support prioritizing of messages (probably not all messages are equally important).
...
Remaining problem is the immediate delivery of messages in the queue. You have two ways to solve this: either delay enqueuing of messages or delay execution of dequeued messages. I would go with the first approach, delayed enqueuing.
A message then has two properties: (content, delay). You provide the message to a component in your system that queues the message at the appropriate time.
I'm not sure what programming language you're using, but the MS .NET 4 framework has support for such a scenario (delayed execution of tasks).