what is Counting Semaphore?

what is Counting Semaphore? - terminology

Hi I did get how the Counting Semaphore works? Please help me in understanding.
As per my understanding if we set count as 3, then process can use 3 threads to access the resource. so, here just 3 threads have access on the resource. When 1 thread leaves the other waiting thread comes in. If my understanding is correct, these 3 thread can corrupt shared data too. Then what is use of it?

Your observations are correct; typically a resource either needs to be restricted to one thread (e.g. it is being written to), or is safe to use with an unlimited number of threads (e.g. it is read-only). Restricting a resource to be used by say 5 threads is rarely useful.
Thus a counting semaphore with count N is most often used to restrict access to a pool of N resources...when the count reaches zero the next thread has to wait to obtain a resource from the pool.
However, I don't commonly find this useful in practice because simply controlling the number of threads accessing a pool of resources isn't sufficient, you need to manage the resources themselves as well. So I typically end up with a blocking queue containing the managed resources that threads can take from. When a thread is done with a resource, it returns that resource (e.g. an object) to the queue so that a waiting thread can take it.
The queue might internally use a semaphore to control access to the internal buffer, but that is usually encapsulated from the user of the queue.
See also
Wikipedia: Semaphore - Important Observations

Related

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.

First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.

Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.

Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

EsRejectedExecutionException in elasticsearch for parallel search

I am querying elasticsearch for multiple parallel requests using single transport client instance in my application.
I got the below exception for the parallel execution. How to overcome the issue.
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#5f804c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Elasticsearch has a thread pool and a queue for search per node.
A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU.
When the workers are full and there are more search requests, the request will go to queue. The size of queue is also limited. If by default size is, say, 100 and if there happens more parallel requests than this, then those requests would be rejected as you can see in the error log.
Solutions:
The immediate solution for this would be to increase the size of
the search queue. We can also increase the size of threadpool,
but then that might badly affect the performance of individual
queries. So, increasing the queue might be a good idea. But then
remember that this queue is memory residential and increasing the
queue size too much can result in Out Of Memory issues. (more
info)
Increase number of nodes and replicas - Remember each node has its
own search threadpool/queue. Also, search can happen on primary
shard OR replica.

Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.
Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.

I saw this same error because I was sending lots of indexing requests in to ES in parallel. Since I'm writing a data migration, it was easy enough to make them serial, and that resolved the issue.

I don't know what was your node configuration but your queue size (1000) is already on a higher side. As others have explained already, your search requests are queued in the Elasticsearch thread pool queue. Even after such a high queue size, if you are getting rejections, that gives some hint that you need to revisit your query pattern.
Like many other designs, even in this case, there is no one-size-fits-all solution. I found this is a very good post about how this queue works and different ways to do a performance test to find out what suits best for your use case.
HTH!

Does serializing of simultaneous global memory accesses to one address happen when there are L1 and L2 cache levels?

Based on what I know, when threads of a warp access the same address in global memory, requests get serialized so it's better to use constant memory. Does serializing of simultaneous global memory accesses happen when GPU is equipped with L1 and L2 cache levels (in Fermi and Kepler architecture)? In other words, when threads of a warp access the same global memory address, do 31 threads of a warp benefit from cache existence because 1 thread has already requested that address? What happens when the access is a read and also when access is a write?

Simultaneous global accesses to the same address by threads in the same warp in Fermi and Kepler do not get serialized. The warp read has a broadcast mechanism which satisfies all such reads from a single cacheline read with no performance impact. The performance is the same as if it were a fully coalesced read. This is true regardless of cache specifics, for example it is true even if L1 caching is disabled.
The performance of simultaneous writes is not specified (AFAIK) but behaviorally, simultaneous writes always get serialized, and the order is undefined.
EDIT responding to additional questions below:
Even if all threads in the warp write the same value into the same address, does it get serialized? Isn't there a write broadcast mechanism that recognizes such situation?
There is not a write broadcast mechanism that looks at all the simultaneous writes to see if they are all the same, and then take some action based on that. The correct answer is that the writes happen in unspecified order, and the performance characteristics are undefined. Obviously, if all the values being written are the same, you can be assured that the value that ends up in the location will be that value. But if you're asking whether the write activity is collapsed to a single cycle or requires multiple cycles to complete, that actual behavior is undefined (undocumented) and in fact may vary from one architecture to the next (for example, cc1.x may serialize in such all way that all the writes are performed, whereas cc2.x may "serialize" in such a way that one write "wins" and all the others are discarded, not consuming actual cycles.) Again, the performance is undocumented/unspecified, but the program-observable behavior is defined.
2 With this broadcast mechanism you explained, the only difference between constant memory broadcast access and global memory broadcast access is that the first one may route the access all the way to the global memory but the latter has a dedicated hardware and is faster, right?
__constant__ memory uses the constant cache, which is a dedicated piece of hardware that is available on a per-SM basis, and caches a particular section of global memory in a read-only fashion. This HW cache is physically and logically separate from L1 cache (if it exists and is enabled) and L2 cache. For Fermi and beyond, both mechanisms support broadcast on read, and for constant cache, this is the preferred access pattern, because the constant cache can only service one read access per cycle (i.e. does not support a whole cacheline read by a warp.) Either mechanism may "hit" in the cache (if present) or "miss" and trigger a global read. On the first read of a given location (or cacheline), niether cache will have the requested data, and it will therefore "miss" and trigger a global memory read, to service the access. Thereafter, in either case, subsequent reads will be serviced out of the cache, assuming the relevant data is not evicted in the interim. For early cc1.x devices, the constant memory cache was pretty valuable since those early devices did not have a L1 cache. For Fermi and beyond the principal reason to use the constant cache would be if identifiable data(i.e. read-only) and access patterns (same address per warp) are available, then using the constant cache will prevent those reads from travelling through L1 and possibly evicting other data. In effect you are increasing the cacheable footprint somewhat, over just what the L1 can support alone.

Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

I wrote some lock-free code that works fine with local
reads, under most conditions.
Does local spinning on a memory read necessarily imply I
have to ALWAYS insert a memory barrier before the spinning
read?
(To validate this, I managed to produce a reader/writer
combination which results in a reader never seeing the
written value, under certain very specific
conditions--dedicated CPU, process attached to CPU,
optimizer turned all the way up, no other work done in the
loop--so the arrows do point in that direction, but I'm not
entirely sure about the cost of spinning through a memory
barrier.)
What is the cost of spinning through a memory barrier if
there is nothing to be flushed in the cache's store buffer?
i.e., all the process is doing (in C) is
while ( 1 ) {
__sync_synchronize();
v = value;
if ( v != 0 ) {
... something ...
}
}
Am I correct to assume that it's free and it won't encumber
the memory bus with any traffic?
Another way to put this is to ask: does a memory barrier do
anything more than: flush the store buffer, apply the
invalidations to it, and prevent the compiler from
reordering reads/writes across its location?
Disassembling, __sync_synchronize() appears to translate into:
lock orl
From the Intel manual (similarly nebulous for the neophyte):
Volume 3A: System Programming Guide, Part 1 -- 8.1.2
Bus Locking
Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.
[...]
For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).
My translation: "when you say LOCK, this would be expensive, but we're
only doing it where necessary."
#BlankXavier:
I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).
I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.
My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)
I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.

The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.
The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.
Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.
"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.

I may well not properly have understood the question, but...
If you're spinning, one problem is the compiler optimizing your spin away. Volatile solves this.
The memory barrier, if you have one, will be issued by the writer to the spin lock, not the reader. The writer doesn't actually have to use one - doing so ensures the write is pushed out immediately, but it'll go out pretty soon anyway.
The barrier prevents for a thread executing that code re-ordering across it's location, which is its other cost.

Keep in mind that barriers typically are used to order sets of memory accesses, so your code could very likely also need barriers in other places. For example, it wouldn't be uncommon for the barrier requirement to look like this instead:
while ( 1 ) {
v = pShared->value;
__acquire_barrier() ;
if ( v != 0 ) {
foo( pShared->something ) ;
}
}
This barrier would prevent loads and stores in the if block (ie: pShared->something) from executing before the value load is complete. A typical example is that you have some "producer" that used a store of v != 0 to flag that some other memory (pShared->something) is in some other expected state, as in:
pShared->something = 1 ; // was 0
__release_barrier() ;
pShared->value = 1 ; // was 0
In this typical producer consumer scenario, you'll almost always need paired barriers, one for the store that flags that the auxiliary memory is visible (so that the effects of the value store aren't seen before the something store), and one barrier for the consumer (so that the something load isn't started before the value load is complete).
Those barriers are also platform specific. For example, on powerpc (using the xlC compiler), you'd use __isync() and __lwsync() for the consumer and producer respectively. What barriers are required may also depend on the mechanism that you use for the store and load of value. If you've used an atomic intrinsic that results in an intel LOCK (perhaps implicit), then this will introduce an implicit barrier, so you may not need anything. Additionally, you'll likely also need to judicious use of volatile (or preferably use an atomic implementation that does so under the covers) in order to get the compiler to do what you want.

How can I make an SQL query thread start, then do other work before getting results?

I have a program that does a limited form of multithreading. It is written in Delphi, and uses libmysql.dll (the C API) to access a MySQL server. The program must process a long list of records, taking ~0.1s per record. Think of it as one big loop. All database access is done by worker threads which either prefetch the next records or write results, so the main thread doesn't have to wait.
At the top of this loop, we first wait for the prefetch thread, get the results, then have the prefetch thread execute the query for the next record. The idea being that the prefetch thread will send the query immediately, and wait for results while the main thread completes the loop.
It often does work that way. But note there's nothing to ensure that the prefetch thread runs right away. I found that often the query was not sent until the main thread looped around and started waiting for the prefetch.
I sort-of fixed that by calling sleep(0) right after launching the prefetch thread. This way the main thread surrenders the remainder of it's time slice, hoping that the prefetch thread will now run, sending the query. Then that thread will sleep while waiting, which allows the main thread to run again.
Of course, there's plenty more threads running in the OS, but this did actually work to some extent.
What I really want to happen is for the main thread to send the query, and then have the worker thread wait for the results. Using libmysql.dll I call
result := mysql_query(p.SqlCon,pChar(p.query));
in the worker thread. Instead, I'd like to have the main thread call something like
mysql_threadedquery(p.SqlCon,pChar(p.query),thread);
which would hand off the task as soon as the data went out.
Anybody know of anything like that?
This is really a scheduling problem, so I could try is lauching the prefetch thread at a higher priority, then have it reduce its priority after the query is sent. But again, I don't have any mysql call that separates sending the query from receiving the results.
Maybe it's in there and I just don't know about it. Enlighten me, please.
Added Question:
Does anyone think this problem would be solved by running the prefetch thread at a higher priority than the main thread? The idea is that the prefetch would immediately preempt the main thread and send the query. Then it would sleep waiting for the server reply. Meanwhile the main thread would run.
Added: Details of current implementation
This program performs calculations on data contained in a MySQL DB. There are 33M items with more added every second. The program runs continuously, processing new items, and sometimes re-analyzing old items. It gets a list of items to analyze from a table, so at the beginning of a pass (current item) it knows the next item ID it will need.
As each item is independent, this is a perfect target for multiprocessing. The easiest way to do this is to run multiple instances of the program on multiple machines. The program is highly optimized via profiling, rewrites, and algorithm redesign. Still, a single instance utilizes 100% of a CPU core when not data-starved. I run 4-8 copies on two quad-core workstations. But at this rate they must spend time waiting on the MySQL server. (Optimization of the Server/DB schema is another topic.)
I implemented multi-threading in the process solely to avoid blocking on the SQL calls. That's why I called this "limited multi-threading". A worker thread has one task: send a command and wait for results. (OK, two tasks.)
It turns out there are 6 blocking tasks associated with 6 tables. Two of these read data and the other 4 write results. These are similar enough to be defined by a common Task structure. A pointer to this Task is passed to a threadpool manager which assigns a thread to do the work. The main thread can check the task status through the Task structure.
This makes the main thread code very simple. When it needs to perform Task1, it waits for Task1 to be not busy, puts the SQL command in Task1 and hands it off. When Task1 is no longer busy, it contains the results (if any).
The 4 tasks that write results are trivial. The main thread has a Task write records while it goes on to the next item. When done with that item it makes sure the previous write finished before starting another.
The 2 reading threads are less trivial. Nothing would be gained by passing the read to a thread and then waiting for the results. Instead, these tasks prefetch data for the next item. So the main thread, coming to this blocking tasks, checks if the prefetch is done; Waits if necessary for the prefetch to finish, then takes the data from the Task. Finally, it reissues the Task with the NEXT Item ID.
The idea is for the prefetch task to immediately issue the query and wait for the MySQL server. Then the main thread can process the current Item and by the time it starts on the next Item the data it needs is in the prefetch Task.
So the threading, a thread pool, the synchronization, data structures, etc. are all done. And that all works. What I'm left with is a Scheduling Problem.
The Scheduling Problem is this: All the speed gain is in processing the current Item while the server is fetching the next Item. We issue the prefetch task before processing the current item, but how do we guarantee that it starts? The OS scheduler does not know that it's important for the prefetch task to issue the query right away, and then it will do nothing but wait.
The OS scheduler is trying to be "fair" and allow each task to run for an assigned time slice. My worst case is this: The main thread receives its slice and issues a prefetch, then finishes the current item and must wait for the next item. Waiting releases the rest of its time slice, so the scheduler starts the prefetch thread, which issues the query and then waits. Now both threads are waiting. When the server signals the query is done the prefetch thread restarts, and requests the Results (dataset) then sleeps. When the server provides the results the prefetch thread awakes, marks the Task Done and terminates. Finally, the main thread restarts and takes the data from the finished Task.
To avoid this worst-case scheduling I need some way to ensure that the prefetch query is issued before the main thread goes on with the current item. So far I've thought of three ways to do that:
Right after issuing the prefetch task, the main thread calls Sleep(0). This should relinquish the rest of its time slice. I then hope that the scheduler runs the prefetch thread, which will issue the query and then wait. Then the scheduler should restart the main thread (I hope.) As bad as it sounds, this actually works better than nothing.
I could possibly issue the prefetch thread at a higher priority than the main thread. That should cause the scheduler to run it right away, even if it must preempt the main thread. It may also have undesirable effects. It seems unnatural for a background worker thread to get a higher priority.
I could possibly issue the query asynchronously. That is, separate sending the query from receiving the results. That way I could have the main thread send the prefetch using mysql_send_query (non blocking) and go on with the current item. Then when it needed the next item it would call mysql_read_query, which would block until the data is available.
Note that solution 3 does not even use a worker thread. This looks like the best answer, but requires a rewrite of some low-level code. I'm currently looking for examples of such asynchronous client-server access.
I'd also like any experienced opinions on these approaches. Have I missed anything, or am I doing anything wrong? Please note that this is all working code. I'm not asking how to do it, but how to do it better/faster.

Still, a single instance utilizes 100% of a CPU core when not data-starved. I run 4-8 copies on two quad-core workstations.
I have a conceptual problem here. In your situation I would either create a multi-process solution, with each process doing everything in its single thread, or I would create a multi-threaded solution that is limited to a single instance on any particular machine. Once you decide to work with multiple threads and accept the added complexity and probability of hard-to-fix bugs, then you should make maximum use of them. Using a single process with multiple threads allows you to employ varying numbers of threads for reading from and writing to the database and to process your data. The number of threads may even change during the runtime of your program, and the ratio of database and processing threads may too. This kind of dynamic partitioning of the work is only possible if you can control all threads from a single point in the program, which isn't possible with multiple processes.
I implemented multi-threading in the process solely to avoid blocking on the SQL calls.
With multiple processes there wouldn't be a real need to do so. If your processes are I/O-bound some of the time they don't consume CPU resources, so you probably simply need to run more of them than your machine has cores. But then you have the problem to know how many processes to spawn, and that may again change over time if the machine does other work too. A threaded solution in a single process can be made adaptable to a changing environment in a relatively simple way.
So the threading, a thread pool, the synchronization, data structures, etc. are all done. And that all works. What I'm left with is a Scheduling Problem.
Which you should leave to the OS. Simply have a single process with the necessary pooled threads. Something like the following:
A number of threads reads records from the database and adds them to a producer-consumer queue with an upper bound, which is somewhere between N and 2*N where N is the number of processor cores in the system. These threads will block on the full queue, and they can have increased priority, so that they will be scheduled to run as soon as the queue has more room and they become unblocked. Since they will be blocked on I/O most of the time their higher priority shouldn't be a problem.
I don't know what that number of threads is, you would need to measure.
A number of processing threads, probably one per processor core in the system. They will take work items from the queue mentioned in the previous point, on block on that queue if it's empty. Processed work items should go to another queue.
A number of threads that take processed work items from the second queue and write data back to the database. There should probably an upper bound for the second queue as well, to make it so that a failure to write processed data back to the database will not cause processed data to pile up and fill all your process memory space.
The number of threads needs to be determined, but all scheduling will be performed by the OS scheduler. The key is to have enough threads to utilise all CPU cores, and the necessary number of auxiliary threads to keep them busy and deal with their outputs. If these threads come from pools you are free to adjust their numbers at runtime too.
The Omni Thread Library has a solution for tasks, task pools, producer consumer queues and everything else you would need to implement this. Otherwise you can write your own queues using mutexes.
The Scheduling Problem is this: All the speed gain is in processing the current Item while the server is fetching the next Item. We issue the prefetch task before processing the current item, but how do we guarantee that it starts?
By giving it a higher priority.
The OS scheduler does not know that it's important for the prefetch task to issue the query right away
It will know if the thread has a higher priority.
The OS scheduler is trying to be "fair" and allow each task to run for an assigned time slice.
Only for threads of the same priority. No lower priority thread will get any slice of CPU while a higher priority thread in the same process is runnable.
[Edit: That's not completely true, more information at the end. However, it is close enough to the truth to ensure that your higher priority network threads send and receive data as soon as possible.]
Right after issuing the prefetch task, the main thread calls Sleep(0).
Calling Sleep() is a bad way to force threads to execute in a certain order. Set the thread priority according to the priority of the work they perform, and use OS primitives to block higher priority threads if they should not run.
I could possibly issue the prefetch thread at a higher priority than the main thread. That should cause the scheduler to run it right away, even if it must preempt the main thread. It may also have undesirable effects. It seems unnatural for a background worker thread to get a higher priority.
There is nothing unnatural about this. It is the intended way to use threads. You only must make sure that higher priority threads block sooner or later, and any thread that goes to the OS for I/O (file or network) does block. In the scheme I sketched above the high priority threads will also block on the queues.
I could possibly issue the query asynchronously.
I wouldn't go there. This technique may be necessary when you write a server for many simultaneous connections and a thread per connection is prohibitively expensive, but otherwise blocking network access in a threaded solution should work fine.
Edit:
Thanks to Jeroen Pluimers for the poke to look closer into this. As the information in the links he gave in his comment shows my statement
No lower priority thread will get any slice of CPU while a higher priority thread in the same process is runnable.
is not true. Lower priority threads that haven't been running for a long time get a random priority boost and will indeed sooner or later get a share of CPU, even though higher priority threads are runnable. For more information about this see in particular "Priority Inversion and Windows NT Scheduler".
To test this out I created a simple demo with Delphi:
type
TForm1 = class(TForm)
Label1: TLabel;
Label2: TLabel;
Label3: TLabel;
Label4: TLabel;
Label5: TLabel;
Label6: TLabel;
Timer1: TTimer;
procedure FormCreate(Sender: TObject);
procedure FormDestroy(Sender: TObject);
procedure Timer1Timer(Sender: TObject);
private
fLoopCounters: array[0..5] of LongWord;
fThreads: array[0..5] of TThread;
end;
var
Form1: TForm1;
implementation
{$R *.DFM}
// TTestThread
type
TTestThread = class(TThread)
private
fLoopCounterPtr: PLongWord;
protected
procedure Execute; override;
public
constructor Create(ALowerPriority: boolean; ALoopCounterPtr: PLongWord);
end;
constructor TTestThread.Create(ALowerPriority: boolean;
ALoopCounterPtr: PLongWord);
begin
inherited Create(True);
if ALowerPriority then
Priority := tpLower;
fLoopCounterPtr := ALoopCounterPtr;
Resume;
end;
procedure TTestThread.Execute;
begin
while not Terminated do
InterlockedIncrement(PInteger(fLoopCounterPtr)^);
end;
// TForm1
procedure TForm1.FormCreate(Sender: TObject);
var
i: integer;
begin
for i := Low(fThreads) to High(fThreads) do
// fThreads[i] := TTestThread.Create(True, #fLoopCounters[i]);
fThreads[i] := TTestThread.Create(i >= 4, #fLoopCounters[i]);
end;
procedure TForm1.FormDestroy(Sender: TObject);
var
i: integer;
begin
for i := Low(fThreads) to High(fThreads) do begin
if fThreads[i] <> nil then
fThreads[i].Terminate;
end;
for i := Low(fThreads) to High(fThreads) do
fThreads[i].Free;
end;
procedure TForm1.Timer1Timer(Sender: TObject);
begin
Label1.Caption := IntToStr(fLoopCounters[0]);
Label2.Caption := IntToStr(fLoopCounters[1]);
Label3.Caption := IntToStr(fLoopCounters[2]);
Label4.Caption := IntToStr(fLoopCounters[3]);
Label5.Caption := IntToStr(fLoopCounters[4]);
Label6.Caption := IntToStr(fLoopCounters[5]);
end;
This creates 6 threads (on my 4 core machine), either all with lower priority, or 4 with normal and 2 with lower priority. In the first case all 6 threads run, but with wildly different shares of CPU time:
In the second case 4 threads run with roughly equal share of CPU time, but the other two threads get a little share of the CPU as well:
But the share of CPU time is very very small, way below a percent of what the other threads receive.
And to get back to your question: A program using multiple threads with custom priority, coupled via producer-consumer queues, should be a viable solution. In the normal case the database threads will block most of the time, either on the network operations or on the queues. And the Windows scheduler will make sure that even a lower priority thread will not completely starve to death.

I don't know any database access layer that permits this.
The reason is that each thread has its own "thread local storage" (The threadvar keyword in Delphi, other languages have equivalents, it is used in a lot of frameworks).
When you start things on one thread, and continue it on another, then you get these local storages mixed up causing all sorts of havoc.
The best you can do is this:
pass the query and parameters to the thread that will handle this (use the standard Delphi thread synchronization mechanisms for this)
have the actual query thread perform the query
return the results to the main thread (use the standard Delphi thread synchronization mechanisms for this)
The answers to this question explains thread synchronization in more detail.
Edit: (on presumed slowness of starting something in an other thread)
"Right away" is a relative term: it depends in how you do your thread synchronization and can be very very fast (i.e. less than a millisecond).
Creating a new thread might take some time.
The solution is to have a threadpool of worker threads that is big enough to service a reasonable amount of requests in an efficient manner.
That way, if the system is not yet too busy, you will have a worker thread ready to start servicing your request almost immediately.
I have done this (even cross process) in a big audio application that required low latency response, and it works like a charm.
The audio server process runs at high priority waiting for requests. When it is idle, it doesn't consume CPU, but when it receives a request it responds really fast.
The answers to this question on changes with big improvements and this question on cross thread communication provide some interesting tips on how to get this asynchronous behaviour working.
Look for the words AsyncCalls, OmniThread and thread.
--jeroen

I'm putting in a second answer, for your second part of the question: your Scheduling Problem
This makes it easier to distinguish both answers.
First of all, you should read Consequences of the scheduling algorithm: Sleeping doesn't always help which is part of Raymond Chen's blog "The Old New Thing".
Sleeping versus polling is also good reading.
Basically all these make good reading.
If I understand your Scheduling Problem correctly, you have 3 kinds of threads:
Main Thread: makes sure the Fetch Threads always have work to do
Fetch Threads: (database bound) fetch data for the Processing Threads
Processing Threads: (CPU bound) process fetched data
The only way to keep 3 running is to have 2 fetch as much data as they can.
The only way to keep 2 fetching, is to have 1 provide them enough entries to fetch.
You can use queues to communicate data between 1 and 2 and between 2 and 3.
Your problem now is two-fold:
finding the balance between the number of threads in category 2 and 3
making sure that 2 always have work to do
I think you have solved the former.
The latter comes down to making sure the queue between 1 and 2 is never empty.
A few tricks:
You can use Sleep(1) (see the blog article) as a simple way to "force" 2 to run
Never let the treads exit their execute: creating and destroying threads is expensive
choose your synchronization objects (often called IPC objects) carefully (Kudzu has a nice article on them)
--jeroen

You just have to use the standard Thread synchronization mechanism of the Delphi threading.
Check your IDE help for TEvent class and its associated methods.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008