EsRejectedExecutionException in elasticsearch for parallel search - exception

I am querying elasticsearch for multiple parallel requests using single transport client instance in my application.
I got the below exception for the parallel execution. How to overcome the issue.
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#5f804c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Elasticsearch has a thread pool and a queue for search per node.
A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU.
When the workers are full and there are more search requests, the request will go to queue. The size of queue is also limited. If by default size is, say, 100 and if there happens more parallel requests than this, then those requests would be rejected as you can see in the error log.
Solutions:
The immediate solution for this would be to increase the size of
the search queue. We can also increase the size of threadpool,
but then that might badly affect the performance of individual
queries. So, increasing the queue might be a good idea. But then
remember that this queue is memory residential and increasing the
queue size too much can result in Out Of Memory issues. (more
info)
Increase number of nodes and replicas - Remember each node has its
own search threadpool/queue. Also, search can happen on primary
shard OR replica.

Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.
Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.

I saw this same error because I was sending lots of indexing requests in to ES in parallel. Since I'm writing a data migration, it was easy enough to make them serial, and that resolved the issue.

I don't know what was your node configuration but your queue size (1000) is already on a higher side. As others have explained already, your search requests are queued in the Elasticsearch thread pool queue. Even after such a high queue size, if you are getting rejections, that gives some hint that you need to revisit your query pattern.
Like many other designs, even in this case, there is no one-size-fits-all solution. I found this is a very good post about how this queue works and different ways to do a performance test to find out what suits best for your use case.
HTH!

Related

how to tell NServiceBus is using MaximumConcurrencyLevel?

I'm trying to validate our company's code works when NServiceBus v4.3 is using the MaximumConcurrencyLevel value setup in the config.
The problem is, when I try to process 12k+ of queued entries, I cannot tell any difference in times between the five different max concur levels I change. I set it to 1 and I can process the queue in 8m, then I put it to 2 and I get 9m, seems interesting (I was expecting more, but it's still going in the right direction), but then I put 3, 4, 5 and the timings stay at around 8m. I was expecting a much better throughput.
My question is, how can I verify that NServiceBus is actually indeed using five threads to process entries on the queue?
PS I've tried setting the MaximumConcurrencyLevel="1" and the MaximumMessageThroughputPerSecond along with logging the Thread.CurrentThread.ManagedThreadId thinking\hoping I was ONLY going to see one ThreadID value, but I'm seeing quite a few of different ones, which surprised me. My plan was to see one, then bump the max concur level to 5 and hopefully see five different values.
What am I missing? Thank you in advance.
There can be multiple reasons why you don't see faster processing times when increasing the concurrency setting described on the official documentation page: http://docs.particular.net/nservicebus/operations/tuning
You mentioned you're using the MaximumMessageThroughputPerSecond which will negate any performance gains my parallel message processing if a low value has been configured. Try removing this setting if possible.
Maybe you're accessing a resource in your handlers which isn't supporting/optimized for parallel access.
NServiceBus internally schedules the processing logic on the threadpool. This means that even with a MaximumConcurrencyLevel of 1, you will most likely see a different thread processing each message since there is no thread affinity. But the configuration values work as expected, if your queue contains 5 messages:
it will process these messages one by one if you configured MaximumConcurrencyLevel to 1
it will process all messages in parallel if you configured MaximumConcurrencyLevel to 5.
Depending on your handlers it can of course happen that the first message is already processed at the time the fifth message is read from the queue.

Handling lots of req / sec in go or nodejs

I'm developing a web app that needs to handle bursts of very high loads,
once per minute I get a burst of requests in very few seconds (~1M-3M/sec) and then for the rest of the minute I get nothing,
What's my best strategy to handle as many req /sec as possible at each front server, just sending a reply and storing the request in memory somehow to be processed in the background by the DB writer worker later ?
The aim is to do as less as possible during the burst, and write the requests to the DB ASAP after the burst.
Edit : the order of transactions in not important,
we can lose some transactions but 99% need to be recorded
latency of getting all requests to the DB can be a few seconds after then last request has been received. Lets say not more than 15 seconds
This question is kind of vague. But I'll take a stab at it.
1) You need limits. A simple implementation will open millions of connections to the DB, which will obviously perform badly. At the very least, each connection eats MB of RAM on the DB. Even with connection pooling, each 'thread' could take a lot of RAM to record it's (incoming) state.
If your app server had a limited number of processing threads, you can use HAProxy to "pick up the phone" and buffer the request in a queue for a few seconds until there is a free thread on your app server to handle the request.
In fact, you could just use a web server like nginx to take the request and say "200 OK". Then later, a simple app reads the web log and inserts into DB. This will scale pretty well, although you probably want one thread reading the log and several threads inserting.
2) If your language has coroutines, it may be better to handle the buffering yourself. You should measure the overhead of relying on our language runtime for scheduling.
For example, if each HTTP request is 1K of headers + data, want to parse it and throw away everything but the one or two pieces of data that you actually need (i.e. the DB ID). If you rely on your language coroutines as an 'implicit' queue, it will have 1K buffers for each coroutine while they are being parsed. In some cases, it's more efficient/faster to have a finite number of workers, and manage the queue explicitly. When you have a million things to do, small overheads add up quickly, and the language runtime won't always be optimized for your app.
Also, Go will give you far better control over your memory than Node.js. (Structs are much smaller than objects. The 'overhead' for the Keys to your struct is a compile-time thing for Go, but a run-time thing for Node.js)
3) How do you know it's working? You want to be able to know exactly how you are doing. When you rely on the language co-routines, it's not easy to ask "how many threads of execution do I have and what's the oldest one?" If you make an explicit queue, those questions are much easier to ask. (Imagine a handful of workers putting stuff in the queue, and a handful of workers pulling stuff out. There is a little uncertainty around the edges, but the queue in the middle very explicitly captures your backlog. You can easily calculate things like "drain rate" and "max memory usage" which are very important to knowing how overloaded you are.)
My advice: Go with Go. Long term, Go will be a much better choice. The Go runtime is a bit immature right now, but every release is getting better. Node.js is probably slightly ahead in a few areas (maturity, size of community, libraries, etc.)
How about a channel with a buffer size equal to what the DB writer can handle in 15 seconds? When the request comes in, it is sent on the channel. If the channel is full, give some sort of "System Overloaded" error response.
Then the DB writer reads from the channel and writes to the database.

Performance comparison between ZeroMQ, RabbitMQ and Apache Qpid

I need a high performance message bus for my application so I am evaluating performance of ZeroMQ, RabbitMQ and Apache Qpid. To measure the performance, I am running a test program that publishes say 10,000 messages using one of the message queue implementations and running another process in the same machine to consume these 10,000 messages. Then I record time difference between the first message published and the last message received.
Following are the settings I used for the comparison.
RabbitMQ: I used a "fanout" type exchange and a queue with default configuration. I used the RabbitMQ C client library.
ZeroMQ: My publisher publises to tcp://localhost:port1 with ZMQ_PUSH socket, My broker listens on tcp://localhost:port1 and resends the message to tcp://localhost:port2 and my consumer listens on tcp://localhost:port2 using ZMQ_PULL socket. I am using a broker instead of peer to to peer communication in ZeroMQ to to make the performance comparison fair to other message queue implementation that uses brokers.
Qpid C++ message broker: I used a "fanout" type exchange and a queue with default configuration. I used the Qpid C++ client library.
Following is the performance result:
RabbitMQ: it takes about 1 second to receive 10,000 messages.
ZeroMQ: It takes about 15 milli seconds to receive 10,000 messages.
Qpid: It takes about 4 seconds to receive 10,000 messages.
Questions:
Have anyone run similar performance comparison between the message queues? Then I like to compare my results with yours.
Is there any way I could tune RabbitMQ or Qpid to make it performance better?
Note:
The tests were done on a virtual machine with two allocated processor. The result may vary for different hardware, however I am mainly interested in relative performance of the MQ products.
RabbitMQ is probably doing persistence on those messages. I think you need to set the message priority or another option in messages to not do persistence. Performance will improve 10x then. You should expect at least 100K messages/second through an AMQP broker. In OpenAMQ we got performance up to 300K messages/second.
AMQP was designed for speed (e.g. it does not unpack messages in order to route them) but ZeroMQ is simply better designed in key ways. E.g. it removes a hop by connecting nodes without a broker; it does better asynchronous I/O than any of the AMQP client stacks; it does more aggressive message batching. Perhaps 60% of the time spent building ZeroMQ went into performance tuning. It was very hard work. It's not faster by accident.
One thing I'd like to do, but am too busy, is to recreate an AMQP-like broker on top of ZeroMQ. There is a first layer here: http://rfc.zeromq.org/spec:15. The whole stack would work somewhat like RestMS, with transport and semantics separated into two layers. It would provide much the same functionality as AMQP/0.9.1 (and be semantically interoperable) but significantly faster.
Hmm, of course ZeroMQ will be faster, it is designed to be and does not have a lot of the broker based functionality that the other two provide. The ZeroMQ site has a wonderful comparison of broker vs brokerless messaging and drawbacks & advantages of both.
RabbitMQ Blog:
RabbitMQ and 0MQ are focusing on different aspects of messaging. 0MQ puts much more focus on how the messages are transferred over the wire. RabbitMQ, on the other hand, focuses on how messages are stored, filtered and monitored.
(I also like the above RabbitMQ post above as it also talks about using ZeroMQ with RabbitMQ)
So, what I'm trying to say is that you should decide on the tech that best fits your requirements. If the only requirement is speed, ZeroMQ. But if you need other aspects such as persistence of messages, filtering, monitoring, failover, etc well, then that's when u need to start considering RabbitMQ & Qpid.
I am using a broker instead of peer to to peer communication in ZeroMQ to to make the performance comparison fair to other message queue implementation that uses brokers.
Not sure why you want to do that -- if the only thing you care about is performance, there is no need to make the playing field level. If you don't care about persistence, filtering, etc. then why pay the price?
I'm also very leery of running benchmarks on VM's -- there are a lot of extra layers that can affect the results in ways that are not obvious. (Unless you're planning to run the real system on VM's, of course, in which case that is a very valid method).
I've tested c++/qpid
I sent 50000 messages per second between two diferent machines for a long time with no queuing.
I didn't use a fanout, just a simple exchange (non persistent messages)
Are you using persistent messages?
Are you parsing the messages?
I suppose not, since 0MQ doesn't have message structs.
If the broker is mainly idle, you probably haven't configured the prefetch on sender and receptor. This is very important to send many messages.
We have compared RabbitMQ with our SocketPro (http://www.udaparts.com/) persistent message queue at the site http://www.udaparts.com/document/articles/fastsocketpro.htm with all source codes. Here are results we obtained for RabbitMQ:
Same machine enqueue and dequeue:
"Hello world" --
Enqueue: 30000 messages per second;
Dequeue: 7000 messages per second.
Text with 1024 bytes --
Enqueue: 11000 messages per second;
Dequeue: 7000 messages per second.
Text with 10 * 1024 bytes --
Enqueue: 4000 messages per second;
Dequeue: 4000 messages per second.
Cross-machine enqueue and dequeue with network bandwidth 100 mbps:
"Hello world" --
Enqueue: 28000 messages per second;
Dequeue: 1900 messages per second.
Text with 1024 bytes --
Enqueue: 8000 messages per second;
Dequeue: 1000 messages per second.
Text with 10 * 1024 bytes --
Enqueue: 800 messages per second;
Dequeue: 700 messages per second.
Try to configure prefetch on sender and receptor with a value like 100. Prefetching just sender is not enough
We've developed an open source message bus built on top of ZeroMQ - we initially did this to replace Qpid. It's brokerless so it's not a totally fair comparison but it provides the same functionality as brokered solutions.
Our headline performance figure is 140K msgs per second between two machines but you can see more detail here: https://github.com/Abc-Arbitrage/Zebus/wiki/Performance

Reliably monitor a serial port (Nortel CS1000)

I have a custom python script that monitors the call logs from a Nortel phone system. This phone system is under extremely high volume throughout the day and it's starting to appear that some records may be getting lost.
Some of you may dislike this, but I'm not interested in sharing the source code or current method in any way. I would rather consider this from a "new project" approach.
I'm looking for insight into the easiest and safest way to reliably monitor heavy data output through a serial port on Linux. I'm not limiting this to any particular set of tools or languages, I want to find out what works best to do this one critical job. I'm comfortable enough parsing the data and inserting it into mysql that we could just assume the data could be dropped to a text file.
Thank you
Well, the way that I would approach this this to have 2 threads (or processes) working.
Thread 1: The read thread
This thread does nothing but read data from the raw serial port and put the data into a local buffer/queue (In memory is preferred for speed). It should do nothing else. Depending on the clock speed of the serial connection, this should be pretty easy to do.
Thread2: The processing thread
This thread just sleeps until there is data in the local buffer to process, then reads and processes it. That's it.
The reason for splitting it apart in two, is so that if one is busy (a block in MySQL for the processing thread) it won't affect the other. After all, while the serial port is buffered by the OS, the buffer size is limited.
But then again, any local program is likely going to be way faster than the serial port can send data. Serial transfer is actually quite slow relative to the clock speed of the processor (115.2kbps is about the limit on standard hardware). So unless you're CPU speed bound (such as on an Arduino), I can't see normal conditions affecting it too much. So your choice of language really shouldn't be of too much concern (assuming modern hardware). Stick to what you know.

Solutions for handling millions of timed (scheduled) messages?

I'm evaluating possible solutions for handling a large quantity of queued messages, which must be delivered to workers at a certain date and time. The result of executing them is mostly updates to stored data, and they may or may not be originally triggered by user action.
For example, think of what you'd implement in a hypothetical large-scale StarCraft game server for storing and executing users' actions, like upgrading a building, hatching a soldier, all of which requires to be applied to the game state after several seconds or minutes after the player initiates them.
The problem is I can't seem to find the right term to name this problem area. There are several that looks similar, but different:
cron/task/job scheduler
The content of the queue is not dynamic, it's predefined.
Each task is scheduled.
message queue
The content of the queue is dynamic.
Each task is intended to be delivered immediately.
???
The content of the queue is dynamic.
Each task is scheduled.
If there are message queues that allow conditional delivery of messages, that might be it.
Summary:
What are these kind of technology called?
What are some of the solutions out there?
This just sounds like a trivial priority queue on the surface. The priority in this case is the time of completion, and you check the front of the queue to see when the next event is due. Pretty much every language comes with a priority queue or something that can easily be used as one, so I'm not sure what the actual problem is here.
Is it that you're worried about scalability, when it comes to millions of messages? Obviously 'millions' is a meaningless term - if that's millions per day, it's a trivial problem. If it's millions per second, then you can just scale horizontally, splitting the queue across multiple processes. (And the benefit of such a queue system is that this parallelization is really simple.)
I would bet that when implementing a large scale real-time strategy game server you would hit networking problems long before you start hitting problems with the message queue.
Have you tried looking at push queues by Iron.io? The content of the queue can be anything you like, and you specify a webhook to where the messages will be pushed to. You can also set a delay for each of the messages.
The webhook is static though for each queue and delay isn't always exactly on time (could be up to a minute off). If timing is more important or the ability of providing a different webhook per message is important, try looking at boomerang.io.
They say they are pretty accurate on the timing, you can provide a delay or unix timestamp for the webhook to return and that is per message. Sounds like either of those might work for you.
For StarCraft, I would use the Red Dwarf server.
For a Java EE app, I would use Quartz Scheduler.
It seems to me that a queue-based solution would be best in this case for a number of reasons:
Management. Most queuing solutions provide support for inspecting the content of queues which makes it easier to debug, easier to take action when certain threshold are exceeded, ...
Performance. You can divide workload by having multiple enqueue/dequeue processes (gives you the ability to scale out).
Prioritizing. Most queues support prioritizing of messages (probably not all messages are equally important).
...
Remaining problem is the immediate delivery of messages in the queue. You have two ways to solve this: either delay enqueuing of messages or delay execution of dequeued messages. I would go with the first approach, delayed enqueuing.
A message then has two properties: (content, delay). You provide the message to a component in your system that queues the message at the appropriate time.
I'm not sure what programming language you're using, but the MS .NET 4 framework has support for such a scenario (delayed execution of tasks).