Single events channel with multiple consumer groups (with competing members) - message-queue

I have a scenario where I want to publish actions in a single channel and want multiple groups of consumers compete for actions they are specialized to process.
E.g. 2 types of actions with an attribute indicating their type:
Complex action
Simple action
Then, I need 2 groups of consumers (that I can scale in/out based on the load) to process 2 types of actions.
The solution should be as simple/cheap as possible.
Options I considered so far: Azure Service Bus, RabbitMQ, Kafka.
Azure SB approach I came up with seems a bit complex:
Single topic with subscribers sending filtered actions to 2 separate queues and consumers processing actions.
Is there a simpler/cheaper ootb approach in azure?
Suggestions?

Related

Message bus vs. Service bus vs. Event hub vs Event grid

I'm learning the messaging system and got confused by those terminology.
All the messaging system below provides loose coupling between services with different sets of features.
queue - FIFO, pulling mechanism, 1 consumer each queue but any number of producers?
message bus - pub/sub model, any number of consumers with any number of producers processing messages? Is Azure Service Bus an implementation of message bus?
event bus - pub/sub model, any number of consumers with any number of producers processing events?
Do people use message bus and event bus interchangeably as far as terminology goes?
What are the difference between events and messages? Are those just synonyms in this context?
event hub - pub/sub model, partition, replay, consumers can store events in external storage or close to real-time data analysis. What exactly is event hub?
event grid - it can be used as a downstream service of event hub. What does it exactly do that event hub doesn't do?
Can someone provide some historical context as how each technology evolve to another each tied with some practical use cases?
I've found message bus vs. message queue helpful
Even thou all these services deal with the transfer of data from source to target and might seem similar falling under the umbrella messaging services they do differ in their intent.
High-level definition:
Azure Event Grids – Event-driven publish-subscribe model (think reactive programming)
Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)
Azure Service Bus- Traditional enterprise broker messaging system (Similar to Azure Queue but provide many advanced features depending on use case full comparison)
Difference between Event Grids & Event Hubs
Event Grids doesn’t guarantee the order of the events, but Event Hubs use partitions which are ordered sequences, so it can maintain the order of the events in the same partition.
Event Hubs are accepting only endpoints for the ingestion of data and they don’t provide a mechanism for sending data back to publishers. On the other hand, Event Grids sends HTTP requests to notify events that happen in publishers.
Event Grid can trigger an Azure Function. In the case of Event Hubs, the Azure Function needs to pull and process an event.
Event Grids is a distribution system, not a queueing mechanism. If an event is pushed in, it gets pushed out immediately and if it doesn’t get handled, it’s gone forever. Unless we send the undelivered events to a storage account. This process is known as dead-lettering.
In Event Hubs the data can be kept for up to seven days and then replayed. This gives us the ability to resume from a certain point or to restart from an older point in time and reprocess events when we need it.
Difference between Event Hubs & Service Bus
To the external publisher or the receiver Service Bus and Event Hubs can look very similar and this is what makes it difficult to understand the differences between the two and when to use what.
Event Hubs focuses on event streaming where Service Bus is more of a traditional messaging broker.
Service Bus is used as the backbone to connects applications running in the cloud to other applications or services and transfers data between them whereas Event Hubs is more concerned about receiving massive volume of data with high throughout and low latency.
Event Hubs decouples multiple event-producers from event-receivers whereas Service Bus aims to decouple applications.
Service Bus messaging supports a message property ‘Time to Live’ whereas Event Hubs has a default retention period of 7 days.
Service Bus has the concept of message session. It allows relating messages based on their session-id property whereas Event Hubs does not.
Service Bus the messages are pulled out by the receiver & cannot be processed again whereas Event Hubs message can be ingested by multiple receivers.
Service Bus uses the terminology of queues and topics whereas Event Hubs partitions terminology is used.
Use this loose general rule of thumb.
SOMETHING HAS HAPPENED – Event Hubs
DO SOMETHING or GIVE ME SOMETHING – Service Bus
As #Louie Almeda stated you may find this link to the official Azure documentation useful.
I found this comparison from Azure docs extremely helpful. Here's the key distinction between events and messages.
Event vs. message services
There's an important distinction to note
between services that deliver an event and services that deliver a
message.
Event
An event is a lightweight notification of a condition or a state
change. The publisher of the event has no expectation about how the
event is handled. The consumer of the event decides what to do with
the notification. Events can be discrete units or part of a series.
Discrete events report state change and are actionable. To take the
next step, the consumer only needs to know that something happened.
The event data has information about what happened but doesn't have
the data that triggered the event. For example, an event notifies
consumers that a file was created. It may have general information
about the file, but it doesn't have the file itself. Discrete events
are ideal for serverless solutions that need to scale.
Series events
report a condition and are analyzable. The events are time-ordered and
interrelated. The consumer needs the sequenced series of events to
analyze what happened.
Message
A message is raw data produced by a service to be consumed or stored
elsewhere. The message contains the data that triggered the message
pipeline. The publisher of the message has an expectation about how
the consumer handles the message. A contract exists between the two
sides. For example, the publisher sends a message with the raw data,
and expects the consumer to create a file from that data and send a
response when the work is done.
Comparison of those different services were also discussed, so be sure to check it out.
I agree with your remarks about overloaded terms, especially with cloud-service marketing jargon....
Historically, I events and messages had more distinct meanings
- event was term used to refer to communication within the same process whereas
- message referred to communication across different processes.
Regarding the "bus", I can give you some "historical" information, as I learned to be a sound engineer. In a music mixer, you also have a "bus" and "routing" for mixing signals. In the case of a mixer, we are talking about electrical signals, either being in the mix or not!
Regarding the messaging system, think of "bus", "hub" and "grid" to be synonyms! They are all fancy words for the same thing. They are trying to express some kind of transportation system that includes some kind of routing, because you always have producers and consumers - and this can be an N:M relation. Depending on the use case.
A queue is typically a bit different, but its effect can be the same. A queue means something where things are in line, like a queue of people to buy something! (Theatre tickets....)
Nowadays, everything is digital, which in its essence means it can be countable. That's how "messages" came into existance! A music mixer would traditionally mix analogous signals, which are not countable but continuous, so the information would be f.ex. spoken voices or any kind of sound. Today, a "message" means some kind of information package, which is unique and countable. So it is a "thing" you can add to and remove from a queue, or send it to a hub for consumers to consume it.
Don't worry, you'll get used to those terms! I hope I was able to give you an idea.

Electing a new leader in distributed systems

I have the following problem:
I have a distributed system where I need to reach a consensus in one way or another when choosing a leader.
I have a group of players that communicate with each other via messages. In order for these players to progress from a stage to another someone has to keep track of their progress. Currently, there are 2 types of players:
leader---when he receives N-1 done messages (for N-1 players) he is responsible for broadcasting to all other users state change
follower ---he is responsible for getting the messages of the leader and updating his internal state-machine.
Each player receives messages from 2 pipelines:
-Status pipeline - He receives an array of type
[user1,user2,user3...userN] where each element is the user that is online.
-Message pipeline -Push based notification. Follower users will post here messages that they are ready for the next step. The leader will keep track of the DONE counter and when the threshold is reached he will broadcast ADVANCE to next step.
For a better idea i included a picture:
I do not know how to deal with leader reelection. In case the leader disconnects (this can be implemented with a timeout), how can the other nodes decide who is the next leader and if they pick randomly, should the current leader be stored in the database? I mean they only exchange messages there's nothing on the server, like a global variable or something.
What you basically need is to implement both 2 phase commit and a leader election recipe. Now, either you can implement them on your own (2 phase commit is well documented, and yes, you would need a shared storage), or if you have the flexibility to use a distributed open source co-ordination service, zookeeper would be your best bet. Have a look at the below article on apache zookeeper's page where they discuss both the recipes which you need. Hope this helps.
https://zookeeper.apache.org/doc/current/recipes.html#sc_recipes_twoPhasedCommit

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.

RabbitMQ: routing on multiple criteria but hitting only one consumer

How can we distribute work via RabbitMQ such that worker pools can subscribe to work messages based on differing (but frequently overlapping) criteria from each other but such that when a message is routed that matches to multiple worker pools, only one worker will pick up the job?
Simplified example:
We have host1 and host2.
Host1 handles jobs of classA and classB; host2 handles jobs of classB and classC.
If we route a job of
classA, only host1 will pick it up; if we route a job of classB,
either host1 or host2 will pick it up (based on their current load /
first available) but never both.
It would seem that we need to use a topic exchange, as our routing criteria is complex and using wildcards gives us the type of flexible matching we want.
However:
If we use the same name for the worker pool queue (say “worker-jobs”) we get the desired work splitting out to arbitrary matching workers, but every worker subscribing to the named queue seems to infect the other workers with each other’s routing criteria as they bind it. I.e. the binding of the routing key seems to be at the central queue name level not on a connection-to-queue basis.
If we use different queue names for each worker pool connection (say “poolA-jobs” and “poolB-jobs”) to the same exchange then we get the desired behavior with the different routing criteria maintained between pools but a job coming in that can match to both poolA and poolB gets routed to both of them (albeit only to one worker in each).
Notes:
I’ve spared you the details of why but suffice to say we have an existing multi-petabyte distributed search application that needs response times < 50ms. We already achieve this with our own custom routing hub but we’d like to replace this with RabbitMQ as its performance is attractive (as is retiring homemade code that overlaps with general purpose community projects) if we can get the sophisticated routing we need.
We use Python
Disco isn’t viable for many reasons, too numerous to go into.
It doesn’t have to be RabbitMQ but the performance needs to be as good. ØMQ looks very interesting and like it might provide both the flexibility and the performance but we’re already using RabbitMQ and after wading through the first half of the colorfully written ØMQ guide I’m still not sure if it will support the routing we need but it does look like we’ll have to pretty much write a broker to do it.
We actually have the luxury of knowing which hosts are capable of serving which jobs, so we can do something like have host1 subscribe to #.host1.# and host2 to #.host2.#. Then when we route a classB message, we can give it a key of host1.host2 to indicate which backends are acceptable for service. This simplifies the routing rules but still doesn’t overcome the problem described.

Locks and batch fetch messages with RabbitMq

I'm trying to use RabbitMq in a more unconventional way (though at this point i can pick any other message queue implementation if needed). Instead of leaving Rabbit push messages to my consumers, the consumer connects to a queue and fetches a batch of N messages (during which it consumes some and possible rejects some), after which it jumps to another queue and so on. This is done for redundancy. If some consumers crash all messages are guaranteed to be consumed by some other consumer.
The problem is that I have multiple consumers and I don't want them to compete over the same queue. Is there a way to guarantee a lock on a queue? If not, can I at least make sure that if 2 consumers are connected to the same queue they don't read the same message? Transactions might help me to some degree but I've heard talk that they'll get removed from RabbitMQ.
Other architectural suggestions are welcomed too.
Thanks!
EDIT:
As pointed in the comment there's an a particularity in how I need to process the messages. They only make sense taken in groups and there's a high probability that related messages are clumped together in a queue. If for example I pull a batch of 100 messages, there's a high probability that I'll be able to do something with messages 1-3, 4-5,6-10 etc. If I fail to find a group for some messages I'll resubmit them to the queue. WorkQueue wouldn't work because it would spread messages from the same group to multiple workers that wouldn't know what to do with them.
Have you had a look at this free online book on Enterprise Integration Patterns?
It sounds like you really need a workflow where you have a batcher component before the messages get to your workers. With RabbitMQ there are two ways to do that. Either use an exchange type (and message format) that can do the batching for you, or have one queue, and a worker that sorts out batches and places each batch on its own queue. The batcher should probably also send a "batch ready" message to a control queue so that a worker can discover the existence of the new batch queue. Once the batch is processed the worker could delete the batch queue.
If you have control over the message format, you might be able to get RabbitMQ to do the batching implicitly in a couple of ways. With a topic exchange, you could make sure that the routing key on each message is of the format work.batchid.something and then a worker that learns of the existence of batch xxyzz would use a binding key like #.xxyzz.# to only consume those messages. No republishing needed.
The other way is to include a batch id in a header and use the newer headers exchange type. Of course you can also implement your own custom exchange types if you are willing to write a small amount of Erlang code.
I do recommend checking the book though, because it gives a better overview of messaging architecture than the typical worker queue concept that most people start with.
Have your consumers pull from just one queue. They will be guaranteed not to share messages (Rabbit will round-robin the messages among the currently-connected consumers) and it's heavily optimized for that exact usage pattern.
It's ready-to-use, out of the box. In the RabbitMQ docs it's called the Work Queue model. One queue, multiple consumers, with none of them sharing anything. It sounds like what you need.
You can set a channel/consumer level prefetch count to consume messages in batches. In order to re-submit messages, you should use the basic.reject AMQP method and those messages can be chosen to be requeued or forwarded to a dead letter queue. Multiple consumers trying to pull messages from the same queue is not an issue asthe AMQP basic.get method will be synchronized to handle concurrent consumers.
https://groups.google.com/forum/#!topic/rabbitmq-users/hJ8f5du-GCA