How do we maintain consistent read promise to clients when using a fallback queue? - exception

In my company, we are using Event Sourcing pattern to implement a storage for all changes to the price of a booking. Across the company, different services might try to append events to a booking identified by a booking code.
We use DynamoDB to store the event and it does support consistent read. The thing is in the case when a booking is initially made and the very 1st event is created for a booking code, if we fail to save into DynamoDB for whatever reasons, we put the event into a fallback queue and simply return a success to the client to acknowledge that we already received the event. Client can then move on with their business logic flow and in turn, show a success message to end users. The goal is to not block booking creation at all costs.
The problem is, for a very short period of time, when the event is still in the fallback queue, if clients try to fetch the event using the booking code, they will get back an error although we told them that the write on the 1st event was a success earlier. In a way, we're breaking the consistent read promise here.
I'm trying to find a way where we can improve the design and keep this promise while remaining out of the way of the main booking flow (i.e. not blocking the booking on failure).
I'd be very grateful if someone could throw me an idea to look into.

One solution might be to have the fallback queue be durable (i.e. reading from it doesn't remove elements from the queue) up to some retention period (broadly: the maximum allowable time between initial booking and persisting the creation event to DynamoDB) and instead of being a fallback queue be the actual source of truth for which bookings have been created.
Services can then consume this queue: one of these services is responsible for writing the initial creation event to DynamoDB (which is longer-lived than the queue). If that service is falling behind and approaching the retention limit, that's an operational emergency, but you're buying yourself time. Another of these services maintains an in-memory view based on the queue of created bookings which haven't yet made it to Dynamo.

Related

Replacing items in message queue

Our system requirements say that we need to build a slightly unusual producer-consumer processing system. Imagine we have multiple data streams and we take a snapshot each X seconds and put it into the queue for processing. The streams count is not constant. The more clients we have, the more streams we need to process. At the same time, we don't need to process ALL taken snapshots. If we have too many clients and we are not able to process all items in real-time, we would prefer to skip old snapshots and process only the latest ones.
So as I see, the requirements can be met by keeping only one item in a queue for each stream. If there is a new snapshot, while the previous is still there, we need to REPLACE it using stream id as a key.
Is it possible to implement such behavior by Service Bus queue or something similar? Or maybe it makes sense to look into some other solutions like Redis?
So as I see, the requirements can be met by keeping only one item in a
queue for each stream. If there is a new snapshot, while the previous
is still there, we need to REPLACE it using stream id as a key. Is it
possible to implement such behavior by Service Bus queue or something
similar?
To the best of my knowledge, Azure Service Bus does not support this scenario. Through it's duplicate detection functionality, in fact it supports exact opposite of that. You would need to use some other mechanism (like Redis Cache you mentioned) to accomplish this.

Message bus vs. Service bus vs. Event hub vs Event grid

I'm learning the messaging system and got confused by those terminology.
All the messaging system below provides loose coupling between services with different sets of features.
queue - FIFO, pulling mechanism, 1 consumer each queue but any number of producers?
message bus - pub/sub model, any number of consumers with any number of producers processing messages? Is Azure Service Bus an implementation of message bus?
event bus - pub/sub model, any number of consumers with any number of producers processing events?
Do people use message bus and event bus interchangeably as far as terminology goes?
What are the difference between events and messages? Are those just synonyms in this context?
event hub - pub/sub model, partition, replay, consumers can store events in external storage or close to real-time data analysis. What exactly is event hub?
event grid - it can be used as a downstream service of event hub. What does it exactly do that event hub doesn't do?
Can someone provide some historical context as how each technology evolve to another each tied with some practical use cases?
I've found message bus vs. message queue helpful
Even thou all these services deal with the transfer of data from source to target and might seem similar falling under the umbrella messaging services they do differ in their intent.
High-level definition:
Azure Event Grids – Event-driven publish-subscribe model (think reactive programming)
Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)
Azure Service Bus- Traditional enterprise broker messaging system (Similar to Azure Queue but provide many advanced features depending on use case full comparison)
Difference between Event Grids & Event Hubs
Event Grids doesn’t guarantee the order of the events, but Event Hubs use partitions which are ordered sequences, so it can maintain the order of the events in the same partition.
Event Hubs are accepting only endpoints for the ingestion of data and they don’t provide a mechanism for sending data back to publishers. On the other hand, Event Grids sends HTTP requests to notify events that happen in publishers.
Event Grid can trigger an Azure Function. In the case of Event Hubs, the Azure Function needs to pull and process an event.
Event Grids is a distribution system, not a queueing mechanism. If an event is pushed in, it gets pushed out immediately and if it doesn’t get handled, it’s gone forever. Unless we send the undelivered events to a storage account. This process is known as dead-lettering.
In Event Hubs the data can be kept for up to seven days and then replayed. This gives us the ability to resume from a certain point or to restart from an older point in time and reprocess events when we need it.
Difference between Event Hubs & Service Bus
To the external publisher or the receiver Service Bus and Event Hubs can look very similar and this is what makes it difficult to understand the differences between the two and when to use what.
Event Hubs focuses on event streaming where Service Bus is more of a traditional messaging broker.
Service Bus is used as the backbone to connects applications running in the cloud to other applications or services and transfers data between them whereas Event Hubs is more concerned about receiving massive volume of data with high throughout and low latency.
Event Hubs decouples multiple event-producers from event-receivers whereas Service Bus aims to decouple applications.
Service Bus messaging supports a message property ‘Time to Live’ whereas Event Hubs has a default retention period of 7 days.
Service Bus has the concept of message session. It allows relating messages based on their session-id property whereas Event Hubs does not.
Service Bus the messages are pulled out by the receiver & cannot be processed again whereas Event Hubs message can be ingested by multiple receivers.
Service Bus uses the terminology of queues and topics whereas Event Hubs partitions terminology is used.
Use this loose general rule of thumb.
SOMETHING HAS HAPPENED – Event Hubs
DO SOMETHING or GIVE ME SOMETHING – Service Bus
As #Louie Almeda stated you may find this link to the official Azure documentation useful.
I found this comparison from Azure docs extremely helpful. Here's the key distinction between events and messages.
Event vs. message services
There's an important distinction to note
between services that deliver an event and services that deliver a
message.
Event
An event is a lightweight notification of a condition or a state
change. The publisher of the event has no expectation about how the
event is handled. The consumer of the event decides what to do with
the notification. Events can be discrete units or part of a series.
Discrete events report state change and are actionable. To take the
next step, the consumer only needs to know that something happened.
The event data has information about what happened but doesn't have
the data that triggered the event. For example, an event notifies
consumers that a file was created. It may have general information
about the file, but it doesn't have the file itself. Discrete events
are ideal for serverless solutions that need to scale.
Series events
report a condition and are analyzable. The events are time-ordered and
interrelated. The consumer needs the sequenced series of events to
analyze what happened.
Message
A message is raw data produced by a service to be consumed or stored
elsewhere. The message contains the data that triggered the message
pipeline. The publisher of the message has an expectation about how
the consumer handles the message. A contract exists between the two
sides. For example, the publisher sends a message with the raw data,
and expects the consumer to create a file from that data and send a
response when the work is done.
Comparison of those different services were also discussed, so be sure to check it out.
I agree with your remarks about overloaded terms, especially with cloud-service marketing jargon....
Historically, I events and messages had more distinct meanings
- event was term used to refer to communication within the same process whereas
- message referred to communication across different processes.
Regarding the "bus", I can give you some "historical" information, as I learned to be a sound engineer. In a music mixer, you also have a "bus" and "routing" for mixing signals. In the case of a mixer, we are talking about electrical signals, either being in the mix or not!
Regarding the messaging system, think of "bus", "hub" and "grid" to be synonyms! They are all fancy words for the same thing. They are trying to express some kind of transportation system that includes some kind of routing, because you always have producers and consumers - and this can be an N:M relation. Depending on the use case.
A queue is typically a bit different, but its effect can be the same. A queue means something where things are in line, like a queue of people to buy something! (Theatre tickets....)
Nowadays, everything is digital, which in its essence means it can be countable. That's how "messages" came into existance! A music mixer would traditionally mix analogous signals, which are not countable but continuous, so the information would be f.ex. spoken voices or any kind of sound. Today, a "message" means some kind of information package, which is unique and countable. So it is a "thing" you can add to and remove from a queue, or send it to a hub for consumers to consume it.
Don't worry, you'll get used to those terms! I hope I was able to give you an idea.

Dealing with exceptions in an event driven world

I'm trying to understand how exceptions are handled in an event driven world using micro-services (using apache kafka). For example, if you take the following order scenario whereby the following actions need to happen before the order can be completed.
1) Authorise the payment with the payment service provider
2) Reserve the item from stock
3.1) Capture the payment with the payment service provider
3.2) Order the item
4) Send a email notification accepting the order with a receipt
At any stage in this scenario, there could be a failure such as:
The item is no longer in stock
The payment information was incorrect
The account the payee is using doesn't have the funds available
External calls such as those to the payment service provider fail, such as downtime
How do you track that each stage has been called for and/or completed?
How do you deal with issues that arise? How would you notify the frontend of the failure?
Some of the things you describe are not errors or exceptions, but alternative flows that you should consider in your distributed architecture.
For example, that an item is out of stock is a perfectly valid alternative flow in your business process. One that possibly requires human intervention. You could move the message to a separate queue and provide some UI where a human operator can deal with the problem, solve it and cause the flow of events to continue.
A similar thing could be said of the payment problems you describe. If an order cannot successfully be settled, a human operator will need to investigate the case and solve it. For that matter, your design must contemplate that alternative flow as part of it, and make it so a human can intervene somehow when the messages end up in a queue that requires a person to review them.
Those cases should be differentiated from errors or exceptions being thrown by the program. Those cases, depending on the circumstance, might in fact require to move the message to a dead letter queue (DLQ) for an engineer to take a look at them.
This is a very broad topic and entire books could written about this.
I believe you could probably benefit from gaining more understanding of concepts like:
Compensating Transactions Pattern
Try/Cancel/Confirm Pattern
Long Running Transactions
Sagas
The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels that order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.
I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.
In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.
The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).
As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.
The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.
A few other aspects I also typically consider are the following:
Idempotency
I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.
Transient vs Persistent Errors
When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.
Embrace Error as an Alternative Flow
As mentioned at the beginning of my answer, not everything is an error. Some things are just alternative flows.
In those cases of interservice communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can just embrace error as part of you workflow. Catalog the possible causes of error and make them an alternative flow of events that simply requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.
For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.
Transaction and Data Model State are Key
I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.
For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.
The difficulty here is discovering how to design transaction granularity in way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.
Designing for Distributed Transactional Workflows
So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in a inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.
That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client–server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.
As I was originally saying, the topic is way too broad and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.
At any rate, I hope this somehow helped you with your investigation.

CQRS / communication between contexts / eventstore / push or pull?

Communications between bounded context in CQRS/ES architecture is achieved through events; context A generates events as response to commands, and these events is then forwarded to context B through event bus (message queue).
Or... you can store the events in eventstore (that belongs to context A).
Or... both (store and forward).
My question is: from context B, should I pull the events from the context store? or simply consume the events pushed through the event bus?
I'm leaning toward the pulling approach. Because then we can do some catching up in context B. In contrast, in the push approach, context B might be unaware of events that were delivered while B is experiencing downtime.
So... does it mean... when we have eventstore, we can simply forget about the message queue (seems redundant)?
Or am I missing something here?
You'll want to review Consume event stream without Pub/Sub
At the DDD Europe conference, I realized that the speakers I talked with where (sic) avoiding Pub/Sub whenever possible.
The discussion that follows may have value. TL;DR: not many fans of pub/sub there.
Konrad Garus on Push or Pull?, describing the Pull design:
In the latter (and simpler) design, they only spread the information that a new event has been saved, along with its sequential ID (so that all projections can estimate how much behind they are). When awakened, the executor can continue along its normal path, starting with querying the event store.
Why? Because handling events coming from a single source is easier, but more importantly because a DB-backed event store trivially guarantees ordering and has no issues with lost or duplicate messages. Querying the database is very fast, given that we’re reading a single table sequentially by primary key, and most of the time the data is in RAM cache anyway. The bottleneck is in the projection thread updating its read model database.
In the large, it comes down to this: when people are thinking about event sourcing, they are really thinking about histories, rather than events in isolation. If what you really want is an ordered sequence of events with no gaps, querying the authority for that sequence is much better than trying to reconstruct if from a bunch of disjoint event messages.
But - once you decide to do that, then suddenly the history, and all of the events that appear within it, becomes part of the api of context A. What happens when team A decides that a different event store implementation is more suitable? Can they just roll out a new version of their own services, or do we need a grand outage because every consumer also has to get updated?
Similarly, what happens if we decide to refactor context A into context C and context D? Again, do we have to screw around in context B to get the data we need?
Maybe the real problem is that context B is coupled to the histories in context A, and those histories should really be private? Should context B be accessing context A's data, or should it instead be delegating that work to context A's capabilities?
Udi Dahan essays on SOA may jump start your thinking in that direction.

Failures in eventual consistent system and user experience [duplicate]

When using distributed and scalable architecture, eventual consistency is often a requirement.
Graphically, how to deal with this eventual consistency?
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
How to deal with the GUI for such scenarios?
Please note the question applies both for desktop applications and web applications.
PS: I'm working with the Microsoft platform, but I imagine the question applies to any technology...
A Task Based UI fits this model great. You create and execute tasks from the UI. You can also have something like a task status monitor to show the user when a task has executed.
Another option is to use some kind of pooling from the client. You send the command, and pool from the client until the command completed and the new data is available. You will have a delay in some cases from when the user presses save to when he will see the new record, but in most cases it should be almost synchronous.
Another (good?) option is to assume/design commands that don't fail. This is not trivial but you can have a cache on the client and add the data from the command to that cache and display it to the user even before the command has been executed. If the command fails for some unexpected situation, well then just design a good "we are sorry" message for misleading the user for a few seconds.
You can also combine the methods above.
Usually eventual consistency is more of a business/domain problem, and you should have your domain experts handle it.
I think that other answers mix together CQRS in general and eventual consistency in particular. Task-based UI is very suitable for CQRS but it does not resolve the issue with eventually consistent read model.
First, I would like to challenge your statement:
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
What do you by this? Why is it not possible to see the result immediately? I think the issue here is your definition of result.
The result of any action is that that action has been performed. There are numerous of ways to show this! It depends on what kind of action do you want to complete. Examples:
Send an email: if user has entered a correct email address, it is almost guaranteed that the action will complete successfully. To prevent unexpected failures one might use durable queues since this kind of actions do not need to be done synchronously. So you just say "email sent". Typically you see this kind of response when you ask to reset your password.
Update some information in a user profile: after you have validated the new data on the client, most probably the command will succeed too since the only thing that could happen is the database error (if you use database). Again, even this can be mitigated by using durable queues. In this case you just show the updated field in the same form. The good practice for SPA is to have a comprehensive data store on the client side, like Redux does. In this case you can safely update the server by sending a command and also updating the client-side store, which will result in UI to shows the latest data. Disclaimer: some answers refer to this technique as "tricking the user", but I disagree with this definition.
If you have commands that are prone to error, you can use techniques that are already described in other answers like Websockets or Server-side events to communicate errors back. This requires quite a lot of additional work. You can also send a command and wait for reply or execute commands synchronously. Some would say "this is not CQRS" but this would be just another dogma to be challenged. Ensuring the command has completed the execution in combination with the previous point (client-side data store) will be a good solution.
I am not sure if there is any 100% bullet proof technique that allows you to always show non-stale data from the read model. I think it goes against the principles of CQRS. Even with real-time events you will only get events that indicate that you write model has been updated. Still, your projections could have failed and reacting on this is a whole other story.
However, I would not concentrate that much on this issue. The fact is that well-tested projections and almost-guaranteed commands will work very well. For error handling in 90% of situations it is enough to have some manual or half-manual process to recover from those errors. For the last 10% you can combine generic "error" messages pushed from the server saying "sorry, your action XXX has failed to execute" and the top priority actions could have some creative process behind them but in reality those situations would be very very rare.
There are 2 ways:
To trick a user (just to show that things has happened then they
really hasn't happened yet)
Show that system is processing request
and use polling in background (not good) or just timer with value of
your SLA.
I prefer the 1st option.
As someone has already mentioned, task based UI's fit well for this, and what I would do is employ a technique that 'buys you time' for the command to propagate.
For example, imagine we are on a list screen, where the user can perform various actions, one of which being to add a new item to the list. After choosing to add an item you could display a "What would you like to do next?" which could have 'Add another item', 'Do this task', 'Do some other task', 'Go back to list'.
By the time they have clicked on an option, the data would have hopefully been refreshed.
Also, if you're using a task based UI, you can analyse the patterns of task execution and use these "what would you like to do next" screens to streamline the UI. Similar to amazon's "other people also bought these items".
As previously stated, it is fine to tell the user that the request (command) has been acknowledged (successfully issued). In case of some failure, the system should communicate this to the requester, by means of:
email;
SMS;
custom inbox (e.g. like the SO inbox);
whatever.
E.g., mail client / service:
I am sending a mail to a wrong address;
the mail service says: "email sent successfully :)";
after few minutes, I receive a mail from the service: "email could not be delivered".
I believe a great way to inform the user about a recent failure is to present him an error panel while he's navigating through the application. A user gesture might be required in order to dismiss that alert etc.
For example:
I wouldn't go with tricking the user or blocking him from committing some other actions. I would rather go for streaming data toward UI after they are being acknowledged by a read side. Let's consider these two cases:
Users saves data and expects result. Connection is established toward server. After they are being acknowledged by a read side, they are streamed toward UI and UI is being updated.
User saves data and refreshes web page. Upon reload, data are being fetched from data store and connection for streaming is established. If read side didn't update the data store in the meantime, there's still an opened stream and UI should be updated after data reaches the read side.
Why streaming from read side and not directly from write side? Simply, that would be a confirmation that read side has been reached.
From technical aspect, Server-Sent Events could be used.
Disadvantage:
Results will still not be reflected immediately by a read side. But at least, in most cases, user will be able to continue with his work without being blocked by a UI.
There are several ways to handle eventual consistency. All of them are really to occupy the time from the User's action until the backend refresh.
User Reads A given user can only read from the same database node that they write to. Other users read from the replicated nodes. PROS: UI is quick enough, and application stays in sync. CONS: Your service architecture has to track and route Users to specific database nodes.
Disable the UI until the action has completed, and refresh it. Java Server Faces has a classic example of this. One could create a modal with a loading spinner to cover the UI until the refresh was completed. PROS: UI stays in sync with application state. CONS: Most every action creates a blocked UI. Users get very frustrated by the restricted UI, and will complain of application slowness.
Confirmation Immediately thank the user for their submission. Then let them know later (email, SMS, in-app notification) whether or not the action was completed. PROS: It's fast up front. CONS: UI lags behind system until refresh. Even with a notice, the User may get confused that they don't see the updates. It also requires integration of various communication channels. Users won't see their changes right away. If the action fails, they may not know until it's too late.
Fake it Optimistically assume that the action will complete. Show the User the resulting UI (upvote, comment, credit card confirmation, etc) and allow them to continue as if it succeeded. If there were failures, immediately show them as contextual errors: alerts next to the undone upvotes, in-app alert on the post with the failed comment, email for the declined credit card. PROS: UI feels much faster. CONS: UI is temporarily out of sync with application state, and you must resolve that. One case: you might fake creation of content with temp IDs. But after content is created, then the temp IDs will be wrong until the refresh. Second case, you might need to store all state changes on the UI after the action until the refresh. Then you need some Resolver to apply all the local state changes since the action was issued. This is resolution is non-trivial.
Web Sockets Subscribe the UI to an event stream so that when the action is completed on the backend, it is pushed to the front end. Is it one-way or two-way streaming? PROS: UI feels fast, and it's in sync with the application state. CONS: Consistent browser support, need a backend source of streaming events, and socket server scalability.