Why putting queue can solve data inconsistency? - message-queue

I am looking into a tool called Cadence, which can be used to lower the complexity of developing distributed systems.
I came across this video, https://youtu.be/llmsBGKOuWI?t=108.
Starting from 1:40, he mentioned that when sending a transaction, which include debit and credit, if either one of the operation failed, consistency issue will occur, and we can solve it by putting a queue.
The speaker did not mentioned the reason of it, and I'm thinking that is it because queue can enable replay message? Or there is some other reasons that I missed?
Any answers or opinions are appreciated!

Queues have the ability to persist the messages for a short amount of time, so if any of your server fails, you can still get the message from queue and retry, I guess this is what he means.

Related

How do multiple developers use the same queue for development?

We use SQS for queueing use-cases in our company. All developers connect to the same queue for local development. If we're producing some messages for testing in local development, it can happen that the message is consumed on other person's locally running consumer, if that person has the app running at the same time.
How do you make sure that messages produced by one person don't end up getting lost by consumption on other person's locally running consumer. Is using different different queues for each person the only solution? Wondering what is standard followed to avoid this in the industry?
This is very open-ended IMO. Would recommend adding some context as to how you're using SQS.
But from what I could understand:
Yes, I would recommend creating queues per "developer"
OR
Although not elegant, you can maybe add an SQS message attribute (this is metadata other than message body) with a developer's username.
And each developer should then only process a message if it's meant for them. Arguably, you could also add a flag in the message itself, but, I am not sure about the constraints on your message format. Message attributes are meant to be used for these situations, where you want to know if you really need to process a message before even parsing the message body.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#sqs-message-attributes
But you'll have to increase the maxReceives to a high number (so that message does not move to dead letter queue, if you have configured one). This is not exhaustive, it will just decrease the chances of your messages being deleted by someone else. Because if say, 10 people read the message and did not delete it because username was not their username, and maxReceives is 8, it will still move to DLQ and cause unnecessary confusion.

Dealing with exceptions in an event driven world

I'm trying to understand how exceptions are handled in an event driven world using micro-services (using apache kafka). For example, if you take the following order scenario whereby the following actions need to happen before the order can be completed.
1) Authorise the payment with the payment service provider
2) Reserve the item from stock
3.1) Capture the payment with the payment service provider
3.2) Order the item
4) Send a email notification accepting the order with a receipt
At any stage in this scenario, there could be a failure such as:
The item is no longer in stock
The payment information was incorrect
The account the payee is using doesn't have the funds available
External calls such as those to the payment service provider fail, such as downtime
How do you track that each stage has been called for and/or completed?
How do you deal with issues that arise? How would you notify the frontend of the failure?
Some of the things you describe are not errors or exceptions, but alternative flows that you should consider in your distributed architecture.
For example, that an item is out of stock is a perfectly valid alternative flow in your business process. One that possibly requires human intervention. You could move the message to a separate queue and provide some UI where a human operator can deal with the problem, solve it and cause the flow of events to continue.
A similar thing could be said of the payment problems you describe. If an order cannot successfully be settled, a human operator will need to investigate the case and solve it. For that matter, your design must contemplate that alternative flow as part of it, and make it so a human can intervene somehow when the messages end up in a queue that requires a person to review them.
Those cases should be differentiated from errors or exceptions being thrown by the program. Those cases, depending on the circumstance, might in fact require to move the message to a dead letter queue (DLQ) for an engineer to take a look at them.
This is a very broad topic and entire books could written about this.
I believe you could probably benefit from gaining more understanding of concepts like:
Compensating Transactions Pattern
Try/Cancel/Confirm Pattern
Long Running Transactions
Sagas
The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels that order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.
I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.
In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.
The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).
As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.
The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.
A few other aspects I also typically consider are the following:
Idempotency
I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.
Transient vs Persistent Errors
When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.
Embrace Error as an Alternative Flow
As mentioned at the beginning of my answer, not everything is an error. Some things are just alternative flows.
In those cases of interservice communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can just embrace error as part of you workflow. Catalog the possible causes of error and make them an alternative flow of events that simply requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.
For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.
Transaction and Data Model State are Key
I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.
For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.
The difficulty here is discovering how to design transaction granularity in way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.
Designing for Distributed Transactional Workflows
So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in a inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.
That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client–server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.
As I was originally saying, the topic is way too broad and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.
At any rate, I hope this somehow helped you with your investigation.

Kafka - How to commit offset after every message using High-Level consumer?

I'm using Kafka's high-level consumer. Because I'm using Kafka as a 'queue of transactions' for my application, I need to make absolutely sure I don't miss or re-read any messages. I have 2 questions regarding this:
How do I commit the offset to zookeeper? I will turn off auto-commit and commit offset after every message successfully consumed. I can't seem to find actual code examples of how to do this using high-level consumer. Can anyone help me with this?
On the other hand, I've heard committing to zookeeper might be slow, so another way may be to locally keep track of the offsets? Is this alternative method advisable? If yes, how would you approach it?
You could first disable auto commit: auto.commit.enable=false
Then commit after fetching the message: consumer.commitOffsets(true)
There are two relevant settings from http://kafka.apache.org/documentation.html#consumerconfigs.
auto.commit.enable
and
auto.commit.interval.ms
If you want to set it such that the consumer commits the offset after each message, that will be difficult since the only setting is after a timer interval, not after each message. You will have to do some rate prediction of the incoming messages and accordingly set the time.
In general, it is not recommended to keep this interval too small because it vastly increases the read/write rates in zookeeper and zookeeper gets slowed down because it's strongly consistent across its quorum.
I've solved my problem by using:
consumerConfig.EnableAutoCommit = false;
after
var consumer = consumerBuilder.Consume(cancelToken.Token);
using
consumerBuilder.Commit(consumer);
I'm using:
Confluent.Kafka
for my C# client

FreePBX Queues Issue

I'm experiencing some issues with FreePBX queueus.
The longest calls waiting don't always seem to have priority, in various cases we've experienced instances when a call was on hold for 10 ten minutes and another call came in and the new call was sent to the next available agent before it.
Anyone have any experience with this?
According to what I've been told by some developers, this seems to be a bug in Asterisk's queue application. The queue app doesn't seem to share the longest wait times across the queues, thus, if a member is part of multiple queues, there could be some problems like what we've experienced.
I have come to accept that and moved on to a commercial grade Call Center solution.

Should I move this task to a message queue?

I'm a big fan of using message queue systems(like Apache ActiveMQ) for the tasks which are rather slow and do not require instant feedback in User Interface.
The question is: Should I use it for other tasks(which are pretty fast) and do not require instant feedback in User Interface?
Or does it in involve another level of complexity without not so much benefits?
Well, if you think about it, Win32 is more or less built around message queues. That said, just because you have a hammer, doesn't make every problem a nail. Frankly though, it depends. Queues for example do not work well towards multiple receivers.
If you're already using a MQ system in your app, I would consider moving most non-synchronous tasks there. Any fire-and-forget or event driven task is a good candidate. But then, don't overdo it either, and I certainly wouldn't consider adding the dependency if you have no other use for MQ in the application.
Is it necessary for requests to be processed in the background, independent of other components in your application? If the answer is yes, I think it warrants a queue of some kind.
If you're processing them inline in your application, that just means your application has to be running to process them. If you want them to run on their own, then you need something else to process them. I don't think adding the queue is enough extra work to justify avoid it if your message processing needs to always be running.
The right tool for the job. Is a message queue the right tool for the job you have in mind?