Kafka - How to commit offset after every message using High-Level consumer? - message-queue

I'm using Kafka's high-level consumer. Because I'm using Kafka as a 'queue of transactions' for my application, I need to make absolutely sure I don't miss or re-read any messages. I have 2 questions regarding this:
How do I commit the offset to zookeeper? I will turn off auto-commit and commit offset after every message successfully consumed. I can't seem to find actual code examples of how to do this using high-level consumer. Can anyone help me with this?
On the other hand, I've heard committing to zookeeper might be slow, so another way may be to locally keep track of the offsets? Is this alternative method advisable? If yes, how would you approach it?

You could first disable auto commit: auto.commit.enable=false
Then commit after fetching the message: consumer.commitOffsets(true)

There are two relevant settings from http://kafka.apache.org/documentation.html#consumerconfigs.
auto.commit.enable
and
auto.commit.interval.ms
If you want to set it such that the consumer commits the offset after each message, that will be difficult since the only setting is after a timer interval, not after each message. You will have to do some rate prediction of the incoming messages and accordingly set the time.
In general, it is not recommended to keep this interval too small because it vastly increases the read/write rates in zookeeper and zookeeper gets slowed down because it's strongly consistent across its quorum.

I've solved my problem by using:
consumerConfig.EnableAutoCommit = false;
after
var consumer = consumerBuilder.Consume(cancelToken.Token);
using
consumerBuilder.Commit(consumer);
I'm using:
Confluent.Kafka
for my C# client

Related

How do multiple developers use the same queue for development?

We use SQS for queueing use-cases in our company. All developers connect to the same queue for local development. If we're producing some messages for testing in local development, it can happen that the message is consumed on other person's locally running consumer, if that person has the app running at the same time.
How do you make sure that messages produced by one person don't end up getting lost by consumption on other person's locally running consumer. Is using different different queues for each person the only solution? Wondering what is standard followed to avoid this in the industry?
This is very open-ended IMO. Would recommend adding some context as to how you're using SQS.
But from what I could understand:
Yes, I would recommend creating queues per "developer"
OR
Although not elegant, you can maybe add an SQS message attribute (this is metadata other than message body) with a developer's username.
And each developer should then only process a message if it's meant for them. Arguably, you could also add a flag in the message itself, but, I am not sure about the constraints on your message format. Message attributes are meant to be used for these situations, where you want to know if you really need to process a message before even parsing the message body.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#sqs-message-attributes
But you'll have to increase the maxReceives to a high number (so that message does not move to dead letter queue, if you have configured one). This is not exhaustive, it will just decrease the chances of your messages being deleted by someone else. Because if say, 10 people read the message and did not delete it because username was not their username, and maxReceives is 8, it will still move to DLQ and cause unnecessary confusion.

Replacing items in message queue

Our system requirements say that we need to build a slightly unusual producer-consumer processing system. Imagine we have multiple data streams and we take a snapshot each X seconds and put it into the queue for processing. The streams count is not constant. The more clients we have, the more streams we need to process. At the same time, we don't need to process ALL taken snapshots. If we have too many clients and we are not able to process all items in real-time, we would prefer to skip old snapshots and process only the latest ones.
So as I see, the requirements can be met by keeping only one item in a queue for each stream. If there is a new snapshot, while the previous is still there, we need to REPLACE it using stream id as a key.
Is it possible to implement such behavior by Service Bus queue or something similar? Or maybe it makes sense to look into some other solutions like Redis?
So as I see, the requirements can be met by keeping only one item in a
queue for each stream. If there is a new snapshot, while the previous
is still there, we need to REPLACE it using stream id as a key. Is it
possible to implement such behavior by Service Bus queue or something
similar?
To the best of my knowledge, Azure Service Bus does not support this scenario. Through it's duplicate detection functionality, in fact it supports exact opposite of that. You would need to use some other mechanism (like Redis Cache you mentioned) to accomplish this.

Why putting queue can solve data inconsistency?

I am looking into a tool called Cadence, which can be used to lower the complexity of developing distributed systems.
I came across this video, https://youtu.be/llmsBGKOuWI?t=108.
Starting from 1:40, he mentioned that when sending a transaction, which include debit and credit, if either one of the operation failed, consistency issue will occur, and we can solve it by putting a queue.
The speaker did not mentioned the reason of it, and I'm thinking that is it because queue can enable replay message? Or there is some other reasons that I missed?
Any answers or opinions are appreciated!
Queues have the ability to persist the messages for a short amount of time, so if any of your server fails, you can still get the message from queue and retry, I guess this is what he means.

Should I move this task to a message queue?

I'm a big fan of using message queue systems(like Apache ActiveMQ) for the tasks which are rather slow and do not require instant feedback in User Interface.
The question is: Should I use it for other tasks(which are pretty fast) and do not require instant feedback in User Interface?
Or does it in involve another level of complexity without not so much benefits?
Well, if you think about it, Win32 is more or less built around message queues. That said, just because you have a hammer, doesn't make every problem a nail. Frankly though, it depends. Queues for example do not work well towards multiple receivers.
If you're already using a MQ system in your app, I would consider moving most non-synchronous tasks there. Any fire-and-forget or event driven task is a good candidate. But then, don't overdo it either, and I certainly wouldn't consider adding the dependency if you have no other use for MQ in the application.
Is it necessary for requests to be processed in the background, independent of other components in your application? If the answer is yes, I think it warrants a queue of some kind.
If you're processing them inline in your application, that just means your application has to be running to process them. If you want them to run on their own, then you need something else to process them. I don't think adding the queue is enough extra work to justify avoid it if your message processing needs to always be running.
The right tool for the job. Is a message queue the right tool for the job you have in mind?

Simple scalable work/message queue with delay

I need to set up a job/message queue with the option to set a delay for the task so that it's not picked up immediately by a free worker, but after a certain time (can vary from task to task). I looked into a couple of linux queue solutions (rabbitmq, gearman, memcacheq), but none of them seem to offer this feature out of the box.
Any ideas on how I could achieve this?
Thanks!
I've used BeanstalkD to great effect, using the delay option on inserting a new job to wait several seconds till the item becomes available to be reserved.
If you are doing longer-term delays (more than say 30 seconds), or the jobs are somewhat important to perform (abeit later), then it also has a binary logging system so that any daemon crash would still have a record of the job. That said, I've put hundreds of thousands of live jobs through Beanstalkd instances and the workers that I wrote were always more problematical than the server.
You could use an AMQP broker (such as RabbitMQ) and I have an "agent" (e.g. a python process built using pyton-amqplib) that sits on an exchange an intercepts specific messages (specific routing_key); once a timer has elapsed, send back the message on the exchange with a different routing_key.
I realize this means "translating/mapping" routing keys but it works. Working with RabbitMQ and python-amqplib is very straightforward.