Can I achieve ordered processing with multiple consumers in Kafka? - message-queue

In Kafka, I have a producer queuing up work of clients. Each piece of work has a client ID on it. Work of different clients can be processed out of order, but work of one client must be processed in order.
To do this, I intend to have (for example) 20 topics to achieve parallelism. The producer will queue up work of a client ID into topic[client ID mod 20]. I then intend to have many consumers each capable of processing work of any client but I still want the work processed in order. This means that the next price of work in the topic can't begin to be processed before the previous piece has completed. In case of consumer failure it's OK to process work twice, but it means that the offset of that topic can't progress to the next piece of work.
Note: the number of messages per second is rather small (10s-100s messages).
To sum up:
'At least once' processing of every message (=work)
In order processing of work for one topic
Multiple consumers for every topic to support consumer failure
Can this be done using Kafka?

Yes, you can do this with Kafka. But you shouldn't do it quite the way you've described. Kafka already supports semantic partitioning within a topic if you provide a key with each message. In this case you'd create a topic with 20 partitions, then make the key for each message the client ID. This guarantees all messages with the same key end up in the same partition, i.e. it will do the partitioning that you were going to do manually.
When consuming, use the high level consumer, which automatically balances partitions across available consumers. If you want to absolutely guarantee at least once processing, you should commit the offsets manually and make sure you have fully processed messages you have consumed before committing them. Beware that consumers joining or leaving the group will cause rebalancing of partitions across the instances, and you'll have to make sure you handle that correctly (e.g. if your processing is stateful, you'll have to make sure that state can be moved between consumers upon rebalancing).

Related

Shard key with mostly even distribution. How to handle outliers?

I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
I'd suggest adding a new attribute to the client's records, for example we could call it part. Assign a single value to simple clients, and store the same value in part for all their records.
But heavy clients would be assigned multiple values for part, up to the number of shards. Every record for that client would set its part to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.
Your hashing algorithm for mapping clients to a shard would then use the client id + the part attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.
This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the part values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for each part value associated with that client.
To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
You've tagged your question with cockroachdb so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.

Using table locking to prevent multiple users from updating at a given time

I am building a simple shopping cart. Currently, to ensure that a customer can never purchase a product that is out of stock, when processing the order I have a loop for each product in their cart:
-- Begin a transaction --
Loop through each product in the cart and
Select the stock count from the products table
If it is in stock:
I will reduce the stock count from the product
Add the product to the order items table
Otherwise, I call a rollback and return an error
-- (If there isn't a call for rollback, everything ends off with a commit --
However, if at any given time, the stock count for a product is updated AFTER it has checked for that particular product, there may be inconsistencies.
Question: would it be a good idea to lock the table from writes whenever I am processing an order? So that when the 'loop' above occurs, I can be assured that no one else is able to alter the product count and it will always be accurate.
The idea is that the product count/availability will always be consistent, and there will never be an instance where the stock count goes to -1 (which would be unfulfillable).
However, I have seen so many posts on locks being inefficient/having bad effects. If so, what is the best way to accomplish this?
I have seen alternatives like handling it in an update + select query, but have seen that it may also not be suitable in some cases.
You have at least three strategies:
1. Pessimistic Locking
If your application will experience low activity then you can lock the tables (or single rows) to make sure no other thread changes the values during the processing of a purchase. It works, but it has performance limitations.
2. Optimistic Locking
If your application/web site must serve a high load then you can opt for the "optimistic locking" strategy. In this case you add a version number column to your critical tables and then you use it when reading/writing it.
When updating stock you check the version number you are updating must be the same that you read. If it's not the case anymore (another thread modified it) you roll back the transaction and can retry again a couple of times until you succeed.
It requires more development effor since you need to identify the bad case and implement retry logic (if you want to).
3. Processing Queues
You can implement processing queues. When a thread wants to "purchase an order" it can submit it to a processing queue for purchase orders. This queue can be implemented by one or more threads dedicated to this task; if you choose multiple threads they can be divided by order types, regions, categories, etc. to distribute the load.
This requires more programming effort since you need to manage asynchronous processing, but can sustain much higher levels of load.
You can use this strategy for multiple different tasks: purchasing orders, refilling stock, sending notifications, processing promotions, etc.

Understanding the max.inflight property of kafka producer

I work on a bench of my Kafka cluster in version 1.0.0-cp1.
In part of my bench who focus on the max throughput possible with ordering guarantee and no data loss (a topic with only one partition), need I to set the max.in.flight.requests.per.connection property to 1?
I've read this article
And I understand that I only have to set the max.in.flight to 1 if I enable the retry feature at my producer with the retries property.
Another way to ask my question: Only one partition + retries=0 (producer props) is sufficient to guarantee the ordering in Kafka?
I need to know because increase the max.in.flight increases drastically the throughput.
Your use case is slightly unclear. You mention ordering and no data loss but don't specify if you tolerate duplicate messages. So it's unclear if you want At least Once (QoS 1) or Exactly Once
Either way, as you're using 1.0.0 and only using a single partition, you should have a look at the Idempotent Producer instead of tweaking the Producer configs. It allows to properly and efficiently guarantee ordering and no data loss.
From the documentation:
Idempotent delivery ensures that messages are delivered exactly once
to a particular topic partition during the lifetime of a single
producer.
The early Idempotent Producer was forcing max.in.flight.requests.per.connection to 1 (for the same reasons you mentioned) but in the latest releases it can now be used with max.in.flight.requests.per.connection set to up to 5 and still keep its guarantees.
Using the Idempotent Producer you'll not only get stronger delivery semantics (Exactly Once instead of At least Once) but it might even perform better!
I recommend you check the delivery semantics [in the docs]
[in the docs]:http://kafka.apache.org/documentation/#semantics
Back to your question
Yes without the idempotent (or transactional) producer, if you want to avoid data loss (QoS 1) and preserve ordering, you have to set max.in.flight.requests.per.connection to 1, allow retries and use acks=all. As you saw this comes at a significant performance cost.
Yes, you must set the max.in.flight.requests.per.connection property to 1.
In the article you have read it was an initial mistake (currently corrected) where author wrote:
max.in.flights.requests.per.session
which doesn't exist in the Kafka documentation.
This errata comes probably from the book "Kafka The Definitive Guide" (1st edition) where you can read in the page 52:
<...so if guaranteeing order is critical, we recommend setting
in.flight.requests.per.session=1 to make sure that while a batch of
messages is retrying, additional messages will not be sent ...>
imo, it is invaluable to also know about this issue that makes things far more interesting and slightly more complicated.
When you enable enable.idempotence=true , every time you send a message to the broker, you also send a sequence number, starting from zero. Brokers store that sequence number too on their side. When you make a next request to the broker, let’s say with sequence_id=3, the broker can look at its currently stored sequence number and say :
if its 4 - good, its a new batch of records
if its 3 - its a duplicate
if its 5 (or higher), it means messages were lost
And now max.inflight.requests.per.connection . A producer can send as many as this value concurrent requests without actually waiting for an answer from the broker. When we reach 3 (let’s say max.inflight.requests.per.connection=3) , we start to ask the broker for the previous results (at the same time we can’t process any batches now even if they are ready).
Now, for the sake of the example, let’s say the broker says this : “1 was OK, I stored it”, “2 has failed” and now the important part: because 2 failed, the only possible thing you can get for 3 is “out of order”, which means it did not store it. The client now knows that it needs to reprocess 2 and 3 and it creates a List and resends them - in that exact order; if retry is enabled.
This explanation is probably over simplified, but this is my basic understanding after reading the source code a bit.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

MySql / MSSQL - Checking out Records for Processing - Scaling?

I'm trying to figure out the most efficient and scalable way to implement a processing queue mechanism in a sql database. The short of it is, I have a bunch of 'Domain' objects with associated 'Backlink' statistics. I want to figure out efficiently which Domains need to have their Backlinks processed.
Domain table: id, domainName
Backlinks table: id, domainId, count, checkedTime
The Backlinks table has many records (to keep a history) to one Domain record.
I need to efficiently select domains that are due to have their Backlinks processed. This could mean that the Backlinks record with the most recent checkedTime is far enough in the past, or that there is no Backlinks record at all for a domain record. Domains will need to be ordered for processing by a number of factors, including ordering by the oldest checkedTime first.
There are multiple ‘readers’ processing domains. If the same domain gets processed twice it’s not a huge deal, but it is a waste of cpu cycles.
The worker takes an indeterminate amount of time to process a domain. I would prefer to have some backup in the sense that a checkout would 'expire' rather than require the worker process to explicitly 'checkin' a record when it's finished, in case the worker fails for some reason.
The big issue here is scaling. From the start I’ll easily have about 2 million domains, and that number will keep growing daily. This means my Backlinks history will grow quickly too, as I expect to process in some cases daily, and other cases weekly for each domain.
The question becomes, what is the most efficient way to find domains that require backlinks processing?
Thanks for your help!
I decided to structure things a bit differently. Instead of finding domains that need to be processed based on the criteria of several tables, I'm assigning a date at which each metric needs to be processed for a given domain. This makes finding those domains needing processing much simpler of a query.
I ended up using the idea of batches where I find domains to process, mark them as being processed by a batch id, then return those domains to the worker. When the worker is done, it returns the results, and the batch is deleted, and the domains will naturally be ready for processing again in the future.