How does consumer rebalancing work in Kafka? - message-queue

When a new consumer/brorker is added or goes down, Kafka triggers a rebalance operation. Is Kafka Rebalancing a blocking operation. Are Kafka consumers blocked while a rebalancing operation is in progress?

Depends on what you mean by "blocked". If you mean "are existing connections closed when rebalance is triggered" then the answer is yes. The current Kafka's rebalancing algorithm is unfortunately imperfect.
Here is what is happening during consumer rebalance.
Assume we have a topic with 10 partitions (0-9), and one consumer (lets name it consumer1) consuming it. When a second consumer appears (consumer2) the rebalance task triggers for both of them (consumer1 gets an event, consumer2 does the initial rebalance). Now consumer1 closes all the existing connections (even those that will be reopened soon) and releases the partition ownership in Zookeeper for all 10 partitions.
Then it runs the partition assignment algorithm and decides what partitions should be claimed and claims the partition ownership in Zookeeper again. If the claim was successful consumer1 starts fetching his new partitions.
Meanwhile consumer2 runs the partition assignment algorithm as well and tries to claim his partitions in Zookeeper as well. Claim will succeed only when consumer1 releases the ownership on these partitions. When the claim is successful consumer2 starts fetching, or if it fails to claim partitions within a given amount of retries you get a rebalance failed after n retries exception.
As you noticed instead of just closing connections and releasing ownership for partitions consumer1 does not own anymore, it unnecessarily closes ALL his connections and restarts with just a lower amount of partitions. The same story with adding partitions (when we consume by a wildcard filter and new topic appears) - ALL connections are closed and then opened again instead of just opening new ones.
So I hope this answers your question - fetching stops when rebalance kicks in.

The accepted response above (from serejja) was correct in the past. Kafka has implemented "Incremental Cooperative Rebalancing" from version 2.3 (release date June 2019) and above. So now there is no need for all consumers to stop the processing ("stop the world event") to rebalance work in group fe. when new consumer appears in group or some consumer goes offline.
For more info see: From Eager to Smarter in Apache Kafka Consumer Rebalances

Related

Is AWS Aurora Multi-Master cluster suitable for WordPress

I have a situation where during peak moments my writer database even on the largest 96 core AWS instance becomes maxed (due to limited edition promotions where we process hundreds of orders per second).
I have seen that Aurora offer a multi-master setup where all nodes of the cluster are able to write - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html
In the docs they mention:
If two DB instances attempt to modify the same data page at almost the same instant, a write conflict occurs. The earliest change request is approved using a quorum voting mechanism. That change is saved to permanent storage. The DB instance whose change isn't approved rolls back the entire transaction containing the attempted change. Rolling back the transaction ensures that data is kept in a consistent state, and applications always see a predictable view of the data. Your application can detect the deadlock condition and retry the entire transaction.
I am not really sure what they mean here by "data page". I am pretty sure WordPress doesn't use transactions at all but when thousands of orders are coming in and being pushed into the same table will this cause write errors that will cause orders to fail?
I have looked online and cannot find anyone talking about using WordPress with Aurora multi-master cluster. Is it compatible?

MySQL jobs stuck in sidekiq queue

My Rails application takes a JSON blob of ~500 entries (from an API endpoint), throws it into a sidekiq/ redis background queue. The background job parses the blob then loops through the entries to perform a basic Rails Model.find_or_initialize_by_field_and_field() and model.update_attributes().
If this job were in the foreground, it would take a matter of seconds (if that long). I'm seeing these jobs remain in the sidekiq queue for 8 hours. Obviously, something's not right.
I've recently re-tuned the MySQL database to use 75% of available RAM as the buffer_pool_size and divided that amongst 3 buffer pools. I originally thought that might be part of the deadlock but the load avg on the box is still well below any problematic status ( 5 CPU and a load of ~ 2.5 ) At this point, I'm not convinced the DB is the problem though, of course, I can't rule it out.
I'm sure, at this point, I need to scale back the sidekiq worker instances. In anticipation of the added load I increased the concurrency to 300 per worker (I have 2 active workers on different servers.) Under a relatively small amount of load there queues operate as expected; even the problematic jobs are completed in ~1 minute. Though, per the sidekiq documentation >50 concurrent workers is a bad idea. I wasn't having any stability issues at 150 workers per instance. The problem has been this newly introduced job that performs ~500 MySQL finds and updates.
If this were a database timeout issue, the background job should have failed and been moved from the active (busy) queue to the failed queue. That's not the case. They're just getting stuck in the queue.
What other either MySQL or Rails/ sidekiq tuning parameters should I be examining to ensure these jobs succeed, fail, or properly time out?

How to upgrade our short/long memory term for real time processing

Our mobile app track user events (Events can have many types)
Each mobile reporting the user event and later on can retrieve it.
I thought of writing to Redis and Mysql.
When user request:
1. Find on Redis
2. If not on Redis find on Mysql
3. Return the value
4. Keep Redis modified in case value wasnt existed.
5. set expiry policy to each key on redis to avoid out of mem.
Problem:
1. Reads: If many users at once requesting information which not existed at Redis mysql going to be overloaded with Reads (latency).
2. Writes: I am going to have lots of writes into Mysql since every event going to be written to both datasources.
Facts:
1. Expecting 10m concurrect users which writes and reads.
2. Need to serv each request with max latency of one second.
3. expecting to have couple of thousands requests per sec.
Any solutions for that kind of mechanism to have good qos?
3. Is that in any way Lambda architecture solution ?
Thank you.
Sorry, but such issues (complex) rarely have a ready answer here. Too many unknowns. What is your budget and how much hardware you have. Since 10 million clients are concurrent use your service your question is about hardware, not the software.
Here is no any words about several important requirements:
What is more important - consistency vs availability?
What is the read/write ratio?
Read/write ratio requirement
If you have 10,000,000 concurrent users this is problem in itself. But if you have much of reads it's not so terrible as it may seem. In this case you should take care about right indexes in mysql. Also buy servers with lot of RAM to keep at least index data in RAM. So one server can hold 3000-5000 concurrent select queries without any problems with latency requirement in 1 second (one of our statistic project hold up to 7,000 select rps per server on 4 years old ordinary harware).
If you have much of writes - all becomes more complicated. And consistency becomes main question.
Consistency vs availability
If consistency is important - go to the store for new servers with SSD drives and moder CPU. Do not forget to buy much RAM as possible. Why? If you have much of write requests your sql server would rebuild index with every write. And you can't do not use indexes because of your read requests do not to keep in latency requirement. Under consistency i mean - if you write something, you should do this in 1 second and if you read this data right after write - you get actual written information in 1 second.
Your problem 1:
Reads: If many users at once requesting information which not existed at Redis mysql going to be overloaded with Reads (latency).
Or well known "cache miss" problem. And it has just some solutions - horizontal scaling (buy more hardware) or precaching. Precaching in this case may be done in at least 3 scenarios:
Using non blocking read and wait up to one second while data wont be queried from SQL server. If it not, return data from Redis. Update in Redis immediately or throw queue - as you want.
Using blocking/non blocking read and return data from Redis as fast as possible, but with every ready query push jub to queue about update cache data in Redis (also may inform app it should requery data after some time).
Always read/write from Redis, but register job in queue every write request to update data in SQL.
Every of them is compromise:
High availability but consistency suffers, Redis is LRU cache.
High availability but consistency suffers, Redis is LRU cache.
High availability and consistency but requires lot of RAM for Redis.
Writes: I am going to have lots of writes into Mysql since every event going to be written to both datasources.
The filed of compromise again. Lot's of writes rests to hardware. So buy more or use queues for pending writes. So availability vs consistency again.
Event tracking means (usualy) you can return data close to real time but not in real time. For example have 1-10 seconds latency to update data on disk (mysql) keeping 1 second latency for write/read serving requests.
So, it's combination of 1/2/3 (or some other) techniques for data provessing:
Use LRU in Redis and do not use expire. Lot's of expire keys - problem as is. So we can't use to be sure we save RAM.
Use queue to warm up missing keys in Redis.
Use queue to write data into mysql server from Redis server.
Use additional requests to update data from client size of cache missing situation accures.

Writing into multiple MySQL databases async

I am using AWS RDS so database replication between regions are impossible.
My application written in PHP and deployed on all regions, i am looking for a fast and reliable way to achieve that.
I am going to make MySQL connections :
SET ##auto_increment_increment= NUMBER_OF_WRITEABLE_DATABASES;
SET ##auto_increment_offset = REGION_ID ;
so AI pk's will be unique all over regions.
And my current plan is keeping a query log table with fields => id,queries,status,user_id. It will log all insert,update,delete queries into queries field in same page load.
Status Codes:
Status 0 => not executed
Status 1 => successfully executed on all regions
Status 2 => failed
Status 3 => failed with affected rows not match
Example Row:
id=>1
queries=>
INSERT INTO PROFILES VALUES (1,{USER_ID},'Username','Email')##SEPERATOR##AFFECTED_COUNT
UPDATE USERS SET last_modified='2012-12...' where id={USER_ID}##SEPERATOR##AFFECTED_COUNT
status=0
user_id=>{USER_ID}
and there will be a daemon which reads records which status != 1 and will process them on all regions without commit , once all run without error it will commit or roll back in case of error.
That is what i thought and going to use.
My question is there any more decent/tested approach to that scenario or is there any problem about my approach.
thanks in advance
My initial thought is that you are going down the wrong path if you are trying to use RDS as a solution to enforce unique record ID's across multiple regions. I would think you might want to rethink your actual need for uniqueness across regions or enforce uniqueness using multiple columns (i.e. an autoincrement plus a region identifier). That could be read and put into some eventually consistent data store for read purposes.
You're making a commendable effort, but as the other commenters have stated, your solution isn't viable, for a number of reasons.
You don't really want to use auto_increment_offset and auto_increment_increment at the session level. You want to set those at the server level. If RDS won't let you do that, this is another reason why RDS is probably not the best solution.
If I came out and suggested that you deploy a global network of MySQL servers (EC2, not RDS) in a multi-master ring, where data replicates 1 => 2 => 3 => 4 => 1 and each server ignores incoming replication messages with its own server id, my fellow MySQL DBAs would accuse me of having lost my mind and setting you up for a difficult-to-manage situation; however, I am convinced that this would be a much easier solution than what you have proposed, because at least, then, the data would be changing around the world in pretty much the same order in which it actually changed -- which would reduce the likelihood of conflicting updates originating from multiple locations. MySQL replication is asynchronous, in the sense that server 1 does not wait for a transaction to be committed on server 2 before returning success to the client (indicating that the transaction has committed), but don't confuse that fact with the fact that it is sequential -- transactions are replicated on each server in the order in which they were committed. (New options in MySQL 5.6 allow some exceptions to this by with parallel replication threads, but that isn't significant to this discussion).
Since you have devised a scheme for avoiding conflicting auto-increment values, your bigger problems are likely to come from updates and deletes. In the scenario I just described, if server 2 deleted a record and server 4 deleted the same record at the same time, then server 4 would stop replicating incoming events when it received the delete from server 2, because the "rows affected" would have been different. Your scenario would similarly fail. The difference is that using actual MySQL replication, nothing happening after the conflicting event happened, so until you resolved that conflict, at least your data would not diverge any further into inconsistency because of the sequential nature discussed above and the fact that MySQL replication completely stops whenever a conflict is encountered. In a ring of master servers, the server that has stopped replicating continues collecting a log of replication events from the upstream systems, but execution halts and the data on that server is frozen unless changed locally until the conflict is resolved and replication restarted.
Note also that in your scenario, you need to preserve "from" and "to" values for each column on updates, because you can't roll anything back unless you know that it rolls back to.
That being noted, a rollback needs to occur in real-time, not later. If I transfer money between two bank accounts, and for some reason that transfer needs to roll back, I need to see that while I'm using the bank's web site -- the bank can't roll that transaction back in the middle of the night just because one of their servers has a different balance in my bank account.
Here's a thought: In your scenario, it the account I was transferring "to" was consistent among all the servers, but the account I was transferring "from" was not, then I wonder... would your setup roll back the withdrawal from the "from" account, but leave the deposit in the "to" account? I think it might.
Keep in mind that you are limited by the CAP theorem. No system can be globally consistent, available, and tolerate isolation among the nodes. At best, you can pick any two.
With that thought, the question I have is this: why do all of the nodes in your global system need to be synchronized? If the main reason is performance, consider the possibility of deploying a single global master server, with read replicas distributed among the regions. Write your application with two pools of database connection threads so that most SELECT queries go to the local read replica, while INSERT, DELETE, UPDATE, and CALL (stored procedures that update data), are sent to the global master server. Your biggest worry, then, becomes the fact that you only have eventual consistency on the read replicas. With properly-sized servers and well-written queries, this is very fast (subject to the laws of physics for global travel of optical and electrical signals) but it is not instantaneous. What you have to do to accomplish this is for sessions that have recently made changes to the database, their reads may need to hit the global master -- if you place an order, you need to see the order immediately, so the master might be the best place to look, right away. Later, looking at the local replica will work. You're still out of scope for RDS with this, because of the cross-regional issue... but MySQL on EC2 is a good fit.
Read replicas impose a very small load on the master, but even this load can be mitigated by connecting a single read replica to the master and then connecting the downstream read replicas to that intermediate server.
Setting slave_compressed_protocol = 1 on the masters and the replicas will enable the machines to use compressed connections for transferring the replication events. I have found this to be anywhere from 3:1 to 10:1 depending on the nature of the data being replicated and the delay of compressing and decompressing the data seems insignificant.
Additionally, you could set up a second master, adjacent to the primary master (perhaps in a different A/Z), link those two servers with master-master replciation, chain the read replicas to the 2nd master, use auto increment increment and offsets appropriately, but do not write to or read from to the second master under normal conditions. Why would you do this? This way, you have a 2nd global master that could be placed into service immediately in case of failure of the primary master by redirecting your application to access it.
Of course, the nature of your application plays a large factor in how much global integration is actually required. Solving this problem will require you to rethink how the application works, to determine whether architectural changes are needed.
As a DBA, I don't like some of the restrictions and flexibility constraints that RDS imposes on me. All I really get in return for the loss-of-control is a relative ease of backups and point-in-time restoration... which I like... but, to me, these don't make up for the restrictions.
Footnote: In the 3rd paragraph, I said "transactions are replicated on each server in the order in which they were committed." But that doesn't necessarily mean in the real-world wall-clock actual-order in which they were committed... it actually means the order in which they were committed to each server relative to the other transactions being committed by that server... so a transaction on Server #1 that actually committed before a different transaction on Server #3 might arrive at server #4 after the transaction from #3 instead of before it, depending on how long the transaction took to propagate through server #2 and be committed on server #3. However, this is still "true enough" in principle, because if the transaction on #1 is perceived at server #3 as conflicting with whatever happened on #3, it will not actually replicate to #4 because #3 will stop replicating.

Implementing message priority in AMQP

I'm intending to use AMQP to allow a distributed collection of machines to report to a central location asynchronously. The idea is to drop messages into the queue and allow the central logging entity to process the queue in a decoupled fashion; the 'process' is simply to create or update a row in a database table.
A problem that I'm anticipating is the effect of network jitter in the message queuing process - what happens if an update accidentally gets in front of an insert because the time between the two messages being issued is less than the network jitter?
Reading the AMQP spec, it seems that I could just apply a higher priority to inserts so they skip the queue and get processed first. But presumably this only applies if a queue actually exists at the broker to be skipped. Is there a way to impose a buffer or delay at the broker to absorb this jitter and allow priority to be enacted before the messages are passed on to the consumer(s)?
Or do I have to go down the route of a resequencer as ActiveMQ suggests?
The lack of ordering between multiple publishers has nothing to do with network jitter, it's a completely natural thing in distributed applications. Messages from the same publisher will always be ordered. If you really need causal ordering of actions performed by different nodes then either a resequencer or a global sequence numbering scheme are your only options. Note that you cannot use sender timestamps for this, which is what everyone seems to try first..