This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 4 days ago.
We have a hive installation that has MariaDB as metastore database. MariaDB has around ~250 GB metadata with ~100GB indexes. It becomes terribly slow during the peak load of 40-60K QPS.
Looking from the community to share similar experiences if any and what they did to scale out the meta store or fix it?
Some of the ideas i am looking at currently are:
Application Caching at HMS level: Didn't found out of box capability in my current v2.0.1. Is there support for it in higher versions?
Read replicas and routing SELECTS to it: Facing some failure if there is replication lag and i am trying to read back the value.
Horizontal sharding of Mysql: founding it way complex. Saw some recommendations of TiDB but not sure of its experience.
An answer to this would greatly depend on things such as the type of workload (percentage of reads vs writes, etc.), it also depends on your hardware. From your opening question it seems like the data is time critical (once data is committed it should be read).
There are several approaches you can take including:
Application level caching, or a caching layer such as Memcached / Redis
MariaDB Galera cluster with a load balancer for the read loads
Sharding such as Spider engine
Tuning your data / schemas / database for the workload
If you don't have the resources to try these things out on a simulated identical setup then I recommend getting some kind of MariaDB consulting services involved. These services will help you make the best decisions for your specific workload. There are many companies out there who will provide this (including MariaDB Corporation themselves).
Edit: Declaration requested... I declare that I am not affiliated with MariaDB Corporation. I work for the MariaDB Foundation which is an independent non-profit entity.
I have been trying to read data out of MySQL using Kafka connect using MySQL source connector for database and debezium connector for bin logs. I am trying to understand what could be the better way to pull the change data. Bin logs has overhead of writing to logs etc while reading from database has the overhead of querying database. What are the other major advantages and disadvantages that are associated with both of these approaches? What could be a better way of capturing change data? Also starting from MySQL 8 the bin logs are enabled by default. Does this mean it could be a better way of doing things?
This question can be summarized as follows:
What are the pros and cons of a log-based CDC (represented by Debezium Connector) versus a polling-based CDC (represented by JDBC Source Connector)?
Query-based CDC:
✓ Usually easier to setup
✓ Requires fewer permissions
✗ Impact of polling the DB
✗ Needs specific columns in source schema to track changes
✗ Can't track deletes
✗ Can't track multiple events between polling interval
Log-based CDC:
✓ All data changes are captured
✓ Low delays of events while avoiding increased CPU load
✓ No impact on data model
✓ Can capture deletes
✓ Can capture old record state and further meta data
✗ More setup steps
✗ Higher system previleges required
✗ Can be expensive for some proprietary DB
Reference:
Five Advantages of Log-Based Change Data Capture by Gunnar
Morling
No More Silos: How to Integrate Your Databases with Apache
Kafka and CDC by Robin Moffatt
StackOverflow: Kafka Connect JDBC vs Debezium CDC
The list given by #Iskuskov Alexander is great. I'd add a few more points:
Log-based CDC also requires writing to logs (you mentioned this in your question). This has overhead not only for performance, but also storage space.
Log-based CDC requires a continuous stream of logs. If the CDC misses a log, then the replica cannot be kept in sync, and the whole replica must be replaced by a new replica initialized by a new snapshot of the database.
If your CDC is offline periodically, this means you need to keep logs until the CDC runs, and this can be hard to predict how long it will be. This leads to needing more storage space.
That said, query-based CDC has its own drawbacks. At my company, we have used a query-based CDC, but we found that it is inconvenient, and we're working on replacing it with a Debezium log-based solution. For many of the reasons in the other answer, and also:
Query-based CDC makes it hard to keep schema changes in sync with the replica, so if a schema change occurs in the source database, it may require the replica be trashed and replaced with a fresh snapshot.
The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot, and users don't like this downtime. Also snapshot transfers increase the network bandwidth requirements.
Neither solution is "better" than the other. Both have pros and cons. Your job as an engineer is to select the option that fits your project's requirements the best. In other words, choose the one whose disadvantages are least bad for your needs.
We can't make that choice for you, because you know your project better than we do.
Re your comments:
Enabling binary logs has no overhead for read queries, but significant overhead for write queries. The overhead became greater in MySQL 8.0, as measured by Percona CTO Vadim Tkachenko and reported here: https://www.percona.com/blog/2018/05/04/how-binary-logs-affect-mysql-8-0-performance/
He concludes the overhead of binary logs is about 13% for MySQL 5.7, and up to 30% for MySQL 8.0.
Can you also explain "The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot"? Do you mean building a replication database?
Yes, if you need to build a new replica, you acquire a snapshot of the source database and import it to the replica. Every step of this takes time:
Create the snapshot of the source
Transfer the snapshot to the host where the replica lives
Import the snapshot into the replica instance
How long depends on the size of the database, but it can be hours or even days. While waiting for this, users can't use the replica database, at least not if they want their queries to analyze a complete copy of the source data. They have to wait for the import to finish.
Hi I am pretty new to apache kafka, I dont know how much sense this will make.
I did lot of research and couldn't find whats the advantage of multiple brokers.
Went through the whole kafka documentation and couldn't find an answer for this.
Say for example I am receiving data from two different set of devices which I should manipulate and store.Depending on from which set of device data arrives the consumer will change.
Should I go with multi broker - single topic - multi partition OR single broker - single topic - multi partition OR some other approach ??
Any help or guide is appreciated.
As with pretty much any distributed system: scalability and resiliency. One broker goes down - no problem if you have replication set up. You suddenly get a traffic spike which would be too much for a single machine to handle - no problem if you have a cluster of machines to handle the traffic.
State: We are sharing zookeeper with kafka and several different services, which are using zookeeper for coordination. They are nicely operating on zookeeper subcontext. Looks like this:
/
/service1/...
/service2/...
/brokers/...
/consumers/...
My question is.. Is it possible to setup kafka to use subcontext? So, the other services can't eventually modify other services subcontext. It would be:
/
/service1/...
/service2/...
/kafka/brokers/...
/kafka/consumers/...
I saw this syntax in other projects:
zk://10.0.0.1,10.0.0.2/kafka
lets say. So, kafka would see only the brokers and consumers paths and there would be no way to mess up with other subcontext.
I'm afraid kafka is just not supported this format at the time. Other question is, is there a workaround? Like wrap up zookeeper somehow? Any ideas? Or kafka is supposed to use zookeeper exclusively. Is it best practice and we should spawn zookeeper for each project, which is overkill thus zookeeper need ensemble consists atleast of 3 nodes.
Thanks for your answer!
You can use the zk chroot syntax with Kafka, as detailed in the Kafka configuration documentation.
Zookeeper also allows you to add a "chroot" path which will make all kafka data for this cluster appear under a particular path. This is a way to setup multiple Kafka clusters or other applications on the same zookeeper cluster. To do this give a connection string in the form hostname1:port1,hostname2:port2,hostname3:port3/chroot/path which would put all this cluster's data under the path /chroot/path. Note that you must create this path yourself prior to starting the broker and consumers must use the same connection string.
The best practice is to maintain a single ZooKeeper cluster (at least that is what I've seen). Otherwise you are creating more operational workload for maintaining a good ZK ensemble.
The Kafka documentation on operationalizing ZK sort of recommends having multiple ZKs though:
Application segregation: Unless you really understand the application patterns of other apps that you want to install on the same box, it can be a good idea to run ZooKeeper in isolation (though this can be a balancing act with the capabilities of the hardware).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using Zookeeper; Can I not just use the broker only? Why is a Zookeeper must?
Yes, Zookeeper is required for running Kafka. From the Kafka Getting Started documentation:
Step 2: Start the server
Kafka uses zookeeper so you need to first start a zookeeper server if
you don't already have one. You can use the convenience script
packaged with kafka to get a quick-and-dirty single-node zookeeper
instance.
As to why, well people long ago discovered that you need to have some way to coordinating tasks, state management, configuration, etc across a distributed system. Some projects have built their own mechanisms (think of the configuration server in a MongoDB sharded cluster, or a Master node in an Elasticsearch cluster). Others have chosen to take advantage of Zookeeper as a general purpose distributed process coordination system. So Kafka, Storm, HBase, SolrCloud to just name a few all use Zookeeper to help manage and coordinate.
Kafka is a distributed system and is built to use Zookeeper. The fact that you are not using any of the distributed features of Kafka does not change how it was built. In any event there should not be much overhead from using Zookeeper. A bigger question is why you would use this particular design pattern -- a single broker implementation of Kafka misses out on all of the reliability features of a multi-broker cluster along with its ability to scale.
As explained by others, Kafka (even in most recent version) will not work without Zookeeper.
Kafka uses Zookeeper for the following:
Electing a controller. The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
Cluster membership - which brokers are alive and part of the cluster? this is also managed through ZooKeeper.
Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
(0.9.0) - Quotas - how much data is each client allowed to read and write
(0.9.0) - ACLs - who is allowed to read and write to which topic
(old high level consumer) - Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
[from https://www.quora.com/What-is-the-actual-role-of-ZooKeeper-in-Kafka/answer/Gwen-Shapira]
Regarding your scenario, only one broker instance and one producer with multiple consumer, u can use pusher to create a channel, and push event to that channel that consumer can subscribe to and hand those events.
https://pusher.com/
Important update - August 2019:
ZooKeeper dependency will be removed from Apache Kafka. See the high-level discussion in KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum.
These efforts will take a few Kafka releases and additional KIPs. Kafka Controllers will take over the tasks of current ZooKeeper tasks. The Controllers will leverage the benefits of the Event Log which is a core concept of Kafka.
Some benefits of the new Kafka architecture are a simpler architecture, ease of operations, and better scalability e.g. allow "unlimited partitions".
Updated on Oct 2022
For new clusters in the 3.3 release you can use Apache Kafka without ZooKeeper (in new mode, called KRaft mode) in production.
Apache Kafka Raft (KRaft) is the consensus protocol that was introduced to remove Apache Kafka’s dependency on ZooKeeper for metadata management. The development progress is tracked in KIP-500.
KRaft mode was released in early access in Kafka 2.8. It was not suitable for production before 3.3 version (see details in KIP-833: Mark KRaft as Production Ready)
1. Benefits of Kafka’s new quorum controller
Enables Kafka clusters to scale to millions of partitions through improved control plane performance with the new metadata management
Improves stability, simplifies the software, and makes it easier to monitor, administer, and support Kafka.
Allows Kafka to have a single security model for the whole system
Provides a lightweight, single process way to get started with Kafka
Makes controller failover near-instantaneous
2. Timeline
Note: this timeline is very rough and subject to change.
2022/10: KRaft mode declared production-ready in Kafka 3.3
2023/02: Upgrade from ZK mode supported in Kafka 3.4 as early access.
2023/04: Kafka 3.5 released with both KRaft and ZK support. Upgrade from ZK goes production. ZooKeeper mode deprecated.
2023/10: Kafka 4.0 released with only KRaft mode supported.
References:
KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum
Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency
Preparing Your Clients and Tools for KIP-500: ZooKeeper Removal from Apache Kafka
KRaft: Apache Kafka Without ZooKeeper
Kafka is built to use Zookeeper. There is no escaping from that.
Kafka is a distributed system and uses Zookeeper to track status of kafka cluster nodes. It also keeps track of Kafka topics, partitions etc.
Looking at your question, it seems you do not need Kafka. You can use any application that supports pub-sub such as Redis, Rabbit MQ or hosted solutions such as Pub-nub.
IMHO Zookeeper is not an overhead but makes your life a lot easier.
It is basically used to maintain co-ordination between different nodes in a cluster. One of the most important things for Kafka is it uses zookeeper to periodically commit offsets so that in case of node failure it can resume from the previously committed offset (imagine yourself taking care of all this by your own).
Zookeeper also plays a vital role for serving many other purposes, such as leader detection, configuration management, synchronization, detecting when a new node joins or leaves the cluster, etc.
Future Kafka releases are planning to remove the zookeeper dependency but as of now it is an integral part of it.
Here are a few lines taken from their FAQ page:
Once the Zookeeper quorum is down, brokers could result in a bad state and could not normally serve client requests, etc. Although when Zookeeper quorum recovers, the Kafka brokers should be able to resume to normal state automatically, there are still a few corner cases the they cannot and a hard kill-and-recovery is required to bring it back to normal. Hence it is recommended to closely monitor your zookeeper cluster and provision it so that it is performant.
For more details check here
Zookeeper is centralizing and management system for any kind of distributed systems. Distributed system is different software modules running on different nodes/clusters (might be on geographically distant locations) but running as one system. Zookeeper facilitates communication between the nodes, sharing configurations among the nodes, it keeps track of which node is leader, which node joins/leaves, etc. Zookeeper is the one who keeps distributed systems sane and maintains consistency. Zookeeper basically is an orchestration platform.
Kafka is a distributed system. And hence it needs some kind of orchestration for its nodes that might be geographically distant (or not).
Apache Kafka v2.8.0 gives you early access to KIP-500 that removes the Zookeeper dependency on Kafka which means it no longer requires Apache Zookeeper.
Instead, Kafka can now run in Kafka Raft metadata mode (KRaft mode) which enables an internal Raft quorum. When Kafka runs in KRaft mode its metadata is no longer stored on ZooKeeper but on this internal quorum of controller nodes instead. This means that you don't even have to run ZooKeeper at all any longer.
Note however that v2.8.0 is currently early access and you should not use Zookeeper-less Kafka in production for the time being.
A few benefits of removing ZooKeeper dependency and replacing it with an internal quorum:
More efficient as controllers no longer need to communicate with ZooKeeper to fetch cluster state metadata every time the cluster is starting up or when a controller election is being made
More scalable as the new implementation will be able to support many more topics and partitions in KRaft mode
Easier cluster management and configuration as you don't have to manage two distinct services any longer
Single process Kafka Cluster
For more details you can read the article Kafka No Longer Requires ZooKeeper
Yes, Zookeeper is must by design for Kafka. Because Zookeeper has the responsibility a kind of managing Kafka cluster. It has list of all Kafka brokers with it. It notifies Kafka, if any broker goes down, or partition goes down or new broker is up or partition is up. In short ZK keeps every Kafka broker updated about current state of the Kafka cluster.
Then every Kafka client(producer/consumer) all need to do is connect with any single broker and that broker has all metadata updated by Zookeeper, so client need not to bother about broker discovery headache.
Other than the usual payload message transfer, there are many other communications that happens in kafka, like
Events related to brokers requesting the cluster membership.
Events related to Brokers becoming available.
Getting bootstrap config setups.
Events related to controller and leader updates.
Help status updates like Heartbeat updates.
Zookeeper itself is a distributed system consisting of multiple nodes in an ensemble. Zookeeper is centralised service for maintaining such metadata.
This article explains the role of Zookeeper in Kafka. It explains how kafka is stateless and how zookeper plays an important role in distributed nature of kafka (and many more distributed systems).
The request to run Kafka without Zookeeper seems to be quite common. The library Charlatan addresses this.
According to the description is Charlatan more or less a mock for Zookeeper, providing the Zookeeper services either backed up by other tools or by a database.
I encountered that library when dealing with the main product of the authors for the Charlatan library; there it works fine …
Firstly
Apache ZooKeeper is a distributed store which is used to provide configuration and synchronization services in a high available way.
In more recent versions of Kafka, work was done in order for the client consumers to not store information about how far it had consumed messages (called offsets) into ZooKeeper.This reduced usage did not get rid of the need for consensus and coordination in distributed systems however. While Kafka provides fault-tolerance and resilience, something is needed in order to provide the coordination needed and ZooKeeper enables that piece of the overall system.
Secondly
Agreeing on who the leader of a partition is, is one example of the practical application of ZooKeeper within the Kafka ecosystem.
Zookeeper would work if there was even a single broker.
These are from Kafka In Action book.
Image is from this course