Kafka Cluster - Producer - configuration

I have several questions about Kafka. If someone can help me by responding to one of them, i will be very thankful.
Thank you in advance :)
Q1) I know that partitions are split across Kafka Broker. But the split is based on what ?. For instance, if I have 3 brokers and 6 partitions, how to ensure that each broker will have 2 partitions ? How this split is currently made in Kafka ?
Q2) When a producer send a new message, what id the default format of the message ? Avro format ? How can I change this default format to another format which may be more suitable for example ?
Q3) I know that to configure the maximum size of a file (log segment) within a partition, I have to change the following configuration : log.segment.bytes (1G by default). But which configuration parameter, I have to change to increase/decrease the maximum size of a directory (i.e a partition) ?
Q4) If a partition consider as the leader is dead, one of the follower partition will take the lead. What is the step, to elect the new leader ? (i.e) How the election of a new leader is made of ?
Q5) What is the configuration parameter, that allow me to change the time between 2 persist on disk ? (persist data on disk - sequential write)
Q6) How the message is sent from the hard disk Head of a Kafka broker to a Kafka consumer ? What is the interaction between Kafka Broker and Zookeeper ?
Is it Zookeeper which send the message to the consumer or Kafka Broker ?
Thank you in advance,

Q1: see How Partitions are split into Kafka Broker?
Q2: Brokers are agnostic to the message format -- they treat messages a plain byte arrays. Thus, it can handle any message format you want to have. The format is determined in your own code -- choose whatever you want and just provide the corresponding de/serializer to the producer/consumer.
Q3: Topics and thus partitions are either truncated after a configurable retention time passed (log.retention.ms) or if they grow beyond log.retention.bytes. Furthermore, topics can be compacted to avoid infinite growth. (cf. log.cleanup.policy)
Q4: For leader election Apache Zookeeper is used.
Q5: Don't understand the question.
Q6: ZK is only used to maintain metadata (which topics do exists for example). ZK is not involved in any actual data transfer of client-broker communication. Kafka uses its own network protocol. See the Kafka Wiki for more details: https://cwiki.apache.org/confluence/display/KAFKA/Index

Related

Debezium transforms 1GB database to 100GBs topic storage

I have Debezium in a container, capturing all changes of PostgeSQL database records.
PostgeSQL database is around 1GB having 1thousand tables. On the other side, Debezium is configured to capture all table changes and it's storage is around 100GB after initial load.
I understand that there will be an overhead from conversion to JSON but the difference is multiple times bigger.
Is there anything which can be configured to reduce kafka topic storage?
You can consider single message transformation (SMT) to reduce the size of topic messages, just adding the SMT configuration details to your connector’s configuration:
transforms=unwrap,...
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
See the documentation:
A Debezium data change event has a complex structure that provides a
wealth of information. Kafka records that convey Debezium change
events contain all of this information. However, parts of a Kafka
ecosystem might expect Kafka records that provide a flat structure of
field names and values. To provide this kind of record, Debezium
provides the event flattening single message transformation (SMT).
Configure this transformation when consumers need Kafka records that
have a format that is simpler than Kafka records that contain Debezium
change events.
At the same time Kafka supports compression at topic-level, so you can specify connector's configuration for the default topic compression as part of default topic creation group.
See the documentation:
topic.creation.default.compression.type is mapped to the
compression.type property of the topic level configuration parameters
and defines how messages are compressed on hard disk.

Extract daily data changes from mysql and rollout to timeseries DB

In MySQL, using binlog, we can extract the data changes. But I need only the latest changes made during that time / day and need to feed those data into timeseries DB (planning to go with druid)
While reading binlog, Is there any mechanism to avoid duplicate and keep latest changes?
My intension is to get the entire MySQL DB backed up every day in a timeseries DB. It helps to debug my application for past dates by referring actual data present on that day
Kafka, by design, is append only log (no updates).
Kafka Connect source connector will continuously capture all the changes from binlog into the Kafka topic. Connector stores its position in binlog and will only write new changes into Kafka as they become available in MySLQ.
For consuming from Kafka, as one option, you can use sink connector that will write all the changes to your target. Or, instead of the Kafka Connect sink connector, some independent process that will read (consume) from Kafka. For Druid specifically, you may look https://www.confluent.io/hub/imply/druid-kafka-indexing-service.
Consumer (either connector or some independent process), will store its position (offset) in the Kafka topic, and will only write new changes into the target (Druid) as they become available in Kafka.
Processes described above captures all the changes and allows you to view source (MySQL) data at any point in time in target (Druid). It is the best practice to have all the changes available in the target. Use your target's functionality to limit view of data to certain time of the day, if needed.
If, for example, there are huge number of daily changes to a record in MySQL and you'd like to only write latest status as of specific time of the day to target. You'll still need to read all the changes from MySQL. Create some additional daily process that will read all the changes since prior run and filter only latest records and write them to target.

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

Advantages of kafka multi broker

Hi I am pretty new to apache kafka, I dont know how much sense this will make.
I did lot of research and couldn't find whats the advantage of multiple brokers.
Went through the whole kafka documentation and couldn't find an answer for this.
Say for example I am receiving data from two different set of devices which I should manipulate and store.Depending on from which set of device data arrives the consumer will change.
Should I go with multi broker - single topic - multi partition OR single broker - single topic - multi partition OR some other approach ??
Any help or guide is appreciated.
As with pretty much any distributed system: scalability and resiliency. One broker goes down - no problem if you have replication set up. You suddenly get a traffic spike which would be too much for a single machine to handle - no problem if you have a cluster of machines to handle the traffic.

spark vs flink vs m/r for batch processing

I was wondering if you can help me decide which one is best suitable to use in my case.
Use case:
I want to batch process ~200M events that are stored in apache kafka and ~20M rows in different sql tables per day. Data in rows represent states of users and events in kafka change these states. Events in kafka are well partitioned (all events for one user are stored in exactly one kafka segment),but still, there are more users than kafka segments.
(EDIT)
State updates can`t be handled in real-time as events come from different sources in different times. All events have timestamps with proper timezone, but events might be observed late which will yield in shifted timestamp. There are business rules how to handle these
I know to compute user state for any given time if all events and starting state is available.
Output:
consistent final user states are stored in mysql
writes during computation to other sources (kafka, text files, etc..) can occur based on current state
All of them are able to read and group data so I can process them, but as far as I know:
spark and flink can work withou hadoop (so far I don't have any stable cluster)
Spark has problem with dealing more data than RAM available (?)
with Flink I`m not sure if I can combine data from data stream (kafka) and table (sql)
with m/r I need to set up hadoop cluster
Also in the future there might be 100M events per hour and there will be functional hadoop cluster.