Debezium transforms 1GB database to 100GBs topic storage - json

I have Debezium in a container, capturing all changes of PostgeSQL database records.
PostgeSQL database is around 1GB having 1thousand tables. On the other side, Debezium is configured to capture all table changes and it's storage is around 100GB after initial load.
I understand that there will be an overhead from conversion to JSON but the difference is multiple times bigger.
Is there anything which can be configured to reduce kafka topic storage?

You can consider single message transformation (SMT) to reduce the size of topic messages, just adding the SMT configuration details to your connector’s configuration:
transforms=unwrap,...
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
See the documentation:
A Debezium data change event has a complex structure that provides a
wealth of information. Kafka records that convey Debezium change
events contain all of this information. However, parts of a Kafka
ecosystem might expect Kafka records that provide a flat structure of
field names and values. To provide this kind of record, Debezium
provides the event flattening single message transformation (SMT).
Configure this transformation when consumers need Kafka records that
have a format that is simpler than Kafka records that contain Debezium
change events.
At the same time Kafka supports compression at topic-level, so you can specify connector's configuration for the default topic compression as part of default topic creation group.
See the documentation:
topic.creation.default.compression.type is mapped to the
compression.type property of the topic level configuration parameters
and defines how messages are compressed on hard disk.

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Merge multi data sources to sink and keep up to date

My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?
have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.

Extract daily data changes from mysql and rollout to timeseries DB

In MySQL, using binlog, we can extract the data changes. But I need only the latest changes made during that time / day and need to feed those data into timeseries DB (planning to go with druid)
While reading binlog, Is there any mechanism to avoid duplicate and keep latest changes?
My intension is to get the entire MySQL DB backed up every day in a timeseries DB. It helps to debug my application for past dates by referring actual data present on that day
Kafka, by design, is append only log (no updates).
Kafka Connect source connector will continuously capture all the changes from binlog into the Kafka topic. Connector stores its position in binlog and will only write new changes into Kafka as they become available in MySLQ.
For consuming from Kafka, as one option, you can use sink connector that will write all the changes to your target. Or, instead of the Kafka Connect sink connector, some independent process that will read (consume) from Kafka. For Druid specifically, you may look https://www.confluent.io/hub/imply/druid-kafka-indexing-service.
Consumer (either connector or some independent process), will store its position (offset) in the Kafka topic, and will only write new changes into the target (Druid) as they become available in Kafka.
Processes described above captures all the changes and allows you to view source (MySQL) data at any point in time in target (Druid). It is the best practice to have all the changes available in the target. Use your target's functionality to limit view of data to certain time of the day, if needed.
If, for example, there are huge number of daily changes to a record in MySQL and you'd like to only write latest status as of specific time of the day to target. You'll still need to read all the changes from MySQL. Create some additional daily process that will read all the changes since prior run and filter only latest records and write them to target.

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

spark vs flink vs m/r for batch processing

I was wondering if you can help me decide which one is best suitable to use in my case.
Use case:
I want to batch process ~200M events that are stored in apache kafka and ~20M rows in different sql tables per day. Data in rows represent states of users and events in kafka change these states. Events in kafka are well partitioned (all events for one user are stored in exactly one kafka segment),but still, there are more users than kafka segments.
(EDIT)
State updates can`t be handled in real-time as events come from different sources in different times. All events have timestamps with proper timezone, but events might be observed late which will yield in shifted timestamp. There are business rules how to handle these
I know to compute user state for any given time if all events and starting state is available.
Output:
consistent final user states are stored in mysql
writes during computation to other sources (kafka, text files, etc..) can occur based on current state
All of them are able to read and group data so I can process them, but as far as I know:
spark and flink can work withou hadoop (so far I don't have any stable cluster)
Spark has problem with dealing more data than RAM available (?)
with Flink I`m not sure if I can combine data from data stream (kafka) and table (sql)
with m/r I need to set up hadoop cluster
Also in the future there might be 100M events per hour and there will be functional hadoop cluster.