Kafka producer vs Kafka connect to read MySQL Datasource - mysql

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?

I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

Related

Debezium transforms 1GB database to 100GBs topic storage

I have Debezium in a container, capturing all changes of PostgeSQL database records.
PostgeSQL database is around 1GB having 1thousand tables. On the other side, Debezium is configured to capture all table changes and it's storage is around 100GB after initial load.
I understand that there will be an overhead from conversion to JSON but the difference is multiple times bigger.
Is there anything which can be configured to reduce kafka topic storage?
You can consider single message transformation (SMT) to reduce the size of topic messages, just adding the SMT configuration details to your connector’s configuration:
transforms=unwrap,...
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
See the documentation:
A Debezium data change event has a complex structure that provides a
wealth of information. Kafka records that convey Debezium change
events contain all of this information. However, parts of a Kafka
ecosystem might expect Kafka records that provide a flat structure of
field names and values. To provide this kind of record, Debezium
provides the event flattening single message transformation (SMT).
Configure this transformation when consumers need Kafka records that
have a format that is simpler than Kafka records that contain Debezium
change events.
At the same time Kafka supports compression at topic-level, so you can specify connector's configuration for the default topic compression as part of default topic creation group.
See the documentation:
topic.creation.default.compression.type is mapped to the
compression.type property of the topic level configuration parameters
and defines how messages are compressed on hard disk.

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Merge multi data sources to sink and keep up to date

My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?
have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.

How to stream sharded MySQL Tables CDC to Kafka Cluster

I have new shared MySQL tables being created every day with names as "output_table_[date]".
I want to capture CDC from it and stream to my Kafka cluster, so what can I use for that?
I explored Debezium connector for MySQL, but couldn't find any configuration which can take into account the new tables created everyday. eg. while creating the connector we have to provide “table.whitelist” for the tables we would like to capture CDC for, but in my case of new tables being created everyday, we'll have to manually update that everyday.
So can anyone help me with possible solutions to my use case?
You can capture all tables by simply omitting table.include.list and related options (btw. your usage of table.whitelist indicates you're using an older version of Debezium). Note for Debezium 1.6 we're also planning to do a PoC for supporting changes to the filter configuration.

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/