Is there a sink connector for debezium to MySQL Database ?
I'm able to connect to source and can watch the topic with schema and all the changes on the source database. You can have a look at sample schema topic for one table as below. Please help me understand how to proceed from here. Why I'm converting every change to Json format if the CDC queries are available on the source DB.
{"schema":{"type":"struct","fields":[{"type":"int32","optional":false,"field":"ID"}],"optional":false,"name":"SourceDBChandra.world_x.city.Key"},"payload":{"ID":5}} {"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"ID"},{"type":"string","optional":false,"default":"","field":"Name"},{"type":"string","optional":false,"default":"","field":"CountryCode"},{"type":"string","optional":false,"default":"","field":"District"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"Info"}],"optional":true,"name":"SourceDBChandra.world_x.city.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"ID"},{"type":"string","optional":false,"default":"","field":"Name"},{"type":"string","optional":false,"default":"","field":"CountryCode"},{"type":"string","optional":false,"default":"","field":"District"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"Info"}],"optional":true,"name":"SourceDBChandra.world_x.city.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":true,"field":"table"},{"type":"int64","optional":false,"field":"server_id"},{"type":"string","optional":true,"field":"gtid"},{"type":"string","optional":false,"field":"file"},{"type":"int64","optional":false,"field":"pos"},{"type":"int32","optional":false,"field":"row"},{"type":"int64","optional":true,"field":"thread"},{"type":"string","optional":true,"field":"query"}],"optional":false,"name":"io.debezium.connector.mysql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"field":"transaction"}],"optional":false,"name":"SourceDBChandra.world_x.city.Envelope"},"payload":{"before":{"ID":5,"Name":"Amsterdam","CountryCode":"NLA","District":"Noord-Holland","Info":"{"Population":731200}"},"after":{"ID":5,"Name":"Amsterdam","CountryCode":"NLD","District":"Noord-Holland","Info":"{"Population":731200}"},"source":{"version":"1.9.5.Final","connector":"mysql","name":"SourceDBChandra","ts_ms":1658210588000,"snapshot":"false","db":"world_x","sequence":null,"table":"city","server_id":1,"gtid":null,"file":"binlog.000002","pos":958398,"row":0,"thread":49,"query":null},"op":"u","ts_ms":1658210588985,"transaction":null}}
My company is generating around 6 million of records per day and I have seen hadoop is a good solution to handle big amounts of data. I found how to load data from mysql but it is exporting full database, Is there a way to keep sync data between my operational mysql DB and Hadoop?
One of the best solution you can use is Debezium. Debezium is built on top of Apache Kafka Connect API and provides connectors that monitor specific databases.
It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.
The Architecture will something like this:
MySQL --> Debezium(Kafka Connect Plugin) --> Kafka Topic --> HDFS Sink
You can find more information and documentation about Debezium Here.
Furthermore, Apache NiFi has a processor named CaptureChangeMySQL, You can design NiFi flow like below to do this:
MySQL --> CaptureChangeMySQL(Processor) --> PutHDFS(Processor)
You can read more about CaptureChangeMySQL Here.
There are multiple solutions available which you may need to choose as per your architectural requirement or deployment setup.
Debezium :
Debezium is a distributed platform deployed via Apache Kafka Connect that can help in continuously monitoring the various databases of your system and let the applications stream every row-level change in the same order they were committed to the database. It turns your existing databases into event streams, whereby the applications can see and accordingly respond to each row-level change in the databases.
Kafka Connect is a framework and runtime for implementing and operating source connectors such as Debezium, which ingest data into Kafka and sink connectors, which propagate data from Kafka topics into other systems.
For the case of MySQL, the Debezium's MySQL Connector can help in monitoring and recording all of the row-level changes in the databases on a MySQL server . All of the events for each table are recorded in a separate Kafka topic and the client applications can read the Kafka topics that correspond to the database tables it’s interested in following, and react to every row-level event it sees in those topics as per the requirement.
Once the data of interest is available in topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files in a variety of formats as per your use case or requirement and integrates with Hive and when it is enabled. This connector helps application in periodically polling data from Apache Kafka and writing them to HDFS. This connector also automatically creates an external Hive partitioned table for each Kafka topic and updates the table according to the available data in HDFS.
Maxwell's daemon :
Maxwell's daemon is a CDC (Change Data Capture) application that reads MySQL binlogs (events from MyQSQL database) and writes row updates as JSON to Kafka or other streaming platforms . Once the data of interest is available in kafka topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files.
NiFi :
Apache NiFi helps in automating the flow of data between systems. Apache NiFi CDC (Change Data Capture) flow also uses MySQL bin logs(via CaptureChangeMySQL) to create a copy of a table and ensures that it is in sync with row-level changes to the source. This inturn can be operated upon by NiFi PutHDFS for writing the data to HDFS.
I have to put my CDC changes from Aurora mysql db to MSK kafka topic .
I think there is no inbuilt mechanism in aws like DMS which will sink all CDC from Aurora to MSK topic and in their docs they have not mentioned any support for that .
So i am only left with using something open source .
To use that i have few questions around that please help me with this .
I was doing some search and everywhere i see kafka-connect-jdbc for this.So first this is this open source free licence ?
I have seen debezium one as well which connects mysql to msk.
Aurora record will be text and i need to put record into MSK as JSON ,so i have to use schema registry ? is schema registry free licence or open source or comes MSK kafka ?
If i have to use kafka -connect from confluent or debezium i need EC2 instance .In this what i need to install ? Only kafka or with kafka confluent and debezium as well .
Please suggest something which is free licence and open source .
1) If you want to use Confluent Platform components outside of Zookeeper, Apache Kafka brokers and base Kafka Connect (such as the JDBC connector plugin) please read https://www.confluent.io/confluent-community-license-faq/
2) Debezium should work fine. It's under the Apache 2.0 License ; you can use Apache Kafka with it, not the rest of the Confluent Platform
3a) Schema Registry is only capable of storing Avro, not JSON. Therefore you don't require it. 3b) See 1 for Schema Registry licensing
4) You will need EC2 or use Docker Kafka Connect / Debezium containers via ECS / EKS.
There are still other options available capable of CDC into Kafka, some of which I know are open source or even commercial supported, but listing them here is too broad.
Regarding built in AWS services, you can trigger lambda functions on Aurora tables to do whatever you want
https://aws.amazon.com/blogs/database/capturing-data-changes-in-amazon-aurora-using-aws-lambda/
It might be possible Lambda has / will eventually have MSK integration
I'm looking for an open source to replicate MySQL to Hadoop, I found two options, but
Sqoop, Flume: not support realtime UPDATE, DELETE
Tungsten: closed source and pricing
so what other tools to meet that requirement?
With Best of My Knowledge Kafka can be Useful for your case.
Kafka-mysql-connector is a plugin that allows you to easily replicate MySQL changes to Apache Kafka and from Kafka you can load to HDFS or HIVE
For a MySQL->Kafka solution based on Kafka Connect, check out the excellent Debezium project.
http://debezium.io/
For a MySQL->Kafka solution that is a standalone application, check out the excellent Maxwell project, upon which this connector was based.
http://maxwells-daemon.io/
Hope this Helps
(Note: I have not used this solution but you can give a try)