How to connect MSK from Aurora using open source kafka connect - mysql

I have to put my CDC changes from Aurora mysql db to MSK kafka topic .
I think there is no inbuilt mechanism in aws like DMS which will sink all CDC from Aurora to MSK topic and in their docs they have not mentioned any support for that .
So i am only left with using something open source .
To use that i have few questions around that please help me with this .
I was doing some search and everywhere i see kafka-connect-jdbc for this.So first this is this open source free licence ?
I have seen debezium one as well which connects mysql to msk.
Aurora record will be text and i need to put record into MSK as JSON ,so i have to use schema registry ? is schema registry free licence or open source or comes MSK kafka ?
If i have to use kafka -connect from confluent or debezium i need EC2 instance .In this what i need to install ? Only kafka or with kafka confluent and debezium as well .
Please suggest something which is free licence and open source .

1) If you want to use Confluent Platform components outside of Zookeeper, Apache Kafka brokers and base Kafka Connect (such as the JDBC connector plugin) please read https://www.confluent.io/confluent-community-license-faq/
2) Debezium should work fine. It's under the Apache 2.0 License ; you can use Apache Kafka with it, not the rest of the Confluent Platform
3a) Schema Registry is only capable of storing Avro, not JSON. Therefore you don't require it. 3b) See 1 for Schema Registry licensing
4) You will need EC2 or use Docker Kafka Connect / Debezium containers via ECS / EKS.
There are still other options available capable of CDC into Kafka, some of which I know are open source or even commercial supported, but listing them here is too broad.
Regarding built in AWS services, you can trigger lambda functions on Aurora tables to do whatever you want
https://aws.amazon.com/blogs/database/capturing-data-changes-in-amazon-aurora-using-aws-lambda/
It might be possible Lambda has / will eventually have MSK integration

Related

How to connect hive to mysql using kafka connectors(i.e. source and sink)?

I have to make Kafka source and sink connectors for connecting HIVE to MYSQL.
I am not able to find anything on above question. I have looked the confluent website also.
Hive has a JDBC Driver, therefore, try using the JDBC Source and Sink connector, for both.
Alternatively, Spark can easily read and write to both locations; Kafka is not necessary.

MYSQL CDC using apache Kafka and kafka connect on my local without using docker and confluent

What is the best way to run a Kafka cluster on a server without docker and confluent cloud for CDC
how do i connect mysql db to kafka cluster running on local and create a sink connection to another mysql db.
i see REST API is the best way but not sure how to do it.
do i need to run the configuration on the producer end ? need some idea how to implement it
Reading from MySQL does not require Confluent products.
Debezium is open-source, Apache 2.0 License. So is Kafka and Kafka Connect (which Debezium uses).
Kafka Connect has a REST API, yes, that is the way to use it.
The Debezium tutorial uses MySQL and Docker, but all steps can be repeated without Docker:
Download Kafka
Download Debezium Connectors and MySQL drivers and setup Connect classpath + plugins (follow the Debezium installation docs)
Start Zookeeper and Kafka and Connect Distributed Server (follow the Kafka quickstart and Connect docs above, and use distributed mode)
Reconfigure the database address & HTTP-POST the JSON mentioned in the tutorial to start the CDC Connector
To sink the data to another database, you will want to try to use Confluent's JDBC Sink connector

How to connect MySQL to Kafka cluster

I am trying to learn kafka technology and I want to setup a small pipeline in my pc which will take data from mysql and parse them into a kafka cluster.
For a small research that I made, I saw that the best way for this is KAFKA CONNECTOR.
Can someone please help with tips or information about how to implement the above logic?
The easiest way would be to use the JDBC Source connector to pull data from your MySQL server to Kafka topic for further processing. Please find the link to the Confluent JDBC source connector. Please view the various configuration the connector offers to better align your data for subsequent processing.

How to feed hadoop periodically

My company is generating around 6 million of records per day and I have seen hadoop is a good solution to handle big amounts of data. I found how to load data from mysql but it is exporting full database, Is there a way to keep sync data between my operational mysql DB and Hadoop?
One of the best solution you can use is Debezium. Debezium is built on top of Apache Kafka Connect API and provides connectors that monitor specific databases.
It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.
The Architecture will something like this:
MySQL --> Debezium(Kafka Connect Plugin) --> Kafka Topic --> HDFS Sink
You can find more information and documentation about Debezium Here.
Furthermore, Apache NiFi has a processor named CaptureChangeMySQL, You can design NiFi flow like below to do this:
MySQL --> CaptureChangeMySQL(Processor) --> PutHDFS(Processor)
You can read more about CaptureChangeMySQL Here.
There are multiple solutions available which you may need to choose as per your architectural requirement or deployment setup.
Debezium :
Debezium is a distributed platform deployed via Apache Kafka Connect that can help in continuously monitoring the various databases of your system and let the applications stream every row-level change in the same order they were committed to the database. It turns your existing databases into event streams, whereby the applications can see and accordingly respond to each row-level change in the databases.
Kafka Connect is a framework and runtime for implementing and operating source connectors such as Debezium, which ingest data into Kafka and sink connectors, which propagate data from Kafka topics into other systems.
For the case of MySQL, the Debezium's MySQL Connector can help in monitoring and recording all of the row-level changes in the databases on a MySQL server . All of the events for each table are recorded in a separate Kafka topic and the client applications can read the Kafka topics that correspond to the database tables it’s interested in following, and react to every row-level event it sees in those topics as per the requirement.
Once the data of interest is available in topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files in a variety of formats as per your use case or requirement and integrates with Hive and when it is enabled. This connector helps application in periodically polling data from Apache Kafka and writing them to HDFS. This connector also automatically creates an external Hive partitioned table for each Kafka topic and updates the table according to the available data in HDFS.
Maxwell's daemon :
Maxwell's daemon is a CDC (Change Data Capture) application that reads MySQL binlogs (events from MyQSQL database) and writes row updates as JSON to Kafka or other streaming platforms . Once the data of interest is available in kafka topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files.
NiFi :
Apache NiFi helps in automating the flow of data between systems. Apache NiFi CDC (Change Data Capture) flow also uses MySQL bin logs(via CaptureChangeMySQL) to create a copy of a table and ensures that it is in sync with row-level changes to the source. This inturn can be operated upon by NiFi PutHDFS for writing the data to HDFS.

How kafka connector works for postgresql and mysql database

I'm following the quick start tutorials from here quick-start-kafka-connect
This tutorial shows how to stream the mysql database table changes into kafka topic.
The only part is download everything and just add /tmp/kafka-connect-jdbc-source.json file with some config properties and start
How does this work in background ?
1 : Does it create connection with database and monitor tables for specific intervals of time? OR
2 : Does it uses replication log? (i don't know how this works)
3 : Is this same mechanism for mysql and postgresql?
Debezium monitors the OpLog.
Kafka Connect JDBC by Confluent (which you've linked to) can use a time-interval, and that configuration is shared by all JDBC-compliant connections, MySQL and Postgres included.
For incremental query modes that use timestamps, the source connector uses a configuration timestamp.delay.interval.ms ...
replication log? (i don't know how this works)
You can find the Debezium guide here, but this mechanism differs for Mongo, Postgres, MySQL, MSSQL, Oracle, etc.