My company is generating around 6 million of records per day and I have seen hadoop is a good solution to handle big amounts of data. I found how to load data from mysql but it is exporting full database, Is there a way to keep sync data between my operational mysql DB and Hadoop?
One of the best solution you can use is Debezium. Debezium is built on top of Apache Kafka Connect API and provides connectors that monitor specific databases.
It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.
The Architecture will something like this:
MySQL --> Debezium(Kafka Connect Plugin) --> Kafka Topic --> HDFS Sink
You can find more information and documentation about Debezium Here.
Furthermore, Apache NiFi has a processor named CaptureChangeMySQL, You can design NiFi flow like below to do this:
MySQL --> CaptureChangeMySQL(Processor) --> PutHDFS(Processor)
You can read more about CaptureChangeMySQL Here.
There are multiple solutions available which you may need to choose as per your architectural requirement or deployment setup.
Debezium :
Debezium is a distributed platform deployed via Apache Kafka Connect that can help in continuously monitoring the various databases of your system and let the applications stream every row-level change in the same order they were committed to the database. It turns your existing databases into event streams, whereby the applications can see and accordingly respond to each row-level change in the databases.
Kafka Connect is a framework and runtime for implementing and operating source connectors such as Debezium, which ingest data into Kafka and sink connectors, which propagate data from Kafka topics into other systems.
For the case of MySQL, the Debezium's MySQL Connector can help in monitoring and recording all of the row-level changes in the databases on a MySQL server . All of the events for each table are recorded in a separate Kafka topic and the client applications can read the Kafka topics that correspond to the database tables it’s interested in following, and react to every row-level event it sees in those topics as per the requirement.
Once the data of interest is available in topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files in a variety of formats as per your use case or requirement and integrates with Hive and when it is enabled. This connector helps application in periodically polling data from Apache Kafka and writing them to HDFS. This connector also automatically creates an external Hive partitioned table for each Kafka topic and updates the table according to the available data in HDFS.
Maxwell's daemon :
Maxwell's daemon is a CDC (Change Data Capture) application that reads MySQL binlogs (events from MyQSQL database) and writes row updates as JSON to Kafka or other streaming platforms . Once the data of interest is available in kafka topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files.
NiFi :
Apache NiFi helps in automating the flow of data between systems. Apache NiFi CDC (Change Data Capture) flow also uses MySQL bin logs(via CaptureChangeMySQL) to create a copy of a table and ensures that it is in sync with row-level changes to the source. This inturn can be operated upon by NiFi PutHDFS for writing the data to HDFS.
Related
I have developed a billing application using grails3. Requirement is to use this application both online and offline, when internet facility interrupted, While using offline data is being stored in local Mysql db and when I switch to online, the offline Mysql data should be updated to my remote Mysql db. What is the best strategy to implement this use-case. Is there any grails-plugin available for this scenario.
One of my friends suggested the following use-cases to consider:
If you assume the amount of time the system is going to be offline is fairly low, you can hold the data in-memory and push when the system is back online.
If you think the data can get large with offline storage, you can use a FLAT File to store your data and then read that file and do the writes in your remote mysql db.
You can also use mysql replication capabilities to replicate your data and have both local and remote mysql instances. You will have to invest some time on how to do this.
Use a message queue mechanism to queue your queries when offline and then write to remote db in same sequence when online.
Use a in-memory or local h2 database If your sql queries are not too complex. Write a job that periodically writes to your remote instance.
I have installed CDH 5.16 in a RHEL 7 server and installed kafka separately.
I am trying to load data from mysql to HDFS or Hive table on real time basis(CDC approach). That is if some data is updated or added in mysql table ,it should be immediately reflected in HDFS or Hive table.
Approach i have come up with:
Use kafka-connect to connect to mysql server and push table data to a kafka topic
and write a consumer code in spark-stream which reads the data from topic
and store it in HDFS.
One problem with this approach is, hive table on top of these files should
be refreshed periodically for the update to be reflected.
I also came to know of Kafka-Hive integration in HDP 3.1. Unfortunately i am using Hadoop 2.6.0. So cant leverage this feature.
Is there any other better way achieve this?
I am using Hadoop 2.6.0 and CDH 5.16.1
I'm following the quick start tutorials from here quick-start-kafka-connect
This tutorial shows how to stream the mysql database table changes into kafka topic.
The only part is download everything and just add /tmp/kafka-connect-jdbc-source.json file with some config properties and start
How does this work in background ?
1 : Does it create connection with database and monitor tables for specific intervals of time? OR
2 : Does it uses replication log? (i don't know how this works)
3 : Is this same mechanism for mysql and postgresql?
Debezium monitors the OpLog.
Kafka Connect JDBC by Confluent (which you've linked to) can use a time-interval, and that configuration is shared by all JDBC-compliant connections, MySQL and Postgres included.
For incremental query modes that use timestamps, the source connector uses a configuration timestamp.delay.interval.ms ...
replication log? (i don't know how this works)
You can find the Debezium guide here, but this mechanism differs for Mongo, Postgres, MySQL, MSSQL, Oracle, etc.
We have working application with one application server and 3 node Cassandra cluster. Recently we got new requirement to import large CSV files to our existing database. Rows in CSV need to be transformed before saving in Cassandra. Our infrastructure is deployed in Amazon AWS.
Have couple questions:
It looks to us that Spark is right tool for the job since it has Spark Cassandra Connector and Spark CSV plugin. Are we correct?
Maybe a newbie Spark question, but in our deployment scenario where should importer app be deployed? Our idea is to have Spark Master on one of DB nodes, Spark workers spread on 3 database nodes and importer application on same node where is master. It would be perfect to have some command line interface to import CSV which can later evolve to API/web interface.
Can we put importer application on application server and what will be network penalty?
Can we use Spark in this scenario for Cassandra JOINS as well and how can we integrate to existing application which already uses regular Datastax java driver along with application joins if needed
First of all, keep in mind that Spark Cassandra Connector will only be useful for data locality if you're loading your data from Cassandra, not from an external source. So, for loading a CSV file, you'll have to transport it to your Spark workers, using a shared storage or HDFS, etc. Which means that wherever you place your importer application, it will stream the data to your spark Workers.
Now to address your points:
You're correct about Spark, but incorrect about Spark Cassandra Connector, as it's only useful if you're loading data from Cassandra (which might be the case for #4 when you need to perform Joins between external data and Cassandra data), otherwise it won't give you any significant help.
Your importer application will be deployed to your cluster. In the scenario you described, this is a stand-alone Spark Cluster. So you'll need to package your application, then use the spark-submit command on your master node to deploy your application. Using a command line parameter for your CSV file location, you can deploy and run your application as a normal command line tool.
As described in #2, your importer application will be deployed from your master node to all your workers. What matters here is where your CSV file is. A simple way to deploy it is by splitting the file across your worker nodes (using the same local file path), and load it as a local file. But be aware that you'd lose your local CSV part if the node dies. For more reliable distribution you can place your CSV file on an HDFS cluster then read from there.
Using Spark Cassandra Connector, you can load your data from Cassandra into RDDs on the corresponding local nodes, then using the RDDs you created by loading your CSV data, you can perform Joins and of course write the result back to Cassandra if you need to. You can use the Spark Cassandra Connector as a higher level tool to perform both the reading and writing, you wouldn't need to use the Java Driver directly (as the connector is built on top of it anyway).
I need to migrate the couchbase data into HDFS but the db and Hadoop clusters are not accessible to each other. So I cannot use sqoop in the recommended way. Is there a way to import couchbase data into local files (instead of HDFS) using sqoop. If it is possible I can do that and then transfer the local files using ftp and then use sqoop again to transfer them to HDFS.
If that's a bad solution, then is there any other way I can transfer all the cb data in local files. Creating views on this cb cluster is a difficult task and I would like to avoid using it.
Alternative solution (perhaps not as elegant, but it works):
Use Couchbase backup utility: cbbackup and save locally all data.
Transfer backup files to HDFS reachable network host.
Install Couchbase in the network segment where HDFS is reachable and use Couchbase restore from backup procedure to populate that instance.
Use Scoop (in recommended way) against that Couchbase instance that has access to HDFS.
You can use the cbbackup utility that comes with the Couchbase installation to export all data to backup files. By default the backups are actually stored in SQLite format, so you can move them to your Hadoop cluster and then use any JDBC SQLite driver to import the data from each *.cbb file individually with Sqoop. I actually wrote a blog about this a while ago, you can check it out.
To get you started, here's one of the many JDBC SQLite drivers out there.
You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in any file system you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
https://github.com/paypal/couchbasekafka