Real time update of Data(CDC approach) from mysql to HDFS or Hive table - mysql

I have installed CDH 5.16 in a RHEL 7 server and installed kafka separately.
I am trying to load data from mysql to HDFS or Hive table on real time basis(CDC approach). That is if some data is updated or added in mysql table ,it should be immediately reflected in HDFS or Hive table.
Approach i have come up with:
Use kafka-connect to connect to mysql server and push table data to a kafka topic
and write a consumer code in spark-stream which reads the data from topic
and store it in HDFS.
One problem with this approach is, hive table on top of these files should
be refreshed periodically for the update to be reflected.
I also came to know of Kafka-Hive integration in HDP 3.1. Unfortunately i am using Hadoop 2.6.0. So cant leverage this feature.
Is there any other better way achieve this?
I am using Hadoop 2.6.0 and CDH 5.16.1

Related

How to feed hadoop periodically

My company is generating around 6 million of records per day and I have seen hadoop is a good solution to handle big amounts of data. I found how to load data from mysql but it is exporting full database, Is there a way to keep sync data between my operational mysql DB and Hadoop?
One of the best solution you can use is Debezium. Debezium is built on top of Apache Kafka Connect API and provides connectors that monitor specific databases.
It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.
The Architecture will something like this:
MySQL --> Debezium(Kafka Connect Plugin) --> Kafka Topic --> HDFS Sink
You can find more information and documentation about Debezium Here.
Furthermore, Apache NiFi has a processor named CaptureChangeMySQL, You can design NiFi flow like below to do this:
MySQL --> CaptureChangeMySQL(Processor) --> PutHDFS(Processor)
You can read more about CaptureChangeMySQL Here.
There are multiple solutions available which you may need to choose as per your architectural requirement or deployment setup.
Debezium :
Debezium is a distributed platform deployed via Apache Kafka Connect that can help in continuously monitoring the various databases of your system and let the applications stream every row-level change in the same order they were committed to the database. It turns your existing databases into event streams, whereby the applications can see and accordingly respond to each row-level change in the databases.
Kafka Connect is a framework and runtime for implementing and operating source connectors such as Debezium, which ingest data into Kafka and sink connectors, which propagate data from Kafka topics into other systems.
For the case of MySQL, the Debezium's MySQL Connector can help in monitoring and recording all of the row-level changes in the databases on a MySQL server . All of the events for each table are recorded in a separate Kafka topic and the client applications can read the Kafka topics that correspond to the database tables it’s interested in following, and react to every row-level event it sees in those topics as per the requirement.
Once the data of interest is available in topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files in a variety of formats as per your use case or requirement and integrates with Hive and when it is enabled. This connector helps application in periodically polling data from Apache Kafka and writing them to HDFS. This connector also automatically creates an external Hive partitioned table for each Kafka topic and updates the table according to the available data in HDFS.
Maxwell's daemon :
Maxwell's daemon is a CDC (Change Data Capture) application that reads MySQL binlogs (events from MyQSQL database) and writes row updates as JSON to Kafka or other streaming platforms . Once the data of interest is available in kafka topics, the Kafka Connect HDFS Sink connector can be used to export the data from Kafka topics to HDFS files.
NiFi :
Apache NiFi helps in automating the flow of data between systems. Apache NiFi CDC (Change Data Capture) flow also uses MySQL bin logs(via CaptureChangeMySQL) to create a copy of a table and ensures that it is in sync with row-level changes to the source. This inturn can be operated upon by NiFi PutHDFS for writing the data to HDFS.

Adding Hive SerDe jar on SparkSQL Thrift Server

I have Hive tables that point to JSON files as contents and these tables need JSON SerDe jar (from here) in order to query the tables. In the machine (or VM) hosting my Hadoop distro, I can simply execute in Hive or Beeline CLI:
ADD JAR /<local-path>/json-serde-1.0.jar;
and then I am able to perform SELECT queries on my Hive tables.
I need to use these Hive tables as data sources for my Tableau (installed in Windows, my host machine), so I start the Thrift server in Spark.
For Hive tables that does not contain JSON (and does not require the SerDe), Tableau can connect and read the tables easily.
When it comes to the Hive tables that contain JSON data, however, it looks like Tableau cannot find the Hive JSON SerDe jar, and I get the following error:
'java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)'.
How do I add the Hive JSON SerDe jar so that Tableau can read the Hive JSON tables?
I am guessing you're using jdbc to connect tableau to hive.
When using the hive shell, hive bundles all the needed libraries (included the SerDe) from the hive client and builds a jar that is distributed and executed on the cluster. Unfortunately, the jdbc server does not do that, so you'll have to manually install and configure the SerDe on all the nodes and put it on the classpath of all the map/reduce nodes as well (copy the jar to all the nodes and add something like HADOOP_CLASSSPATH=$HADOOP_CLASSPATH:/location/of/your/serde.jar).
It may be necessary to restart yarn as well after that.
It is quite inconvenient but that's how the jdbc driver works.
See https://issues.apache.org/jira/browse/HIVE-5275

Couchbase to local files export

I need to migrate the couchbase data into HDFS but the db and Hadoop clusters are not accessible to each other. So I cannot use sqoop in the recommended way. Is there a way to import couchbase data into local files (instead of HDFS) using sqoop. If it is possible I can do that and then transfer the local files using ftp and then use sqoop again to transfer them to HDFS.
If that's a bad solution, then is there any other way I can transfer all the cb data in local files. Creating views on this cb cluster is a difficult task and I would like to avoid using it.
Alternative solution (perhaps not as elegant, but it works):
Use Couchbase backup utility: cbbackup and save locally all data.
Transfer backup files to HDFS reachable network host.
Install Couchbase in the network segment where HDFS is reachable and use Couchbase restore from backup procedure to populate that instance.
Use Scoop (in recommended way) against that Couchbase instance that has access to HDFS.
You can use the cbbackup utility that comes with the Couchbase installation to export all data to backup files. By default the backups are actually stored in SQLite format, so you can move them to your Hadoop cluster and then use any JDBC SQLite driver to import the data from each *.cbb file individually with Sqoop. I actually wrote a blog about this a while ago, you can check it out.
To get you started, here's one of the many JDBC SQLite drivers out there.
You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in any file system you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
https://github.com/paypal/couchbasekafka

Export Data Dictionary of my database using MySQL Workbench CE?

I have a database on server with around 60 tables and now I want to export Data Dictionary of the database (including table structures)..
I can do that on my local machine which has PHPMyAdmin, however, I am not able to find way to export it on server using Workbench.
Any one who can help?
You may install db_doc.lua, a Lua script plugin for MySQL Workbench that generates data dictionaries, similar to those generated by DBDoc on MySQL Workbench Enterprise.
Download it from:
https://docs.google.com/file/d/0BxXM2ftdUPGeNkM4OGpiYmFxdFk/edit?pli=1
Plugin developer's website
http://tmsanchezdev.blogspot.mx/2013/11/reporte-actualizado-del-modelo-de-la.html
[EDITED]
It seems that the LUA plugin support was discontinued.
So I wrote a plugin in Python to generate data dictionaries.
It is available at: https://github.com/rsn86/MWB-DBDocPy

does sqoop export from hdfs to mysql preserve partitions

I have created a multi-node hadoop cluster and installed hive on it. Also, on another remote machine I have installed MySQL.
I intend to export data stored in HDFS into relational database MySQL. I researched about how this can be done using Sqoop. So I found that I need to create a table in MySQL that has target columns in the same order(as present in Hive), with the appropriate SQL types. And then use the sqoop export command.
My question is:
If the table is partitioned in Hive, and if while creating the table in MySQL I partition it accordingly, will the sqoop export command preserve the partitions?
My question is similar to sqoop export mysql partition. I want to know if partitioning support has been added to sqoop.
This will help me decide whether to go ahead and install scoop for the task or to use some custom Python scripts that I have written for it.
Thank you.
Sqoop will work at the JDBC layer when talking to MySQL. It won't be aware of the underlying partitioning, MySQL will handle this as the records are inserted or updated.