Near real time sync from mysql to Hbase - mysql

Currently I am facing a issue during sync the data from mysql to hbase, I need a near real time data sync from mysql to hbase, and I need to merge multiple mysql tables into one hbase table during the data sync.
I tried sqoop looks like it can not fit our requirements.
So are there any existing tools/libs can be used for my case, or any other solutions I can try with spark.

Consider using Apache Phoenix on HBase. It will give you low-latency SQL queries (so it is suitable for OLTP and easy to use for OLAP) on data stored in HBase so you don't have to worry about syncing. It also has NoSQL features such as the ability to dynamically add columns during query-time.
To satisfy your use case, you could run Phoenix for OLTP, and a second instance of Phoenix on a read replica to run table joins for OLAP.
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_read_replicas.html
Secondary replicas are refreshed at intervals controlled by a timer (hbase.regionserver.storefile.refresh.period), and so are guaranteed to be at most that interval of milliseconds behind the primary RegionServer.
This solution satisfies your requirements for OLTP, OLAP, and near real-time syncing while giving your transactional database scalability that you would not easily have with MySQL. Apache Phoenix also offers full integration with the Hadoop ecosystem so it will integrate well with your current analytics stack.

Related

Query data from database for 2 different server

I want to query data from 2 different database server using mysql. Is there a way to do that without having to create Federated database as Google Cloud Platform does not support Federated Engine.
Thanks!
In addition to #MontyPython's excellent response, there is a third, albeit a bit cumbersome, way to do this if by any chance you cannot use Federated Engine and you also cannot manage your databases replication.
Use an ETL tool to do the work
Back in the day, I faced a very similar problem: I had to join data from two separate database servers, neither of which I had any administrative access to. I ended up setting up Pentaho's ETL suite of tools to Extract data from both databases, Transform if (basically having Pentaho do a lot of work with both datasets) and Loading it on my very own local database engine where I ended up with exactly the merged and processed data I needed.
Be advised, this IS a lot of work (you have to "teach" your ETL tool what you need and depending on what tool you use, it may involve quite some coding) but once you're done, you can schedule the work to happen automatically at regular intervals so you always have your local processed/merged data readily accesible.
FWIW, I used Pentaho's community edition so free as in beer
You can achieve this in two ways, one you have already mentioned:
1. Use Federated Engine
You can see how it is done here - Join tables from two different server. This is a MySQL specific answer.
2. Set up Multi-source Replication on another server and query that server
You can easily set up Multi-source Replication using Replication channels
Check out their official documentation here - https://dev.mysql.com/doc/refman/8.0/en/replication-multi-source-tutorials.html
If you have an older version of MySQL where Replication channels are not available, you may use one of the many third-party replicators like Tungsten Replicator.
P.S. - There is no such thing in MySQL as a FDW in PostgreSQL. Joins across servers are easily possible in other database management systems but not in MySQL.

Cassandra + Spark vs MySQL + Spark

I have to design a piece of software on a three layer architecture:
A process periodically polling a data source such an ftp to inject in a database
A database
Spark for the processing of the data
My data is simple and perfectly suitable for being stored in a single RDMS table, or I can store it in Cassandra, then periodically I would need Spark to run some machine learning algorithms on the whole set of data.
Which of the database better suits my use case? In detail, I do not need to scale on multiple nodes and I think the main underlying questions are:
Is simple querying (SELECT) faster on Cassandra or MySQL on a simple table?
Does the Spark Connector from Cassandra benefit of some features of it that will make it faster than a SQL connector?
You can use MySQL if data size is less than 2Tb. Select on MySQL table will be more flexible than in Cassandra.
You should use Cassandra when your data storage requirement crosses single machine. Cassandra needs careful data modeling to be done for each lookup or Select scenario.
You can use suggested approach below for MySQL Spark Integration
How to work with MySQL and Apache Spark?
It all depends on Data : size, integrity, scale, Flexible schema sharding etc.
Use MySQL if:
Data size is small ( in single digit TBs)
Strong Consistency( Atomicity, Consistency, Isolation & Durability) is required
Use Cassandra if:
Data size is huge and horizontal scalability is required
Eventual Consistency ( Basically Available Soft-state Eventual consistency)
Flexible schema
Distributed application.
Have a look at this benchmarking article and this pdf
I think it's better to use a sql database as mysql, cassandra only should be used if you need to scale your data in bigger proportions and along many datacenters. The java cassandra jdbc driver is just a normal driver to connect to cassandra, it doesn't have any especial advantages over other database drivers.

Replicating data from mySQL to Hbase using flume: how?

I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase
You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.
So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)

Hive layer on top of MySQL Cluster

Disclaimer: I am a newbie w.r.t Hadoop and Hive.
We have set up a MySql Cluster (version 7.2.5) which stores huge amounts of data. The rows runs into millions and are partitioned based on Mysql's autosharding logic. Even though we are leveraging Adaptive Query Localization (AQL) of Cluster 7.2, some of our queries have multiple joins and runs into quite a few minutes and sometimes hours.
In this scenario, can I use Hive along with Hadoop to query the DB and retrieve the data? Will it make the querying faster? Does it duplicate the data in its file system? What are the pros and cons of this type of approach?
My intent is to use Hive as a layer on top of MySQL Cluster and use it for read/write from and to MySQL Cluster DB. I do not have any transactions in my application. So is this really possible?
I think it is possible. The closest solution in this direction known to me is :http://www.hadapt.com/ by Daniel Abadi.
The idea of it solution to have local RDBMS on each node and run usual hadoop MR, and Hive on top of it on these nodes.
In principle if you will do smart Hive integration and push down predicates to MySQL instances it can give you some performance gains.
In the same time you should do some serious hacking to make hadoop to be aware of you sharding placement to preserve data locality.
Summarizing all above - it should be possible but will require serious development.
In the same time - I am not aware of out of the box solution to run hive over Mysql cluster as is.

MySQL Cluster is a NoSQL technology?

MySQL Cluster is a NoSQL technology? Or is another way to use the relational database?
MySQL Cluster uses MySQL Servers as API nodes to provide SQL access/a relational view to the data. The data itself is stored in the data nodes - which are separate processes. The fastest way to access the data is through the C++ API (NDB API) - in fact that is how the MySQL Server gets to the data.
There are a number of NoSQL access methods for getting to the data (that avoid going through the MySQL Server/releational view) including Rest, Java, JPA, LDAP and most recently the Memcached key-value store API.
It is another way to use the database by spreading it across multiple machines and allowing a simplified concurrent-master setup. It comes with a bit of a cost in that your indexes cannot exceed the amount of RAM available to hold them. To you application, it looks no different than regular MySQL.
Perhaps take a look at Can MySQL Cluster handle a terabyte database.