Disclaimer: I am a newbie w.r.t Hadoop and Hive.
We have set up a MySql Cluster (version 7.2.5) which stores huge amounts of data. The rows runs into millions and are partitioned based on Mysql's autosharding logic. Even though we are leveraging Adaptive Query Localization (AQL) of Cluster 7.2, some of our queries have multiple joins and runs into quite a few minutes and sometimes hours.
In this scenario, can I use Hive along with Hadoop to query the DB and retrieve the data? Will it make the querying faster? Does it duplicate the data in its file system? What are the pros and cons of this type of approach?
My intent is to use Hive as a layer on top of MySQL Cluster and use it for read/write from and to MySQL Cluster DB. I do not have any transactions in my application. So is this really possible?
I think it is possible. The closest solution in this direction known to me is :http://www.hadapt.com/ by Daniel Abadi.
The idea of it solution to have local RDBMS on each node and run usual hadoop MR, and Hive on top of it on these nodes.
In principle if you will do smart Hive integration and push down predicates to MySQL instances it can give you some performance gains.
In the same time you should do some serious hacking to make hadoop to be aware of you sharding placement to preserve data locality.
Summarizing all above - it should be possible but will require serious development.
In the same time - I am not aware of out of the box solution to run hive over Mysql cluster as is.
Related
Currently I am facing a issue during sync the data from mysql to hbase, I need a near real time data sync from mysql to hbase, and I need to merge multiple mysql tables into one hbase table during the data sync.
I tried sqoop looks like it can not fit our requirements.
So are there any existing tools/libs can be used for my case, or any other solutions I can try with spark.
Consider using Apache Phoenix on HBase. It will give you low-latency SQL queries (so it is suitable for OLTP and easy to use for OLAP) on data stored in HBase so you don't have to worry about syncing. It also has NoSQL features such as the ability to dynamically add columns during query-time.
To satisfy your use case, you could run Phoenix for OLTP, and a second instance of Phoenix on a read replica to run table joins for OLAP.
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_read_replicas.html
Secondary replicas are refreshed at intervals controlled by a timer (hbase.regionserver.storefile.refresh.period), and so are guaranteed to be at most that interval of milliseconds behind the primary RegionServer.
This solution satisfies your requirements for OLTP, OLAP, and near real-time syncing while giving your transactional database scalability that you would not easily have with MySQL. Apache Phoenix also offers full integration with the Hadoop ecosystem so it will integrate well with your current analytics stack.
I have to design a piece of software on a three layer architecture:
A process periodically polling a data source such an ftp to inject in a database
A database
Spark for the processing of the data
My data is simple and perfectly suitable for being stored in a single RDMS table, or I can store it in Cassandra, then periodically I would need Spark to run some machine learning algorithms on the whole set of data.
Which of the database better suits my use case? In detail, I do not need to scale on multiple nodes and I think the main underlying questions are:
Is simple querying (SELECT) faster on Cassandra or MySQL on a simple table?
Does the Spark Connector from Cassandra benefit of some features of it that will make it faster than a SQL connector?
You can use MySQL if data size is less than 2Tb. Select on MySQL table will be more flexible than in Cassandra.
You should use Cassandra when your data storage requirement crosses single machine. Cassandra needs careful data modeling to be done for each lookup or Select scenario.
You can use suggested approach below for MySQL Spark Integration
How to work with MySQL and Apache Spark?
It all depends on Data : size, integrity, scale, Flexible schema sharding etc.
Use MySQL if:
Data size is small ( in single digit TBs)
Strong Consistency( Atomicity, Consistency, Isolation & Durability) is required
Use Cassandra if:
Data size is huge and horizontal scalability is required
Eventual Consistency ( Basically Available Soft-state Eventual consistency)
Flexible schema
Distributed application.
Have a look at this benchmarking article and this pdf
I think it's better to use a sql database as mysql, cassandra only should be used if you need to scale your data in bigger proportions and along many datacenters. The java cassandra jdbc driver is just a normal driver to connect to cassandra, it doesn't have any especial advantages over other database drivers.
I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase
You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.
So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)
We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.
MySQL Cluster is a NoSQL technology? Or is another way to use the relational database?
MySQL Cluster uses MySQL Servers as API nodes to provide SQL access/a relational view to the data. The data itself is stored in the data nodes - which are separate processes. The fastest way to access the data is through the C++ API (NDB API) - in fact that is how the MySQL Server gets to the data.
There are a number of NoSQL access methods for getting to the data (that avoid going through the MySQL Server/releational view) including Rest, Java, JPA, LDAP and most recently the Memcached key-value store API.
It is another way to use the database by spreading it across multiple machines and allowing a simplified concurrent-master setup. It comes with a bit of a cost in that your indexes cannot exceed the amount of RAM available to hold them. To you application, it looks no different than regular MySQL.
Perhaps take a look at Can MySQL Cluster handle a terabyte database.