Cassandra + Spark vs MySQL + Spark - mysql

I have to design a piece of software on a three layer architecture:
A process periodically polling a data source such an ftp to inject in a database
A database
Spark for the processing of the data
My data is simple and perfectly suitable for being stored in a single RDMS table, or I can store it in Cassandra, then periodically I would need Spark to run some machine learning algorithms on the whole set of data.
Which of the database better suits my use case? In detail, I do not need to scale on multiple nodes and I think the main underlying questions are:
Is simple querying (SELECT) faster on Cassandra or MySQL on a simple table?
Does the Spark Connector from Cassandra benefit of some features of it that will make it faster than a SQL connector?

You can use MySQL if data size is less than 2Tb. Select on MySQL table will be more flexible than in Cassandra.
You should use Cassandra when your data storage requirement crosses single machine. Cassandra needs careful data modeling to be done for each lookup or Select scenario.
You can use suggested approach below for MySQL Spark Integration
How to work with MySQL and Apache Spark?

It all depends on Data : size, integrity, scale, Flexible schema sharding etc.
Use MySQL if:
Data size is small ( in single digit TBs)
Strong Consistency( Atomicity, Consistency, Isolation & Durability) is required
Use Cassandra if:
Data size is huge and horizontal scalability is required
Eventual Consistency ( Basically Available Soft-state Eventual consistency)
Flexible schema
Distributed application.
Have a look at this benchmarking article and this pdf

I think it's better to use a sql database as mysql, cassandra only should be used if you need to scale your data in bigger proportions and along many datacenters. The java cassandra jdbc driver is just a normal driver to connect to cassandra, it doesn't have any especial advantages over other database drivers.

Related

Near real time sync from mysql to Hbase

Currently I am facing a issue during sync the data from mysql to hbase, I need a near real time data sync from mysql to hbase, and I need to merge multiple mysql tables into one hbase table during the data sync.
I tried sqoop looks like it can not fit our requirements.
So are there any existing tools/libs can be used for my case, or any other solutions I can try with spark.
Consider using Apache Phoenix on HBase. It will give you low-latency SQL queries (so it is suitable for OLTP and easy to use for OLAP) on data stored in HBase so you don't have to worry about syncing. It also has NoSQL features such as the ability to dynamically add columns during query-time.
To satisfy your use case, you could run Phoenix for OLTP, and a second instance of Phoenix on a read replica to run table joins for OLAP.
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_read_replicas.html
Secondary replicas are refreshed at intervals controlled by a timer (hbase.regionserver.storefile.refresh.period), and so are guaranteed to be at most that interval of milliseconds behind the primary RegionServer.
This solution satisfies your requirements for OLTP, OLAP, and near real-time syncing while giving your transactional database scalability that you would not easily have with MySQL. Apache Phoenix also offers full integration with the Hadoop ecosystem so it will integrate well with your current analytics stack.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

Hive layer on top of MySQL Cluster

Disclaimer: I am a newbie w.r.t Hadoop and Hive.
We have set up a MySql Cluster (version 7.2.5) which stores huge amounts of data. The rows runs into millions and are partitioned based on Mysql's autosharding logic. Even though we are leveraging Adaptive Query Localization (AQL) of Cluster 7.2, some of our queries have multiple joins and runs into quite a few minutes and sometimes hours.
In this scenario, can I use Hive along with Hadoop to query the DB and retrieve the data? Will it make the querying faster? Does it duplicate the data in its file system? What are the pros and cons of this type of approach?
My intent is to use Hive as a layer on top of MySQL Cluster and use it for read/write from and to MySQL Cluster DB. I do not have any transactions in my application. So is this really possible?
I think it is possible. The closest solution in this direction known to me is :http://www.hadapt.com/ by Daniel Abadi.
The idea of it solution to have local RDBMS on each node and run usual hadoop MR, and Hive on top of it on these nodes.
In principle if you will do smart Hive integration and push down predicates to MySQL instances it can give you some performance gains.
In the same time you should do some serious hacking to make hadoop to be aware of you sharding placement to preserve data locality.
Summarizing all above - it should be possible but will require serious development.
In the same time - I am not aware of out of the box solution to run hive over Mysql cluster as is.

Converting Mysql to No sql databases

I have a production database server running on MYSQL 5.1, now we need to build a app for reporting which will fetch the data from the production database server, since reporting queries through entire database may slow down, hence planning to switch to nosql. The whole system is running aws stack planning to use DynamoDb. Kindly suggest me the ways to sync data from the production nosql server to nosql database server.
Just remember the simple fact that any NoSQL database is essentially a document database; it's really difficult to automatically convert a typical relational database in MySQL to a good document design.
In NoSQL you have a single collection of documents, and each document will probably contain data that would be in related rows in multiple tables. The advantage of a NoSQL redesign is that most data access is simpler and faster without requiring you to write complex join statements.
If you automatically convert each MySQL table to a corresponding NoSQL collection, you really won't be taking advantage of a NoSQL DB. This is because you'll end up loading many more documents, and thus make many more calls to the database than needed and thus loosing simplicity and speediness of NoSQL DB.
Perhaps a better approach is to look at how your applications use the MySQL database and go from there. You might then consider writing a simple utility script knowing fully well your MySQL database design.
As the data from a NoSQL database like MongoDB, RIAK or CouchDB has a very different structure than a relational database like MySQL the only way to migrate/synchronise the data would be to actually write a job which would write the data from MySQL to the NoSQL database using SELECT queries as stated on the MongoDB website:
Migrate the data from the database to MongoDB, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your MongoDB model using the language of your choice.
Depending of the quantity of your data this could take awhile to process.
If you have any other questions don't hesitateo to ask.

MySQL Cluster is a NoSQL technology?

MySQL Cluster is a NoSQL technology? Or is another way to use the relational database?
MySQL Cluster uses MySQL Servers as API nodes to provide SQL access/a relational view to the data. The data itself is stored in the data nodes - which are separate processes. The fastest way to access the data is through the C++ API (NDB API) - in fact that is how the MySQL Server gets to the data.
There are a number of NoSQL access methods for getting to the data (that avoid going through the MySQL Server/releational view) including Rest, Java, JPA, LDAP and most recently the Memcached key-value store API.
It is another way to use the database by spreading it across multiple machines and allowing a simplified concurrent-master setup. It comes with a bit of a cost in that your indexes cannot exceed the amount of RAM available to hold them. To you application, it looks no different than regular MySQL.
Perhaps take a look at Can MySQL Cluster handle a terabyte database.