how to setup Hadoop multinode environment? - hadoop2

I want to create Hadoop multinode Environment. I have 3 Ubuntu systems, one will b master and another 2 will be the slave. Have I to configure Hadoop in all 3 nodes?
Later on, if I want to add components like HIVE, Spark or HBase, do i need to configure these components on all 3 nodes?

Following the official docs
Hadoop Cluster Setup

Related

Convert single MariaDB to a Galera Cluster

I have the next question that I don't know how to solve.
I currently have a server running mariadb version 10.1 on a Debian 9.8. This server is for test environments so it is not a real production environment.
Now I need to create a cluster using a gallery to have two nodes. I already have the second server mounted, but I don't know what are the steps to configure the Galera cluster of a node with databases already created and with information and with another node completely from scratch. What would be the steps and what should be done?

Real time update of Data(CDC approach) from mysql to HDFS or Hive table

I have installed CDH 5.16 in a RHEL 7 server and installed kafka separately.
I am trying to load data from mysql to HDFS or Hive table on real time basis(CDC approach). That is if some data is updated or added in mysql table ,it should be immediately reflected in HDFS or Hive table.
Approach i have come up with:
Use kafka-connect to connect to mysql server and push table data to a kafka topic
and write a consumer code in spark-stream which reads the data from topic
and store it in HDFS.
One problem with this approach is, hive table on top of these files should
be refreshed periodically for the update to be reflected.
I also came to know of Kafka-Hive integration in HDP 3.1. Unfortunately i am using Hadoop 2.6.0. So cant leverage this feature.
Is there any other better way achieve this?
I am using Hadoop 2.6.0 and CDH 5.16.1

Couchbase to local files export

I need to migrate the couchbase data into HDFS but the db and Hadoop clusters are not accessible to each other. So I cannot use sqoop in the recommended way. Is there a way to import couchbase data into local files (instead of HDFS) using sqoop. If it is possible I can do that and then transfer the local files using ftp and then use sqoop again to transfer them to HDFS.
If that's a bad solution, then is there any other way I can transfer all the cb data in local files. Creating views on this cb cluster is a difficult task and I would like to avoid using it.
Alternative solution (perhaps not as elegant, but it works):
Use Couchbase backup utility: cbbackup and save locally all data.
Transfer backup files to HDFS reachable network host.
Install Couchbase in the network segment where HDFS is reachable and use Couchbase restore from backup procedure to populate that instance.
Use Scoop (in recommended way) against that Couchbase instance that has access to HDFS.
You can use the cbbackup utility that comes with the Couchbase installation to export all data to backup files. By default the backups are actually stored in SQLite format, so you can move them to your Hadoop cluster and then use any JDBC SQLite driver to import the data from each *.cbb file individually with Sqoop. I actually wrote a blog about this a while ago, you can check it out.
To get you started, here's one of the many JDBC SQLite drivers out there.
You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in any file system you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
https://github.com/paypal/couchbasekafka

Mysql installation procedure in hadoop cluster

how to install & configure mysql in hadoop cluster (os centos) to access the created database in one node and accesing in other node.? can anyone post the step by step installation & confiuration procedure?
If you're looking for a distributed SQL type solution, i would suggest you look at Cloudera Impala, Apache Spark SQL, or the Amazon solution Redshift.
MySQL is not designed to be distributed horizontally but rather vertically.

MySQL Cluster + Manager and NDB/J

I've been trying to setup a MySQL Cluster for a few days using the MySQL Cluster Manager on 3 Ubuntu nodes (3 identical VM instances with 1GB RAM each).
I've followed the video on MySQL Cluster Manager on the MySQL site. There's not much other documentation/tutorials on it (probably because it's a commercial product).
I start the cluster and show the status, but the mysqld nodes never start, they just remain as "added". If I install mysql-server using "sudo apt-get install mysql-server" then I get the normal local server running and the nodes register as "started", but I can't see how to connect to the cluster rather than the individual MySQL servers running on the mysqld nodes.
I'm also at a loss as to how the Java connector for MySQL Cluster is organised, it appears that there are multiple libraries so I don't even know which library I need or how to get them (some are created when compiling MySQL Cluster???). Could someone please explain how the connectors work to interact with NDB from Java and how to get them?
Thanks for any answers.
First of all, the official documentation for MySQL Cluster Manager can be found by navigating to the Cluster documentation on dev.mysql.com (called "MySQL Cluster Manager"). You are correct that MySQL Cluster Manager is commercial software although MySQL Cluster itself is available under a commercial or GPL license.
It sounds as though you've already configured the agents and have them running and so if you want to get a Cluster up and running quickly then refer to this simple worked example of using MySQL Cluster Manager
In terms of understanding why the MySQL Servers (mysqlds) are not starting up, there aren't many clues in your question and so we need to narrow it down (one reason could be if you had multiple mysqlds defined on the same host that are trying to use the default port (3306)).
To check what the manager has been doing, take a look in the file called mysql-cluster-manager.log. You can adjust the level of logging using the cluster manager configuration file.
To see what MySQL Cluster itself thinks has happened, check the directories storing the cluster data files (if you haven't over-written the defaults then this would be under /clusters/ and then you'll see a directory for each node in the cluster). The first one to check is ndb__cluster.log and other logs that you'll find in the "data" sub-directory of the id associated with the ndb_mgmd node. There will also be per-node log files so also check the mysqld_out.err and mysqld_out.log files stored in the data directory associated with mysqld node-ids.
Most important point is do not use the mysqld that gets installed with "sudo apt-get install mysql-server" as this version will not be compatible with MySQL Cluster - always use the binaries that come with the MySQL Cluster tar ball (or if using Cluster Manager that should be transparent to you anyway.
Note that if you want to get MySQL Cluster up and running on a single host without MySQL Cluster Manager then refer to the quick-start guide located on the MySQL Cluster download site (on mysql.com rather than e-Delivery).
For the java access, try out this MySQL Cluster ClusterJ tutorial.