Spark build in hive MySQL metastore isn't being used - mysql

I'm using Apache Spark 2.1.1 and I have put the following hive-site.xml on $SPARK_HOME/conf folder:
<?xml version="1.0"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql_server:3306/hive_metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>${test.tmp.dir}/hadoop-tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop_namenode:9000/value_iq/hive_warehouse/</value>
<description>Warehouse Location</description>
</property>
</configuration>
When I start the thrift server the metastore schema is created on my MySQL DB but is not used, instead Derby is used.
Could not find any error on the thrift server log file, the only thing that calls my attentions is that it attempts to use MySQL at first (INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL) but then without any error use Derby instead (INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY). This is the thrift server log https://www.dropbox.com/s/rxfwgjm9bdccaju/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-s-master.value-iq.com.out?dl=0
I have no hive installed on my system, I just pretend to use the built in Hive of Apache Spark.
I'm using mysql-connector-java-5.1.23-bin.jar which is located on $SPARK_HOME/jars folder.

As it appears in the hive-site.xml you have not set the metastore service to connect to. So spark will use the default one which is local metastore service with derby DB backend
I order to use Metastore service that has MySQL DB as its backend, you have to :
Start the metastore service. you can have a look here how to start the service hive metastore admin manual. You start your metastore service with the backend of MySQL DB, using your same hive-site.xml and you add the folowing lines to start the metastore service on METASTORESERVER on the port XXXX:
<property>
<name>hive.metastore.uris</name>
<value>thrift://METASTRESERVER:XXXX</value>
</property>
Let spark knows where the metastore service has started. That could be done using the same hive-site.xml you'have used when starting the metastore service (with the lines above added to it) copy this file into the configuration path of Spark, then restart your spark thrift server

Related

SQL metastore not working in for hive in hadoop pseudo cluster

I want to execute sql queries in hive and hence I used SQL as metastore . But while executing I get the error:-
SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.
I looked around but can't find the solution.
Hive shell is running but while executing sql queries I get the error.
I took the help of http://hadooptutorials.info/2017/09/15/part-2-install-hive/ link.
Hive-site.xml
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveUser</value>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive_metastore?createDatabaseIfNotExist=true&useSSL=false</value>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hiveUser</value>
Just add the following property to resolve this issue:
<property>
<name>hive.metastore.uris</name>
<value>thrift://localost:9083</value>
</property>

Hive : why is metastore_db created in my project folder?

I have put hive-site.xml in my spark/conf dir and configured it to connect to thrift://<user>:9083 and I am not using derby I have mysql-connector-jar inside hive/lib folder , still every time I create hive table and store data , all data are stored in metastore_db in my project directory instead in my hdfs://<user>:9000/user/hive/warehouse, so if I delete metastore_db data is lost.
conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://saurab:3306/metastore_db?
createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/home/saurab/hadoopec/hive/lib/hive-serde-
2.1.1.jar</value>
</property>
<property>
<name>spark.sql.warehouse.dir</name>
<value>hdfs://saurab:9000/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://saurab:9083</value>
<description>URI for client to contact metastore
server</description>
</property>
This is my thriftserver log.Mysql server is running.So why it is still creating metastore_db and storing data there.
I would say you have made those changes on the Spark conf folder but not on the server one (at least not all of them).
Notice on the server log:
"metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY"
A common practice is instead of copying the config under spark/conf just add a link from there to /etc/hive/conf/hive-site.xml, to make sure client and server are using the same configuration.
My advice is to setup the server side correctly first (you also have a port conflict), test it with beeline and only them start using it from Spark

Problems importing table data from MySQL to Hadoop HDFS on Ubuntu Server via Sqoop

I am testing importing data from MySQL to Hadoop running in pseudo-distributed mode under Ubuntu Server. It looks like the jobs are being submitted nicely, but at some point execution crashes and strangely the underlying user is logged out. After this, pretty much a full restart is required, in addition to some cleaning up in HDFS. What I end up seeing in the namenode and datanode logs look like:
2016-10-05 12:10:33,688 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
2016-10-05 12:10:33,681 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
I am using the following versions:
Ubuntu Server 16.04
Hadoop 2.7.3
Sqoop 1.4.6 (sqoop-1.4.6.bin__hadoop-2.0.4-alpha)
MySQL Server 5.7.15
MySQL Connector/J 5.1.39
The configurations under Hadoop are mostly defaults, for example I haven't yet tuned any memory, disk or CPU related parameters.
Running the same exact scenario on Ubuntu Server 14.04 works fine.
Update
Hadoop configurations below.
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
hadoop-env.sh
Only thing changed is:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Get java.net.UnknownHostException when use hadoop-ha?

I got an exception when i execute the command sudo -u hdfs hdfs balancer -threshold 5.
Here is the Exception.
RuntimeException: java.lang.IllegalArgumentException: java.net.UnknownHostException: nameservice1
Here is my core-site.xml.
<property>
<name>fs.defaultFS</name>
<value>hdfs://nameservice1</value>
</property>
Here is my hdfs-site.xml.
<property>
<name>dfs.nameservices</name>
<value>nameservice1</value>
</property>
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>nn1,nn2</value>
</property>
Someone help me?
I ran into this problem when setting up HA. The problem is that I set dfs.client.failover.proxy.provider.mycluster based on the reference documentation. When I replaced mycluster with my nameservice name, everything worked!
Reference: https://issues.apache.org/jira/browse/HDFS-12109
You can try after putting the port number at core-site.xml.
<property>
<name>fs.defaultFS</name>
<value>hdfs://nameservice1:9000</value>
</property>
And make sure your machine's /etc/hosts file has entry for nameservice1.
For Example (let you machine IP is 192.168.30.102)
127.0.0.1 localhost
192.168.30.102 nameservice1
<property>
<name>dfs.client.failover.proxy.provider.nameservice1</name>
<value>
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
</value>
</property>

How to run HBase shell against a remote cluster

I'm running HBase in pseudo-distributed mode on my workstation. We also have HBase running on a cluster. Using the HBase shell, I'd like to access the HBase instance that's running on the cluster from my workstation. I would like to do this without logging into one of the cluster machines.
With Hadoop, you can run jobs on a remote cluster by specifying the -conf parameter and supplying an alternate version of hadoop-site.xml. Is there an equivalent for the HBase shell?
I'm running cloudera cdh3u3 on my workstation and on the cluster machines.
Make changes to the following conf files.
For hadoop: core-site.xml, mapred-site.xml.
For hbase: hbase-site.xml.
You could create multiple versions of these files and switch between them as needed.
I'm using following command:
hbase --config "path to folder with config files" shell
Folder with configuration should contain at least hbase-site.xml with content:
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk1,zk2,zk3</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<!--or /hbase-->
<value>/hbase-unsecure</value>
</property>
</configuration>
change hbase-site.xml add zookeeper host port with hbase server.
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>zk.hostname</value>
</property>
</configuration>