Adding Hive SerDe jar on SparkSQL Thrift Server

Adding Hive SerDe jar on SparkSQL Thrift Server - json

I have Hive tables that point to JSON files as contents and these tables need JSON SerDe jar (from here) in order to query the tables. In the machine (or VM) hosting my Hadoop distro, I can simply execute in Hive or Beeline CLI:
ADD JAR /<local-path>/json-serde-1.0.jar;
and then I am able to perform SELECT queries on my Hive tables.
I need to use these Hive tables as data sources for my Tableau (installed in Windows, my host machine), so I start the Thrift server in Spark.
For Hive tables that does not contain JSON (and does not require the SerDe), Tableau can connect and read the tables easily.
When it comes to the Hive tables that contain JSON data, however, it looks like Tableau cannot find the Hive JSON SerDe jar, and I get the following error:
'java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found)'.
How do I add the Hive JSON SerDe jar so that Tableau can read the Hive JSON tables?

I am guessing you're using jdbc to connect tableau to hive.
When using the hive shell, hive bundles all the needed libraries (included the SerDe) from the hive client and builds a jar that is distributed and executed on the cluster. Unfortunately, the jdbc server does not do that, so you'll have to manually install and configure the SerDe on all the nodes and put it on the classpath of all the map/reduce nodes as well (copy the jar to all the nodes and add something like HADOOP_CLASSSPATH=$HADOOP_CLASSPATH:/location/of/your/serde.jar).
It may be necessary to restart yarn as well after that.
It is quite inconvenient but that's how the jdbc driver works.
See https://issues.apache.org/jira/browse/HIVE-5275

Related

How to connect hive to mysql using kafka connectors(i.e. source and sink)?

I have to make Kafka source and sink connectors for connecting HIVE to MYSQL.
I am not able to find anything on above question. I have looked the confluent website also.

Hive has a JDBC Driver, therefore, try using the JDBC Source and Sink connector, for both.
Alternatively, Spark can easily read and write to both locations; Kafka is not necessary.

I need mysql sink connector jar file to pull data from kafka

I need mysql sink connector jar file to pull data from kafka which contain the SinkConnector.class file

There's no such thing as a "MySQL sink connector jar".
There's Confluent's JDBC sink connector, which can be found from Confluent Hub website. This can be used with most databases offering JDBC drivers. It can also be used in any Kafka environment, not only Confluent Platform/Cloud.
Then there's the MySQL JDBC Driver JAR, which can be downloaded from MySQL website... but different drivers exist based on the server version, so make sure you get the correct one

Real time update of Data(CDC approach) from mysql to HDFS or Hive table

I have installed CDH 5.16 in a RHEL 7 server and installed kafka separately.
I am trying to load data from mysql to HDFS or Hive table on real time basis(CDC approach). That is if some data is updated or added in mysql table ,it should be immediately reflected in HDFS or Hive table.
Approach i have come up with:
Use kafka-connect to connect to mysql server and push table data to a kafka topic
and write a consumer code in spark-stream which reads the data from topic
and store it in HDFS.
One problem with this approach is, hive table on top of these files should
be refreshed periodically for the update to be reflected.
I also came to know of Kafka-Hive integration in HDP 3.1. Unfortunately i am using Hadoop 2.6.0. So cant leverage this feature.
Is there any other better way achieve this?
I am using Hadoop 2.6.0 and CDH 5.16.1

Spark integration in existing application using cassandra

We have working application with one application server and 3 node Cassandra cluster. Recently we got new requirement to import large CSV files to our existing database. Rows in CSV need to be transformed before saving in Cassandra. Our infrastructure is deployed in Amazon AWS.
Have couple questions:
It looks to us that Spark is right tool for the job since it has Spark Cassandra Connector and Spark CSV plugin. Are we correct?
Maybe a newbie Spark question, but in our deployment scenario where should importer app be deployed? Our idea is to have Spark Master on one of DB nodes, Spark workers spread on 3 database nodes and importer application on same node where is master. It would be perfect to have some command line interface to import CSV which can later evolve to API/web interface.
Can we put importer application on application server and what will be network penalty?
Can we use Spark in this scenario for Cassandra JOINS as well and how can we integrate to existing application which already uses regular Datastax java driver along with application joins if needed

First of all, keep in mind that Spark Cassandra Connector will only be useful for data locality if you're loading your data from Cassandra, not from an external source. So, for loading a CSV file, you'll have to transport it to your Spark workers, using a shared storage or HDFS, etc. Which means that wherever you place your importer application, it will stream the data to your spark Workers.
Now to address your points:
You're correct about Spark, but incorrect about Spark Cassandra Connector, as it's only useful if you're loading data from Cassandra (which might be the case for #4 when you need to perform Joins between external data and Cassandra data), otherwise it won't give you any significant help.
Your importer application will be deployed to your cluster. In the scenario you described, this is a stand-alone Spark Cluster. So you'll need to package your application, then use the spark-submit command on your master node to deploy your application. Using a command line parameter for your CSV file location, you can deploy and run your application as a normal command line tool.
As described in #2, your importer application will be deployed from your master node to all your workers. What matters here is where your CSV file is. A simple way to deploy it is by splitting the file across your worker nodes (using the same local file path), and load it as a local file. But be aware that you'd lose your local CSV part if the node dies. For more reliable distribution you can place your CSV file on an HDFS cluster then read from there.
Using Spark Cassandra Connector, you can load your data from Cassandra into RDDs on the corresponding local nodes, then using the RDDs you created by loading your CSV data, you can perform Joins and of course write the result back to Cassandra if you need to. You can use the Spark Cassandra Connector as a higher level tool to perform both the reading and writing, you wouldn't need to use the Java Driver directly (as the connector is built on top of it anyway).

Couchbase to local files export

I need to migrate the couchbase data into HDFS but the db and Hadoop clusters are not accessible to each other. So I cannot use sqoop in the recommended way. Is there a way to import couchbase data into local files (instead of HDFS) using sqoop. If it is possible I can do that and then transfer the local files using ftp and then use sqoop again to transfer them to HDFS.
If that's a bad solution, then is there any other way I can transfer all the cb data in local files. Creating views on this cb cluster is a difficult task and I would like to avoid using it.

Alternative solution (perhaps not as elegant, but it works):
Use Couchbase backup utility: cbbackup and save locally all data.
Transfer backup files to HDFS reachable network host.
Install Couchbase in the network segment where HDFS is reachable and use Couchbase restore from backup procedure to populate that instance.
Use Scoop (in recommended way) against that Couchbase instance that has access to HDFS.

You can use the cbbackup utility that comes with the Couchbase installation to export all data to backup files. By default the backups are actually stored in SQLite format, so you can move them to your Hadoop cluster and then use any JDBC SQLite driver to import the data from each *.cbb file individually with Sqoop. I actually wrote a blog about this a while ago, you can check it out.
To get you started, here's one of the many JDBC SQLite drivers out there.

You can use couchbase kafka adapter to stream data from couchbase to kafka and from kafka you can store in any file system you like. CouchbaseKafka adapter uses TAP protocol to push data to kafka.
https://github.com/paypal/couchbasekafka

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008