How to use Spark DataFrame with MySQL - mysql

OK, I have known that I could use jdbc connector to create DataFrame with this command:
val jdbcDF = sqlContext.load("jdbc",
Map("url" -> "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456",
"dbtable" -> "video"))
But I got this error: java.sql.SQLException: No suitable driver found for ...
And I have tried to add jdbc jar to spark_path with both commands but failed:
spark-shell --jars mysql-connector-java-5.0.8-bin.jar
SPARK_CLASSPATH=mysql-connector-java-5.0.8-bin.jar spark-shell
My Spark version is 1.3.0 while Class.forName("com.mysql.jdbc.Driver").newInstance is worked.

It is caused because the data frame does the find the the Mysql Connector Jar in the class path. This can be resolved by adding the jar to the spark class path as below:
Edit /spark/bin/compute-classpath.sh as
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR:yourPathToJar/mysql-connector-java-5.0.8-bin.jar"
Save the file and Restart the spark.

You might want to try mysql-connector-java-5.1.29-bin.jar

Related

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Adding Spark CSV dependency to Zeppelin

I'm running an EMR with a spark cluster on AWS.
Spark version is 1.6
When running the folllowing command:
proxy = sqlContext.read.load("/user/zeppelin/ProxyRaw.csv",
format="com.databricks.spark.csv",
header="true",
inferSchema="true")
I get the following error:
Py4JJavaError: An error occurred while calling o162.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at
http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
How can I solve this? I assume I should add a package but how do I install it and where?
There is many way to add packages in Zeppelin :
One of them is to actually change the conf/zeppelin-env.sh configuration file adding the packages you need e.g com.databricks:spark-csv_2.10:1.4.0 in your case to the submit options since Zeppelin uses the spark-submit command under the hood :
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.4.0"
But let's say that you don't have actually access to those configuration. You can then use Dynamic Dependency Loading via %dep interpreter (deprecated) :
%dep
z.load("com.databricks:spark-csv_2.10:1.4.0")
This will require that you load the dependencies before launching or restarting the interpreter.
Another way to do it is do add the dependency you need via the interpreter dependency manager as described in the following link : Dependency Management for Interpreter.
Well,
First you need to download the CSV liv from Maven repository:
https://mvnrepository.com/artifact/com.databricks/spark-csv_2.10/1.5.0
Check the scala version that you are using. If is 2.10 or 2.11.
When you call spark-shell our spark-submit or pyspark. Or even your Zeppelin you need to add the option --jars and the path to your lib.
Like this:
pyspark --jars /path/to/jar/spark-csv_2.10-1.5.0.jar
Than you can call it as you did above.
You can see other close issue here: How to add third party java jars for use in pyspark

Hive mysql connector error

I have installed Hive and mysql successfully, i did the configuration for Hive as suggested in link. But i see an error as below:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
..
..
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
so i added the mysql-connector-java.jar in lib of Hive. Now hive just hangs, i dont get the shell at all.
Kindly suggest how i can resolve it
You need to add mysql connector to the classpath in the hive. It is looking for that connector in your classpath and its not able to find it. Download mysql connector and put it to the following path
/usr/lib/hive/apache-hive-0.13.0-bin/lib

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

I am using shared node cluster
Hadoop 2.5.0-cdh5.3.2
Please share the names all compatible version of MySql jar files to be loaded and all the path folders for any successful import and export between HDFS and MySQL.
I am currently getting below error message
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
so far, I have loaded MySql connector and hadoop jar files in /usr/lib/sqoop/lib
mysql-connector-java-5.0.8-bin
hadoop-core-1.0.3
Please let me know if I need to add more files and specify the path
Is t
If it is a version incompatibility issue, then you can give a try to mysql-connector-java-5.1.31.jar as I am using mysql-connector-java-5.1.31.jar with sqoop version 1.4.5. For me, it works for both data import and export use cases.

ClassNotFoundException: com.mysql.jdbc.Driver

I wrote java class to connect mysql database and insert data to database on my centos installed pc.(this java file called by asterisk program using AGI). but I got below exception at runtime.
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
How could I solve this ?
Ok. I solved my question. when I run the asterisk program with AGI server, I added mysql connector classpath as below.
java -classpath asterisk-java-0.3.jar:mysql-connector-java-5.1.18-bin.jar:. org.asteriskjava.fastagi.DefaultAgiServer
you need this jar file: mysql-connector-java-{version}.jar
e.g.
http://findjar.com/jar/mysql/mysql-connector-java/5.1.9/mysql-connector-java-5.1.9.jar.html